Watsonx.ai comes with the popular inference stack vLLM out of the box. With the latest version it’s also possible to deploy custom inference servers.
Not every Generative AI Model can be run on vLLM. With the latest version (5.2.0) of watsonx.ai software you can deploy custom containers with inference servers to run specific models. Models which have been deployed with these custom inference servers, look and feel like any other model on watsonx.ai, for example you can use them in watsonx.ai Prompt Lab and access them over APIs.
I like this powerful capability since watsonx.ai can be utilized as platform to host all types of Generative AI models.
There are several pages in the documentation which describe this new feature. Below is a summary of the high-level steps.
- Building a custom inference runtime image
- Example Inference Server
- Deploying Custom Images in watsonx.ai
- Example Notebook for custom Images
- Setting up storage and uploading a model
- Registering a custom foundation model
- Requirements for deploying custom foundation models
Process
The following steps describe how to deploy custom inference servers.
- Upload the model files to a PVC on watsonx.ai
- Get the base image base image ‘runtime-24.1-py3.11-cuda’
- Develop the custom inference server Python code which implements the endpoint ‘/v1/chat/completions’
- Build the image
- Push the image
- Create a custom software specification
- Create a custom runtime definition
- Choose the GPU hardware configuration
- Register your custom foundation model
- Test your model
Example
Custom inference servers need to implement the ‘/v1/chat/completions’ API from OpenAI. Check out the example.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from flask import Blueprint, request, jsonify
import uuid
import time
chat_bp = Blueprint('chat', __name__)
@chat_bp.route('/v1/chat/completions', methods=['POST'])
def chat_completions():
data = request.get_json()
dummy_response = {
"id": f"chatcmpl-{uuid.uuid4().hex[:24]}",
"object": "chat.completion",
"created": int(time.time()),
"model": data.get("model", "provider/model"),
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "This is a dummy response from a mock API."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 10,
"completion_tokens": 10,
"total_tokens": 20
}
}
return jsonify(dummy_response)
While the inference server implements the OpenAI API, access from applications is consistent with other models hosted on watsonx.ai including streaming.
- POST /ml/v1/deployments/{id_or_name}/text/chat
- POST /ml/v1/deployments/{id_or_name}/text/chat_stream
Next Steps
To learn more, check out the Watsonx.ai documentation and the Watsonx.ai landing page.