heidloff.net - Building is my Passion
Post
Cancel

Custom Inference Stacks in watsonx.ai

Watsonx.ai comes with the popular inference stack vLLM out of the box. With the latest version it’s also possible to deploy custom inference servers.

Not every Generative AI Model can be run on vLLM. With the latest version (5.2.0) of watsonx.ai software you can deploy custom containers with inference servers to run specific models. Models which have been deployed with these custom inference servers, look and feel like any other model on watsonx.ai, for example you can use them in watsonx.ai Prompt Lab and access them over APIs.

I like this powerful capability since watsonx.ai can be utilized as platform to host all types of Generative AI models.

There are several pages in the documentation which describe this new feature. Below is a summary of the high-level steps.

Process

The following steps describe how to deploy custom inference servers.

  1. Upload the model files to a PVC on watsonx.ai
  2. Get the base image base image ‘runtime-24.1-py3.11-cuda’
  3. Develop the custom inference server Python code which implements the endpoint ‘/v1/chat/completions’
  4. Build the image
  5. Push the image
  6. Create a custom software specification
  7. Create a custom runtime definition
  8. Choose the GPU hardware configuration
  9. Register your custom foundation model
  10. Test your model

Example

Custom inference servers need to implement the ‘/v1/chat/completions’ API from OpenAI. Check out the example.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from flask import Blueprint, request, jsonify
import uuid
import time

chat_bp = Blueprint('chat', __name__)

@chat_bp.route('/v1/chat/completions', methods=['POST'])
def chat_completions():
    data = request.get_json()

    dummy_response = {
        "id": f"chatcmpl-{uuid.uuid4().hex[:24]}",
        "object": "chat.completion",
        "created": int(time.time()),
        "model": data.get("model", "provider/model"),
        "choices": [
            {
                "index": 0,
                "message": {
                    "role": "assistant",
                    "content": "This is a dummy response from a mock API."
                },
                "finish_reason": "stop"
            }
        ],
        "usage": {
            "prompt_tokens": 10,
            "completion_tokens": 10,
            "total_tokens": 20
        }
    }
    return jsonify(dummy_response)

While the inference server implements the OpenAI API, access from applications is consistent with other models hosted on watsonx.ai including streaming.

  • POST /ml/v1/deployments/{id_or_name}/text/chat
  • POST /ml/v1/deployments/{id_or_name}/text/chat_stream

Next Steps

To learn more, check out the Watsonx.ai documentation and the Watsonx.ai landing page.

Featured Blog Posts
Disclaimer
The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.
Contents
Trending Tags