Text Generation Inference for Foundation Models

Serving AI models is resource intensive. There are various model inference platforms that help operating these models as efficiently as possible. This post summarizes two platforms for classic ML and foundation models.

Kubeflow KServe

KServe enables inferencing on Kubernetes and provides performant, high abstraction interfaces for common machine learning frameworks like TensorFlow and PyTorch to solve production model serving use cases. Read my earlier post how to run Watson NLP on KServe ModelMesh.

There are two options how to run KServe. The easiest way is to use ‘Single Model Serving’ where each model is associated with one dedicated runtime container. For frequently-changing model use cases and to achieve better scalability, containers can also handle multiple models via ‘ModelMesh’.

Text Generation Inference

While KServe is a great platform to serve classic machine learning models, it has not been designed specifically for Large Language Models and other Foundation Models. These models have extended requirements that TGI addresses. For example, while CPUs can easily be virtualized, this hasn’t really been possible so far for GPUs. Additionally, batching and caching is performed heavily to optimize access to the GPUs. Here is a list of some of the features from the TGI documentation:

Simple launcher to serve most popular LLMs
Production ready
Tensor Parallelism for faster inference on multiple GPUs
Token streaming using Server-Sent Events
Continuous batching of incoming requests for increased total throughput
Quantization with bitsandbytes

IBM also uses TGI from Hugging Face and has extended features like gRPC.

The article Fine-tuning and Serving an open source foundation model with Red Hat OpenShift AI describes how TGI (TGIS) is integrated in KServe in Red Hat OpenShift AI.

Next Steps

To learn more, check out the Watsonx.ai documentation and the Watsonx.ai landing page.