heidloff.net - Building is my Passion
Post
Cancel

Getting started with Llama Stack

Llama Stack is an open-source effort from Meta that aims to standardize the core building blocks for AI application development. This post describes how to get started with the stack running on desktop machines.

Llama Stack defines and standardizes the core building blocks needed to bring generative AI applications to market. It provides a unified set of APIs with implementations from leading service providers, enabling seamless transitions between development and production environments. More specifically, it provides:

  • Unified API layer for Inference, RAG, Agents, Tools, Safety, Evals, and Telemetry
  • Plugin architecture to support the rich ecosystem of implementations of the different APIs in different environments
  • Prepackaged verified distributions
  • Multiple developer interfaces

Here are some resources:

Introduction

To me Llama Stack is for AI development what J2EE is for Java development. The Java community agreed on common specifications for typical enterprise scenarios. Different providers like IBM were able to provide their own implementations of the specifications in addition to simpler reference implementations. This allowed the community to grow faster since the same skills could be leveraged for different providers.

The Llama Stack also defines an architecture with some key components. Implementations can be replaced. For example, you can use the Meta guardian models or implementations from other providers. Similarly to J2EE there are different distributions of the stack from different providers.

I like the following statement which describes the purpose of this effort:

“By reducing friction and complexity, Llama Stack empowers developers to focus on what they do best: building transformative generative AI applications.”

Setup

Run the following commands to run Llama Stack and a local model.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
export OLLAMA_INFERENCE_MODEL="llama3.2:3b-instruct-fp16"
ollama run $OLLAMA_INFERENCE_MODEL --keepalive 60m

mkdir ~/.llama
wget https://raw.githubusercontent.com/meta-llama/llama-stack/cd40a5fdbfee6f5da17fb943526fb436eee757d1/llama_stack/templates/ollama/run.yaml

export LLAMA_STACK_PORT=5001
podman run \
  -it \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -v ~/.llama:/root/.llama \
  -v ./run.yaml:/root/run.yaml \
  llamastack/distribution-ollama \
  --yaml-config /root/run.yaml \
  --port $LLAMA_STACK_PORT \
  --env INFERENCE_MODEL=$INFERENCE_MODEL \
  --env OLLAMA_URL=http://host.docker.internal:11434

Simple Sample

The following snippet invokes a model.

1
2
3
4
5
6
7
8
9
10
11
from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url="http://localhost:5001")
response = client.inference.chat_completion(
    model_id="meta-llama/Llama-3.2-3B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about coding"}
    ]
)
print(response.completion_message.content)

When running it locally with Ollama, not all capabilities are supported yet.

Next Steps

Check out the Llama Stack Playground to try more features.

Featured Blog Posts
Disclaimer
The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.
Contents
Trending Tags