Watsonx.ai is IBM’s next generation enterprise studio for AI builders to train, validate, tune and deploy AI models. It comes with multiple Large Language Models which can be accessed via API.
Via the API all models can be invoked in the same way to run inference. There are REST APIs, Python APIs, Node.js APIs and a CLI. They all leverage the ‘/generate’ REST endpoint.
The ‘/generate’ endpoint is intuitive, but let’s look at some details.
Input
In addition to the model, the decoding type is important which I described in an earlier post Decoding Methods for Generative AI.
- decoding_method
- beam_width
- temperature
- top_k
- top_p
‘min_new_tokens’ and ‘max_new_tokens’ are relative self explanatory. Just note that ‘max_new_tokens’ does not lead automatically to endings that you might expect. For example, sentences can be cut off. You might have to do some post-processing in your code.
‘length_penalty’ is an interesting parameter. Some people who use large language models to generate answers in Question Answering scenarios expect rather short answers with additional links to find out more. This setting can help to shorten the answers for some models. Note that the length primarily depends on how models have been pretrained and fine-tuned. Especially decoder models with greedy decoding only focus on the next token. In this case the length_penality parameter has no impact.
‘repetition_penalty’ is very useful since it addresses the common repetition issue. My understanding is that this works only when ‘sampling’ is used as decoding method.
‘stop_sequences’ are one or more strings which will cause the text generation to stop. This might be useful if the model generates certain non-ethical words.
‘time_limit’ and ‘truncate_input_tokens’ are basically convenience parameters.
Output
The output can include more or less information based on more input parameters:
- generated_tokens: Include the list of individual generated tokens.
- input_tokens: Include the list of input tokens for decoder-only models.
- token_logprobs: Natural log of probability for each returned token.
- token_ranks: Include rank of each returned token.
- top_n_tokens: Include top n candidate tokens at the position of each returned token.
As output you get the following ‘stop_reason’:
- NOT_FINISHED - Possibly more tokens to be streamed
- MAX_TOKENS - Maximum requested tokens reached
- EOS_TOKEN - End of sequence token encountered
- CANCELLED - Request canceled by the client
- TIME_LIMIT - Time limit reached
- STOP_SEQUENCE - Stop sequence encountered
- TOKEN_LIMIT - Token limit reached
- ERROR - Error encountered
Example
Let’s look at an example.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
curl https://workbench-api.res.ibm.com/v1/generate \
-X POST \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer your-token' \
-d '{
"model_id": "tiiuae/falcon-40b",
"inputs": ["Niklas Heidloff is a "],
"parameters": {
"temperature": 0,
"max_new_tokens": 50,
"return_options": {
"generated_tokens": true,
"input_tokens": true,
"token_logprobs": true,
"token_ranks": true,
"top_n_tokens": 2
}
}
}' | jq '.'
The following output is returned. Note that for each token it also returns the probability and other potential tokens.
BTW: I’m neither 23, nor a photographer. Looks like Falcon doesn’t read my blog. However, since my last name occurs in Germany most often, it guessed that I’m from Germany.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
{
"id": "90fdff4f-d714-461d-866d-4712b20f3c8b",
"model_id": "tiiuae/falcon-40b",
"created_at": "2023-08-25T06:56:14.541Z",
"results": [
{
"generated_text": "23-year-old German photographer who has been taking pictures since he was 15. He is currently studying photography at the University of Applied Sciences and Arts in Dortmund.\nNiklas’s work is a mixture of fashion and portrait photography. He",
"generated_token_count": 50,
"input_token_count": 9,
"stop_reason": "MAX_TOKENS",
"generated_tokens": [
{
"text": "23",
"logprob": -2.724609375,
"rank": 1,
"top_tokens": [
{
"text": "23",
"logprob": -2.724609375
},
{
"text": "24",
"logprob": -2.771484375
}
]
},
{
"text": "-",
"logprob": -0.544921875,
"rank": 1,
"top_tokens": [
{
"text": "-",
"logprob": -0.544921875
},
{
"text": "Ġyear",
"logprob": -1.30859375
}
]
},
...
Next Steps
To learn more, check out the Watsonx.ai documentation and the Watsonx.ai landing page.