This post walks you through working with Ollama's REST APIs, focusing on the core endpoints. You'll learn how to generate AI model responses, understand the parameters you can use, and see an example request in action. This guide is ideal for developers looking to integrate Ollama's AI capabilities into their applications.
1. Introduction to the /api/generate Endpoint
The /api/generate endpoint in Ollama's REST API allows you to generate a response from a model for a given prompt. This endpoint supports both standard and advanced parameters, making it highly customizable for your needs. It is a streaming endpoint, which provide incremental data as the AI processes the request.
1.1 Required Parameters
· model: The name of the model to use (e.g., "llama3.2").
· prompt: The input text to generate a response for.
1.2 Optional Parameters
· suffix: Text appended after the model response.
· images: Base64-encoded images (used for multimodal models like llava).
1.3 Advanced Parameters
· format: The response format (json or a JSON schema).
· options: Additional model parameters like temperature.
· system: A system message to override what’s defined in the Modelfile.
· template: A custom prompt template to override the Modelfile template.
· stream: Set to false to get a single response object instead of a stream.
· raw: Prevents formatting of the prompt, ideal for templated prompts.
· keep_alive: Duration to keep the model in memory (default: 5 minutes).
· context (deprecated): Keeps short conversational memory.
1.4 Example Request
Below is an example of using the /api/generate endpoint to get a joke using the llama3.2 model.
curl http://localhost:11434/api/generate -d ' { "model": "llama3.2", "prompt": "Tell me a joke" } '
$curl http://localhost:11434/api/generate -d ' > { > "model": "llama3.2", > "prompt": "Tell me a joke" > } > ' {"model":"llama3.2","created_at":"2025-01-18T05:54:00.465195Z","response":"Why","done":false} {"model":"llama3.2","created_at":"2025-01-18T05:54:00.480944Z","response":" don","done":false} {"model":"llama3.2","created_at":"2025-01-18T05:54:00.496633Z","response":"'t","done":false} {"model":"llama3.2","created_at":"2025-01-18T05:54:00.512623Z","response":" eggs","done":false} {"model":"llama3.2","created_at":"2025-01-18T05:54:00.527991Z","response":" tell","done":false} {"model":"llama3.2","created_at":"2025-01-18T05:54:00.543666Z","response":" jokes","done":false} {"model":"llama3.2","created_at":"2025-01-18T05:54:00.561027Z","response":"?\n\n","done":false} {"model":"llama3.2","created_at":"2025-01-18T05:54:00.577466Z","response":"Because","done":false} {"model":"llama3.2","created_at":"2025-01-18T05:54:00.5932Z","response":" they","done":false} {"model":"llama3.2","created_at":"2025-01-18T05:54:00.609621Z","response":"'d","done":false} {"model":"llama3.2","created_at":"2025-01-18T05:54:00.62539Z","response":" crack","done":false} {"model":"llama3.2","created_at":"2025-01-18T05:54:00.640999Z","response":" each","done":false} {"model":"llama3.2","created_at":"2025-01-18T05:54:00.657273Z","response":" other","done":false} {"model":"llama3.2","created_at":"2025-01-18T05:54:00.673042Z","response":" up","done":false} {"model":"llama3.2","created_at":"2025-01-18T05:54:00.688764Z","response":"!","done":false} {"model":"llama3.2","created_at":"2025-01-18T05:54:00.705071Z","response":"","done":true,"done_reason":"stop","context":[128006,9125,128007,271,38766,1303,33025,2696,25,6790,220,2366,18,271,128009,128006,882,128007,271,41551,757,264,22380,128009,128006,78191,128007,271,10445,1541,956,19335,3371,32520,1980,18433,814,4265,17944,1855,1023,709,0],"total_duration":1950280125,"load_duration":564980625,"prompt_eval_count":29,"prompt_eval_duration":1143000000,"eval_count":16,"eval_duration":240000000}
The response is streamed as a series of JSON objects. Each object contains parts of the model's response (response field). The final object includes additional metadata like statistics and marks the end of the response with "done": true.
Each piece of the response contributes to the final joke.
{"response": "Why"} {"response": " don"} {"response": "'t"} {"response": " eggs"} ...
When concatenated, the full response is:
"Why don't eggs tell jokes?\n\nBecause they'd crack each other up!"
The final JSON object indicates the completion of the streaming process.
{ "response": "", "done": true, "done_reason": "stop", "context": [...], "total_duration": 1950280125, "load_duration": 564980625, "prompt_eval_count": 29, "prompt_eval_duration": 1143000000, "eval_count": 16, "eval_duration": 240000000 }
1.5 Key Fields in the Response JSON
· model: Specifies the model used to generate the response, e.g., "llama3.2".
· created_at: Timestamp of when each part of the response was generated.
Example: "2025-01-18T05:54:00.465195Z"
· response: A partial or complete text output generated by the model.
· done: Indicates whether the response is complete.
"done": false: Streaming is in progress.
"done": true: The response is complete.
· done_reason: Explains why the response stream stopped.
Example: "stop" indicates the end of the response generation.
· context: A list of token IDs representing the model's internal context (useful for debugging or keeping conversational memory).
· Timing Information:
o total_duration: Total time taken to process the request (in nanoseconds).
Example: 1950280125 (1.95 seconds).
o load_duration: Time taken to load the model into memory.
Example: 564980625 (0.56 seconds).
o prompt_eval_duration: Time spent evaluating the input prompt.
Example: 1143000000 (1.14 seconds).
o eval_duration: Time spent generating the model's response.
Example: 240000000 (0.24 seconds).
· prompt_eval_count: Number of tokens in the input prompt processed by the model.
Example: 29.
· eval_count: Number of tokens generated in the response.
Example: 16.
1.6 Understanding the Final Joke
When concatenated, the response forms the following joke:
Q: Why don't eggs tell jokes? A: Because they'd crack each other up!
This demonstrates the streaming nature of the API: the model generates responses piece by piece.
2. Return the response as a single json object by setting stream to false.
curl http://localhost:11434/api/generate -d ' { "model" : "llama3.2", "prompt" : "Tell me a joke", "stream" : false } '
$curl http://localhost:11434/api/generate -d ' { "model" : "llama3.2", "prompt" : "Tell me a joke", "stream" : false } ' {"model":"llama3.2","created_at":"2025-01-18T06:01:32.913057Z","response":"Here's one:\n\nWhat do you call a fake noodle?\n\nAn impasta.","done":true,"done_reason":"stop","context":[128006,9125,128007,271,38766,1303,33025,2696,25,6790,220,2366,18,271,128009,128006,882,128007,271,41551,757,264,22380,128009,128006,78191,128007,271,8586,596,832,1473,3923,656,499,1650,264,12700,46895,273,1980,2127,3242,14635,13],"total_duration":395412458,"load_duration":33141292,"prompt_eval_count":29,"prompt_eval_duration":94000000,"eval_count":18,"eval_duration":267000000}
Formatted response looks like below.
{ "model": "llama3.2", "created_at": "2025-01-18T06:01:32.913057Z", "response": "Here's one:\n\nWhat do you call a fake noodle?\n\nAn impasta.", "done": true, "done_reason": "stop", "context": [ 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 271, 128009, 128006, 882, 128007, 271, 41551, 757, 264, 22380, 128009, 128006, 78191, 128007, 271, 8586, 596, 832, 1473, 3923, 656, 499, 1650, 264, 12700, 46895, 273, 1980, 2127, 3242, 14635, 13 ], "total_duration": 395412458, "load_duration": 33141292, "prompt_eval_count": 29, "prompt_eval_duration": 94000000, "eval_count": 18, "eval_duration": 267000000 }
When "stream": false, the entire response is processed and delivered as a single JSON object instead of being broken into chunks. This simplifies handling responses for cases where streaming is unnecessary or unwanted.
3. /api/chat endpoint
The /api/chat endpoint is used to make the model generate a response in a conversation. For example, you can use it to ask questions or continue a chat based on previous messages.
3.1 How It Works ?
You send a POST request to /api/chat with some parameters, such as:
· The model name (e.g., "llama3.2")
· The messages in the chat (e.g., what you and the AI have said so far)
· Optional settings, like whether to stream the response or get it all at once.
The AI will process this and send back the next message in the chat.
3.2 Example
curl http://localhost:11434/api/chat -d '{ "model": "llama3.2", "messages": [ { "role": "user", "content": "why is the sky blue? Explain in 10 words maximum" } ] }'
What It Means?
· You’re using the "llama3.2" model.
· You’re sending a single message: "Why is the sky blue? Explain in 10 words maximum."
· The AI will read this and send back a short response (10 words or fewer).
3.3 Required Parameters
· model: The name of the AI model you want to use (e.g., "llama3.2").
Example: "model": "llama3.2"
· messages: A list of all the chat messages so far. This lets the model understand the context of the conversation.
Each message has:
o role: Who is speaking (system, user, assistant, or tool).
§ system: Instructions for how the AI should behave.
§ user: Messages sent by you.
§ assistant: Responses from the AI.
o content: The actual text of the message.
3.4 Optional Parameters
§ stream: Controls how the response is delivered. If true, the response is sent in small chunks (useful for long messages). If false, the entire response is sent at once.
Default: true.
§ tools: Let’s the AI use extra tools to enhance its response. Only works if "stream": false.
§ format: The format of the response (json or JSON schema). Default is json.
§ options: Additional settings for the model, such as how creative the response should be. For example, temperature controls randomness (higher = more random responses).
§ keep_alive: Specifies how long the model stays in memory after the request. Default is 5 minutes.
3.5 Understand the response of chat API
$curl http://localhost:11434/api/chat -d '{ > "model": "llama3.2", > "messages": [ > { > "role": "user", > "content": "why is the sky blue? Explain in 10 words maximum" > } > ], > "stream" : false > }' {"model":"llama3.2","created_at":"2025-01-18T06:12:48.888132Z","message":{"role":"assistant","content":"Scattering of sunlight by atmospheric molecules and particles occurs."},"done_reason":"stop","done":true,"total_duration":354791125,"load_duration":27936708,"prompt_eval_count":37,"prompt_eval_duration":152000000,"eval_count":12,"eval_duration":174000000}
This response shows the result of an API call where the streaming option was disabled ("stream": false). As a result, the full response is delivered in one single JSON object, rather than in smaller chunks.
3.6 Key Elements in the Response
§ "model": "llama3.2": Indicates that the response was generated using the "llama3.2" AI model.
§ Response Message: "message": {"role": "assistant", "content": "Scattering of sunlight by atmospheric molecules and particles occurs."}.
This is the assistant's full response. It answers the user's question, "Why is the sky blue?" within 10 words, as requested. The explanation is concise and scientifically accurate.
§ Done Signal:
"done": true: Indicates that the AI has finished generating its response.
"done_reason": "stop": The completion was achieved naturally, meaning the AI finished without hitting constraints like a token limit or timeout.
§ Timing Information:
o "total_duration": 354791125 (nanoseconds): Total time taken for the response, which equals about 354 milliseconds.
o "load_duration": 27936708 (nanoseconds): Time spent loading the model into memory, which equals about 28 milliseconds.
o "prompt_eval_duration": 152000000 (nanoseconds): Time taken to process the input message (i.e., evaluate the question).
o "eval_duration": 174000000 (nanoseconds): Time taken to generate the response text.
§ Token Statistics:
o "prompt_eval_count": 37: Number of tokens (words or pieces of words) in the input message (prompt) processed by the model.
o "eval_count": 12: Number of tokens generated by the assistant in its response.
4. Getting the response as json
curl http://localhost:11434/api/generate -d ' { "model" : "llama3.2", "prompt" : "Tell me a joke? RESPONSD using JSON", "format" : "json", "stream" : false } '
By setting the format property to json, we can request LLM to send the response as JSON.
$curl http://localhost:11434/api/generate -d ' > { > "model" : "llama3.2", > "prompt" : "Tell me a joke? RESPONSD using JSON", > "format" : "json", > "stream" : false > } > ' {"model":"llama3.2","created_at":"2025-01-18T06:16:03.594088Z","response":"{\"setup\": \"Why did the scarecrow win an award?\", \"punchline\": \"Because he was outstanding in his field!\"}","done":true,"done_reason":"stop","context":[128006,9125,128007,271,38766,1303,33025,2696,25,6790,220,2366,18,271,128009,128006,882,128007,271,41551,757,264,22380,30,46577,715,5608,1701,4823,128009,128006,78191,128007,271,5018,15543,794,330,10445,1550,279,44030,52905,3243,459,10292,32111,330,79,3265,1074,794,330,18433,568,574,19310,304,813,2115,9135,92],"total_duration":626731875,"load_duration":34872000,"prompt_eval_count":35,"prompt_eval_duration":148000000,"eval_count":29,"eval_duration":442000000}
Formatted response looks like below.
{ "model": "llama3.2", "created_at": "2025-01-18T06:16:03.594088Z", "response": "{\"setup\": \"Why did the scarecrow win an award?\", \"punchline\": \"Because he was outstanding in his field!\"}", "done": true, "done_reason": "stop", "context": [ 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 271, 128009, 128006, 882, 128007, 271, 41551, 757, 264, 22380, 30, 46577, 715, 5608, 1701, 4823, 128009, 128006, 78191, 128007, 271, 5018, 15543, 794, 330, 10445, 1550, 279, 44030, 52905, 3243, 459, 10292, 32111, 330, 79, 3265, 1074, 794, 330, 18433, 568, 574, 19310, 304, 813, 2115, 9135, 92 ], "total_duration": 626731875, "load_duration": 34872000, "prompt_eval_count": 35, "prompt_eval_duration": 148000000, "eval_count": 29, "eval_duration": 442000000 }
References
https://github.com/ollama/ollama/blob/main/docs/api.md
No comments:
Post a Comment