LLMs generate text incrementally, one token at a time. To improve responsiveness, many LLM providers support streaming outputs instead of sending a full response at once. The StreamingChatModel and StreamingLanguageModel interfaces allow you to leverage this behavior in your applications. By implementing the StreamingChatResponseHandler, you can react to three key events:
· onPartialResponse(String partialResponse): Triggered as each token is generated, enabling real-time UI updates.
· onCompleteResponse(ChatResponse completeResponse): Called once the full response is ready.
· onError(Throwable error): Invoked if an error occurs during streaming.
This approach not only delivers a smoother user experience but also gives developers granular control over how responses are handled and displayed.
Find the below working application.
StreamingDemo.java
package com.sample.app.streaming; import java.time.Duration; import java.util.concurrent.TimeUnit; import dev.langchain4j.model.chat.response.ChatResponse; import dev.langchain4j.model.chat.response.StreamingChatResponseHandler; import dev.langchain4j.model.ollama.OllamaStreamingChatModel; public class StreamingDemo { public static void main(String[] args) throws InterruptedException { OllamaStreamingChatModel chatModel = OllamaStreamingChatModel.builder().baseUrl("http://localhost:11434") .modelName("llama3.2").temperature(0.3).timeout(Duration.ofMinutes(1)).build(); String prompt = "Tell me some Interesting Fact About LLMs in maximum 30 words"; chatModel.chat(prompt, new StreamingChatResponseHandler() { @Override public void onPartialResponse(String partialResponse) { System.out.println("onPartialResponse: " + partialResponse); } @Override public void onCompleteResponse(ChatResponse completeResponse) { System.out.println("onCompleteResponse: " + completeResponse); } @Override public void onError(Throwable error) { error.printStackTrace(); } }); // Sleep for 2 minutes to get the complete response tokens TimeUnit.MINUTES.sleep(2); } }
Output
onPartialResponse: LL onPartialResponse: Ms onPartialResponse: ( onPartialResponse: Large onPartialResponse: Language onPartialResponse: Models onPartialResponse: ) onPartialResponse: can onPartialResponse: generate onPartialResponse: human onPartialResponse: -like onPartialResponse: text onPartialResponse: , onPartialResponse: summarize onPartialResponse: content onPartialResponse: , onPartialResponse: and onPartialResponse: even onPartialResponse: create onPartialResponse: new onPartialResponse: stories onPartialResponse: . onPartialResponse: They onPartialResponse: learn onPartialResponse: from onPartialResponse: vast onPartialResponse: amounts onPartialResponse: of onPartialResponse: data onPartialResponse: , onPartialResponse: improving onPartialResponse: over onPartialResponse: time onPartialResponse: with onPartialResponse: user onPartialResponse: feedback onPartialResponse: . onCompleteResponse: ChatResponse { aiMessage = AiMessage { text = "LLMs (Large Language Models) can generate human-like text, summarize content, and even create new stories. They learn from vast amounts of data, improving over time with user feedback." toolExecutionRequests = [] }, metadata = ChatResponseMetadata{id='null', modelName='llama3.2', tokenUsage=TokenUsage { inputTokenCount = 39, outputTokenCount = 38, totalTokenCount = 77 }, finishReason=STOP} }
A more concise approach to handling streamed responses is by using the LambdaStreamingResponseHandler class. This utility offers static methods to easily create a StreamingChatResponseHandler with lambda expressions. To stream responses using lambdas, simply use the onPartialResponse() method and pass a lambda that defines how to handle each partial response:
import static dev.langchain4j.model.LambdaStreamingResponseHandler.onPartialResponse; chatModel.chat(prompt, onPartialResponse(System.out::print));
This pattern allows for clean, readable code while enabling real-time response streaming.
StreamingCompactWay.java
package com.sample.app.streaming; import static dev.langchain4j.model.LambdaStreamingResponseHandler.onPartialResponse; import java.time.Duration; import java.util.concurrent.TimeUnit; import dev.langchain4j.model.ollama.OllamaStreamingChatModel; public class StreamingCompactWay { public static void main(String[] args) throws InterruptedException { OllamaStreamingChatModel chatModel = OllamaStreamingChatModel.builder().baseUrl("http://localhost:11434") .modelName("llama3.2").temperature(0.3).timeout(Duration.ofMinutes(1)).build(); String prompt = "Tell me some Interesting Fact About LLMs in maximum 30 words"; chatModel.chat(prompt, onPartialResponse(System.out::println)); // Sleep for 2 minutes to get the complete response tokens TimeUnit.MINUTES.sleep(2); } }
Output
LL Ms ( Large Language Models ) can generate human -like text , but they lack common sense and understanding of the world , often producing nons ens ical or absurd responses to real -world scenarios .
The onPartialResponseAndError() method lets you specify handlers for both the onPartialResponse() and onError() events in a single, streamlined call.
import static dev.langchain4j.model.LambdaStreamingResponseHandler.onPartialResponseAndError; chatModel.chat(prompt, onPartialResponseAndError(System.out::println, Throwable::printStackTrace));
StreamingCompactWayBothResponseAndError.java
package com.sample.app.streaming; import static dev.langchain4j.model.LambdaStreamingResponseHandler.onPartialResponseAndError; import java.time.Duration; import java.util.concurrent.TimeUnit; import dev.langchain4j.model.ollama.OllamaStreamingChatModel; public class StreamingCompactWayBothResponseAndError { public static void main(String[] args) throws InterruptedException { OllamaStreamingChatModel chatModel = OllamaStreamingChatModel.builder().baseUrl("http://localhost:11434") .modelName("llama3.2").temperature(0.3).timeout(Duration.ofMinutes(1)).build(); String prompt = "Tell me some Interesting Fact About LLMs in maximum 30 words"; chatModel.chat(prompt, onPartialResponseAndError(System.out::println, Throwable::printStackTrace)); // Sleep for 2 minutes to get the complete response tokens TimeUnit.MINUTES.sleep(2); } }
Output
LL Ms ( Large Language Models ) can generate human -like text , but their understanding is limited to patterns learned from vast datasets , making them prone to biases and inaccur acies .
Previous Next Home
No comments:
Post a Comment