LLMs generate text incrementally, one token at a time. To improve responsiveness, many LLM providers support streaming outputs instead of sending a full response at once. The StreamingChatModel and StreamingLanguageModel interfaces allow you to leverage this behavior in your applications. By implementing the StreamingChatResponseHandler, you can react to three key events:
· onPartialResponse(String partialResponse): Triggered as each token is generated, enabling real-time UI updates.
· onCompleteResponse(ChatResponse completeResponse): Called once the full response is ready.
· onError(Throwable error): Invoked if an error occurs during streaming.
This approach not only delivers a smoother user experience but also gives developers granular control over how responses are handled and displayed.
Find the below working application.
StreamingDemo.java
package com.sample.app.streaming; import java.time.Duration; import java.util.concurrent.TimeUnit; import dev.langchain4j.model.chat.response.ChatResponse; import dev.langchain4j.model.chat.response.StreamingChatResponseHandler; import dev.langchain4j.model.ollama.OllamaStreamingChatModel; public class StreamingDemo { public static void main(String[] args) throws InterruptedException { OllamaStreamingChatModel chatModel = OllamaStreamingChatModel.builder().baseUrl("http://localhost:11434") .modelName("llama3.2").temperature(0.3).timeout(Duration.ofMinutes(1)).build(); String prompt = "Tell me some Interesting Fact About LLMs in maximum 30 words"; chatModel.chat(prompt, new StreamingChatResponseHandler() { @Override public void onPartialResponse(String partialResponse) { System.out.println("onPartialResponse: " + partialResponse); } @Override public void onCompleteResponse(ChatResponse completeResponse) { System.out.println("onCompleteResponse: " + completeResponse); } @Override public void onError(Throwable error) { error.printStackTrace(); } }); // Sleep for 2 minutes to get the complete response tokens TimeUnit.MINUTES.sleep(2); } }
Output
onPartialResponse: LL
onPartialResponse: Ms
onPartialResponse: (
onPartialResponse: Large
onPartialResponse: Language
onPartialResponse: Models
onPartialResponse: )
onPartialResponse: can
onPartialResponse: generate
onPartialResponse: human
onPartialResponse: -like
onPartialResponse: text
onPartialResponse: ,
onPartialResponse: summarize
onPartialResponse: content
onPartialResponse: ,
onPartialResponse: and
onPartialResponse: even
onPartialResponse: create
onPartialResponse: new
onPartialResponse: stories
onPartialResponse: .
onPartialResponse: They
onPartialResponse: learn
onPartialResponse: from
onPartialResponse: vast
onPartialResponse: amounts
onPartialResponse: of
onPartialResponse: data
onPartialResponse: ,
onPartialResponse: improving
onPartialResponse: over
onPartialResponse: time
onPartialResponse: with
onPartialResponse: user
onPartialResponse: feedback
onPartialResponse: .
onCompleteResponse: ChatResponse { aiMessage = AiMessage { text = "LLMs (Large Language Models) can generate human-like text, summarize content, and even create new stories. They learn from vast amounts of data, improving over time with user feedback." toolExecutionRequests = [] }, metadata = ChatResponseMetadata{id='null', modelName='llama3.2', tokenUsage=TokenUsage { inputTokenCount = 39, outputTokenCount = 38, totalTokenCount = 77 }, finishReason=STOP} }
A more concise approach to handling streamed responses is by using the LambdaStreamingResponseHandler class. This utility offers static methods to easily create a StreamingChatResponseHandler with lambda expressions. To stream responses using lambdas, simply use the onPartialResponse() method and pass a lambda that defines how to handle each partial response:
import static dev.langchain4j.model.LambdaStreamingResponseHandler.onPartialResponse; chatModel.chat(prompt, onPartialResponse(System.out::print));
This pattern allows for clean, readable code while enabling real-time response streaming.
StreamingCompactWay.java
package com.sample.app.streaming; import static dev.langchain4j.model.LambdaStreamingResponseHandler.onPartialResponse; import java.time.Duration; import java.util.concurrent.TimeUnit; import dev.langchain4j.model.ollama.OllamaStreamingChatModel; public class StreamingCompactWay { public static void main(String[] args) throws InterruptedException { OllamaStreamingChatModel chatModel = OllamaStreamingChatModel.builder().baseUrl("http://localhost:11434") .modelName("llama3.2").temperature(0.3).timeout(Duration.ofMinutes(1)).build(); String prompt = "Tell me some Interesting Fact About LLMs in maximum 30 words"; chatModel.chat(prompt, onPartialResponse(System.out::println)); // Sleep for 2 minutes to get the complete response tokens TimeUnit.MINUTES.sleep(2); } }
Output
LL Ms ( Large Language Models ) can generate human -like text , but they lack common sense and understanding of the world , often producing nons ens ical or absurd responses to real -world scenarios .
The onPartialResponseAndError() method lets you specify handlers for both the onPartialResponse() and onError() events in a single, streamlined call.
import static dev.langchain4j.model.LambdaStreamingResponseHandler.onPartialResponseAndError; chatModel.chat(prompt, onPartialResponseAndError(System.out::println, Throwable::printStackTrace));
StreamingCompactWayBothResponseAndError.java
package com.sample.app.streaming; import static dev.langchain4j.model.LambdaStreamingResponseHandler.onPartialResponseAndError; import java.time.Duration; import java.util.concurrent.TimeUnit; import dev.langchain4j.model.ollama.OllamaStreamingChatModel; public class StreamingCompactWayBothResponseAndError { public static void main(String[] args) throws InterruptedException { OllamaStreamingChatModel chatModel = OllamaStreamingChatModel.builder().baseUrl("http://localhost:11434") .modelName("llama3.2").temperature(0.3).timeout(Duration.ofMinutes(1)).build(); String prompt = "Tell me some Interesting Fact About LLMs in maximum 30 words"; chatModel.chat(prompt, onPartialResponseAndError(System.out::println, Throwable::printStackTrace)); // Sleep for 2 minutes to get the complete response tokens TimeUnit.MINUTES.sleep(2); } }
Output
LL Ms ( Large Language Models ) can generate human -like text , but their understanding is limited to patterns learned from vast datasets , making them prone to biases and inaccur acies .
Previous Next Home
No comments:
Post a Comment