Saturday, 28 June 2025

Streaming LLM Responses in Real-Time with LangChain4j using StreamingChatResponseHandler

LLMs generate text incrementally, one token at a time. To improve responsiveness, many LLM providers support streaming outputs instead of sending a full response at once. The StreamingChatModel and StreamingLanguageModel interfaces allow you to leverage this behavior in your applications. By implementing the StreamingChatResponseHandler, you can react to three key events:

 

·      onPartialResponse(String partialResponse): Triggered as each token is generated, enabling real-time UI updates.

·      onCompleteResponse(ChatResponse completeResponse): Called once the full response is ready.

·      onError(Throwable error): Invoked if an error occurs during streaming.

 

This approach not only delivers a smoother user experience but also gives developers granular control over how responses are handled and displayed.

 

Find the below working application.

 

StreamingDemo.java 

package com.sample.app.streaming;

import java.time.Duration;
import java.util.concurrent.TimeUnit;

import dev.langchain4j.model.chat.response.ChatResponse;
import dev.langchain4j.model.chat.response.StreamingChatResponseHandler;
import dev.langchain4j.model.ollama.OllamaStreamingChatModel;

public class StreamingDemo {

    public static void main(String[] args) throws InterruptedException {

        OllamaStreamingChatModel chatModel = OllamaStreamingChatModel.builder().baseUrl("http://localhost:11434")
                .modelName("llama3.2").temperature(0.3).timeout(Duration.ofMinutes(1)).build();

        String prompt = "Tell me some Interesting Fact About LLMs in maximum 30 words";

        chatModel.chat(prompt, new StreamingChatResponseHandler() {

            @Override
            public void onPartialResponse(String partialResponse) {
                System.out.println("onPartialResponse: " + partialResponse);
            }

            @Override
            public void onCompleteResponse(ChatResponse completeResponse) {
                System.out.println("onCompleteResponse: " + completeResponse);
            }

            @Override
            public void onError(Throwable error) {
                error.printStackTrace();
            }
        });

        // Sleep for 2 minutes to get the complete response tokens
        TimeUnit.MINUTES.sleep(2);

    }

}

Output

onPartialResponse: LL
onPartialResponse: Ms
onPartialResponse:  (
onPartialResponse: Large
onPartialResponse:  Language
onPartialResponse:  Models
onPartialResponse: )
onPartialResponse:  can
onPartialResponse:  generate
onPartialResponse:  human
onPartialResponse: -like
onPartialResponse:  text
onPartialResponse: ,
onPartialResponse:  summarize
onPartialResponse:  content
onPartialResponse: ,
onPartialResponse:  and
onPartialResponse:  even
onPartialResponse:  create
onPartialResponse:  new
onPartialResponse:  stories
onPartialResponse: .
onPartialResponse:  They
onPartialResponse:  learn
onPartialResponse:  from
onPartialResponse:  vast
onPartialResponse:  amounts
onPartialResponse:  of
onPartialResponse:  data
onPartialResponse: ,
onPartialResponse:  improving
onPartialResponse:  over
onPartialResponse:  time
onPartialResponse:  with
onPartialResponse:  user
onPartialResponse:  feedback
onPartialResponse: .
onCompleteResponse: ChatResponse { aiMessage = AiMessage { text = "LLMs (Large Language Models) can generate human-like text, summarize content, and even create new stories. They learn from vast amounts of data, improving over time with user feedback." toolExecutionRequests = [] }, metadata = ChatResponseMetadata{id='null', modelName='llama3.2', tokenUsage=TokenUsage { inputTokenCount = 39, outputTokenCount = 38, totalTokenCount = 77 }, finishReason=STOP} }

A more concise approach to handling streamed responses is by using the LambdaStreamingResponseHandler class. This utility offers static methods to easily create a StreamingChatResponseHandler with lambda expressions. To stream responses using lambdas, simply use the onPartialResponse() method and pass a lambda that defines how to handle each partial response:

import static dev.langchain4j.model.LambdaStreamingResponseHandler.onPartialResponse;

chatModel.chat(prompt, onPartialResponse(System.out::print));

This pattern allows for clean, readable code while enabling real-time response streaming.

 

StreamingCompactWay.java

 

package com.sample.app.streaming;

import static dev.langchain4j.model.LambdaStreamingResponseHandler.onPartialResponse;

import java.time.Duration;
import java.util.concurrent.TimeUnit;

import dev.langchain4j.model.ollama.OllamaStreamingChatModel;

public class StreamingCompactWay {

    public static void main(String[] args) throws InterruptedException {

        OllamaStreamingChatModel chatModel = OllamaStreamingChatModel.builder().baseUrl("http://localhost:11434")
                .modelName("llama3.2").temperature(0.3).timeout(Duration.ofMinutes(1)).build();

        String prompt = "Tell me some Interesting Fact About LLMs in maximum 30 words";

        chatModel.chat(prompt, onPartialResponse(System.out::println));

        // Sleep for 2 minutes to get the complete response tokens
        TimeUnit.MINUTES.sleep(2);

    }

}

Output

LL
Ms
 (
Large
 Language
 Models
)
 can
 generate
 human
-like
 text
,
 but
 they
 lack
 common
 sense
 and
 understanding
 of
 the
 world
,
 often
 producing
 nons
ens
ical
 or
 absurd
 responses
 to
 real
-world
 scenarios
.

The onPartialResponseAndError() method lets you specify handlers for both the onPartialResponse() and onError() events in a single, streamlined call.

import static dev.langchain4j.model.LambdaStreamingResponseHandler.onPartialResponseAndError;
chatModel.chat(prompt, onPartialResponseAndError(System.out::println, Throwable::printStackTrace));

StreamingCompactWayBothResponseAndError.java

package com.sample.app.streaming;

import static dev.langchain4j.model.LambdaStreamingResponseHandler.onPartialResponseAndError;

import java.time.Duration;
import java.util.concurrent.TimeUnit;

import dev.langchain4j.model.ollama.OllamaStreamingChatModel;

public class StreamingCompactWayBothResponseAndError {

    public static void main(String[] args) throws InterruptedException {

        OllamaStreamingChatModel chatModel = OllamaStreamingChatModel.builder().baseUrl("http://localhost:11434")
                .modelName("llama3.2").temperature(0.3).timeout(Duration.ofMinutes(1)).build();

        String prompt = "Tell me some Interesting Fact About LLMs in maximum 30 words";

        chatModel.chat(prompt, onPartialResponseAndError(System.out::println, Throwable::printStackTrace));

        // Sleep for 2 minutes to get the complete response tokens
        TimeUnit.MINUTES.sleep(2);

    }

}

 

Output

LL
Ms
 (
Large
 Language
 Models
)
 can
 generate
 human
-like
 text
,
 but
 their
 understanding
 is
 limited
 to
 patterns
 learned
 from
 vast
 datasets
,
 making
 them
 prone
 to
 biases
 and
 inaccur
acies
.

 

  

Previous                                                    Next                                                    Home

No comments:

Post a Comment