Problem

By default, interactions with LLM are stateless and it has no clue about what was asked previously to it so it can have a more meaningful engaged conversation with our end user. You want to make your awesome GenAI application an enriching experience for the end user. If you are a superhero fan, you must have heard Batman re-iterating "I am Batman" to villains every single time as Batman assumes they have no memory or recollection.

Solution

Spring offers the benefit of using the mighty Advisor(s). It is through this useful feature where we can tap into the user input request before it reaches the LLM, and enrich the user input with some past chat history thereby giving LLM some more context and having the end user a "conversational" type experience. The figure below paints a bird eye view of what our application would result in.

In the figure below, we introduce a segment of Chat Memory. This segment can be a choice of either using "In-memory" or "Persistence" (relational, non-relational) driven solution depending on the use case.

In-Memory: This mode of persistence stores the context of conversation for a particular user session in a ConcurrentHashMap that has a signature of type 'ConcurrentHashMap<String, List<Messages>>'.

The key 'String' here holds the conversationId isolated to each user for a session.
The value 'List<Messages>' holds conversations belonging to a user session including both request/response to and from the LLM. As mentioned earlier, whenever a new user prompt is sent to the LLM, the conversational history is included as part of it to enable LLM to have more profound context for it to response accordingly.
By default, a maximum of 20 messages can be held for a session per conversationId. This is obviously configurable based on your requirement.
Caveats: For production, it maybe worth considering some pitfalls of this approach:

Distributed Systems: In a distributed environment, each instance of the application would contain their own state. When a traffic is routed to a particular instance, it's state of in-memory may differ from another instance leading to erratic experience.
State: Should a particular instance crash out, the entire history would be lost
Memory: Since both request and response are stored in memory, it can grow quite large depending upon the volume of interactions between user and your application and won't scale for high demand.

Persistence: This mode of database storage offers a single source of truth which leads to far better consistency and state of data. In a distributed system, all instances of the same application would be retrieving same level of information. SpringAI offers a wide variety of options to chose from for either relational or non-relational databases. For demonstration purposes we would be using Postgres relational database for our playground, as it quite commonly adopted in enterprise production application.

conversation_id: This holds a unique conversation identifier value per user and can be used to identify a user activity on our application especially when we support multi user interaction. If we don't explicitly specify the value, a value of "default" is used.
content: As name suggests the holds the text for both request and response to and from the LLM. Even though the content is "text", we should be observant of the length of context window we allow for our LLM. This can be dictated by configuring ChatModelOption where we can specify size configuration for context window
type: This holds the value for the type of record and can be of "USER, ASSISTANT, SYSTEM, TOOL". The examples to follow we will see some of these being observed.
timestamp: Merely an audit for when the particular activity took place in application.

Alright, there is just one more theory remaining which is an internal of the class layout used by Spring. This is useful as it would give us option of how we can configure what we studied above into practice when we integrate these options into our application. Figure below represents a summarized layout of the various classes and interfaces (denoted by <I>) below:

In the figure above:

Advisor: We enrich the request with conversational history using Advisor. Spring offers two that we can use here namely, MessageChatMemoryAdvisor and PromptChatMemoryAdvisor. The difference here is that, the latter builds conversational history into system prompts, while the former builds it as part of user prompts.

ChatMemory: By default, spring offers a concrete implementation for this type using the MessageWindowChatMemory, which has a default rolling window of 20 message(s) that is configurable based on your need. At any given time, it will contain the default/configurable number of messages and keep discarding the older from its buffer/persistence. There are caveats here to consider depending on the application use case, as you may or may not want to hold sufficiently large/small amount of messages. Keeping it for long may provide more tokens into the LLM, thereby incurring cost since it also provides entire conversational history. This would result in unnecessary cost (vendor dependent) as tokens once again are the social currency of GenAI applications. Internally this holds a contract for ChatMemoryRepository.

ChatMemoryRepository: As discussed above, we two different varieties of it. In the figure above, we can see the various concrete implementations. For relational, you can see the different vendors as of this writing available for us to use. The non relational ones such as "Apache Cassandra", "Neo4J" are their own unique implementations. For each of the supported relational database, SpringAI has the relevant schema setup out of the box (schema and database interaction), provided we configure the database configurations correctly. A sample schema used for Postgres is shown in the "Persistence" section.

Playground

In order to demonstrate the before and after of Chat Memory integration, our application now comprises of four different endpoints each with their own version of "ChatClient". This is done so that we don't complicate the logic and have it easy to manage for our playground purposes. Let's get some basic admin information out of the way first.

Source Code: Found here
Added Endpoints:

http://localhost:8080/chat/generic
http://localhost:8080/chat/in-memory
http://localhost:8080/chat/db-memory
http://localhost:8080/chat/db-user-memory

Added Dependencies:

postgres
spring-ai-starter-model-chat-memory-repository-jdbc

Added container:

Configuration:

Run 1: http://localhost:8080/chat/generic

This demonstrates the use of before end user experience where the LLM has no state information on prior conversation it had with that user.

From above, we can see that image of left is where we trying to tell LLM who we are. In the right, we ask again the LLM who we are to see if it remembers us. If we look at the response obtained, we can see that in this particular case LLM doesn't love us one bit, it didn't care who we were, how rude !!

Run 2: http://localhost:8080/chat/in-memory

This demonstrates the result of what happens when we put love into LLM and make it compassionate about us. It's always nice when people remember who you are, has a nice feeling to it, ain't it.

In the figure above we first educate LLM it is I and then on right we ask it again to see if it remembers us. Based on the response obtained we can see we gave it memory. If you run the application, and observe the interaction of Spring AI with LLM in logs (yes we still have Logging Advisor), you will notice that it keeps on sending the previous conversation history as part of the new prompt every-time to LLM, and that is exactly how we made LLM to fall in love with us. It is to be noted since it is in-memory, if we restart the application it will forget us 😡. So much for love.

Run 3: http://localhost:8080/chat/db-memory

We upgrade the love LLM has for us by granting it persistence. For this we had to use some additional dependencies and configuration stated above. In this run, the key interest to us is how is this stored in DB right. The image below shows

In the above exciting result, we can see that a value of "default" is used for conversation_id as we didn't specify our own. We can see the type as "USER" and "ASSISTANT" indicating request response made and received. The content is the actual message that we observe. As we can see for it, it was able to remember us. Even if we restart our application, and try again, it will remember. One drawback of this approach is however, if we have an application there is need for multi user sessions, where each session can be identified and isolated to that particular user. Time to upgrade LLM's love for us again...

Run 4: http://localhost:8080/chat/db-user-memory

You always will have a use case where multiple users use your application. Typically we can use some sort of sessionId or a userId to identify these users in the backend. A similar setup is done in "ChatClient". What we doing in our case is we taking a "user-id" as an incoming header request and then using that value to vary the value for conversation_id. Code for such a setup is shown below:

The highlighted section above, shows that we varying the conversation_id now based on the passed userId from our header in the request. Now since Zack Snyder's Justice League 2 never got made and we have to settle for James Gun vision now, our two users for demo will be Superman and Batman. The images below show the outcome of each run for a different user. Pay special attention to the curl where I indicate the "user-id" value. (Please ignore the Cookie JSESSIONID - it has no bearing).

User: Batman

Case: Checking same against different conversation Id (as superman)

New User: Superman

I hope the images were self explanatory of the results obtained. In Summary we ran first our endpoint providing our name and then asking in subsequent request who we are. We first run it for user Batman, and then run it for another user Superman. To indicate correct use of user conversationId we try to ask when using Superman user, if it tells us falsely that we are Batman. Looking at the response we can see it has no clue who we are, and that is what we expect. From database perspective, the results look as:

We can see firstly that in "conversation_id", we now have two different values that we passed via the header request namely "batman" and "superman". Each of these conversation_id have their own content isolated for request/response made/obtained to and from LLM. This definitely helps us now as imagine a full blown application with several users using our awesome application. We are now in a position to offer it as scale accommodating each user to their own conversational history. You can decorate the application to clear the conversational history giving it TTL, etc.

Search This Blog

Everything will be 200 OK :)

Springing into AI - Part 8: Chat Memory

Problem

Solution

Playground

Run 1: http://localhost:8080/chat/generic

Run 2: http://localhost:8080/chat/in-memory

Run 3: http://localhost:8080/chat/db-memory

Run 4: http://localhost:8080/chat/db-user-memory

User: Batman

Comments

Post a Comment

Popular posts from this blog

Springing into AI - Part 4: LLM - Machine Setup

Springing into AI - Part 1: 1000 feet overview

Springing into AI - Part 2: Generative AI