Problem

Tokens as we know form the social currency for our GenAI applications. When we run LLM's locally on our machine this is fine for experimentation, but in production grade applications we would often rely on external providers, and costs start to incur. Wouldn't it be cool if we had some insights, visibility of how our Gen AI application is being used in terms of tokens being used, requests being made so we can take and remediate actions should we wish ?

Solution

Enters Observability, which is a fundamental utility to have in our arsenal that can be our eyes for our application. This cannot be stressed enough on how powerful Observability can be for enterprise applications. Someone please picture Darth Vader going like "If only you knew the power of Observability". In the development community, there exists plethora of tools, some commercial and some open sourced.

Prometheus and Grafana are the most commonly used tools adopted enterprise wide for monitoring an application's state at a given time. While Prometheus allows us to collect, store and query the metric data provided by the application, Grafana allows us to query, visualize and alert on the data that we may take actions where applicable. In context of our application, the figure below presents an overview of how these components integrate into it.

Spring Actuator: Spring framework provides us actuator endpoints that provide us some key metrics about the state of our backend application, e.g.: health, information. In our application we also enable "prometheus" here. This is of vital important to the whole system, as the exposed endpoint would be used by "Prometheus" tool to collect various metrics for further usage. SpringAI offers variety of such metrics from AI perspective such as model parameters, token usage, etc. More about the various exposed metrics can be found here. The image below shows a metrics observed by our application that is gathered by invoking few prompts.

Prometheus: As mentioned above, the metrics exposed on a particular endpoint is scraped by this tool. For our AI application, we would definitely like to know the use of number of tokens (input, output, total) so that we can take actions (limit the context window on model parameter as an example) to prevent cost incurring beyond our pocket can afford. The image shown below is from the user interface Prometheus has, where we have searched for a particular metric amongst many other options available as an example for illustration purposes.

Grafana: Okay, so Spring actuators expose the metrics to outside world. Prometheus has scraped these metrics, stored and enabled us to query. Wouldn't it be nice if we can visualize these metrics in form of a dashboard that we can visit and gain insight via different display widgets ?, enters "Grafana" into the picture. The image below shows a sample dashboard observed from our application where we can already see some key information. As we use our application more and more, it would update almost real-time based on the scraping interval.

From above:

Total AI Requests: Number of requests made by the application to LLM.
Average Response Time: Typical average response time obtained from LLM.
Success Rate: This helps us to know if we had any errors as the success rate would drop
Token Usage: On real time basis, the tokens used per request by the LLM.
Response Time Distribution: This would help us to know for major spikes if any maybe during load as we offer our application in production for large scale users.
Total Tokens Used: This is the total tokens in overall used by the application.

Special mention here to Dan Vega who is a Spring advocate developer. His work has helped me in this journey learn through his medium of tutorials, videos and articles. The above dashboard that you see is sourced from his work where he had created a typical prometheus-grafana dashboard setup for us to use. For a more in depth tutorial, you can view his youtube video tutorial. In above dashboard, we can see the amount of invocations we have done using our endpoint to prompt with LLM, the response times, the token usage, response time duration amongst others to help us know how insights of our application usage.

Playground

Alright, now that we have an understanding of what all tools are, and how they work and expose data to us, let's have a look at how this looks from a code point of view. Most of the code remains same as before, with few enhancements added.

Source Code: Can be accessed here
Dependencies:

spring-boot-starter-actuator: It enables actuator endpoints that can be scraped by "Prometheus". Through configuration we enable prometheus as one of the endpoint.
spring-boot-docker-compose: Out setup makes use of docker-compose file namely "compose.yaml" to setup and run Prometheus and Grafana respectively. As part of the container setup, we also load pre built Grafana dashboard template and a configured Prometheus datasource used by it for data population. Using the mentioned dependency allows us to automatically run these containers at startup instead of us manually managing these containers every time.

Configuration:

Management: Using spring properties we enable certain metrics and setup time capture intervals that would be used by our tools to help our vision be realized.

Endpoints:

Prometheus Spring Actuator: http://localhost:8080/actuator/prometheus
Prometheus Tool: http://localhost:9090/
Grafana: http://localhost:3000/

If we try few sample prompts, and visit our observability center, we will notice the metrics being reported which would as mentioned give us insight into the behaviour of the application. This should form the basics and stock standard setup for most of our experiments going forward. Feel free to experiment and try out various requests and see the monitoring of it on the dashboard in real time.

Search This Blog

Everything will be 200 OK :)

Springing into AI - Part 7: Observability

Problem

Solution

Playground

Comments

Post a Comment

Popular posts from this blog

Springing into AI - Part 4: LLM - Machine Setup

Springing into AI - Part 1: 1000 feet overview

Springing into AI - Part 2: Generative AI