Domo Arigato, Mr. Roboto – Creating an LLM Chatbot

DALL-E generated image inspired by the Styx Single cover for the hit Mr. Roboto

Today’s post builds on my last two which showed the steps I took towards building a Chat API (which I call the Roboto API) using an OpenAI GPT-4 model and Retrieval Augmented Generation. If you have not had a chance to read it, please take a look at Part 1 and Part 2.

Once I had the Roboto Chat API responding to a single message, I went about building a React Frontend to interact with the API. Having a working UI helped me test the API much better especially when trying different follow up prompts and then getting the UI responding interactively.

I first generated the main Chat frontend (I called this project the Roboto UI) using Vercel V0,where I was able to put together a complete and working React interface in a fraction of the time it would have taken me by hand. I used the following 3 prompts in Vercel:

  • I would like to create a basic chat bot ui
  • can you add preselected prompts
  • I dont want to scroll right for the preselected prompts

That’s it! It then produced code for the following screen:

Figure 1. Vercel V0 generated React Chat Bot

After moving the code into Cursor AI, I made some modifications to the interface and wired up the Roboto Chat API using the following code:

import { API_BASE_URL } from "@/app/config/settings"

export async function POST(request: Request) {
  const body = await request.json()
  
  const response = await fetch(`${API_BASE_URL}/api/Chat`, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
    },
    body: JSON.stringify(body)
  })

  const data = await response.json()
  return Response.json(data)
}


Inside main React page (page.tsx) of the Roboto UI app we first configure the messages variable using React’s useState hook:

const [messages, setMessages] = useState<Array<{ id: string; role: 'user' | 'assistant'; content: string }>>([])

And then we populate the messages array with the response received from the Roboto API

      const data = await response.json()
      
      setMessages(prev => [...prev, {
        id: (Date.now() + 1).toString(),
        role: 'assistant',
        content: data.response
      }])
    } catch (error) {
      console.error('Error:', error)
    } finally {
      setIsLoading(false)
      setInput('')
    }

Once the messages list is populated, the following code (generated by Vercel V0 from my prompts earlier), renders 3 react components a CardHeader, CardContent and ChatMessage, all inside the main Card component.

Figure 2. Key React components inside main Roboto UI page.tsx

We then iterate through the messages list and pass the content (which at this point contains the Roboto Chat API‘s response) to the ChatMessages component through its props. To give the interface a bit more of a Chat GPT feel, I got Cursor to add in a TypewriterText component which gets called inside the ChatMessage component.

Here’s what the final version of the Mr Roboto Chatbot looks like:

Figure 3. Roboto UI after code adjustments in Cursor

Lets take a look at how Mr Roboto engages in conversation in the following video:

Video 1. Mr Roboto conversation example

The full code for the Roboto UI can be found here: https://github.com/ShashenMudaly/RobotoChatProject/tree/master/roboto-ui

Final Thoughts

Developing the Chat API and integrating it with a Chat UI provided invaluable insights into the practical applications of language models, demonstrating how they can address challenges in ways beyond traditional logic-driven coding. This project not only improved my understanding of React but also highlighted the value of rapid prototyping with Vercel. There are few more ideas I’m hoping to explore in the coming weeks, so I’m going to get back to coding those up!

Before I go, here’s the track that keeps with today’s theme of talking Robots:

Styx, one of rock’s most innovative bands, captured the imagination of a generation with the 1983 hit “Mr. Roboto” from their album Kilroy Was Here. Blending theatrical storytelling with futuristic themes, the song quickly became a cultural phenomenon, resonating with fans worldwide and sparking conversations about technology, identity, and human interaction. Its catchy synth riffs and memorable lyrics helped cement its status as an enduring pop culture icon, influencing countless artists and shaping discussions about the intersection of art and technology. Notably, its impact was further solidified when it was featured on an episode of Glee, introducing the classic to a whole new audience – Source ChatGPT

Don’t Stop Retrieving…Journeying into Retrieval Augmented Generation (Part 2)

DALL-E generated Album cover based on Journey’s Escape album

In my last post I discussed the concept of Retrieval Augmented Generation (RAG) and showed my first steps towards creating a Chat API that would serve as the backend to a chatbot UI that I am building. I ended that post mentioning the limitations of that version of the API in the following areas:

  1. Context Window: The Chat API needed to draw on previous messages and responses in order to continue a conversation with the user
  2. Language Flow: The Chat API needed to direct the conversation based on whether the user mentioned a movie name, described aspects of the movie plot or responded to the user if they went off-topic during a movie discussion

For the context window problem, I chose an in-memory cache as my method of persisting the user’s conversation history beyond a single prompt. Having worked with Azure Cache for Redis in a recent post, I was now confident in using this method, though in most production Chat APIs we would likely use a database for long term persistence of a conversation (with permission from the user and the necessary privacy and SOC2 compliance considerations taken into account). I used the same cache that I had set up in the previous post, so if you want the details on how to set up an Azure Cache for Redis instance, I recommend taking a look at that post.

Having set up the cache, the next decision was to determine exactly what I planned to store. I thought about the conversation a user would have with the Chat API, something like the following example:

User: “What is the movie where a boy learns karate to defend himself from bullies?”

Chat API response:The movie where a boy learns karate to defend himself from bullies is the Karate Kid

User: “What is the name of the main character in that movie”?

Chat API response: “The main character in the Karate Kid is Daniel Larusso”

This seems a pretty straightforward interaction right? Well, from the perspective of a stateless Model, things can get a little tricky.

We know from the previous post which described the interactions between each of the backend components like the Model API, the Search API and the Chat API, the model produces the first response in the above interaction after receiving the movie title, movie plot and user’s query as part of its prompt. In order for the model to respond appropriately to the second query, it would need to be able to access the information it had from the first prompt as well. As the GPT model is stateless, it will not inherently have this capability and thus it will require the Chat API client to supply that information as part of its context.

It would therefore make sense for the Chat API to save into the Redis Cache the movie plot initially retrieved as part of the context window. In that way, when subsequent follow-up questions are asked by the user, they can be answered easily. It would also make sense to save the Model’s responses. Here’s an example where we can see the reason why:

User: “What was the second training method that Miyagi used to train Daniel”?

Chat API response: “The second training method that Miyagi used to train Daniel was having him sand a walkway that leads around Miyagi’s backyard. This was done with similar instructions on technique: moving his hands and arms in wide circles and deep breathing”

User: “Were there any more training methods that Daniel used?”

Chat API response: “Daniel practiced defensive techniques, learned physical balance by standing in a rowboat and trying to stay upright in the surf of the ocean, and practiced Miyagi’s crane kick at the beach.

So as we see, the user’s second question is relative to the Model’s first response. The Model would need to produce a response that takes into consideration how it previously answered the user.

It made sense then at this stage to include the conversations as well as the context. But then things took a turn…

GPT Models and Large Context windows

For those of you who use ChatGPT or Microsoft Copilot for your day to day chats, you would very seldom encounter a token limit unless you really tested it with a large file where the whole content of the file was important. The reason you would have not noticed the limitations of the underlying model with respect to token limits, is that we must remember Chat systems like ChatGPT or Microsoft Copilot are not in themselves the large language models, but are systems that provide us, the user an interface into the underlying model. ChatGPT for example gives us the ability to send the model information in various formats even though the model itself may not even recognize those formats. For example, a GPT model does not inherently understand how to interpret a PDF but instead relies on the calling system to extract, pre-process and tokenize the underlying text in that PDF document . In the end, the model processes the tokenized version of that document and returns a tokenized result that will then get decoded back into human readable text.

What is a token limit?

A token limit refers to the total combined input and output tokens a model can process in a single context window.

Why does the token limit not usually get reached in ChatGPT?

Systems like ChatGPT prevent a user’s prompt from exceeding a model’s token limit by using various techniques to chunk out input to fit into a manageable context window for the underlying model to process. These approaches known as RAG techniques can range from a very basic partitioning of the user’s input into similar sized chunks or a more complex partitioning where chunks are partitioned based on the semantic relationship of sentences inside these chunks to each other. The overall intent is to have the model process the document one chunk at a time and combined the output of each processed chunk into a single coherent output. To the user, the RAG processing and the model’s limits are transparent.

This does not work in all cases though. There may be cases where processing the entire prompt is required to derive the correct output, and as a result, chunking would cause the model to generate a different output than intended.

Processing Movie plots with GPT-4o-Mini

So as mentioned in the last post, I chose the cheapest model I had access to in the Azure model catalogue -GPT-4o-mini, for my preselected region east-us. When I started to include the movie plot into the context window, I noticed the model taking a very long time to return its response. Granted, the movie plots I was querying were in the range of 15,000 characters, but performance slowed the more interactions I had with the API. It degraded considerably when I changed the topic to another movie.

Not all models are made the same

Initially, I assumed the slowdown was solely due to the increasing size of the context window. Although the response time did correlate with more plot data being added, switching to another model—GPT-4-32K—resulted in some performance improvement. Interestingly, GPT-4o-mini has a 128k token limit, whereas GPT-4-32K is limited to 32k tokens. At this stage, I could only presume that Azure’s API on the Standard tier reduces response times for certain models when handling large requests.

RAG Chunking

Dealing with a large context window was a challenge I would likely face at some point, so I decided to try a strategy to reduce it. I divided each movie plot into chunks of equal size and evaluated each chunk’s relevance to the user’s query. If a chunk was relevant, I added it to a shortlist; if not, I discarded it from the context for that query. To assess a chunk’s relevance, I made a call to the model and prompted it for a brief true/false response. By issuing many small requests to the model rather than one large one, I was ultimately able to reduce the overall context size for some queries. However, queries that required summarizing the entire movie plot still necessitated including the full text to generate a meaningful answer. In the future, I plan to refine this process further and explore techniques such as semantic chunking.

Figure 1. Chunking the movie plots

Dynamic RAG

Another strategy I applied was to dynamically fetch the movie plot based on the most current movie being discussed. This approach ensured that context was not built up unnecessarily for movies no longer in conversation.

Orchestrating a Conversation

This step required several tries before I evolved the code into a form of conversational orchestration. It turns out getting a model to seamlessly maintain a conversation or switch to a new one is not as easy as it looks with chatbots. The main challenge here is that every user prompt in my API could fall into one of the following categories:

User QueryCategoryAction
Tell me about tennisNon Movie relatedInform the user this is a movie chatbot.
Tell me about the Karate KidTitle of the movie is in the queryLook up the Plot based on the Movie title
Tell me about a movie where a boy learns karate to defend himself from bulliesMovie related based on a semantic questionLook up the Plot based on the semantic search
Who were the Cobra Kai students?Movie related based on a current conversationLook up the most recent movie title in the context window and dynamically fetch the plot to answer the question

I therefore created a MovieConversationOrchestrator class that directs a user’s prompt to other specialized classes that perform the following functions:

  1. Checks if queries are movie-related
  2. Finds relevant movies from direct mentions of a movie name or context
  3. Applies different conversation strategies based on query type:
    • Single movie discussions
    • Similar movies comparisons
    • General movie conversations
  4. Maintains conversation history

Think of the MovieConversationOrchestrator as a smart movie expert that:

  • Understands when you’re talking about movies
  • Remembers your previous conversation
  • Provides relevant movie information
  • Keeps the conversation focused on movies
Figure 2. MovieConversationOrchestrator conversation flow

Orchestration was therefore a key component of the Chat API because it determines when to apply RAG and how to optimize the context window in the chat.

Let’s take a look at how the Chat API responds to a follow up prompt: “In the movie, what methods did Miyagi use to train Daniel

Figure 3. Chat Response based on a follow up question

Looking at the Chat API logs we can see some of the underlying steps of the API:

Figure 4. Chat API logs showing the existing conversation being used to identify the movie name

The arrows in the above log highlight the following steps:

  1. Red Arrow: User’s query
  2. Green Arrow: The Orchestrator searching 10 prior messages of the current conversation
  3. Yellow Arrow: The system uses the AI model to confirm if the query is part of the existing conversation
  4. Purple arrow: The system detects the conversation is about The Karate Kid

As we continue the logs, we see the next steps in the orchestration process:

Figure 5. Chat API Logs showing the movie plot being fetched and broken into 4 chunks
  1. Red Arrow: Calling the search API by the movie name to retrieve the plot data
  2. Green Arrow: Begin processing the Plot of size 13191 characters in length
  3. Yellow Arrow: The system breaks down the plot into 4 chunks

And then in the final stage of the Retrieval and Orchestration process, the chunks are processed and the conversation is saved to the conversation history:

Figure 6. Chat API logs showing chunks being processed and the user’s message being saved to the conversation history
  1. Red Arrow: The system detects 2 out of the 4 chunks as relevant
  2. Green Arrow: A context size of 7282 characters is built for the model to process (reduced from a total plot size of 13191 characters)
  3. Yellow Arrow: Finally, after the response by the model is returned to the user, the system stores the user’s message and model response (without the plot) into the cached history.

Here we can take a look at the complete Chat API Architecture:

├── 1. API Layer
│   └── Controllers/
│       ├── ChatController.cs         # Handles chat interactions
│       └── MovieController.cs        # Movie-related endpoints
│
├── 2. Core Services
│   ├── MovieConversationOrchestrator # Main coordinator of all operations
│   │   ├── Query Processing
│   │   ├── Intent Detection
│   │   └── Context Management
│   │
│   ├── ChatClient                    # AI Service Integration
│   │   ├── Movie Detection
│   │   ├── Intent Analysis
│   │   └── Response Generation
│   │
│   ├── MovieSearchService           # Movie Data Operations
│   │   ├── Movie Lookup
│   │   └── Similar Movies Search
│   │
│   └── PlotProcessor               # Plot Text Processing
│       ├── Chunk Management
│       └── Relevance Analysis
│
├── 3. Strategy Pattern
│   ├── ContextStrategyFactory      # Strategy Creation & Management
│   │
│   └── Strategies/
│       ├── SingleMovieStrategy     # Single movie context
│       ├── SimilarMoviesStrategy   # Movie comparisons
│       └── ConversationStrategy    # General discussions
│
├── 4. Models
│   ├── MovieSummary.cs            # Movie data structure
│   ├── ChatMessage.cs             # Message structure
│   └── QueryResult.cs             # API response format
│
├── 5. Interfaces
│   ├── IChatClient.cs
│   ├── IMovieSearchService.cs
│   ├── IContextStrategy.cs
│   └── IMovieConversationOrchestrator.cs
│
├── 6. Utilities
│   ├── MoviePlotChunker          # Text chunking utility
│   └── LoggingExtensions         # Logging helpers
│
└── 7. External Services Integration
    ├── Azure OpenAI              # AI processing
    └── Movie Database            # Movie data source

To take a look at the full codebase please visit: https://github.com/ShashenMudaly/RobotoChatProject/tree/master/RobotoAPI

As we can see the process of managing a conversation with a Large Language Model is an intricate one and I take my hat of to the Chat systems out there that have perfected this process. It is also worth noting that there are frameworks like LangChain and Microsoft’s Semantic Kernel (which I’m hoping to delve into further in future) that help with the orchestration flow by eliminating the need to write this code out by hand. Despite some of the struggles I describe in this post, it was a great learning experience to see how to apply a basic RAG strategy to reduce the context window. I can see myself refactoring the strategy to applying Semantic RAG chunking at a later stage which would help to further reduce the relevant chunks by first grouping sentences according to their semantic connection to each other.

Please join me on my next post where we integrate this new Chat API with Chatbot interface that borrows its name from another 80s hit song.

And finally before I say good night, let’s cue up an 80’s favorite and the inspiration for this post’s title:

Don’t Stop Believin‘” is an iconic hit single by American rock band Journey, released in 1981 as the lead single from their album Escape. Written by band members Jonathan Cain, Steve Perry, and Neal Schon, the song was inspired by the themes of hope and perseverance amid everyday struggles. It peaked at number nine on the Billboard Hot 100 and has since become one of rock’s most enduring anthems. Its appeal was further amplified by high-profile appearances in television shows such as The Sopranos and Glee, as well as in numerous films and commercials. In the digital age, “Don’t Stop Believin'” continues to thrive—garnering millions of streams on platforms like Spotify and Apple Music—cementing its status as a timeless cultural phenomenon and one of the best-selling digital tracks ever. – Source: ChatGPT

Journey’s Escape Album cover

Don’t Stop Retrieving…Journeying into Retrieval Augmented Generation (Part 1)

DALL-E generated Album cover based on Journey’s Escape album

In my two-part blog series a few weeks ago, we first explored a semantic search system, where I used Azure Database PostgreSQL and uploaded a collection of approximately 4,000 movies into a database table. I then leveraged PGVector to vectorize the movie plots for each movie and built an API to search the database using lexical, semantic, and hybrid search methods. In a follow-up post, I demonstrated an example using Azure Cache for Redis. These posts were really important building blocks I would need for what I planned to build next—my very first AI chatbot.

Initially, I envisioned a more sophisticated and fun chatbot. I wanted that chatbot to be a component of a larger application so could I could learn how to really integrated the bot with application data. That may still be a longer term objective but there are elements of a chatbot architecture I would still need to learn and understand, at least from a practical point of view. Instead I decided to try and reuse what I already had and since I already had a database of movies with vectorized plots, it made sense to start with a Movie Expert Chatbot—one that could hold conversations about various movie plots from my dataset.

Building a chat interface for an OpenAI model is arguably the easier part of the task, and I will cover the interface design in an upcoming post. What I wanted to focus on first was applying Retrieval-Augmented Generation (RAG) techniques to the OpenAI model and experimenting with integrating custom data into its context.

What is RAG, and Why is it Important?

A Large Language Model (LLM) can often be considered a black box—we don’t necessarily know, nor do we need to know, its internal workings. What truly matters is that it receives human-readable input and generates relevant responses.

While we now have multimodal models that can generate images, audio, and video (e.g., DALL-E converts text into images), for the sake of this explanation, we will focus on single-modality models—those that process and generate only text.

During training, models ingest a large corpus of data to develop an internal neural network representation of the material. This is similar to how a human student progresses through different levels of schooling before graduating from university. However, just like a student’s knowledge is based on the textbooks available at the time of their studies, an AI model’s training data is static and limited to a specific point in time.

To illustrate this, imagine a Computer Science student who has recently graduated. Their knowledge of programming, databases, and operating systems is limited to what was available in their university textbooks. Their learning has been theoretical and often abstract. When this graduate enters the job market, they soon realize that technology has evolved, best practices have changed, and new programming languages have emerged.

At this point, they have two choices:

  1. Go back to university for additional training—this would expand their conceptual understanding, but it is time-consuming, costly, and impractical for most real-world scenarios.
  2. Apply their existing knowledge while supplementing it with additional context relevant to their work—such as consulting books, documentation, or online resources. This allows them to fill in knowledge gaps on demand, making them more effective without requiring formal retraining.

Similarly, LLMs face the same challenge. There are two main ways to enhance their knowledge:

  • Retraining (or fine tuning) the model with additional data, resulting in a new version of the model with expanded understanding. However, like going back to university, this process is expensive, time-consuming, and computationally intensive.
  • Retrieval-Augmented Generation (RAG)—instead of retraining the model, we fetch relevant external data dynamically and append it to the user’s prompt. This allows the model to apply its existing conceptual understanding while incorporating current and context-specific information, enabling more accurate and informed responses.

This approach expands the model’s generative results by dynamically retrieving supplementary knowledge for the model to reference. The primary benefit of RAG isn’t just keeping the model’s knowledge current—it is also valuable for domain-specific applications where industry expertise is required but was not part of the model’s original training data.

Basic Retrieval-Augmented Generation Workflow

A simplified RAG workflow consists of the following steps:

  1. User submits a query (prompt).
  2. The system evaluates the prompt and derives a search query.
  3. The system searches a data store, using either semantic, lexical, or hybrid search approaches.
  4. Relevant data is retrieved and appended as additional context to the prompt.
  5. The AI model processes the enriched prompt and generates a final response.

Of course, in a real-world chatbot implementation, many additional steps and optimizations are necessary. However, for now, we will stick to these five fundamental steps to establish a foundation for further discussion. Here is how that flow would look:

Figure 1. Basic RAG flow

With the basic architecture above in mind, I started building out the Chatbot API by following these steps:

  1. Deploy a GPT model using Azure Open AI (a.k.a. Azure Foundry Portal):
    • Create an Azure Open AI resource
    • Select and deploy a model from the Open AI model catalog (I initially chose GPT-4o-mini)
    • Save the API endpoint and key for later access
  2. Setup an API to interact with the Azure Open AI Model’s endpoint
    • Create an ASP.net API
    • Add the Azure.AI.OpenAI package to the project and instantiate a OpenAIClient instance (referencing the API endpoint and key)
    • Setup an HTTP Client to make calls the BONO movie search service
    • Wire up the result from BONO search service into the model context
    • Wire up the OpenAIClient to direct the user’s query and context as a prompt to the Open AI model instance from step 1.

Model selection is region specific so I went with the cheapest available model in the east us region, GPT-4o-mini. Pricing of a model is based on completions which is the number of input tokens the model receives and number of output tokens the model returns combined.

Building the Prompt

Using movie plots as my RAG context source did come with one downside. As these movies may already be known to the model internally, I had to ensure the model did not use its own training to answer the movie related prompts and rather used only what returned from the BONO search API. Not doing so may give the false impression that the model’s response was derived from the dynamically retrieved context data but instead could have been auto generated by the model. The model would therefore return a seemingly valid response even if the context was empty or incorrect. Here is how the Chat Client was set up (note the ChatRequestSystemMessage ):

 var response = await _client.GetChatCompletionsAsync(
            new ChatCompletionsOptions
            {
                DeploymentName = _deploymentName,
                Messages = {
                    new ChatRequestSystemMessage(
                        "You are a movie assistant. ONLY use the provided context to answer questions. " +
                        "If the context doesn't contain enough information to answer, say so. " +
                        "Do not use any external knowledge or training data. " +
                        "Context: " + context),
                    new ChatRequestUserMessage(query)
                },
                Temperature = 0.7f,
                MaxTokens = 500
            });

In this snippet, we set up our Chat Client to ensure the model uses only the dynamically retrieved context from the BONO search API rather than relying on its pre-existing training data. Here’s what’s happening step by step:

  • Sending the Chat Request:
    We call the asynchronous method GetChatCompletionsAsync on our OpenAI client instance (_client). This sends the chat request to the model.
  • Configuring the Request Options:
    Within the ChatCompletionsOptions object, we specify important parameters:
    • Deployment Name:
      The DeploymentName parameter identifies the specific GPT model deployment (in our case, GPT-4o-mini).
    • Message Setup:
      Two messages are defined:
      • A system message instructs the model that it’s acting as a movie assistant. It emphasizes that the model should ONLY use the provided context to answer questions and, if the context is insufficient, it should indicate so. This prevents the model from drawing on its built-in training data.
      • A user message contains the actual query, ensuring the model processes the question alongside the specific context.
    • Controlling Creativity and Response Length:
      The Temperature is set to 0.7, striking a balance between creativity and consistency, while MaxTokens limits the length of the generated response to 500 tokens.

Below is a screenshot of the Chat API endpoint after submitting a prompt: What is the movie where a boy learns karate to defend himself from bullies

I included the context in the return response so we confirm the model is using the plot data returned from the BONO Search service.

While this was a good milestone to have reached, the API at this point is far from complete. Here are the limitations:

  1. No Context Window:
    The API did not maintain a context window. If I asked a follow-up prompt to the initial one, the API would perform another hybrid search on the plot data. For example, if my second prompt was “Who played the main character?”, the system would search for a movie in the database with a plot related to that prompt instead of examining the plot of The Karate Kid and returning the correct answer.
  2. Inability to Identify Movie Names:
    The API did not recognize movie names in the user’s prompt. For instance, if the prompt was “Tell me about the plot for the movie The Karate Kid“, rather than looking up the movie by name, extracting its plot, and then returning a summary, the API would attempt a semantic search based on the entire prompt. This process is inefficient, and the results were unpredictable.
  3. Handling Non-Movie Prompts:
    If the user’s prompt was not related to movies, the API would still try to process it by calling the BONO Search API.

In the next post, I’ll explore the strategies I used to overcome these challenges and share the new obstacles I encountered with RAG and the GPT model I chose.