As a quick introduction, before we dive into this article, I just need to clarify a few things.
I'm no AI or LLM expert. I'm just a humble Software Engineer who has spent the last 4 months at work embarking on an exciting journey of exploration (and lots of stumbling) to learn how to build solutions with OpenAI and LangChain. It's been quite pleasant to combine the LLM capabilities with all the novelty architectural patterns emerging around those, such as Vector databases and the creation of knowledge graphs. In this and the next articles, my goal is to expose what I've been learning, and what were the issues I've crossed. The intent here is mainly to start documenting the learning journey.
First steps to a chat
On the journey to learn how I could extract business value from LLMs, one of the most used examples, which was also my starting point, was the example cookbook from OpenAI on how to build an agent to create a chatbot.
This agent is constructed in a manner where it has access to tools you define in your code, and the agent will decide whether to use it or not, depending on how you explain the tool to the LLM agent. These tools can be simple functionalities, APIs, DB calls, whatever you define them to be.
Enabling you to add your custom tools, provides a lot of possibilities, in which the agent could come up with interesting ways to solve your problem. This can be great since it can simplify a lot of things for you. For example, when I started working with the agent, I would simply provide a tool with a GraphQL API call to provide data to the agent to answer questions from our data. The idea here was to aggregate some transactional data according to the questions we received.
This yielded somewhat reasonable results, and naïvely we started experimenting even further by adding more tools to it, and we expected even better results, but we saw that wasn't the case. At least not when using GPT-3.5. Introducing more tools decreased consistency. Previously we would have consistency issues and hallucinations due to the agent misinterpreting the data. Now, sometimes the agent would use the wrong tool, or couldn't define which tool to use.
Yet another issue was, that sometimes GPT would skip relevant data points. For example, let's say I had an array of items, where each item has a type. If I told GPT to sum all the elements with type X, we saw that if we had 10 elements, sometimes it would include only 9. Or even worse, it was summing things up in the wrong way even with the addition of mathematical tools to our toolset.
So, despite the costs, we wanted to try GPT-4, and the results improved significantly, but not enough to be viable. Even though it would, indeed, provide better quality output, the time to resolution spiked from ~3-7 seconds to something ranging from 20 seconds to a minute (sometimes even more than a minute). And we would still have some issues with the data aggregation, where we realized that we were hostages to its reasoning process.
Adjusting the approach
We knew our use case, aggregating data, was a bit too much for GPT 3.5, therefore, we decided to go back to the fundamentals and simplify what we wanted to extract from it.
ChatGPT is a next token predictor. It's a statistical model that tries to guess the next word of what is generated as output, and as a matter of fact, it does that surprisingly well. So, what we already knew from the start had to be hardly swallowed, aggregating data wouldn't be possible. However, we still wanted to use GPT to provide a nicer and automated experience.
That's when we decided that we needed to take the control back from the agent and put it in our hands. We then went to a solution where we needed to first, understand the main question: "What does this user want?". Which some people call "Intent extraction".
To do so, we took two approaches. The first one is the most obvious one, which was asking GPT. To do so, we created a prompt where we gave GPT all the intents we supported and asked it to read the conversation and make a decision on which intent we should use. We also combined that with a semantic search, where we would store a plethora of questions and phrases, the associated intents to those, and the text embeddings. We then would search for the most similar text within a threshold and use that as the intent.
From that point on, we knew what was the user intent, and each intent was associated with a process to fulfil the data requirements for GPT to answer the problem. However, we needed input data to aggregate the data correctly from our APIs, and this data was inside the conversation. So, how do we extract that information? Again, we used GPT for that. Using entity identification and extraction, we could define, for a given intent, what was required or optional to extract from the conversation and use that as input to our API calls. To do so, we used some convenient tools using Pydantic from LangChain
We would then aggregate the data in a textual format, ask GPT to use the conversation to understand the conversation tone and rewrite the text in a nice and explanatory way to the user.
After moving from the agent struggles, where GPT-4 was too expensive and slow, and GPT-3.5 got too confused, we understood that simplification and smaller operations combined will yield better results. So, our next steps all used the following principles:
Smaller tasks yield better results with GPT
GPT cannot aggregate data, sometimes it misses data points that are crucial to your response (we already knew that, but we wanted to exploit as much as we could from OpenAI tools).
GPT is powerful for extracting entities and intent from textual information, use that to your advantage
Combining GPT knowledge with semantic search improves significantly its capabilities to provide good output
Semantic Search and Vector Stores basics:
Entity identification and extraction