Transformer based AI models can generate amazing answers to users’ questions. While the underlaying Large Language Models are not retrained, the performance of Question Answering AI can be improved by running experiments with different hyper parameters.
This post describes how to optimize generated AI answers. When building systems that generate answers to questions of users, you do not just pass the questions to generic
Large Language Models. Instead, the models need to understand the context of the questions. Before the models are asked to generate answers, pre-processing needs to be done to describe the context efficiently. For example, full text searches and re-ranking can be used to identify most relevant passages of documents which are passed to the answer generating models.
Re-ranking utilizes neural information retrieval techniques to find answers to queries without comparing single word occurrences. Instead, representations of words are used to find the closest neighbors in the neural networks. So often two large language models are leveraged:
- Re-ranker which is typically an encoder-based transformer
- Answer generator which is typically a decoder-based transformer
Often three stages are run in pipelines to produce answers:
- Full text searches
- Answer generation
In the context of ‘Question Answering’ tasks it is important to provide answers that are correct to avoid hallucination which is why the ‘temperature’ parameter is set to ‘0’. Other parameters should be optimized for specific use cases and specific data corpora, for example the choice of large language models and the size of documents which are passed between the different stages.
Read my previous posts for more context.
To measure the performance of different models and parameters, ground truth based approaches can be leveraged. Experts for specific domains and data provide at least 100 questions and expected answers, called ‘gold answers’.
Since the answer generation pipelines contain multiple steps, not only the final answers should be measured but also the previous steps like the re-ranking.
The following sample (or the table at the top of this post) shows all the information that needs to be defined in the ground truth file. In addition to questions and answers the key passages are defined as well that include information necessary to generate the answers. Note that good answers often require information from more than one passage.
|1||How can the printer be fixed?||Try to restart it||1||Connect it to a network||2||Ask an IT professional||3||Restart it, connect it to a network or ask an IT professional|
To evaluate different combinations of parameters, the quality of the final answers as well as the found passages need to be checked.
BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another.
The same mechanism can be applied to compare gold answers with actual answers in the same language. In the simplest case n-gram comparisons can be done for words that are next to each other.
To compare experiments, the relative Bleu and Rouge values can be used. Note: Don’t expect them to be 100 or 1 in this scenario.
While these mechanisms give some indications of the performance, you should also do manual due diligence, since final subjective measurements are important.
In the question-answering repo we automatically generate spreadsheets that can easily be read by humans.
|How can the printer be fixed?||Try to restart it and connect it.||Restart it, connect it to a network or ask an IT professional|
For passages matches and recalls can be calculated. The goal is to check how many of the right passages have been found by the first steps of the pipeline (full text search and re-ranking).
To run experiments pipelines need to be built that can be configured easily. The question-answering repo contains a Java application which exposes these parameters as environment variables.
There are a lot of possible variations:
- Which Large Language Model is best for answer generation? How many passages should be passed in? Which prompt should be used?
- How exactly to do full text search queries?
- Should the full documents of the data corpus be split first into smaller passages?
- Should a re-ranker be used? Which one? With how many passages?
- And much more …
To run the evaluations efficiently, I put metrics into the pipeline app. Every endpoint invocation creates a ‘last run’ markdown file.
Additionally, two csv files are created that contain all metadata and all 100+ queries of an experiment.
Automation is required to run the queries defined in the ground truth file, to produce all assets automatically and to run the comparisons against the ground truth file. We’ve built two containers to execute the experiments:
- Container with the pipeline functionality
- Container that invokes the queries and tracks output data
After an experiment all assets are put in Git to track progress and compare results.