Synthetic Data Generation with InstructLab

InstructLab is an open-source framework to fine-tune smaller generative AI models via a new alignment technique. Models can be fine-tuned with relative little real data. This post describe how InstructLab generates data synthetically.

InstructLab provides various ways to generate data by providing pipelines which are explained below.

Input Data

In all cases sample data needs to be provided in a yaml format which contains a description of the AI task and a list of triples.

Input - Context
Input - Instruction
Answer

Example:

  
version: 3
task_description: >-
  Summarization of phone call transcripts between human agents
  and clients of a telecommunication service provider about
  technical issues.
created_by: nheidloff
seed_examples:
  - context: >-
      Sara: (Dialing customer care) Hello, this is Sara, and I'm having some
      issues with my broadband connection. [cut ... cut]\n\nMike: Hi Sara, I'm
      Mike, a customer care agent [cut ... cut]
    question: >-
      Summarize the transcript of the call. Identify the agent and the
      subscriber. Add any specific issues mentioned by the subscriber. Add any
      corrective actions taken as directed by the agent. Please mention if the
      issue is resolved. Mention any follow-up actions and timelines. List the
      phone number of the subscriber at the end.
    answer: >-
      **Agent:** Mike\n**Subscriber:** Sara\n**Specific Issues:** Sara is
      experiencing slow internet speeds and frequent disconnections, especially
      during the evenings.\n**Corrective Actions:** Mike ran a diagnostic on
      Sara's connection and checked the signal strength in her area. He found
      that there was an issue affecting multiple customers in her area and
      informed Sara that technicians were already working to resolve
      it.\n**Issue Resolution:** The issue is not resolved yet, but the team is
      actively working on it.\n**Follow-up Actions and Timelines:** Sara was
      advised to reach out if she experiences any further issues or if there are
      updates. No specific timeline was given for the
      resolution.\n**Subscriber's Phone Number:** 204-555-1234

Based on this data more data is generated. Check out the full example in the watsonx-ai-platform-demos repo.

Simple Pipeline

With the simple pipeline question/answer pairs can be generated for given contexts.

  
ilab data generate \
--pipeline simple \
--sdg-scale-factor 5 \
--endpoint-url http://localhost:8080/v1 \
--output-dir ./datasets \
--chunk-word-count 1000 \
--num-cpus 10 \
--model meta-llama/llama-3-1-70b-instruct

Below are sample prompts are responses are shown which have been generated by Llama 3.1 70b running on watsonx.ai. A proxy has been used as described in the previous post OpenAI API Proxy for watsonx.ai.

Input:

You are an AI language model developed by IBM Research. You are a cautious assistant.
You carefully follow instructions. You are helpful and harmless and you follow
ethical guidelines and promote positive behavior.

You are a very knowledgeable AI Assistant that will faithfully assist the user with
their task.
Develop a series of question and answer pairs to perform a task.
Here are the requirements:
1. Try not to repeat the verb for each instruction to maximize diversity.
2. The language used for the instruction also should be diverse. For example, you
should combine questions with imperative instructions.
3. The type of instructions should be similar to provided examples. The generated
instruction and the output should be grounded in the provided document.
4. A GPT language model should be able to complete the instruction. For example,
do not ask the assistant to create any visual or audio output. For another example,
do not ask the assistant to wake you up at 5pm or set a reminder because it cannot
perform any action.
5. The instructions should be in English.
6. The instructions should be 1 to 2 sentences long. Either an imperative sentence
or a question is permitted.
7. The output should be an appropriate response to the input and the instruction.
Long outputs are preferable.

The task is Summarization of phone call transcripts between human agents and clients
of a customer care department of a telecommunication service provider.

Here is some context for the example question:

Jen (Customer Care Agent): Hello, this is Jen from customer care. [cut ... cut]

Here is an example to help you understand the type of questions that are asked for:

Summarize the transcript of the call. Identify the agent and the subscriber. Add any
specific issues mentioned by the subscriber. Add any corrective actions taken as
directed by the agent. Please mention if the issue is resolved. Mention any follow-up
actions and timelines. List the phone number of the subscriber at the end.

**Agent:** Jen
**Subscriber:** Chris [cut ... cut]

Provide a single question and answer pair based on the example.

Output:

Here is a single question and answer pair based on the example:

**Question:** What is the main issue that Chris is facing with his new smartphone,
and how does Jen assist him in resolving it?

**Answer:** Chris is having trouble transferring data from his old iPhone to his new
Samsung phone using the Smart Switch app. Jen assists him by directing him to use
iCloud backup on his old iPhone and then transfer the data to his new Samsung phone
using the iCloud transfer option. This resolves the issue, and Chris is able to
successfully transfer his data.

Full Pipeline

The full pipeline contains five different steps:

Contexts Generations
Instructions Generations
Instructions Evaluations
Answers Generations
Answers Evaluations

1. Contexts Generations

Input:

You are an AI language model developed by IBM Research. You are a cautious
assistant. You carefully follow instructions. You are helpful and harmless and you
follow ethical guidelines and promote positive behavior.

You are a very knowledgeable AI Assistant that will faithfully assist the user with
their task.
You are asked to come up with a diverse context for - Summarization of phone call
transcripts between human agents and clients of a customer care department of a
telecommunication service provider.
Please follow these guiding principles when generating responses:
* Use proper grammar and punctuation.
* Always generate safe and respectful content. Do not generate content that is
harmful, abusive, or offensive.
* Always generate content that is factually accurate and relevant to the prompt.
* Strictly adhere to the prompt and generate responses in the same style and format
as the example.
* Return the context between [Start of Context] and [End of Context] tags.

To better assist you with this task, here is an example of a context:
[Start of Context]
Jen (Customer Care Agent): Hello, this is Jen from customer care. [cut ... cut]
[End of Context]

Now generate a context paragraph, remember to follow the principles mentioned above
and use the same format as the examples. Remember to use the same style and format
as the example above. Start your response with the tag [Start of Context] and end it
with the tag [End of Context].

Output:

[Start of Context]
Raj (Customer Care Agent): Hello, this is Raj from customer care. How can I assist
you today?

Samantha (Customer): Hi Raj, I'm having trouble with my internet connection. It's
been really slow lately [cut ... cut]
[End of Context]

2. Instructions Generations

Input:

You are an AI language model developed by IBM Research. You are a cautious
assistant. You carefully follow instructions. You are helpful and harmless and you
follow ethical guidelines and promote positive behavior.

You are a very knowledgeable AI Assistant that will faithfully assist the user with
their task.
You are asked to come up with a set of 3 diverse questions - Summarization of phone
call transcripts between human agents and clients of a customer care department of a
telecommunication service provider.

Please follow these guiding principles when generating responses:
* Use proper grammar and punctuation.
* Always generate safe and respectful content. Do not generate content that is
harmful, abusive, or offensive.
* Always generate content that is factually accurate and relevant to the prompt.
* The questions should be clear and human-like.
* The questions should be diverse and cover a wide range of topics.
* The questions should not be template-based or generic, it should be very diverse.
* Simply return the questions, do not return any answers or explanations.
* Strictly adhere to the prompt and generate responses in the same style and format
as the example.
* Most importantly all the questions should be grounded in the context provided and
should be answerable solely based on the provided context.
* The question should address the task described in the prompt.
* Return each question between [Start of Question] and [End of Question] tags.

To better assist you with this task, here is an example:

[Start of Context]
Sara: (Dialing customer care) Hello, this is Sara, and I'm having some issues with
my broadband connection. [cut ... cut]
[End of Context]
[Start of Question]
Summarize the transcript of the call. Identify the agent and the subscriber. Add any
specific issues mentioned by the subscriber. Add any corrective actions taken as
directed by the agent. Please mention if the issue is resolved. Mention any
follow-up actions and timelines. List the phone number of the subscriber at the end.
[End of Question]

Now generate 3 such questions, remember to follow the principles mentioned above and
use the same format as the examples. Remember to use the same style and format as
the example above. Do not return any contexts or answers, only the questions. Return
each question between [Start of Question] and [End of Question] tags.

[Start of Context]
Emily: (Dialing customer care) Hi, I'm Emily, and I'm having some issues with my
mobile phone bill. [cut ... cut]
[End of Context]

Output:

[Start of Question]
Summarize the call transcript between Emily and David. Identify the specific billing
issue raised by Emily and the resolution provided by David. Mention any follow-up
actions taken or recommended by David to help Emily monitor her data usage more
accurately in the future. List Emily's phone number at the end.
[End of Question]

[Start of Question]
What was the initial issue [cut ... cut]
[End of Question]

[Start of Question]
Provide a detailed summary [cut ... cut]
[End of Question]

3. Instructions Evaluations

Input:

ou are an AI language model developed by IBM Research. You are a cautious assistant.
You carefully follow instructions. You are helpful and harmless and you follow
ethical guidelines and promote positive behavior.

You are a very knowledgeable AI Assistant that will faithfully assist the user with
their task.
Please act as an impartial judge and evaluate the questions generated by an AI
assistant displayed below. Evaluate whether or not the question is a good question
of how AI Assistant should respond to the user's instruction. Please assign a score
using a binary 0/1 scale.

Here are the requirements:
* A large language model should be able to complete the question. For example, do
not ask the assistant to create any visual or audio output. For another example, do
not ask the assistant to wake you up at 5pm or set a reminder because it cannot
perform any action.
* The questions should be in English.
* The questions should be 1 to 2 sentences long and should be properly formatted.
* The question should not be offensive, abusive, or harmful. It should be safe and
respectful.
* The question should be relevant to the task given - Summarization of phone call
transcripts between human agents and clients of a customer care department of a
telecommunication service provider.
* Most importantly all the questions should be grounded in the context provided and
should be answerable solely based on the provided context.

If the question meets the above requirements, please rate it 1. If not, please rate
it 0.

For better understanding of the task, here are some examples:
[Start of Context]
In recent years, electric vehicles (EVs) have gained popularity due to their
environmental benefits and advancements in technology. EVs produce zero emissions
while driving, which helps reduce air pollution and combat climate change.
Additionally, with the development of more efficient batteries, the range and
performance of EVs have significantly improved, making them a more viable option for
consumers.
[End of Context]
[Start of Question]
How do electric vehicles help in reducing air pollution?
[End of Question]

[Start of Evaluation]
This question is properly formatted, respectful, and directly relevant to the task
of understanding the benefits of electric vehicles. It is grounded in the provided
context, which mentions that EVs produce zero emissions while driving, helping
reduce air pollution. A large language model can provide an answer to this question
based on the provided context.
[End of Evaluation]

[Start of Score]
1
[End of Score]

Here's the context and question you need to evaluate. Return the evaluation between
[Start of Evaluation] and [End of Evaluation] tags.

[Start of Context]
Alex: (Dialing customer care) Hi, I'm Alex, and I'm having some issues with my
mobile phone bill [cut ... cut]
[End of Context]
[Start of Question]
Provide a detailed summary of the conversation [cut ... cut]
[End of Question]

Begin your evaluation by providing a short explanation. Be as objective as possible.
After providing your explanation, you must rate the question on a scale of 0 or 1 as
mentioned above.
* Return the evaluation between [Start of Evaluation] and [End of Evaluation] tags.
* Return the score using a binary 0/1 scale between [Start of Score] and [End of
Score] tags.

Output:

[Start of Evaluation]
This question is properly formatted, respectful, and directly relevant to the task
of summarizing phone call transcripts between human agents and clients of a customer
care department of a telecommunication service provider. The question is grounded in
the provided context, which is a conversation between Alex and Rachel about a
billing issue. [cut ... cut]
[End of Evaluation]

[Start of Score]
1
[End of Score]

Reasoning Skill: This question requires the ability to analyze a conversation,
identify key elements such as problem-solving approaches, questions asked, and
corrective actions taken, and summarize the outcome of the call. It also requires
the ability to evaluate the relevance and answerability of the question based on the
provided context. The correct answer demonstrates an understanding of the
conversation and the ability to extract relevant information to provide a detailed
summary. [cut ... cut]

4. Answers Generations

Input:

You are an AI language model developed by IBM Research. You are a cautious
assistant. You carefully follow instructions. You are helpful and harmless and you
follow ethical guidelines and promote positive behavior.

You are a very knowledgeable AI Assistant that will faithfully assist the user with
their task.
Your task is to faithfully follow the user's prompt, given context and generate a
response.
Please follow these guiding principles when generating responses:
* Use proper grammar and punctuation.
* Always generate safe and respectful content. Do not generate content that is
harmful, abusive, or offensive.
* Always generate content that is factually accurate and relevant to the prompt.
* Strictly adhere to the prompt and generate responses in the same style and format
as the example.
* Most importantly all the responses should be grounded in the context provided.
* Return the response between [Start of Response] and [End of Response] tags.

To better assist you with this task, here is an example:
[Start of Context]
Sara: (Dialing customer care) Hello, this is Sara, and I'm having some issues with
my broadband connection [cut ... cut]
[End of Context]

[Start of Question]
Summarize the transcript of the call. Identify the agent and the subscriber. Add
any specific issues mentioned by the subscriber. Add any corrective actions taken
as directed by the agent. Please mention if the issue is resolved. Mention any
follow-up actions and timelines. List the phone number of the subscriber at the end.
[End of Question]

[Start of Response]
**Agent:** Mike
**Subscriber:** Sara [cut ... cut]
[End of Response]

Now generate a response to the following prompt. Remember to use the same style and
format as the example above. 
Return the response between [Start of Response] and [End of Response] tags.

[Start of Context]
Emily: (Dialing customer care) Hi, I'm Emily, and I'm having some issues with my
mobile phone bill [cut ... cut]
[End of Context]
[Start of Question]
Evaluate the effectiveness of David's customer service skills in resolving Emily's
issue. How did he demonstrate empathy and understanding of Emily's concerns? What
specific actions did he take to resolve the issue promptly, and how did he ensure
Emily was satisfied with the outcome? What additional steps could David have taken
to further improve Emily's customer experience?
[End of Question]

Return the response between [Start of Response] and [End of Response] tags.

Output:

[Start of Response]
**Effectiveness of David's Customer Service Skills:** David demonstrated excellent
customer service skills in resolving Emily's issue. He was empathetic,
understanding, and prompt in addressing her concerns.

**Demonstration of Empathy and Understanding:** David showed empathy and
understanding by acknowledging Emily's frustration and concern about the
overcharge. He listened attentively to her explanation and apologized for the
inconvenience. [cut ... cut]
[End of Response]

5. Answers Evaluations

Input:

You are an AI language model developed by IBM Research. You are a cautious
assistant. You carefully follow instructions. You are helpful and harmless and you
follow ethical guidelines and promote positive behavior.

You are a very knowledgeable AI Assistant that will faithfully assist the user with
their task.
Please act as an impartial judge and evaluate the quality of the answer provided by
an AI assistant to the questions displayed below. Evaluate whether or not the
answer is a good example of how AI Assistant should respond to the user's
instruction. Please assign a score using the following 3-point scale.

1: It means the answer is incorrect, irrelevant, unsafe or provides incomplete and
garbage information. For instance, the answer may be factually wrong, off-topic, or
filled with irrelevant content that doesn't address the user's question or it could
be incomplete and hanging. It may also include any harmful, unethical, racist,
sexist, explicit, offensive, toxic, dangerous, or illegal content. If the answer is
not supported by the context given.

2: It means the answer provides the correct answer, but it is brief and to the
point without explanations. While it directly answers the user's question, it lacks
additional context or in-depth explanations.

3: It means the answer is a perfect answer from an AI Assistant. It intentionally
addresses the user's question with a comprehensive and detailed explanation. It
demonstrates expert knowledge in the area, is very well written, logical, easy to
follow, engaging, and insightful. And the answer is safe and does not include any
harmful content.

[Start of Context]
In recent years, electric vehicles (EVs) have gained popularity due to their
environmental benefits and advancements in technology. EVs produce zero emissions
while driving, which helps reduce air pollution and combat climate change.
Additionally, with the development of more efficient batteries, the range and
performance of EVs have significantly improved, making them a more viable option
for consumers.
[End of Context]

[Start of Question]
How do electric vehicles help in reducing air pollution?
[End of Question]

[Start of Answer]
Electric vehicles (EVs) help in reducing air pollution because they produce zero
emissions while driving.
[End of Answer]

[Start of Evaluation]
This answer is correct, directly addresses the user's question, and is supported by
the provided context. It is brief and to the point but lacks additional context or
in-depth explanations.
[End of Evaluation]

[Start of Score]
2
[End of Score]

Begin your evaluation by providing a short explanation. Be as objective as
possible. After providing your explanation, you must rate the answer on a scale of
1 to 3 as mentioned above. 
Here's the context, question and the answer you need to evaluate:

[Start of Context]
Ash (Customer Service Agent): Hello, this is Ash from customer service. How can I
assist you today? [cut ... cut]
[End of Context]

[Start of Question]
What kind of plan did the customer have and how long had they been experiencing
problems? What efforts had the customer made to troubleshoot the problem before
contacting customer service, and what was the agent's initial response to the
customer's concerns. What different approaches were used by the agent to resolve
the issue during the call?
[End of Question]

[Start of Answer]
**Customer's Plan:** Premium Plan\n\n\n**Duration of Problem:** The customer had
been experiencing problems with her internet connection for about a week.

**Customer's Initial Troubleshooting Efforts:** The customer had tried restarting
her router, but it didn't seem to make a difference [cut ... cut]
[End of Answer]

* Return the evaluation between [Start of Evaluation] and [End of Evaluation] tags.
* Return the score between [Start of Score] and [End of Score] tags.

Output:

[Start of Evaluation]
The answer thoroughly and accurately addresses each part of the question about the
conversation between Ash and Maria about the customer's internet connection issue.
The answer follows the provided context, explaining the plan the customer had, the
duration of the outage, troubleshooting steps the customer already took, and Ash's
initial response in that regard. 
[End of Evaluation]

[Start of Score]
3
[End of Score]"

Next Steps

To learn more, check out the Watsonx.ai documentation and the Watsonx.ai landing page.