Dataset Generation

Prompt-Completion Generators

class PromptCompletionGenerator(openai_key: str | None = None, enable_timeouts=False, timeouts_options=None)[source]

Bases: object

A class for generating prompt completions using the OpenAI API.

This class is designed to interact with the OpenAI API to generate responses based on provided prompts. It supports custom timeout settings and handles the API interactions to fetch prompt completions.

Method __init__:

Initialize the PromptCompletionGenerator instance.

Parameters:
  • openai_key (Optional[str], default=None) – An optional string representing the OpenAI API key. If not provided, the key is fetched from the environment variable “OPENAI_API_KEY”.

  • enable_timeouts (bool, default=False) – A flag to enable custom timeout settings for the OpenAI client.

  • timeouts_options (Optional[dict], default=None) – A dictionary specifying the timeout settings. It is only used if enable_timeouts is set to True.

Returns:

An instance of PromptCompletionGenerator.

Return type:

PromptCompletionGenerator

Raises:

ValueError – Raised if the provided or fetched OpenAI API key is invalid.

Example:

>>> generator = PromptCompletionGenerator(openai_key="sk-xxxxx")
>>> print(type(generator))
<class 'PromptCompletionGenerator'>
create_dataset(context_filename: str, output_filename: str, temperature: float = 0.1, model: str = 'gpt-3.5-turbo-1106', max_tokens: int = 1054, top_p: float = 1.0, frequency_penalty: float = 0.0, presence_penalty: float = 0.0, retry_limit: int = 3, num_questions: int = 3, question_types: List[str] = None, detailed_explanation: bool = True)[source]

Generates a dataset with various types of questions based on the provided context.

Method create_dataset:

Create a dataset by generating questions of specified types for each context in the provided CSV file.

Parameters:
  • context_filename (str) – Path to the CSV file containing context data.

  • output_filename (str) – Path to save the generated dataset.

  • temperature (float, optional) – Controls randomness in response generation. Defaults to 0.1.

  • model (str, optional) – The OpenAI model to be used for generating responses. Defaults to “gpt-3.5-turbo-1106”.

  • max_tokens (int, optional) – Maximum length of the generated output. Defaults to 1054.

  • top_p (float, optional) – Nucleus sampling parameter, controlling the range of tokens considered for generation. Defaults to 1.0.

  • frequency_penalty (float, optional) – Decrease in likelihood for previously used tokens. Defaults to 0.0.

  • presence_penalty (float, optional) – Decrease in likelihood for currently present tokens. Defaults to 0.0.

  • retry_limit (int, optional) – Maximum number of retries for API calls in case of failure. Defaults to 3.

  • num_questions (int, optional) – Number of questions to generate for each context. Defaults to 3.

  • question_types (List[str], optional) – Types of questions to generate (e.g., “open-ended”, “yes-no”). If not specified, defaults to various types.

  • detailed_explanation (bool, optional) – Flag to indicate whether to include detailed explanations in the generated content. Defaults to True.

Raises:

Exception – If an error occurs during question generation or file operations.

Example:

::
>>> generator = PromptCompletionGenerator(openai_key="your-api-key")
>>> generator.create_dataset(
    context_filename="path/to/context.csv",
    output_filename="path/to/generated_dataset.csv",
    temperature=0.7,
    model="gpt-3.5-turbo",
    num_questions=5,
    question_types=["open-ended", "yes-no"],
    detailed_explanation=False
)
# This example generates a dataset with specified question types based on the contexts from 'path/to/context.csv'.
Notes:

  • The method iterates over each row in the context CSV file, generating questions of specified types for each context. The generated questions and answers are saved to the output CSV file.

generate_prompt_completions(context_text: str, output_csv: str, temperature: float = 0.1, model: str = 'gpt-3.5-turbo-1106', max_tokens: int = 1054, top_p: float = 1.0, frequency_penalty: float = 0.0, presence_penalty: float = 0.0, retry_limit: int = 3, num_questions: int = 3, question_type: str = 'open-ended', detailed_explanation: bool = True) List[Dict[str, str]][source]

Generates prompt completions using the OpenAI API and records them to a CSV file.

Method generate_prompt_completions:

Use the OpenAI model to generate responses based on provided context and record the prompt-completion pairs in a CSV file.

Parameters:
  • context_text (str) – The context based on which prompts are generated.

  • output_csv (str) – The file path for saving generated completions in CSV format.

  • temperature (float, optional) – The level of randomness in the output. Higher values lead to more varied outputs. Defaults to 0.1.

  • model (str, optional) – The OpenAI model used for generation. Defaults to “gpt-3.5-turbo-1106”.

  • max_tokens (int, optional) – The maximum length of the generated output. Defaults to 1054.

  • top_p (float, optional) – The proportion of most likely tokens considered for sampling. Defaults to 1.0.

  • frequency_penalty (float, optional) – The decrease in likelihood for frequently used tokens. Defaults to 0.0.

  • presence_penalty (float, optional) – The decrease in likelihood for already used tokens. Defaults to 0.0.

  • retry_limit (int, optional) – The maximum number of retries for API call failures. Defaults to 3.

  • num_questions (int, optional) – The number of questions to generate. Defaults to 3.

  • question_type (str, optional) – The type of questions to generate, such as “open-ended”, “yes/no”, etc. Defaults to “open-ended”.

  • detailed_explanation (bool, optional) – Flag indicating whether to include instructions for detailed explanations and arguments. Defaults to True.

Returns:

A list of dictionaries, each containing ‘prompt’ and ‘completion’ keys.

Return type:

List[Dict[str, str]]

Raises:
  • ValueError – If the OpenAI API key is invalid or not provided.

  • Exception – If the function fails after the specified number of retries.

Example:

>>> generate_prompt_completions("Discuss the impact of AI in healthcare.", "ai_healthcare.csv")
# Generates prompt completions based on the context about AI in healthcare and records them in 'ai_healthcare.csv'.
Notes:

  • Proper API key authorization is essential for successful API requests. Ensure the OpenAI key is valid and has the necessary permissions.

generate_system_prompt(num_questions: int, question_type: str, detailed_explanation: bool = True)[source]

Generates a system prompt including instructions for the number of questions and their type.

This method creates a system prompt to guide the generation of questions of a specific type and quantity. It is useful for creating structured and targeted questions for AI model training or testing.

Parameters:
  • num_questions (int) – The number of questions to be included in the prompt.

  • question_type (str) – The type of questions, such as ‘open-ended’, ‘yes/no’, etc.

  • detailed_explanation (bool, optional) – Flag indicating whether to include instructions for detailed explanations and arguments. Defaults to True.

Returns:

A string containing the generated system prompt.

Return type:

str

Example:

>>> generator = PromptCompletionGenerator(openai_key="sk-xxxxx")
>>> system_prompt = generator.generate_system_prompt(5, 'open-ended', True)
>>> print(system_prompt)
# Outputs a generated system prompt with guidelines for 5 open-ended questions.
parse_and_save_json(json_string, output_file, context=None)[source]

Parses a JSON-formatted string and persists it to a file with optional context.

This function parses a given JSON-formatted string into structured data and saves it into a JSON file. It allows the inclusion of additional contextual data, enriching the content when provided.

Parameters:
  • json_string (str) – A string formatted in JSON, representing structured data to be parsed.

  • output_file (str) – The destination file path where the parsed JSON data will be saved.

  • context (str, optional) – Optional context to be added to the parsed data, augmenting the information.

Note:

The context should align with the structure of the JSON string for consistency in the output file.

Example:

>>> json_str = '[{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}]'
>>> output_path = 'output.json'
>>> context_info = 'Additional context information.'
>>> parse_and_save_json(json_str, output_path, context_info)
# This will parse the JSON string and save it to 'output.json' with additional context if provided.