LLM Evaluation

LLM Evaluation

class LLMEvaluation(model_name='gpt-3.5-turbo', openai_key=None, generator_system_prompt=None, eval_system_prompt=None, enable_timeouts=False, timeouts_options=None)[source]

Bases: object

A class for evaluating language model responses.

This class initializes an evaluation system for language models, particularly designed for assessing responses generated by models like GPT-3.5 Turbo. It can handle generation and evaluation of model responses, with options for customizing system prompts and handling timeouts.

Parameters

model_namestr, optional

The name of the model to be used for generating responses. Default is “gpt-3.5-turbo”.

openai_keystr, optional

The API key for authenticating requests to OpenAI. If not provided, it will attempt to use the key from environment variables.

generator_system_promptstr, optional

Custom system prompt for generating responses. If not provided, a default prompt is used.

eval_system_promptstr, optional

Custom system prompt for evaluation. If not provided, a default evaluation prompt is used.

enable_timeoutsbool, optional

Flag to enable or disable timeouts for API requests. Default is False.

timeouts_optionsdict, optional

A dictionary specifying timeout options. Relevant only if enable_timeouts is True.

Raises

ValueError

If the OpenAI API key is invalid or not provided.

Attributes

clientOpenAI

The OpenAI client configured for interaction with the model.

Examples

>>> evaluator = LLMEvaluation(model_name="gpt-3.5-turbo", openai_key="your-api-key")
>>> evaluator.generator_system_prompt
"You are an intelligent assistant capable of generating responses based on prompts."

Notes

The openai_key is essential for accessing OpenAI’s API. Ensure the key is valid and has appropriate permissions.

evaluate_model(finetuned_model, dataset_csv_path, results_csv_path, temperature=1.0, max_tokens=150, top_p=1.0, frequency_penalty=0, presence_penalty=0, model_start_sequence='', llm_evaluator_model_name='gpt-3.5-turbo', dataset_size=100, finetuned_model_start_sequence_only_for_base_models='', experiment_id=1, save_immediately=False)[source]

Evaluates the performance of a specified model using a dataset and generates statistical insights.

This method assesses a model’s ability to respond to prompts, comparing generated responses with expected completions. It provides a systematic approach to evaluate model performance, saving the results for analysis.

Parameters:
  • finetuned_model (str) – The name of the model to be evaluated.

  • dataset_csv_path (str) – Path to the CSV file containing prompts and expected completions for evaluation.

  • results_csv_path (str) – Path where the evaluation results will be saved.

[Other parameters…]

Returns:

None. The evaluation results are saved to the specified CSV file.

Return type:

None

Raises:
  • FileNotFoundError – If the dataset CSV file is not found.

  • ValueError – If the dataset CSV file is not properly formatted or missing required columns.

  • Exception – For other exceptions that may occur during the evaluation process.

Example:

>>> finetuned_model = 'gpt-3.5-turbo'
>>> dataset_csv_path = 'path/to/dataset.csv'
>>> results_csv_path = 'path/to/results.csv'
>>> evaluate_model(finetuned_model, dataset_csv_path, results_csv_path)
# This will evaluate the model using the dataset and save results to the specified path.
get_generated_completion(finetuned_model, prompt, temperature=1.0, max_tokens=256, top_p=1.0, finetuned_model_start_sequence_only_for_base_models='')[source]

Retrieves the generated completion from a specified model based on the given prompt.

This method interacts with the OpenAI API to generate a completion based on the provided prompt and model parameters. It is designed to work with both base and fine-tuned models, offering various customization options for the generation process.

Parameters:
  • finetuned_model (str) – The name of the fine-tuned or base model to use for generating the completion.

  • prompt (str) – The input prompt to which the model generates a completion.

  • temperature (float, optional) – The sampling temperature, controlling the randomness of the output. Defaults to 1.0.

  • max_tokens (int, optional) – The maximum number of tokens to generate in the response. Defaults to 256.

  • top_p (float, optional) – The top-p sampling parameter, controlling the range of token probabilities considered for sampling. Defaults to 1.0.

  • finetuned_model_start_sequence_only_for_base_models (str, optional) – A start sequence used only for base models, if applicable.

Returns:

The generated completion as a string.

Return type:

str

Raises:
  • ValueError – If an unknown model name is specified.

  • Exception – If there is an error during the generation process or with the OpenAI API interaction.

Example:

>>> finetuned_model = 'gpt-3.5-turbo'
>>> prompt = 'Translate the following English text to French: Hello, how are you?'
>>> completion = get_generated_completion(finetuned_model, prompt)
>>> print(completion)
# Outputs the model-generated translation of the prompt.
read_jsonl(file_path)[source]

Reads a JSONL (JSON Lines) file and returns the data as a list of dictionaries.

This method is designed to read and parse data from a JSONL file, where each line of the file is a separate JSON object. It is particularly useful for processing datasets stored in the JSONL format, commonly used in data processing and machine learning tasks.

Parameters:

file_path (str) – The path to the JSONL file to be read.

Returns:

A list of dictionaries, each representing a JSON object from a line in the JSONL file.

Return type:

List[dict]

Raises:
  • FileNotFoundError – If the specified file does not exist.

  • json.JSONDecodeError – If any line in the file is not a valid JSON object.

Example:

>>> file_path = 'data.jsonl'
>>> data = read_jsonl(file_path)
>>> print(data[0])  # Display the first JSON object from the list.
rephrase_and_classify_prompts_in_dataset(input_csv, output_csv, model_name='gpt-3.5-turbo', classify=False, classes=None)[source]

Processes and classifies prompts from an input CSV file and saves the results to an output CSV file.

Method rephrase_and_classify_prompts_in_dataset:

Process prompts from an input CSV file, potentially classify them, and save the results in another CSV file.

Parameters:
  • input_csv (str) – The path to the input CSV file containing prompts and their corresponding completions.

  • output_csv (str) – The path to the output CSV file where the processed data will be saved.

  • model_name (str, optional) – The name of the language model to use for rephrasing prompts. Default is ‘gpt-3.5-turbo’.

  • classify (bool, optional) – Flag indicating whether to classify the rephrased prompts. Default is False.

  • classes (list of str, optional) – The list of classification categories to be used if classification is enabled. Default is None.

Returns:

None. The method does not return anything but saves the processed data to the specified output CSV file.

Return type:

None

Raises:
  • FileNotFoundError – If the input CSV file is not found.

  • Exception – For any other exceptions that may occur during the processing or file writing.

Example:

>>> rephrase_and_classify_prompts_in_dataset("input_prompts.csv", "processed_prompts.csv", classify=True, classes=["class1", "class2"])
# This will read prompts from 'input_prompts.csv', process and optionally classify them, and save the results to 'processed_prompts.csv'.
Notes:

  • The method expects the input CSV to have columns named ‘prompt’ and ‘completion’.

  • Classification is optional and is performed only if the ‘classify’ parameter is set to True.

rephrase_and_optionally_classify(prompt, model_name='gpt-4', classify=False, classes=None, prompt_style='student-asking', temperature=1, max_tokens=256, top_p=1, frequency_penalty=0, presence_penalty=0)[source]

Rephrases a given prompt and optionally classifies it using a specified language model.

This method takes a sentence and rephrases it using the specified language model. It can also classify the rephrased sentence into provided categories, if classification is requested.

Parameters:
  • prompt (str) – The original sentence that needs to be rephrased.

  • model_name (str, optional) – The name of the language model to use, defaults to ‘gpt-4’.

  • classify (bool, optional) – Indicates whether to classify the rephrased sentence, defaults to False.

  • classes (list of str, optional) – The list of classification categories, used if classify is True, defaults to None.

  • prompt_style (str, optional) – The style for rephrasing the prompt, used if classify is False, defaults to ‘student-asking’.

  • temperature (float, optional) – Controls the randomness of the output, defaults to 1.

  • max_tokens (int, optional) – The maximum number of tokens to generate, defaults to 256.

  • top_p (float, optional) – Nucleus sampling parameter, defaults to 1.

  • frequency_penalty (float, optional) – Adjusts frequency of token usage, defaults to 0.

  • presence_penalty (float, optional) – Adjusts presence of tokens, defaults to 0.

Returns:

A tuple containing the rephrased prompt and its classification (or None if not classified).

Return type:

tuple

Raises:
  • ValueError – If unable to parse the model response as JSON.

  • Exception – If an error occurs during the API request or processing.

Example:

>>> prompt = "What is AI?"
>>> rephrased, classification = rephrase_and_optionally_classify(prompt, classify=True, classes=["ACADEMIC", "RESEARCH"])
>>> print(f"Rephrased: {rephrased}, Classification: {classification}")
# Outputs the rephrased sentence and its classification.
save_dict_list_to_csv(data, output_file_path=None, output_folder='csv')[source]

Converts a list of conversation data into a CSV file, categorizing data into columns for system prompts, user prompts, and assistant completions.

Method save_dict_list_to_csv:

Process and save conversation data in a structured CSV format.

Parameters:
  • data (list) – A list of dictionaries, each representing a conversation with messages categorized by roles (‘system’, ‘user’, ‘assistant’) and their respective content.

  • output_file_path (str, optional) – The file path for the output CSV file. Defaults to None, which uses a default filename.

  • output_folder (str, optional) – The directory to save the output CSV file. Defaults to ‘csv’.

Returns:

None. This method does not return anything but saves the processed data to a CSV file.

Return type:

None

Raises:

Exception – If any error occurs during the processing or file writing.

Example:

>>> data = [{'messages': [{'role': 'system', 'content': 'System message'}, {'role': 'user', 'content': 'User question'}, {'role': 'assistant', 'content': 'Assistant answer'}]}]
>>> save_dict_list_to_csv(data, output_file_path='output.csv')
# This will process the provided data and save it as 'output.csv' in the specified output folder.
Notes:

  • The input data should be formatted correctly, with each conversation’s messages having designated roles (‘system’, ‘user’, ‘assistant’).

save_random_prompts(input_file, output_file=None, output_format='csv', n_samples=100, output_folder='output')[source]

Selects random prompts from a given file and saves them in the specified format.

Method save_random_prompts:

Extract random samples from a data file and save them in a specified format.

Parameters:
  • input_file (str) – The path to the input file, which can be in CSV, JSON, or JSONL format.

  • output_file (str, optional) – The base name of the output file without extension. If None, a name with a timestamp and the number of samples will be generated. Defaults to None.

  • output_format (str, optional) – The format for the output file, which can be ‘csv’, ‘json’, or ‘jsonl’. Defaults to ‘csv’.

  • n_samples (int, optional) – The number of random samples to select from the input file. Defaults to 100.

  • output_folder (str, optional) – The folder where the output file should be saved. Defaults to ‘output’.

Returns:

None. This method does not return anything but saves the extracted samples to a file.

Return type:

None

Raises:
  • ValueError – If the input file format is unsupported or if the output format is not one of ‘csv’, ‘json’, or ‘jsonl’.

  • Exception – If any other error occurs during the processing or file writing.

Example:

>>> save_random_prompts("data.csv", output_file="sample_prompts", output_format='csv', n_samples=50, output_folder='output')
# This will select 50 random prompts from 'data.csv' and save them as 'sample_prompts_[timestamp]_50.csv' in the 'output' folder.
Notes:

  • Ensure that the input file is in one of the supported formats (CSV, JSON, or JSONL) for correct processing.

score_response(prompt, ground_truth_completion, generated_completion, llm_evaluator_model_name='gpt-3.5-turbo')[source]

Generates a scoring prompt, sends it to a language model, and parses the response to evaluate a given generated completion against a ground truth.

Method score_response:

Evaluate the generated completion of a prompt against the ground truth completion using a large language model.

Parameters:
  • prompt (str) – The original prompt used in generating the completion.

  • ground_truth_completion (str) – The correct or expected response to the prompt.

  • generated_completion (str) – The response generated by the evaluated model.

  • llm_evaluator_model_name (str, optional) – Name of the large language model to be used for scoring. Defaults to “gpt-3.5-turbo”.

Returns:

A tuple containing the score (numeric) and the reasoning behind the score (string).

Return type:

tuple

Raises:
  • json.JSONDecodeError – If the response from the model is not in a valid JSON format.

  • Exception – For any other exceptions that may occur during API calls or processing.

Example:

>>> score, reason = score_response("What is AI?", "Artificial Intelligence", "AI is a field in computer science.", "gpt-3.5-turbo")
# This evaluates the generated completion for accuracy and relevance, returning a score and reasoning.
Notes:

  • The method constructs an evaluation prompt combining the original prompt, ground truth completion, and the generated completion.

  • It then sends this prompt to the specified language model for scoring and parses the model’s response to extract the score and reasoning.