LLM Evaluation¶

class LLMEvaluation(model_name='gpt-3.5-turbo', openai_key=None, generator_system_prompt=None, eval_system_prompt=None, enable_timeouts=False, timeouts_options=None)[source]¶

Bases: object

A class for evaluating language model responses.

This class initializes an evaluation system for language models, particularly designed for assessing responses generated by models like GPT-3.5 Turbo. It can handle generation and evaluation of model responses, with options for customizing system prompts and handling timeouts.

Parameters¶

model_namestr, optional: The name of the model to be used for generating responses. Default is “gpt-3.5-turbo”.
openai_keystr, optional: The API key for authenticating requests to OpenAI. If not provided, it will attempt to use the key from environment variables.
generator_system_promptstr, optional: Custom system prompt for generating responses. If not provided, a default prompt is used.
eval_system_promptstr, optional: Custom system prompt for evaluation. If not provided, a default evaluation prompt is used.
enable_timeoutsbool, optional: Flag to enable or disable timeouts for API requests. Default is False.
timeouts_optionsdict, optional: A dictionary specifying timeout options. Relevant only if enable_timeouts is True.

Raises¶

ValueError: If the OpenAI API key is invalid or not provided.

Attributes¶

clientOpenAI: The OpenAI client configured for interaction with the model.

Examples¶

>>> evaluator = LLMEvaluation(model_name="gpt-3.5-turbo", openai_key="your-api-key")
>>> evaluator.generator_system_prompt
"You are an intelligent assistant capable of generating responses based on prompts."

Notes¶

The openai_key is essential for accessing OpenAI’s API. Ensure the key is valid and has appropriate permissions.

evaluate_model(finetuned_model, dataset_csv_path, results_csv_path, temperature=1.0, max_tokens=150, top_p=1.0, frequency_penalty=0, presence_penalty=0, model_start_sequence='', llm_evaluator_model_name='gpt-3.5-turbo', dataset_size=100, finetuned_model_start_sequence_only_for_base_models='', experiment_id=1, save_immediately=False)[source]¶

Evaluates the performance of a specified model using a dataset and generates statistical insights.

This method assesses a model’s ability to respond to prompts, comparing generated responses with expected completions. It provides a systematic approach to evaluate model performance, saving the results for analysis.

Parameters:

finetuned_model (str) – The name of the model to be evaluated.
dataset_csv_path (str) – Path to the CSV file containing prompts and expected completions for evaluation.
results_csv_path (str) – Path where the evaluation results will be saved.

[Other parameters…]

Returns:

None. The evaluation results are saved to the specified CSV file.

Return type:

None

Raises:

FileNotFoundError – If the dataset CSV file is not found.
ValueError – If the dataset CSV file is not properly formatted or missing required columns.
Exception – For other exceptions that may occur during the evaluation process.

Example:

>>> finetuned_model = 'gpt-3.5-turbo'
>>> dataset_csv_path = 'path/to/dataset.csv'
>>> results_csv_path = 'path/to/results.csv'
>>> evaluate_model(finetuned_model, dataset_csv_path, results_csv_path)
# This will evaluate the model using the dataset and save results to the specified path.

get_generated_completion(finetuned_model, prompt, temperature=1.0, max_tokens=256, top_p=1.0, finetuned_model_start_sequence_only_for_base_models='')[source]¶

Retrieves the generated completion from a specified model based on the given prompt.

This method interacts with the OpenAI API to generate a completion based on the provided prompt and model parameters. It is designed to work with both base and fine-tuned models, offering various customization options for the generation process.

Parameters:

finetuned_model (str) – The name of the fine-tuned or base model to use for generating the completion.
prompt (str) – The input prompt to which the model generates a completion.
temperature (float, optional) – The sampling temperature, controlling the randomness of the output. Defaults to 1.0.
max_tokens (int, optional) – The maximum number of tokens to generate in the response. Defaults to 256.
top_p (float, optional) – The top-p sampling parameter, controlling the range of token probabilities considered for sampling. Defaults to 1.0.
finetuned_model_start_sequence_only_for_base_models (str, optional) – A start sequence used only for base models, if applicable.

Returns:

The generated completion as a string.

Return type:

str

Raises:

ValueError – If an unknown model name is specified.
Exception – If there is an error during the generation process or with the OpenAI API interaction.

Example:

>>> finetuned_model = 'gpt-3.5-turbo'
>>> prompt = 'Translate the following English text to French: Hello, how are you?'
>>> completion = get_generated_completion(finetuned_model, prompt)
>>> print(completion)
# Outputs the model-generated translation of the prompt.

read_jsonl(file_path)[source]¶

Reads a JSONL (JSON Lines) file and returns the data as a list of dictionaries.

This method is designed to read and parse data from a JSONL file, where each line of the file is a separate JSON object. It is particularly useful for processing datasets stored in the JSONL format, commonly used in data processing and machine learning tasks.

Parameters:

file_path (str) – The path to the JSONL file to be read.

Returns:

A list of dictionaries, each representing a JSON object from a line in the JSONL file.

Return type:

List[dict]

Raises:

FileNotFoundError – If the specified file does not exist.
json.JSONDecodeError – If any line in the file is not a valid JSON object.

Example:

>>> file_path = 'data.jsonl'
>>> data = read_jsonl(file_path)
>>> print(data[0])  # Display the first JSON object from the list.

rephrase_and_classify_prompts_in_dataset(input_csv, output_csv, model_name='gpt-3.5-turbo', classify=False, classes=None)[source]¶

Processes and classifies prompts from an input CSV file and saves the results to an output CSV file.

Method rephrase_and_classify_prompts_in_dataset:

Process prompts from an input CSV file, potentially classify them, and save the results in another CSV file.

Parameters:

input_csv (str) – The path to the input CSV file containing prompts and their corresponding completions.
output_csv (str) – The path to the output CSV file where the processed data will be saved.
model_name (str, optional) – The name of the language model to use for rephrasing prompts. Default is ‘gpt-3.5-turbo’.
classify (bool, optional) – Flag indicating whether to classify the rephrased prompts. Default is False.
classes (list of str, optional) – The list of classification categories to be used if classification is enabled. Default is None.

Returns:

None. The method does not return anything but saves the processed data to the specified output CSV file.

Return type:

None

Raises:

FileNotFoundError – If the input CSV file is not found.
Exception – For any other exceptions that may occur during the processing or file writing.

Example:

>>> rephrase_and_classify_prompts_in_dataset("input_prompts.csv", "processed_prompts.csv", classify=True, classes=["class1", "class2"])
# This will read prompts from 'input_prompts.csv', process and optionally classify them, and save the results to 'processed_prompts.csv'.

Notes:

The method expects the input CSV to have columns named ‘prompt’ and ‘completion’.
Classification is optional and is performed only if the ‘classify’ parameter is set to True.

rephrase_and_optionally_classify(prompt, model_name='gpt-4', classify=False, classes=None, prompt_style='student-asking', temperature=1, max_tokens=256, top_p=1, frequency_penalty=0, presence_penalty=0)[source]¶

Rephrases a given prompt and optionally classifies it using a specified language model.

This method takes a sentence and rephrases it using the specified language model. It can also classify the rephrased sentence into provided categories, if classification is requested.

Parameters:

prompt (str) – The original sentence that needs to be rephrased.
model_name (str, optional) – The name of the language model to use, defaults to ‘gpt-4’.
classify (bool, optional) – Indicates whether to classify the rephrased sentence, defaults to False.
classes (list of str, optional) – The list of classification categories, used if classify is True, defaults to None.
prompt_style (str, optional) – The style for rephrasing the prompt, used if classify is False, defaults to ‘student-asking’.
temperature (float, optional) – Controls the randomness of the output, defaults to 1.
max_tokens (int, optional) – The maximum number of tokens to generate, defaults to 256.
top_p (float, optional) – Nucleus sampling parameter, defaults to 1.
frequency_penalty (float, optional) – Adjusts frequency of token usage, defaults to 0.
presence_penalty (float, optional) – Adjusts presence of tokens, defaults to 0.

Returns:

A tuple containing the rephrased prompt and its classification (or None if not classified).

Return type:

tuple

Raises:

ValueError – If unable to parse the model response as JSON.
Exception – If an error occurs during the API request or processing.

Example:

>>> prompt = "What is AI?"
>>> rephrased, classification = rephrase_and_optionally_classify(prompt, classify=True, classes=["ACADEMIC", "RESEARCH"])
>>> print(f"Rephrased: {rephrased}, Classification: {classification}")
# Outputs the rephrased sentence and its classification.

save_dict_list_to_csv(data, output_file_path=None, output_folder='csv')[source]¶

Converts a list of conversation data into a CSV file, categorizing data into columns for system prompts, user prompts, and assistant completions.

Method save_dict_list_to_csv:

Process and save conversation data in a structured CSV format.

Parameters:

data (list) – A list of dictionaries, each representing a conversation with messages categorized by roles (‘system’, ‘user’, ‘assistant’) and their respective content.
output_file_path (str, optional) – The file path for the output CSV file. Defaults to None, which uses a default filename.
output_folder (str, optional) – The directory to save the output CSV file. Defaults to ‘csv’.

Returns:

None. This method does not return anything but saves the processed data to a CSV file.

Return type:

None

Raises:

Exception – If any error occurs during the processing or file writing.

Example:

>>> data = [{'messages': [{'role': 'system', 'content': 'System message'}, {'role': 'user', 'content': 'User question'}, {'role': 'assistant', 'content': 'Assistant answer'}]}]
>>> save_dict_list_to_csv(data, output_file_path='output.csv')
# This will process the provided data and save it as 'output.csv' in the specified output folder.

Notes:

The input data should be formatted correctly, with each conversation’s messages having designated roles (‘system’, ‘user’, ‘assistant’).

save_random_prompts(input_file, output_file=None, output_format='csv', n_samples=100, output_folder='output')[source]¶

Selects random prompts from a given file and saves them in the specified format.

Method save_random_prompts:

Extract random samples from a data file and save them in a specified format.

Parameters:

input_file (str) – The path to the input file, which can be in CSV, JSON, or JSONL format.
output_file (str, optional) – The base name of the output file without extension. If None, a name with a timestamp and the number of samples will be generated. Defaults to None.
output_format (str, optional) – The format for the output file, which can be ‘csv’, ‘json’, or ‘jsonl’. Defaults to ‘csv’.
n_samples (int, optional) – The number of random samples to select from the input file. Defaults to 100.
output_folder (str, optional) – The folder where the output file should be saved. Defaults to ‘output’.

Returns:

None. This method does not return anything but saves the extracted samples to a file.

Return type:

None

Raises:

ValueError – If the input file format is unsupported or if the output format is not one of ‘csv’, ‘json’, or ‘jsonl’.
Exception – If any other error occurs during the processing or file writing.

Example:

>>> save_random_prompts("data.csv", output_file="sample_prompts", output_format='csv', n_samples=50, output_folder='output')
# This will select 50 random prompts from 'data.csv' and save them as 'sample_prompts_[timestamp]_50.csv' in the 'output' folder.

Notes:

Ensure that the input file is in one of the supported formats (CSV, JSON, or JSONL) for correct processing.

score_response(prompt, ground_truth_completion, generated_completion, llm_evaluator_model_name='gpt-3.5-turbo')[source]¶

Generates a scoring prompt, sends it to a language model, and parses the response to evaluate a given generated completion against a ground truth.

Method score_response:

Evaluate the generated completion of a prompt against the ground truth completion using a large language model.

Parameters:

prompt (str) – The original prompt used in generating the completion.
ground_truth_completion (str) – The correct or expected response to the prompt.
generated_completion (str) – The response generated by the evaluated model.
llm_evaluator_model_name (str, optional) – Name of the large language model to be used for scoring. Defaults to “gpt-3.5-turbo”.

Returns:

A tuple containing the score (numeric) and the reasoning behind the score (string).

Return type:

tuple

Raises:

json.JSONDecodeError – If the response from the model is not in a valid JSON format.
Exception – For any other exceptions that may occur during API calls or processing.

Example:

>>> score, reason = score_response("What is AI?", "Artificial Intelligence", "AI is a field in computer science.", "gpt-3.5-turbo")
# This evaluates the generated completion for accuracy and relevance, returning a score and reasoning.

Notes:

The method constructs an evaluation prompt combining the original prompt, ground truth completion, and the generated completion.
It then sends this prompt to the specified language model for scoring and parses the model’s response to extract the score and reasoning.