scalexi Python API

Simplifying LLM Development and Fine-Tuning with Python

Overview

scalexi is a versatile open-source Python library, optimized for Python 3.11+, focuses on facilitating low-code development and fine-tuning of diverse Large Language Models (LLMs). It extends beyond its initial OpenAI models integration, offering a scalable framework for various LLMs.

Key to scalexi is its low-code approach, significantly reducing the complexity of dataset preparation and manipulation. It features advanced dataset conversion tools, adept at transforming raw contextual data into structured datasets fullfilling LLMs fine-tuning requirements. These tools support multiple question formats, like open-ended, closed-ended, yes-no, and reflective queries, streamlining the creation of customized datasets for LLM fine-tuning.

A standout feature is the library’s automated dataset generation, which eases the workload involved in LLM training. scalexi also provides essential utilities for cost estimation and token counting, aiding in effective resource management throughout the fine-tuning process.

Developed by scalexi.ai, the library leverages a robust specification to facilitate fine-tuning context-specific models with OpenAI API. Alsom scalexi ensures a user-friendly experience while maintaining high performance and error handling.

Explore the full capabilities of Large Language Models with scalexi’s intuitive and efficient Python API with minimal coding for easy LLM development and fine-tuning from dataset creation to LLM evaluation.

Source Code

For API references, and source code, visit the scalexi GitHub Repository. It contains the source code, issues, and discussion around the development of the scalexi library for LLM development and fine-tuning.

Features

Low-Code Interface: scalexi offers a user-friendly, low-code platform that simplifies interactions with LLMs. Its intuitive design minimizes the need for extensive coding, making LLM development accessible to a broader range of users.
Automated Dataset Generation: The library excels in converting raw data into structured formats, aligning with specific LLM fine-tuning requirements. This automation streamlines the dataset preparation process, saving time and reducing manual effort.
Versatile Dataset Format Support: scalexi is designed to handle various dataset formats including CSV, JSON, and JSONL. It also facilitates effortless conversion between these formats, providing flexibility in dataset management and utilization.
Simplified Fine-Tuning Process: The library provides simplified interfaces for fine-tuning LLMs. These user-friendly tools allow for easy customization and optimization of models on specific datasets, enhancing model performance and applicability.
Efficient Model Evaluation: scalexi includes utilities for the automated evaluation of fine-tuned models. This feature assists in assessing model performance, ensuring the reliability and effectiveness of the fine-tuned models.
Token Usage Estimation: The library incorporates functions to accurately estimate token usage and associated costs. This is crucial for managing resources and budgeting in LLM projects, providing users with a clear understanding of potential expenses.

Installation

Easily install scalexi with pip. Just run the following command in your terminal:

pip install scalexi

This will install scalexi and its dependencies, making it ready for use with Python 3.11 and above (not tested on lower Python versions).

Usage

The scalexi toolkit offers comprehensive features for creating, evaluating, and fine-tuning Large Language Models (LLMs) with OpenAI’s API. It allows users to generate datasets from custom context entries, estimate costs for model training and inference, and convert datasets into formats suitable for fine-tuning. Users can fine-tune models with the FineTuningAPI, which includes a dashboard for managing fine-tuning jobs. Additionally, ScaleXI facilitates the evaluation of fine-tuned LLMs by generating random samples, rephrasing prompts for better generalization, and assessing model performance based on generated completions. This toolkit simplifies and streamlines the process of working with LLMs, making it more accessible and efficient for various applications in research, academia, and industry.

To experiment with scalexi, you can use the following Google Colab Project:

In what follow, we present the different use cases of scalexi.

I. Automated Dataset Generation

Context File Setup

To generate a dataset with scalexi, prepare a CSV file with a single column titled ‘context’. Populate this column with context entries, each in a new row, ensuring the content is within the LLM’s token limit. Save the file in a recognized directory before starting the dataset creation process.

Here is an illutrative example of a context.csv file

context,
"Your first context entry goes here. It can be a paragraph or a document that you want to use as the basis for generating questions or prompts.",
"Your second context entry goes here. Make sure that each entry is not too lengthy to stay within the token limits of your LLM."

Create your dataset

After installing scalexi, you can create a fine-tuning dataset for Large Language Models (LLMs) using your own context data. Below is a simple script demonstrating how to generate a dataset:

 import os
from scalexi.dataset_generation.prompt_completion import PromptCompletionGenerator

 # Ensure your OpenAI API key is set as an environment variable
 os.environ['OPENAI_API_KEY'] = 'your-api-key-here'

 # Instantiate the generator with desired settings
 generator = PromptCompletionGenerator(enable_timeouts=True)

 # Specify the path to your context file and the desired output file for the dataset
 context_file = 'path/to/your/context.csv'
 output_dataset_file = 'path/to/your/generated_dataset.csv'

 # Call the create_dataset method with your parameters
 generator.create_dataset(context_file, output_dataset_file,
                         num_questions=1,
                         question_types=["yes-no", "open-ended", "reflective"],
                         model="gpt-3.5-turbo-1106",
                         temperature=0.3,
                         detailed_explanation=True)

This script will generate a dataset with 'yes-no', 'open-ended' and 'reflective', type questions based on the context provided in your CSV file.

II.Cost Estimation and Dataset Formatting with ScaleXI

The ScaleXI library provides utilities for estimating the cost of using OpenAI’s models and converting datasets into the required formats.

Estimating Costs with OpenAIPricing

The OpenAIPricing class can estimate the costs for fine-tuning and inference. Here’s how you can use it:

import json
import pkgutil
from scalexi.openai.pricing import OpenAIPricing

# Load the pricing data
data = pkgutil.get_data('scalexi', 'data/openai_pricing.json')
pricing_info = json.loads(data)

# Create an OpenAIPricing instance
pricing = OpenAIPricing(pricing_info)

# Estimate cost for fine-tuning
number_of_tokens = 10000  # Replace with your actual token count
estimated_cost = pricing.estimate_finetune_training_cost(number_of_tokens, model_name="gpt-3.5-turbo")
print(f"Estimated cost for fine-tuning with {number_of_tokens} tokens: ${estimated_cost:.2f}")

# Estimate cost for inference
input_tokens = 10000  # Replace with your actual input token count
output_tokens = 5000  # Replace with your actual output token count
estimated_cost = pricing.estimate_inference_cost(input_tokens, output_tokens, model_name="gpt-3.5-turbo")
print(f"Estimated inference cost: ${estimated_cost:.2f}")

III. Converting Datasets with DataFormatter

The DataFormatter class can convert datasets from CSV to JSONL, which is the required format for fine-tuning datasets on OpenAI.

from scalexi.utilities.data_formatter import DataFormatter

# Initialize the DataFormatter
dfm = DataFormatter()

# Convert a CSV dataset to JSONL format
csv_dataset_path = "path/to/your/dataset.csv"  # Replace with your actual CSV file path
jsonl_dataset_path = "path/to/your/dataset.jsonl"  # Replace with your desired JSONL file path
dfm.csv_to_jsonl(csv_dataset_path, jsonl_dataset_path)

Fine-Tuning Dataset Conversion

ScaleXI also provides a method to convert a dataset from prompt completion to a conversation format, suitable for fine-tuning of OpenAI GPT-based conversational models:

# Convert prompt completion dataset to conversation format
prompt_completion_dataset_path = "path/to/your/generated_dataset.jsonl"  # Replace with your actual JSONL file path
conversation_dataset_path = "path/to/your/conversation_dataset.jsonl"  # Replace with your desired JSONL file path
dfm.convert_prompt_completion_to_conversation(prompt_completion_dataset_path, conversation_dataset_path)

# Calculate token usage for the conversation dataset
number_of_tokens = pricing.calculate_token_usage_for_dataset(conversation_dataset_path)
print(f"Number of tokens in the conversation dataset: {number_of_tokens}")

# Estimate fine-tuning cost for the conversation dataset
estimated_cost = pricing.estimate_finetune_training_cost(number_of_tokens, model_name="gpt-3.5-turbo")
print(f"Estimated fine-tuning cost for the conversation dataset: ${estimated_cost:.2f}")

IV.Fine-Tuning OpenAI Models with ScaleXI

ScaleXI simplifies the process of fine-tuning Large Language Models (LLMs) with OpenAI’s API. Here’s how you can start fine-tuning your models with the FineTuningAPI class.

Setting up FineTuningAPI

First, you need to set your OpenAI API key as an environment variable. You can do this within your script or in your shell environment.

import os
from scalexi.openai.fine_tuning_api import FineTuningAPI

# Prompt for the OpenAI API key and set it as an environment variable
api_key = input("Please enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = api_key

# Confirm that the API key has been set
print(f"OpenAI API key set: {os.getenv('OPENAI_API_KEY') is not None}")

V. Running the Fine-Tuning Dashboard

After setting up your API key, you can launch the fine-tuning dashboard provided by ScaleXI to monitor and manage your fine-tuning jobs.

from scalexi.openai.fine_tuning_api import FineTuningAPI

# Initialize the FineTuningAPI with the API key from the environment
api = FineTuningAPI(api_key=os.getenv("OPENAI_API_KEY"))

# Run the dashboard
api.run_dashboard()

This dashboard allows you to start new fine-tuning jobs, monitor the progress of ongoing jobs, and review past jobs.

Menu:
Create a fine-tune file
Create a fine-tuning job
List of tune-tune files
List 10 fine-tuning jobs
Retrieve the state of a fine-tune
Cancel a job
List up to 10 events from a fine-tuning job
Use a fine-tuned model
Delete a fine-tuned model
Exit

IV.Fine-Tuned LLM Evaluation

Step 1: Random Sample Creation

Generate a random sample from your dataset for evaluation purposes.

from scalexi.llm_evaluation.evaluate import LLMEvaluation
import os

# Set your OpenAI API key as an environment variable
os.environ['OPENAI_API_KEY'] = 'your-api-key-here'

# Initialize the LLMEvaluation object
llm_evaluation = LLMEvaluation(enable_timeouts=False)

# Define the path to your dataset
conversation_dataset_jsonl = 'path/to/your/files/conversation_dataset.jsonl'

# Specify the output details for the sample dataset
output_folder = 'path/to/output/folder/'
output_file = output_folder + 'random_prompts.csv'

# Create a CSV output needed for evaluation
llm_evaluation.save_random_prompts(conversation_dataset_jsonl, output_file, output_format='csv', n_samples=10)

Step 2: Rephrase Prompts

Improve your model’s ability to generalize by rephrasing prompts and optionally classifying them. You can choose any classes of your interest.

# Rephrase prompts and classify them if needed
rephrased_dataset_csv = output_folder + 'rephrased_dataset.csv'
llm_evaluation.rephrase_and_classify_prompts_in_dataset(output_file, rephrased_dataset_csv,
                                            classify=True,
                                            classes=['ACADEMIC', 'RESEARCH', 'ADMIN', 'SCIENCE', 'OTHERS'])

Step 3: Evaluate LLM

Assess the performance of your LLM by comparing the generated completions with ground truth.

# Define the fine-tuned model name for evaluation
finetuned_model = 'ft-gpt-3.5-turbo-your-model-id'
evaluation_results_csv = output_folder + 'evaluation_results.csv'
# Evaluate the fine-tuned model using the rephrased dataset
llm_evaluation.evaluate_model(finetuned_model,
                              rephrased_dataset_csv,
                              evaluation_results_csv,
                              temperature=0.3, max_tokens=250, top_p=1.0,
                              frequency_penalty=0, presence_penalty=0,
                              llm_evaluator_model_name='gpt-3.5-turbo',
                              experiment_id=1,  # Experiment identifier
                              save_immediately=False)

The evaluation_results.csv file compiles the scores from the evaluation, providing a clear metric of your LLM’s performance.

Contributing

We warmly welcome contributions to scalexi! Whether you’re fixing bugs, adding new features, or improving documentation, your help is invaluable.

Before you start, please take a moment to review our contribution guidelines. They provide important instructions and best practices to follow, ensuring a smooth and efficient contribution process.

You can find all the necessary details in our Contribution Guidelines. Thank you for your interest in enhancing scalexi!

License

scalexi is released under the ScaleXI License 1.0. This license ensures that the software can be freely used, reproduced, and distributed, both for academic and business purposes, while requiring proper attribution to ScaleX Innovation company in any derivative works or distributions.

For full details of the terms and conditions, please refer to the LICENSE file included with the software.

Contact

For support or queries, reach out to us at Anis Koubaa akoubaa@scalexi.com.

Indices and Tables

ScaleXI Documentation

Documentation