Utilities

Data Formatter

class DataFormatter[source]

Bases: object

convert_conversation_to_prompt_completion(input_file, output_file)[source]

Converts a conversational JSONL file to a format with prompt-completion pairs.

Method convert_conversation_to_prompt_completion:

Transform a conversational JSONL file,

extracting user messages as prompts and assistant responses as completions, and save them in an output JSONL file. :type convert_conversation_to_prompt_completion: method

Parameters:
  • input_file (str) – The path to the input conversational JSONL file containing messages.

  • output_file (str) – The path to the output JSONL file where prompt-completion pairs will be saved.

Returns:

None. The function writes the extracted prompt-completion pairs to the output file.

Return type:

None

Raises:
  • FileNotFoundError – If the input file specified does not exist.

  • Exception – If any error occurs during the file reading or writing process.

Example:

# Given an input JSONL file with conversational data:
# {"messages": [{"role": "user", "content": "What is AI?"}, {"role": "assistant", "content": "AI stands for Artificial Intelligence."}]}
# Running convert_conversation_to_prompt_completion will produce an output JSONL file containing:
# {"prompt": "What is AI?", "completion": "AI stands for Artificial Intelligence."}
convert_llama2_instructions_to_prompt_completion(input_file, output_file)[source]

Converts data from the LLAMA2 instructions format to the original prompt-completion format.

Method convert_llama2_instructions_to_prompt_completion:

Transform data from LLAMA2 instructions format into prompt-completion pairs, and save them in a JSONL file.

Parameters:
  • input_file (str) – The path to the input JSONL file in LLAMA2 instructions format.

  • output_file (str) – The path to the output JSONL file where the converted prompt-completion pairs will be saved.

Returns:

None. The function writes the converted data to the output file.

Return type:

None

Raises:
  • FileNotFoundError – If the input file specified does not exist.

  • Exception – If any error occurs during the file reading or writing process.

Example:

# Given an input JSONL file in LLAMA2 format:
# {"instruction": "Explain what AI is.", "output": "AI stands for Artificial Intelligence."}
# Running convert_llama2_instructions_to_prompt_completion will produce an output JSONL file containing:
# {"prompt": "Explain what AI is.", "completion": "AI stands for Artificial Intelligence."}
convert_prompt_completion_to_conversation(input_file, output_file, system_prompt='You are a chatbot assistant. respond to any question properly.')[source]

Converts a JSONL file containing prompt-completion pairs into a conversational JSONL format.

This method is particularly useful for transforming datasets with prompt and completion pairs into a format suitable for conversational AI model training or evaluation. The function reads from a JSONL file where each line is a JSON object with ‘prompt’ and ‘completion’ keys. It then formats these pairs into a conversational structure, adding a system prompt at the beginning of each conversation, and writes this new format to an output JSONL file.

Method convert_prompt_completion_to_conversation:

Transforms prompt-completion data into a conversational format.

Parameters:
  • input_file (str) – The file path to the input JSONL file with prompt-completion pairs.

  • output_file (str) – The file path to the output JSONL file for the conversational data.

  • system_prompt (str, optional) – A standard prompt to be included as the starting message in each conversation. Defaults to a generic chatbot assistant prompt.

Raises:
  • FileNotFoundError – If the specified input file cannot be found.

  • Exception – If there is an error during file processing.

Example:

>>> convert_prompt_completion_to_conversation("input.jsonl", "output.jsonl")
>>> # This will read from 'input.jsonl' and write the conversational format to 'output.jsonl'.
Example Input JSONL Line:
>>> {"prompt": "What is AI?", "completion": "AI stands for Artificial Intelligence."}
Example Output JSONL Line:
{
“messages”: [

{“role”: “system”, “content”: “You are a chatbot assistant. respond to any question properly.”}, {“role”: “user”, “content”: “What is AI?”}, {“role”: “assistant”, “content”: “AI stands for Artificial Intelligence.”}

]

}

convert_prompt_completion_to_llama2_instructions(input_file, output_jsonl_file, output_json_file)[source]

Converts prompt-completion data from a JSONL file to the LLAMA2 instructions format.

This function is designed to read a JSONL file containing prompt-completion pairs and convert each record into the LLAMA2 instructions format, which is a structured way to represent AI training data. It generates two output files: one in JSONL format and another in JSON format, each containing the converted data.

Method convert_prompt_completion_to_llama2_instructions:

Transforms data into LLAMA2 instructions format for AI training.

Parameters:
  • input_file (str) – The file path to the input JSONL file with prompt-completion pairs.

  • output_jsonl_file (str) – The file path to the output JSONL file which will contain the LLAMA2 formatted data.

  • output_json_file (str) – The file path to the output JSON file which will also contain the LLAMA2 formatted data.

Raises:
  • FileNotFoundError – If the specified input file cannot be found.

  • Exception – If there is an error during the conversion process or file writing.

Example:

>>> convert_prompt_completion_to_llama2_instructions("input.jsonl", "output.llama2.jsonl", "output.llama2.json")
# This will read prompt-completion pairs from 'input.jsonl' and write the LLAMA2 formatted data to 'output.llama2.jsonl' and 'output.llama2.json'.
Example Input JSONL Record:
>>> {"prompt": "Explain how a computer works.", "completion": "A computer is a machine that processes information."}
Example Output LLAMA2 JSONL Record:
>>> {
>>>     "instruction": "Explain how a computer works.",
>>>     "input": "",
>>>     "output": "A computer is a machine that processes information."
>>> }
csv_to_json(json_file_path)[source]

Converts data from a CSV file to a JSON file.

This function is designed to read data from a CSV file and convert it into a JSON format, saving the converted data to a JSON file. It’s useful for data format conversion, especially when dealing with data transformation and integration tasks in data processing and machine learning workflows.

Method csv_to_json:

Transforms data from CSV format into JSON format.

Parameters:
  • csv_file_path (str) – The file path to the CSV file containing the data to be converted.

  • json_file_path (str) – The file path where the resulting JSON file will be saved.

Example:

>>> csv_to_json("data.csv", "data.json")
# This will read data from 'data.csv' and convert it to JSON format, saving the result in 'data.json'.
Note:

The function reads the CSV data into a pandas DataFrame and then converts it to JSON using pandas’ to_json method.

csv_to_jsonl(csv_file_path, jsonl_file_path)[source]

Converts data from a CSV file to a JSON Lines (JSONL) file.

This function is designed to read data from a CSV file and convert it into the JSON Lines format, which is a JSON format where each line is a separate JSON object. This is particularly useful for handling large datasets or stream processing, as it allows for efficient data processing line by line.

Method csv_to_jsonl:

Transforms data from CSV format into JSON Lines format.

Parameters:
  • csv_file_path (str) – The file path to the CSV file containing the data to be converted.

  • jsonl_file_path (str) – The file path where the resulting JSON Lines file will be saved.

Example:

>>> csv_to_jsonl("data.csv", "data.jsonl")
# This will read data from 'data.csv' and convert it to JSON Lines format, saving the result in 'data.jsonl'.
Note:

The function reads the CSV data into a pandas DataFrame and then converts it to JSON Lines using pandas’ to_json method with the orient=”records” and lines=True parameters.

csv_to_yaml(csv_file_path, yaml_file_path)[source]

Converts CSV data to a YAML file format.

Method csv_to_yaml:

Read data from a CSV file and convert it into a YAML file format.

Parameters:
  • csv_file_path (str) – The file path where the CSV data is located.

  • yaml_file_path (str) – The file path where the converted YAML data will be saved.

Returns:

None. The function writes the converted YAML data to the specified file.

Return type:

None

Raises:
  • FileNotFoundError – If the CSV file specified does not exist.

  • Exception – If any error occurs during the file reading or conversion process.

Example:

>>> csv_to_yaml("data.csv", "data.yaml")
# This will read 'data.csv', convert its contents to YAML format, and save it to 'data.yaml'.
df_to_csv(df, output_file)[source]

Appends a DataFrame to a CSV file or creates a new one if it doesn’t exist.

This function is tailored to handle data persistence for pandas DataFrames. It intelligently checks if the specified CSV file exists. If it does not, the function creates a new file and writes the DataFrame with a header. If the file exists, it appends the DataFrame data to the existing file without adding another header, thus maintaining data continuity and avoiding header duplication.

Method df_to_csv:

Handles the appending of DataFrame data to a CSV file or creates a new file if necessary.

Parameters:
  • output_file (str) – The file path where the DataFrame data will be written or appended. This function takes care to avoid overwriting existing data.

  • df (pandas.DataFrame) – The DataFrame that needs to be written to a CSV file.

Note:

The data in the existing CSV file will not be overwritten. New data from the DataFrame will be appended to ensure data integrity.

Example:

>>> df = pandas.DataFrame({"column1": [1, 2], "column2": [3, 4]})
>>> df_to_csv(df, "output.csv")
# This will append the DataFrame to 'output.csv' if it exists, or create a new file if it does not.
df_to_json(df, json_output)[source]

Converts a DataFrame to JSON format and saves it to a specified file.

This method is designed to take a pandas DataFrame, convert it into a JSON format, and save it to a file. It appends to the file if it already exists, or creates a new file if it does not. This function is useful for data serialization and storage, especially in data processing and machine learning workflows.

Method df_to_json:

Converts the given DataFrame to JSON format and saves it to the specified file path.

Parameters:
  • df (pandas.DataFrame) – The DataFrame that needs to be converted to JSON.

  • json_output (str) – The file path where the JSON data will be saved. The method appends to the file if it exists, or creates a new one if it does not.

Raises:

Exception – If there is an error during the conversion or file writing process.

Example:

>>> df = pandas.DataFrame({"column1": [1, 2], "column2": [3, 4]})
>>> df_to_json(df, "output.json")
# This will save the DataFrame as a JSON file named 'output.json'.
extract_json_array(input_text)[source]

Extracts a JSON array from a given text by removing specific markers.

method extract_json_array:

Remove markers from text and parse it as a JSON array.

type extract_json_array:

method

param input_text:

The input text containing a JSON array with additional formatting markers.

type input_text:

str

return:

The extracted JSON array, or None if the text cannot be parsed as JSON.

rtype:

list | None

raises json.JSONDecodeError:

If there is an error in decoding the JSON from the provided text.

example:

>>> extract_json_array('```json

[{“name”: “Alice”}, {“name”: “Bob”}] ```’)

# This will return [{‘name’: ‘Alice’}, {‘name’: ‘Bob’}].

format_prompt_completion(prompt, completion, start_sequence='\n\n###\n\n', end_sequence='END')[source]

Formats and structures a user-defined prompt and its corresponding completion into a pandas DataFrame.

Method format_prompt_completion:

Organizes a prompt and its completion into a structured DataFrame, aiding in data handling and analysis for machine learning tasks.

Parameters:
  • prompt (str) – The initial text or question presented to a model or system, providing context or a scenario for a subsequent completion or answer.

  • completion (str) – The text or response that follows the prompt, offering additional information or a resolution to the given context.

  • start_sequence (str, optional) – A character sequence denoting the beginning of the prompt, aiding in segmenting data entries in large datasets. Defaults to “nn###nn”.

  • end_sequence (str, optional) – A character sequence indicating the end of the completion, assisting in marking the conclusion of data entries in large datasets. Defaults to “END”.

Returns:

A DataFrame with two columns, ‘Prompt’ and ‘Completion’, facilitating easier data manipulation and analysis.

Return type:

pandas.DataFrame

Example:

>>> format_prompt_completion("How is the weather today?", "It is sunny and warm.")
# This will return a DataFrame with the prompt and completion structured in two separate columns.
format_prompt_completion_df(prompt, completion, start_sequence='\n\n###\n\n', end_sequence='END')[source]

Formats and structures a user-defined prompt and its corresponding completion into a pandas DataFrame.

Method [FunctionName]:

Organizes a prompt and its completion into a structured DataFrame, aiding in data handling and analysis for machine learning tasks.

Parameters:
  • prompt (str) – The initial text or question presented to a model or system, providing context or a scenario for a subsequent completion or answer.

  • completion (str) – The text or response that follows the prompt, offering additional information or a resolution to the given context.

  • start_sequence (str, optional) – A character sequence denoting the beginning of the prompt, aiding in segmenting data entries in large datasets. Defaults to “nn###nn”.

  • end_sequence (str, optional) – A character sequence indicating the end of the completion, assisting in marking the conclusion of data entries in large datasets. Defaults to “END”.

Returns:

A DataFrame with two columns, ‘Prompt’ and ‘Completion’, facilitating easier data manipulation and analysis.

Return type:

pandas.DataFrame

Example:

>>> format_prompt_completion_df("How is the weather today?", "It is sunny and warm.")
# This will return a DataFrame with the prompt and completion structured in two separate columns.
json = <module 'json' from '/Users/akoubaa/anaconda3/envs/chatgpt/lib/python3.11/json/__init__.py'>
json_to_csv(json_data, csv_file_path)[source]

Converts JSON data into a CSV file.

This function is specifically designed to take JSON data, represented as a list of dictionaries, and convert it into a CSV file. This can be particularly useful for data transformation tasks where JSON data needs to be presented or analyzed in tabular form.

Method json_to_csv:

Transforms JSON data into a CSV file.

Parameters:
  • json_data (list of dict) – A list of dictionaries, where each dictionary represents a row of data to be converted into CSV format.

  • csv_file_path (str) – The file path where the resulting CSV file will be saved.

Example:

>>> json_data = [
    {"name": "John", "age": 30, "city": "New York"},
    {"name": "Alice", "age": 25, "city": "Los Angeles"},
    {"name": "Bob", "age": 35, "city": "Chicago"}
]
>>> json_to_csv(json_data, "data.csv")
# This will create a CSV file 'data.csv' with the data from 'json_data'.
json_to_yaml(json_data)[source]

Converts a JSON object or string to a YAML-formatted string.

This function is designed to take a JSON object, either in the form of a Python dictionary or as a JSON-formatted string, and converts it into a YAML-formatted string. The conversion preserves the original structure and data types from the JSON input.

Method json_to_yaml:

Converts JSON data to YAML format.

Parameters:

json_data (dict or str) – The JSON data to be converted. It can be a dictionary representing the JSON object or a string containing JSON-formatted text.

Returns:

A string containing the data in YAML format.

Return type:

str

Note:

The function utilizes PyYAML’s dump method for conversion, which generates block-style YAML formatting by default and does not sort keys to maintain the order.

Example:

>>> json_obj = {"name": "John", "age": 30, "city": "New York"}
>>> yaml_str = json_to_yaml(json_obj)
>>> print(yaml_str)
name: John
age: 30
city: New York

>>> json_str = '{"name": "John", "age": 30, "city": "New York"}'
>>> yaml_str = json_to_yaml(json_str)
>>> print(yaml_str)
name: John
age: 30
city: New York
jsonl_to_csv(jsonl_file_path, csv_file_path)[source]

Converts JSON Lines (JSONL) data to a CSV file format.

Method jsonl_to_csv:

Read data from a JSONL file and convert it into a CSV file format.

Parameters:
  • jsonl_file_path (str) – The file path where the JSONL data is located.

  • csv_file_path (str) – The file path where the converted CSV data will be saved.

Returns:

None. The function writes the converted CSV data to the specified file.

Return type:

None

Raises:
  • FileNotFoundError – If the JSONL file specified does not exist.

  • Exception – If any error occurs during the file reading or conversion process.

Example:

>>> jsonl_to_csv("data.jsonl", "data.csv")
# This will read 'data.jsonl', convert its contents to CSV format, and save it to 'data.csv'.
list_to_csv(data_list, output_file)[source]

Saves a list of dictionaries to a CSV file, effectively serializing structured data for consistency, portability, and interchange.

Method list_to_csv:

Convert and store a list of dictionaries into a CSV file.

Parameters:
  • data_list (list of dict) – A list containing dictionaries, where each dictionary represents a record with keys as column names and values as data entries.

  • output_file (str) – The file path where the CSV file will be saved. If a file exists at this path, it will be overwritten.

Raises:

IOError – If an error occurs during the file writing process.

Example:

>>> list_to_csv([{"Name": "Alice", "Age": 30}, {"Name": "Bob", "Age": 25}], "people.csv")
# This saves the provided list of dictionaries to 'people.csv', with 'Name' and 'Age' as column headers.
remove_json_markers(input_text)[source]

Removes specific markers, typically used for JSON formatting, from the given text.

This method is designed to clean up text by removing certain markers that are often used to denote JSON content. It’s particularly useful in scenarios where the text includes these markers for formatting purposes, such as in markdown or documentation, and the raw text is required for further processing or analysis.

Method remove_json_markers:

Strips away specific markers from the input text.

Parameters:

input_text (str) – The text from which the JSON array markers need to be removed. These markers usually include ```json and ` ``.

Returns:

The cleaned text with all specified markers removed.

Return type:

str

Example:

>>> example_text = "``json\n{"key": "value"}\n``"
>>> remove_json_markers(example_text)
'{"key": "value"}'
simplify_data(data)[source]

Recursively simplifies data to ensure it only contains types that are serializable.

Method simplify_data:

Convert complex data structures into serializable formats.

Parameters:

data (dict | list | str | int | float | bool | Any) – The data structure (such as a dictionary, list, or a basic data type) that needs to be simplified.

Returns:

The simplified data where complex types are converted into serializable basic types.

Return type:

dict | list | str | int | float | bool

Example:

>>> simplify_data({"user": {"name": "Alice", "age": 30, "preferences": ["reading", "traveling"]}})
# This will return a simplified version of the nested dictionary, ensuring all elements are serializable.
xml_to_csv(xml_file_path, csv_file_path)[source]

Converts XML data to CSV format.

Method xml_to_csv:

Parse an XML file, extract its data, and save it in CSV format.

Parameters:
  • xml_file_path (str) – The file path where the XML data is located.

  • csv_file_path (str) – The file path where the converted CSV data will be saved.

Returns:

None. The function writes the converted CSV data to the specified file.

Return type:

None

Raises:
  • FileNotFoundError – If the XML file specified does not exist.

  • Exception – If any error occurs during the file parsing or conversion process.

Example:

>>> xml_to_csv("data.xml", "output.csv")
# This will read 'data.xml', extract its contents, and save it to 'output.csv' in CSV format.
Notes:

  • This function utilizes the xml.etree.ElementTree library for parsing the XML file.

  • The XML elements are transformed into CSV rows, with the header row in the CSV derived from the XML element names.

See_also:

  • csv_to_xml: A function to convert CSV data to XML format.

yaml_to_csv(yaml_file_path, csv_file_path)[source]

Converts YAML data to a CSV file.

This function is specifically designed to read data from a YAML file and convert it into a CSV format, subsequently saving the converted data to a CSV file. This is useful in scenarios where YAML-formatted data needs to be represented in a tabular format for easier analysis or integration with other data processing tools.

Method yaml_to_csv:

Transforms data from YAML format into CSV format.

Parameters:
  • yaml_file_path (str) – The file path to the YAML file containing the data to be converted.

  • csv_file_path (str) – The file path where the resulting CSV file will be saved.

Example:

>>> yaml_to_csv("data.yaml", "data.csv")
# This will read data from 'data.yaml', convert it to CSV format, and save the result in 'data.csv'.
Note:

The function reads the YAML data, converts it into a suitable structure (like a pandas DataFrame), and then writes it to a CSV file.

yaml_to_json(yaml_data)[source]

Converts a YAML object or string to a JSON-formatted string.

This function is designed to take a YAML object, either as a Python dictionary resulting from YAML parsing or as a YAML-formatted string, and converts it into a JSON-formatted string. The conversion retains the original structure and data types present in the input YAML.

Method yaml_to_json:

Converts YAML data to a JSON format.

Parameters:

yaml_data (str) – The YAML data to be converted, provided as a string.

Returns:

A dictionary representing the JSON data.

Return type:

dict

Raises:

ValueError – If the YAML data does not represent a valid dictionary.

Example:

>>> yaml_str = "name: John\nage: 30\ncity: New York"
>>> json_obj = yaml_to_json(yaml_str)
>>> print(json_obj)
{'name': 'John', 'age': 30, 'city': 'New York'}
Note:

The function uses PyYAML’s safe_load method to parse the YAML data and convert it into a dictionary.