Data Processing

binaryrain_helper_data_processing is a python package that aims to simplify and help with common functions in data processing areas. It builds on top of the pandas library and provides additional functionality to make data processing easier, reduces boilerplate code and provides clear error messages.

Installation

To install the package you can use your favorite python package manager:

pip install binaryrain-helper-data-processing

uv add binaryrain-helper-data-processing

Supported File Formats

Enum FileFormat is used to specify the file format when creating or converting DataFrames. The supported formats include:

PARQUET: For efficient columnar storage
CSV: For common tabular data
JSON: For structured data exchange
DICT: For Python dictionary data

Key Functions

create_dataframe()

simplifies creating pandas DataFrames from various formats:

from binaryrain_helper_data_processing.dataframe import FileFormat, create_dataframe

# Create from CSV bytes

df = create_dataframe(csv_bytes, FileFormat.CSV)

# Create with custom options

df = create_dataframe(parquet_bytes, FileFormat.PARQUET,
file_format_options={'engine': 'pyarrow'})

Parameters:

file_bytes: bytes | dict | The bytes of the file to be converted into a DataFrame.
file_format: FileFormat | The format of the file (e.g., CSV, Parquet, JSON, or Dict).
file_format_options: dict | None | Optional dictionary of options for the file format (e.g., engine for Parquet).

from_dataframe_to_type()

handles converting DataFrames to different formats:

from binaryrain_helper_data_processing.dataframe import FileFormat, from_dataframe_to_type

# ....df is a pandas DataFrame

# Convert to CSV bytes

csv_bytes = from_dataframe_to_type(df, FileFormat.CSV)

# Convert with custom options

parquet_bytes = from_dataframe_to_type(df, FileFormat.PARQUET,
file_format_options={'engine': 'pyarrow'})

Parameters:

dataframe: pd.DataFrame | The DataFrame to be converted.
file_format: FileFormat | The format to convert the DataFrame to (e.g., CSV, Parquet).
file_format_options: dict | None | Optional dictionary of options for the file format (e.g., engine for Parquet or compression).

combine_dataframes()

provides a simple way to combine multiple DataFrames:

from binaryrain_helper_data_processing.dataframe import combine_dataframes

# ....df1 and df2 are pandas DataFrames

# Combine DataFrames

combined_df = combine_dataframes(df1, df2, sort=True)

Parameters:

df_one: pd.DataFrame | The first DataFrame to combine.
df_two: pd.DataFrame | The second DataFrame to combine.
sort: bool | Optional boolean to sort the combine DataFrame. Default is False.

convert_to_datetime()

automatically detects and converts all date columns:

Supports common date formats:

%d.%m.%Y (e.g., “31.12.2023”)
%Y-%m-%d (e.g., “2023-12-31”)
%Y-%m-%d %H:%M:%S (e.g., “2023-12-31 23:59:59”)
%Y-%m-%dT%H:%M:%S (ISO format)

If you only want to check specific formats, you can select them manually.

from binaryrain_helper_data_processing.dataframe import convert_to_datetime

# ....df is a pandas DataFrame

# Convert date columns with automatic inference

df = convert_to_datetime(df)

# Format only specific formats:
df = convert_to_datetime(df, date_formats=["%d.%m.%Y", "%Y-%m-%d"])

Parameters:

df: pd.DataFrame | The DataFrame with date columns to be converted.
date_formats: list[str] | None | Optional ordered list of strftime-compatible date format strings to attempt. If None, a built-in set of common formats is used and inference is enabled as a first step.

format_datetime_columns()

formats specific datetime columns:

from binaryrain_helper_data_processing.dataframe import format_datetime_columns

# ....df is a pandas DataFrame

# Format date columns directly

df = format_datetime_columns(df, datetime_columns=['date_column1', 'date_column2'], datetime_format='%Y-%m-%d')

# Format date columns to in string columns

df = format_datetime_columns(df, datetime_columns=['date_column1', 'date_column2'], datetime_format='%Y-%m-%d', datetime_string_columns=['string_column1', 'string_column2'])

Parameters:

df: pd.DataFrame | The DataFrame with datetime columns to be formatted.
datetime_columns: list[str] | List of columns to be formatted.
datetime_format: str | The format to apply to the datetime columns.
datetime_string_columns: list[str] | (Optional) List of columns to be formatted as strings. If not provided, the original columns will be replaced with formatted strings.

clean_dataframe()

cleans DataFrames by removing duplicates and missing values:

from binaryrain_helper_data_processing.dataframe import clean_dataframe

# ....df is a pandas DataFrame

# Clean DataFrame

df = clean_dataframe(df)

Parameters:

df: pd.DataFrame | The DataFrame to be cleaned.

remove_empty_values()

filters out empty values from specific columns:

from binaryrain_helper_data_processing.dataframe import remove_empty_values

# ....df is a pandas DataFrame

# Remove empty values with defaults

df = remove_empty_values(df, filter_column='column1')

# Remove empty values without dropping NaN or resetting index

df = remove_empty_values(df, filter_column='column1', dropna=False, reset_index=False)

Parameters:

df: pd.DataFrame | The DataFrame to be filtered.
filter_column: str | The column to filter out empty values.
dropna: bool | Optional boolean to drop rows where filter_column is NaN. Default is True.
reset_index: bool | Optional boolean to reset the index in the returned DataFrame. Default is True.

format_numeric_values()

handles locale-specific number formatting:

from binaryrain_helper_data_processing.dataframe import format_numeric_values

# ....df is a pandas DataFrame

# Convert European number format (1.234,56) to standard format (1,234.56)

df = format_numeric_values(
df,
columns=['price', 'quantity'],
swap_separators=True,
old_decimal_separator=',',
old_thousands_separator='.',
decimal_separator='.',
thousands_separator=',',
)

Parameters:

df: pd.DataFrame | The DataFrame with numeric values to be formatted.
columns: list[str] | List of columns to be formatted.
swap_separators: bool | (Optional) Boolean indicating whether to swap the decimal and thousands separators.
old_decimal_separator: str | (Optional) The old decimal separator to be replaced. The default is ,.
old_thousands_separator: str | (Optional) The old thousands separator to be replaced. The default is ..
decimal_separator: str | (Optional) The new decimal separator to be used. The default is ..
thousands_separator: str | (Optional) The new thousands separator to be used. The default is ,.
temp_separator: str | (Optional) Temporary separator used during the conversion process. The default is |.

format_numeric_to_string()

formats specified columns as locale-style numeric strings with proper rounding and separator handling:

from binaryrain_helper_data_processing.dataframe import format_numeric_to_string

# ....df is a pandas DataFrame

# Format numeric columns to European format (1.234,56)

df = format_numeric_to_string(
df,
columns=['price', 'quantity'],
decimal_separator=',',
thousands_separator='.',
old_decimal_separator='.',
old_thousands_separator=',',
decimal_places=2
)

# Format with different decimal places

df = format_numeric_to_string(
df,
columns=['percentage'],
decimal_separator='.',
thousands_separator=',',
decimal_places=4
)

Parameters:

df: pd.DataFrame | The DataFrame with numeric values to be formatted.
columns: list[str] | List of columns to be formatted.
decimal_separator: str | (Optional) The decimal separator to use. The default is ,.
thousands_separator: str | (Optional) The thousands separator to use. The default is ..
old_decimal_separator: str | (Optional) The old decimal separator to replace. The default is ..
old_thousands_separator: str | (Optional) The old thousands separator to replace. The default is ,.
temp_separator: str | (Optional) Temporary separator used during the conversion process. The default is |.
decimal_places: int | (Optional) Number of decimal places to format to. The default is 2.