Data Processing
binaryrain_helper_data_processing
is a python package that aims to simplify and help with common functions in data processing areas. It builds on top of the pandas library and provides additional functionality to make data processing easier, reduces boilerplate code and provides clear error messages.
Installation
Section titled “Installation”To install the package you can use your favorite python package manager:
pip install binaryrain-helper-data-processing
uv add binaryrain-helper-data-processing
Supported File Formats
Section titled “Supported File Formats”Enum FileFormat
is used to specify the file format when creating or converting DataFrames. The supported formats include:
PARQUET
: For efficient columnar storageCSV
: For common tabular dataJSON
: For structured data exchangeDICT
: For Python dictionary data
Key Functions
Section titled “Key Functions”create_dataframe()
Section titled “create_dataframe()”
pd.DataFrame
simplifies creating pandas DataFrames from various formats:
from binaryrain_helper_data_processing.dataframe import FileFormat, create_dataframe
# Create from CSV bytes
df = create_dataframe(csv_bytes, FileFormat.CSV)
# Create with custom options
df = create_dataframe(parquet_bytes, FileFormat.PARQUET,file_format_options={'engine': 'pyarrow'})
Parameters:
Section titled “Parameters:”file_bytes
:bytes | dict
| The bytes of the file to be converted into a DataFrame.file_format
:FileFormat
| The format of the file (e.g., CSV, Parquet, JSON, or Dict).file_format_options
:dict | None
| Optional dictionary of options for the file format (e.g., engine for Parquet).
convert_dataframe_to_type()
Section titled “convert_dataframe_to_type()”
bytes | str | dict
handles converting DataFrames to different formats:
from binaryrain_helper_data_processing.dataframe import FileFormat, convert_dataframe_to_type
# ....df is a pandas DataFrame
# Convert to CSV bytes
csv_bytes = convert_dataframe_to_type(df, FileFormat.CSV)
# Convert with custom options
parquet_bytes = convert_dataframe_to_type(df, FileFormat.PARQUET,file_format_options={'engine': 'pyarrow'})
Parameters:
Section titled “Parameters:”dataframe
:pd.DataFrame
| The DataFrame to be converted.file_format
:FileFormat
| The format to convert the DataFrame to (e.g., CSV, Parquet).file_format_options
:dict | None
| Optional dictionary of options for the file format (e.g., engine for Parquet or compression).
combine_dataframes()
Section titled “combine_dataframes()”
pd.DataFrame
provides a simple way to combine multiple DataFrames:
from binaryrain_helper_data_processing.dataframe import combine_dataframes
# ....df1 and df2 are pandas DataFrames
# Combine DataFrames
combined_df = combine_dataframes(df1, df2, sort=True)
Parameters:
Section titled “Parameters:”df_one
:pd.DataFrame
| The first DataFrame to combine.df_two
:pd.DataFrame
| The second DataFrame to combine.sort
:bool
| Optional boolean to sort the combine DataFrame. Default isFalse
.
convert_to_datetime()
Section titled “convert_to_datetime()”
pd.DataFrame
automatically detects and converts all date columns:
Supports common date formats:
%d.%m.%Y
(e.g., “31.12.2023”)%Y-%m-%d
(e.g., “2023-12-31”)%Y-%m-%d %H:%M:%S
(e.g., “2023-12-31 23:59:59”)%Y-%m-%dT%H:%M:%S
(ISO format)
If you only want to check specific formats, you can select them manually.
from binaryrain_helper_data_processing.dataframe import convert_to_datetime
# ....df is a pandas DataFrame
# Convert date columns
df = convert_to_datetime(df)
# Format only specific formats:df = convert_to_datetime(df, ["%d.%m.%Y"])
Parameters:
Section titled “Parameters:”df
:pd.DataFrame
| The DataFrame with date columns to be converted.
format_datetime_columns()
Section titled “format_datetime_columns()”
pd.DataFrame
formats specific datetime columns:
from binaryrain_helper_data_processing.dataframe import format_datetime_columns
# ....df is a pandas DataFrame
# Format date columns directly
df = format_datetime_columns(df, datetime_columns=['date_column1', 'date_column2'], datetime_format='%Y-%m-%d')
# Format date columns to in string columns
df = format_datetime_columns(df, datetime_columns=['date_column1', 'date_column2'], datetime_format='%Y-%m-%d', datetime_string_columns=['string_column1', 'string_column2'])
Parameters:
Section titled “Parameters:”df
:pd.DataFrame
| The DataFrame with datetime columns to be formatted.datetime_columns
:list[str]
| List of columns to be formatted.datetime_format
:str
| The format to apply to the datetime columns.datetime_string_columns
:list[str]
| (Optional) List of columns to be formatted as strings. If not provided, the original columns will be replaced with formatted strings.
clean_dataframe()
Section titled “clean_dataframe()”
pd.DataFrame
cleans DataFrames by removing duplicates and missing values:
from binaryrain_helper_data_processing.dataframe import clean_dataframe
# ....df is a pandas DataFrame
# Clean DataFrame
df = clean_dataframe(df)
Parameters:
Section titled “Parameters:”df
:pd.DataFrame
| The DataFrame to be cleaned.
remove_empty_values()
Section titled “remove_empty_values()”
pd.DataFrame
filters out empty values from specific columns:
from binaryrain_helper_data_processing.dataframe import remove_empty_values
# ....df is a pandas DataFrame
# Remove empty values
df = remove_empty_values(df, filter_column'column1')
Parameters:
Section titled “Parameters:”df
:pd.DataFrame
| The DataFrame to be filtered.filter_column
:str
| The column to filter out empty values.
format_numeric_values()
Section titled “format_numeric_values()”
pd.DataFrame
handles locale-specific number formatting:
from binaryrain_helper_data_processing.dataframe import format_numeric_values
# ....df is a pandas DataFrame
# Convert European number format (1.234,56) to standard format (1,234.56)
df = format_numeric_values(df,columns=['price', 'quantity'],swap_separators=True,old_decimal_separator=',',old_thousands_separator='.',decimal_separator='.',thousands_separator=',',)
Parameters:
Section titled “Parameters:”df
:pd.DataFrame
| The DataFrame with numeric values to be formatted.columns
:list[str]
| List of columns to be formatted.swap_separators
:bool
| (Optional) Boolean indicating whether to swap the decimal and thousands separators.old_decimal_separator
:str
| (Optional) The old decimal separator to be replaced. The default is,
.old_thousands_separator
:str
| (Optional) The old thousands separator to be replaced. The default is.
.decimal_separator
:str
| (Optional) The new decimal separator to be used. The default is.
.thousands_separator
:str
| (Optional) The new thousands separator to be used. The default is,
.temp_separator
:str
| (Optional) Temporary separator used during the conversion process. The default is|
.