Skip to content

Data Processing

PyPI VersionGithub

binaryrain_helper_data_processing is a python package that aims to simplify and help with common functions in data processing areas. It builds on top of the pandas library and provides additional functionality to make data processing easier, reduces boilerplate code and provides clear error messages.

To install the package you can use your favorite python package manager:

Terminal window
pip install binaryrain-helper-data-processing

Enum FileFormat is used to specify the file format when creating or converting DataFrames. The supported formats include:

  • PARQUET: For efficient columnar storage
  • CSV: For common tabular data
  • JSON: For structured data exchange
  • DICT: For Python dictionary data

pd.DataFrame

simplifies creating pandas DataFrames from various formats:

from binaryrain_helper_data_processing.dataframe import FileFormat, create_dataframe
# Create from CSV bytes
df = create_dataframe(csv_bytes, FileFormat.CSV)
# Create with custom options
df = create_dataframe(parquet_bytes, FileFormat.PARQUET,
file_format_options={'engine': 'pyarrow'})
  • file_bytes: bytes | dict | The bytes of the file to be converted into a DataFrame.
  • file_format: FileFormat | The format of the file (e.g., CSV, Parquet, JSON, or Dict).
  • file_format_options: dict | None | Optional dictionary of options for the file format (e.g., engine for Parquet).

bytes | str | dict

handles converting DataFrames to different formats:

from binaryrain_helper_data_processing.dataframe import FileFormat, convert_dataframe_to_type
# ....df is a pandas DataFrame
# Convert to CSV bytes
csv_bytes = convert_dataframe_to_type(df, FileFormat.CSV)
# Convert with custom options
parquet_bytes = convert_dataframe_to_type(df, FileFormat.PARQUET,
file_format_options={'engine': 'pyarrow'})
  • dataframe: pd.DataFrame | The DataFrame to be converted.
  • file_format: FileFormat | The format to convert the DataFrame to (e.g., CSV, Parquet).
  • file_format_options: dict | None | Optional dictionary of options for the file format (e.g., engine for Parquet or compression).

pd.DataFrame

provides a simple way to combine multiple DataFrames:

from binaryrain_helper_data_processing.dataframe import combine_dataframes
# ....df1 and df2 are pandas DataFrames
# Combine DataFrames
combined_df = combine_dataframes(df1, df2, sort=True)
  • df_one: pd.DataFrame | The first DataFrame to combine.
  • df_two: pd.DataFrame | The second DataFrame to combine.
  • sort: bool | Optional boolean to sort the combine DataFrame. Default is False.

pd.DataFrame

automatically detects and converts all date columns:

Supports common date formats:

  • %d.%m.%Y (e.g., “31.12.2023”)
  • %Y-%m-%d (e.g., “2023-12-31”)
  • %Y-%m-%d %H:%M:%S (e.g., “2023-12-31 23:59:59”)
  • %Y-%m-%dT%H:%M:%S (ISO format)

If you only want to check specific formats, you can select them manually.

from binaryrain_helper_data_processing.dataframe import convert_to_datetime
# ....df is a pandas DataFrame
# Convert date columns
df = convert_to_datetime(df)
# Format only specific formats:
df = convert_to_datetime(df, ["%d.%m.%Y"])
  • df: pd.DataFrame | The DataFrame with date columns to be converted.

pd.DataFrame

formats specific datetime columns:

from binaryrain_helper_data_processing.dataframe import format_datetime_columns
# ....df is a pandas DataFrame
# Format date columns directly
df = format_datetime_columns(df, datetime_columns=['date_column1', 'date_column2'], datetime_format='%Y-%m-%d')
# Format date columns to in string columns
df = format_datetime_columns(df, datetime_columns=['date_column1', 'date_column2'], datetime_format='%Y-%m-%d', datetime_string_columns=['string_column1', 'string_column2'])
  • df: pd.DataFrame | The DataFrame with datetime columns to be formatted.
  • datetime_columns: list[str] | List of columns to be formatted.
  • datetime_format: str | The format to apply to the datetime columns.
  • datetime_string_columns: list[str] | (Optional) List of columns to be formatted as strings. If not provided, the original columns will be replaced with formatted strings.

pd.DataFrame

cleans DataFrames by removing duplicates and missing values:

from binaryrain_helper_data_processing.dataframe import clean_dataframe
# ....df is a pandas DataFrame
# Clean DataFrame
df = clean_dataframe(df)
  • df: pd.DataFrame | The DataFrame to be cleaned.

pd.DataFrame

filters out empty values from specific columns:

from binaryrain_helper_data_processing.dataframe import remove_empty_values
# ....df is a pandas DataFrame
# Remove empty values
df = remove_empty_values(df, filter_column'column1')
  • df: pd.DataFrame | The DataFrame to be filtered.
  • filter_column: str | The column to filter out empty values.

pd.DataFrame

handles locale-specific number formatting:

from binaryrain_helper_data_processing.dataframe import format_numeric_values
# ....df is a pandas DataFrame
# Convert European number format (1.234,56) to standard format (1,234.56)
df = format_numeric_values(
df,
columns=['price', 'quantity'],
swap_separators=True,
old_decimal_separator=',',
old_thousands_separator='.',
decimal_separator='.',
thousands_separator=',',
)
  • df: pd.DataFrame | The DataFrame with numeric values to be formatted.
  • columns: list[str] | List of columns to be formatted.
  • swap_separators: bool | (Optional) Boolean indicating whether to swap the decimal and thousands separators.
  • old_decimal_separator: str | (Optional) The old decimal separator to be replaced. The default is ,.
  • old_thousands_separator: str | (Optional) The old thousands separator to be replaced. The default is ..
  • decimal_separator: str | (Optional) The new decimal separator to be used. The default is ..
  • thousands_separator: str | (Optional) The new thousands separator to be used. The default is ,.
  • temp_separator: str | (Optional) Temporary separator used during the conversion process. The default is |.