Skip to content

Data Processing

PyPI VersionGithub

binaryrain_helper_data_processing is a python package that aims to simplify and help with common functions in data processing areas. It builds on top of the pandas library and provides additional functionality to make data processing easier, reduces boilerplate code and provides clear error messages.

To install the package you can use your favorite python package manager:

Terminal window
pip install binaryrain-helper-data-processing

Enum FileFormat is used to specify the file format when creating or converting DataFrames. The supported formats include:

  • PARQUET: For efficient columnar storage
  • CSV: For common tabular data
  • JSON: For structured data exchange
  • DICT: For Python dictionary data

pd.DataFrame

simplifies creating pandas DataFrames from various formats:

from binaryrain_helper_data_processing.dataframe import FileFormat, create_dataframe
# Create from CSV bytes
df = create_dataframe(csv_bytes, FileFormat.CSV)
# Create with custom options
df = create_dataframe(parquet_bytes, FileFormat.PARQUET,
file_format_options={'engine': 'pyarrow'})
  • file_bytes: bytes | dict | The bytes of the file to be converted into a DataFrame.
  • file_format: FileFormat | The format of the file (e.g., CSV, Parquet, JSON, or Dict).
  • file_format_options: dict | None | Optional dictionary of options for the file format (e.g., engine for Parquet).

bytes | str | dict

handles converting DataFrames to different formats:

from binaryrain_helper_data_processing.dataframe import FileFormat, from_dataframe_to_type
# ....df is a pandas DataFrame
# Convert to CSV bytes
csv_bytes = from_dataframe_to_type(df, FileFormat.CSV)
# Convert with custom options
parquet_bytes = from_dataframe_to_type(df, FileFormat.PARQUET,
file_format_options={'engine': 'pyarrow'})
  • dataframe: pd.DataFrame | The DataFrame to be converted.
  • file_format: FileFormat | The format to convert the DataFrame to (e.g., CSV, Parquet).
  • file_format_options: dict | None | Optional dictionary of options for the file format (e.g., engine for Parquet or compression).

pd.DataFrame

provides a simple way to combine multiple DataFrames:

from binaryrain_helper_data_processing.dataframe import combine_dataframes
# ....df1 and df2 are pandas DataFrames
# Combine DataFrames
combined_df = combine_dataframes(df1, df2, sort=True)
  • df_one: pd.DataFrame | The first DataFrame to combine.
  • df_two: pd.DataFrame | The second DataFrame to combine.
  • sort: bool | Optional boolean to sort the combine DataFrame. Default is False.

pd.DataFrame

automatically detects and converts all date columns:

Supports common date formats:

  • %d.%m.%Y (e.g., “31.12.2023”)
  • %Y-%m-%d (e.g., “2023-12-31”)
  • %Y-%m-%d %H:%M:%S (e.g., “2023-12-31 23:59:59”)
  • %Y-%m-%dT%H:%M:%S (ISO format)

If you only want to check specific formats, you can select them manually.

from binaryrain_helper_data_processing.dataframe import convert_to_datetime
# ....df is a pandas DataFrame
# Convert date columns with automatic inference
df = convert_to_datetime(df)
# Format only specific formats:
df = convert_to_datetime(df, date_formats=["%d.%m.%Y", "%Y-%m-%d"])
  • df: pd.DataFrame | The DataFrame with date columns to be converted.
  • date_formats: list[str] | None | Optional ordered list of strftime-compatible date format strings to attempt. If None, a built-in set of common formats is used and inference is enabled as a first step.

pd.DataFrame

formats specific datetime columns:

from binaryrain_helper_data_processing.dataframe import format_datetime_columns
# ....df is a pandas DataFrame
# Format date columns directly
df = format_datetime_columns(df, datetime_columns=['date_column1', 'date_column2'], datetime_format='%Y-%m-%d')
# Format date columns to in string columns
df = format_datetime_columns(df, datetime_columns=['date_column1', 'date_column2'], datetime_format='%Y-%m-%d', datetime_string_columns=['string_column1', 'string_column2'])
  • df: pd.DataFrame | The DataFrame with datetime columns to be formatted.
  • datetime_columns: list[str] | List of columns to be formatted.
  • datetime_format: str | The format to apply to the datetime columns.
  • datetime_string_columns: list[str] | (Optional) List of columns to be formatted as strings. If not provided, the original columns will be replaced with formatted strings.

pd.DataFrame

cleans DataFrames by removing duplicates and missing values:

from binaryrain_helper_data_processing.dataframe import clean_dataframe
# ....df is a pandas DataFrame
# Clean DataFrame
df = clean_dataframe(df)
  • df: pd.DataFrame | The DataFrame to be cleaned.

pd.DataFrame

filters out empty values from specific columns:

from binaryrain_helper_data_processing.dataframe import remove_empty_values
# ....df is a pandas DataFrame
# Remove empty values with defaults
df = remove_empty_values(df, filter_column='column1')
# Remove empty values without dropping NaN or resetting index
df = remove_empty_values(df, filter_column='column1', dropna=False, reset_index=False)
  • df: pd.DataFrame | The DataFrame to be filtered.
  • filter_column: str | The column to filter out empty values.
  • dropna: bool | Optional boolean to drop rows where filter_column is NaN. Default is True.
  • reset_index: bool | Optional boolean to reset the index in the returned DataFrame. Default is True.

pd.DataFrame

⚠️ DEPRECATED: This function has been deprecated. Please use format_numeric_to_string() instead.

handles locale-specific number formatting:

from binaryrain_helper_data_processing.dataframe import format_numeric_values
# ....df is a pandas DataFrame
# Convert European number format (1.234,56) to standard format (1,234.56)
df = format_numeric_values(
df,
columns=['price', 'quantity'],
swap_separators=True,
old_decimal_separator=',',
old_thousands_separator='.',
decimal_separator='.',
thousands_separator=',',
)
  • df: pd.DataFrame | The DataFrame with numeric values to be formatted.
  • columns: list[str] | List of columns to be formatted.
  • swap_separators: bool | (Optional) Boolean indicating whether to swap the decimal and thousands separators.
  • old_decimal_separator: str | (Optional) The old decimal separator to be replaced. The default is ,.
  • old_thousands_separator: str | (Optional) The old thousands separator to be replaced. The default is ..
  • decimal_separator: str | (Optional) The new decimal separator to be used. The default is ..
  • thousands_separator: str | (Optional) The new thousands separator to be used. The default is ,.
  • temp_separator: str | (Optional) Temporary separator used during the conversion process. The default is |.

pd.DataFrame

formats specified columns as locale-style numeric strings with proper rounding and separator handling:

from binaryrain_helper_data_processing.dataframe import format_numeric_to_string
# ....df is a pandas DataFrame
# Format numeric columns to European format (1.234,56)
df = format_numeric_to_string(
df,
columns=['price', 'quantity'],
decimal_separator=',',
thousands_separator='.',
old_decimal_separator='.',
old_thousands_separator=',',
decimal_places=2
)
# Format with different decimal places
df = format_numeric_to_string(
df,
columns=['percentage'],
decimal_separator='.',
thousands_separator=',',
decimal_places=4
)
  • df: pd.DataFrame | The DataFrame with numeric values to be formatted.
  • columns: list[str] | List of columns to be formatted.
  • decimal_separator: str | (Optional) The decimal separator to use. The default is ,.
  • thousands_separator: str | (Optional) The thousands separator to use. The default is ..
  • old_decimal_separator: str | (Optional) The old decimal separator to replace. The default is ..
  • old_thousands_separator: str | (Optional) The old thousands separator to replace. The default is ,.
  • temp_separator: str | (Optional) Temporary separator used during the conversion process. The default is |.
  • decimal_places: int | (Optional) Number of decimal places to format to. The default is 2.