Module milanesas.eda_helper

Functions

explode_pie(pie_size): Generates a list of values to explode slices of a pie chart.

Creates a list of random values between 0.01 and 0.05, suitable for visually exploding slices of a pie chart. The number of values in the list is determined by the pie_size argument.

Args:
    pie_size: An integer representing the number of slices in the pie chart.

Returns:
    A list of floating-point values between 0.01 and 0.05, with a length
    equal to `pie_size`.

Example:
    >>> import pandas as pd
    >>> imp_df = pd.DataFrame({'A': [10, 20, 30]})
    >>> explode_values = explode_pie(imp_df.size)
    >>> print(explode_values)  # Example output: [0.03546542, 0.01237543, 0.04892357]

get_column_uniques(df, col): Prints unique values in a DataFrame column, handling semicolon-separated lists.

Prints the unique values found within a specified column of a DataFrame. Treats semicolon-separated values within cells as individual elements.

Args:
    df (pandas.DataFrame): The DataFrame to analyze.
    col (str): The name of the column to extract unique values from.

Example:
    >>> import pandas as pd
    >>> df = pd.DataFrame({'exp_en_IT': ['A;B;C', 'A;B', 'D']})
    >>> print_column_uniques(df, "exp_en_IT")
    {'A', 'B', 'C', 'D'}

get_normal_uniques_col_count(df, col): Counts occurrences of unique values (including those within semicolon-separated lists), normalizing counts by row count.

Calculates the count of each unique value within a specified column of a DataFrame, handling cases where cells contain multiple values separated by semicolons. Normalizes the counts by dividing them by the total number of rows in the DataFrame.

Args:
    df (pandas.DataFrame): The input DataFrame.
    col (str): The name of the column to analyze.

Returns:
    dict: A dictionary where keys represent unique values from the column and values
        represent their normalized counts (fraction of total rows).

Example:
    >>> import pandas as pd
    >>> df = pd.DataFrame({'educacion': ['A;B', 'A', 'A;C', 'B']})
    >>> normalized_counts = get_normal_uniques_col_count(df, "educacion")
    >>> print(normalized_counts)
    {'A': 0.75, 'B': 0.5, 'C': 0.25}

get_percentage(value): Formats a value as a percentage string.

Converts a numerical value into a percentage representation, rounded to the nearest integer, and returns it as a formatted string with a percentage sign.

Args:
    value (float): The numerical value to convert to a percentage.

Returns:
    str: The formatted percentage string (e.g., "42%").

Example:
    >>> percentage_string = get_percentage(0.4235)
    >>> print(percentage_string)  # Output: "42%"

get_uniques_col_count(df, col): Counts occurrences of unique values (including those within semicolon-separated lists).

Calculates the count of each unique value within a specified column of a DataFrame, handling cases where cells contain multiple values separated by semicolons.

Args:
    df (pandas.DataFrame): The input DataFrame.
    col (str): The name of the column to analyze.

Returns:
    dict: A dictionary where keys represent unique values from the column and values
        represent their counts.

Example:
    >>> import pandas as pd
    >>> df = pd.DataFrame({'educacion': ['A;B', 'A', 'A;C', 'B']})
    >>> counts = get_uniques_col_count(df, "educacion")
    >>> print(counts)
    {'A': 3, 'B': 2, 'C': 1}

make_custom_horizontal_bar(df, col, titulo, x_label, y_label, legend): Creates a horizontal bar chart from a pre-formatted DataFrame.

Generates a horizontal bar chart from a DataFrame that's already been prepared with specific column names ("Category" for categories and "count" for values).

Args:
    df (pandas.DataFrame): The input DataFrame, containing a 'Category' column
                            and a 'count' column.
    col (str): Unused in this function, but kept for consistency with other
                charting functions.
    titulo (str): The title of the chart.
    x_label (str): The label for the x-axis.
    y_label (str): The label for the y-axis.
    legend (bool): True to display a legend, False to hide it.

Example:
    >>> import pandas as pd
    >>> df = pd.DataFrame({'Category': ['A', 'B', 'A', 'C'], 'count': [4, 2, 3, 1]})
    >>> make_custom_horizontal_bar(df, "col", "Carreras o especialidades", "Total", "Carreras / Especialidades", False)

make_dataframe(df, col, cat_col, count_col):

make_df(df, col, x_label, y_label): Creates a DataFrame counting occurrences of unique values (including those within semicolon-separated lists).

Constructs a new DataFrame that tallies the number of occurrences of each unique value within a specified column of a given DataFrame. Handles cases where cells contain multiple values separated by semicolons.

Args:
    df (pandas.DataFrame): The input DataFrame.
    col (str): The name of the column to analyze.
    x_label (str): The label for the column containing unique values in the output DataFrame.
    y_label (str): The label for the column containing counts in the output DataFrame.

Returns:
    pandas.DataFrame: A new DataFrame with two columns:
        - x_label: Contains the unique values from the specified column.
        - y_label: Contains the counts of those values.

Example:
    >>> import pandas as pd
    >>> df = pd.DataFrame({'educacion': ['A;B', 'A', 'C;B', 'D']})
    >>> new_df = make_df(df, "educacion", "categories", "count")
    >>> print(new_df)
   categories  count
    0          A      2
    1          B      2
    2          C      1
    3          D      1

make_horizontal_bar(df, col, titulo, x_label, y_label, legend): Creates a horizontal bar chart for a specified column in a DataFrame.

Generates a horizontal bar chart that visualizes the counts of unique values within a given column of a DataFrame.

Args:
    df (pandas.DataFrame): The input DataFrame.
    col (str): The name of the column to visualize.
    titulo (str): The title of the chart.
    x_label (str): The label for the x-axis.
    y_label (str): The label for the y-axis.
    legend (bool): True to display a legend, False to hide it.

Example:
    >>> import pandas as pd
    >>> df = pd.DataFrame({'carr_especialidades': ['A', 'B', 'A', 'C', 'B']})
    >>> make_horizontal_bar(df, "carr_especialidades", "Carreras o especialidades", "Total", "Carreras / Especialidades", False)

make_horizontal_grouped_chart(df, g1, g2, col, labels, config): Creates a horizontal grouped bar chart comparing values between two groups.

Generates a horizontal bar chart with two sets of bars, one for each group (g1 and g2), comparing their counts for unique values in a specified column. Labels, title, and other chart elements are customized using a configuration dictionary.

Args:
    df (pandas.DataFrame): The DataFrame containing the data.
    g1 (pandas.DataFrame): A subset of the DataFrame representing the first group.
    g2 (pandas.DataFrame): A subset of the DataFrame representing the second group.
    col (str): The name of the column to compare values for.
    labels (list): A list of unique values from the column to use as labels.
    config (dict): A configuration dictionary with keys:
        - title (str): The title of the chart.
        - c1_label (str): The label for the first group's bars.
        - c2_label (str): The label for the second group's bars.
        - xlabel (str): The label for the x-axis.
        - ylabel (str): The label for the y-axis.

Raises:
    ValueError: If the specified column does not exist in the DataFrame.

Example:
    >>> import pandas as pd
    >>> df = pd.DataFrame({'exp_en_IT': ['A', 'B', 'A', 'C', 'B'], 'gender': ['MAN', 'WOMAN', 'MAN', 'MAN', 'WOMAN']})
    >>> gen = df.groupby('gender')
    >>> group_config = {
    ...     'title': "exp_en_IT by Gender",
    ...     'c1_label': "MAN",
    ...     'c2_label': "WOMAN",
    ...     'xlabel': "Count",
    ...     'ylabel': "exp_en_IT level"
    ... }
    >>> make_horizontal_grouped_chart(df, gen.get_group("MAN"), gen.get_group("WOMAN"), "exp_en_IT", df["exp_en_IT"].unique(), group_config)

make_normalized_df(df, col): Creates a DataFrame with normalized counts of unique values, handling semicolon-separated lists.

Constructs a new DataFrame that displays the percentage of occurrences for each unique value within a specified column of a given DataFrame. Values in cells can be separated by semicolons, and each unique value within a semicolon-separated list is counted separately.

Args:
    df (pandas.DataFrame): The input DataFrame.
    col (str): The name of the column to analyze.

Returns:
    pandas.DataFrame: A new DataFrame with two columns:
        - categories: Contains the unique values from the specified column.
        - total count: Contains the percentage of occurrences for each unique value.

Example:
    >>> import pandas as pd
    >>> df = pd.DataFrame({'imp_ed_formal': ['A;B', 'A', 'C;B', 'A;B']})
    >>> normalized_counts = make_normalized_df(df, "imp_ed_formal")
    >>> print(normalized_counts)
          total count
    categories
    A           50.0
    B           50.0
    C           25.0

make_vertical_grouped_chart(df, g1, g2, col, labels, config): Creates a vertical grouped bar chart comparing values between two groups.

Generates a vertical bar chart with two sets of bars, one for each group (g1 and g2), comparing their counts for unique values in a specified column. Labels, title, and other chart elements are customized using a configuration dictionary.

Args:
    df (pandas.DataFrame): The DataFrame containing the data.
    g1 (pandas.DataFrame): A subset of the DataFrame representing the first group.
    g2 (pandas.DataFrame): A subset of the DataFrame representing the second group.
    col (str): The name of the column to compare values for.
    labels (list): A list of unique values from the column to use as labels.
    config (dict): A configuration dictionary with keys:
        - title (str): The title of the chart.
        - c1_label (str): The label for the first group's bars.
        - c2_label (str): The label for the second group's bars.
        - xlabel (str): The label for the x-axis.
        - ylabel (str): The label for the y-axis.

Raises:
    ValueError: If the specified column does not exist in the DataFrame.

Example:
    >>> import pandas as pd
    >>> df = pd.DataFrame({'edad_actual': [25, 30, 30, 25, 35], 'gender': ['MAN', 'WOMAN', 'MAN', 'MAN', 'WOMAN']})
    >>> gen = df.groupby('gender')
    >>> group_config = {
    ...     'title': "edad_actual by Gender",
    ...     'c1_label': "Hombres",
    ...     'c2_label': "Mujeres",
    ...     'xlabel': "edad_actual level",
    ...     'ylabel': "Count"
    ... }
    >>> make_vertical_grouped_chart(df, gen.get_group("MAN"), gen.get_group("WOMAN"), "edad_actual", df["edad_actual"].unique(), group_config)

percentage_to_normal(val): Formats a Series of percentage values with rounding and percentage sign.

Converts a Series of values to percentages, rounds them to one decimal place, and adds a percentage sign. The output is formatted as a string.

Args:
    val (pandas.Series): A Series containing numerical values.

Returns:
    pandas.Series: A Series with the same index as the input, but containing
    formatted percentage strings.

Example:
    >>> import pandas as pd
    >>> s = pd.Series([0.1234, 0.5678, 0.9012])
    >>> formatted_percentages = percentage_to_normal(s)
    >>> print(formatted_percentages)
    0    12.3 %
    1    56.8 %
    2    90.1 %
    dtype: object

print_column_uniques(df, col):

random_hex():

replace_column_content(df, col, repl): Replaces values in a DataFrame column using a replacement dictionary.

Modifies a DataFrame column in-place by replacing values based on a provided dictionary. The replacement dictionary maps original values to their desired replacements. Regular expressions can be used for flexible matching.

Args:
    df (pandas.DataFrame): The DataFrame to modify.
    col (str): The name of the column to modify.
    repl (dict): A dictionary containing replacement mappings, where keys
        represent original values and values represent their replacements.

Raises:
    ValueError: If the specified column does not exist in the DataFrame.

Example:
    >>> import pandas as pd
    >>> df = pd.DataFrame({'genero': ['HOMBRE', 'MUJER', 'NO COMPARTO']})
    >>> gen_repl = {
    ...     "HOMBRE": "MAN",
    ...     "MUJER": "WOMAN",
    ...     "NO COMPARTO": "DONT SHARE",
    ... }
    >>> replace_column_content(df, "genero", gen_repl)
    >>> print(df)  # Output:
                   genero
    0                 MAN
    1              WOMAN
    2       DONT SHARE