~/ Polars-Analysis-Toolkit (Custom Utility Package)

> tech_stack : Polars, SciPy, Numpy

Developed a specialized Python package designed to streamline the data analysis workflow with Polars. Data exploration often requires juggling multiple libraries (NumPy, Matplotlib, Polars) with inconsistent syntaxes and manual type conversions, leading to fragmented and a lot of googling. As I grew tired of it, I built a unified wrapper that centralizes the most common statistics tools.

The project idea :

This project is a self-built package designed to solve a personal frustration. I love doing Data Analysis, and I love the Polars library. But the thing is: to conduct a full data analysis on a dataset, you need to import at least 2 or 3 additional libraries.
As soon as you move beyond the mandatory yet limited "statistical description", the syntax and the data types expected by the packages you just pip-installed change.

So, I wanted to make things easier.

The painpoints I tried to solve :

  • No more syntactic breaks: The writing logic should remain the same throughout the entire analysis.
  • Unclear docstrings: A mathematical function shouldn't just state what it expects and what it does?no one remembers formulas by heart. It should explain what it does, why, and when it is appropriate to use it.
  • Easier syntax is a thing, but not everyone has a master degree in statistics (I don't) so I tried to write complete documentation on how to conduct a complete analysis. It's written in French and is used as a "what to do and in which order", from descriptive part to modeling part.

A concrete example :

You just have one question to answer:

  • Am I analyzing a univariate dimension?

If yes, just use this wrapper :

from polars_stats import Univariate

df = pl.read_csv('revenue_by_whatever.csv', schema_overrides={"this_annoying_weird_id_column":pl.Utf8})

uv = Univariate(df["revenue"])
uv.mean()  # Not a revolution so far
uv.shapiro_wilk()  # Now you get the idea
uv.ci_mean_bootstrap()  # OH WOW

# Bonus : 
uv.which_test("normality")  # tells you which test to use !!

If no, just use this wrapper :

mv = Multivariate(df, ["revenue", "users", "sessions"])
mv.correlation_matrix()
mv.ols(target="revenue")
mv.pca(n_components=2)

But wrappers are here to build an analysis pipeline, I suggest doing your explorations using the functions directly as the docstring are, I believe, very usefull.

def pearson(a: pl.Series, b: pl.Series, detail: bool = False) -> float | dict:
    """
    Pearson correlation coefficient between two variables.

    Measures the strength and direction of the LINEAR relationship.
    Ranges from -1 (perfect inverse) to +1 (perfect direct), 0 = no linear relationship.

    Assumes:
    - Both variables are continuous
    - The relationship is approximately linear
    - Both variables are approximately normal (for the p-value to be valid)

    Interpretation:
    - |r| < 0.3  : weak
    - |r| 0.3-0.7: moderate
    - |r| > 0.7  : strong

    :param a: first series
    :param b: second series
    :param detail: if True, returns r + p-value
    :return: correlation coefficient (float) or dict with r and p-value
    """
    stats = _require("scipy.stats")
    a_clean = a.drop_nulls().to_numpy()
    b_clean = b.drop_nulls().to_numpy()

    r, pvalue = stats.pearsonr(a_clean, b_clean)
    if not detail:
        return float(r)
    else:
        return {
            "r": float(r),
            "pvalue": float(pvalue),
        }

The repo is organized in a very simple way, you can probably just search through it to find what you need easily :

polars_stats/
??? __init__.py                  # Exports Univariate, Multivariate
??? _utils.py                    # _require() for dependencies injections
??? wrappers.py                  # Univariate and Multivariate classes
??? univariate/                  # All the univariate functions (with docstring)
?   ??? __init__.py
?   ??? descriptive.py           
?   ??? tests.py                 
?   ??? distribution.py          
?   ??? inference.py             
??? multivariate/                # All the multivariate function (with docstring too)
    ??? __init__.py
    ??? descriptive.py          
    ??? comparison.py            
    ??? correlation.py
    ??? regression.py
    ??? dimension.py
    ??? tests.py

Why not just use SciPy ?

SciPy is actually a dependency in the projet, the functions just does the pl.DataFrame conversion for you.

Repo & conclusion :

The whole repo is available here
The README will guide you through installation and specifics.

< Back to projects