Skip to content

Conversation

@raulk
Copy link
Contributor

@raulk raulk commented Jan 14, 2026

Summary

Migrates all data transformation logic from pandas to polars for improved performance and more expressive data manipulation.

Base branch: feat/network-overview (PR #43)

Changes

  • Add polars>=1.0 to dependencies
  • Migrate all 9 notebooks to use polars for data transformations
  • Keep pandas only where needed for plotly compatibility (.to_pandas())

Migration pattern

# Before (pandas)
df = load_parquet("dataset", target_date)
df_grouped = df.groupby("col").agg({"value": "sum"})

# After (polars)
df = pl.from_pandas(load_parquet("dataset", target_date))
df_grouped = df.group_by("col").agg(pl.col("value").sum())
fig = px.bar(df_grouped.to_pandas(), ...)  # Convert for plotly

Key polars patterns used

pandas polars
df.groupby().agg() df.group_by().agg()
df[df["col"] > 0] df.filter(pl.col("col") > 0)
df["col"] = ... df.with_columns(...)
df.sort_values() df.sort()
df.drop_duplicates() df.unique()
df.fillna() df.fill_null()
df["col"].map({...}) pl.when().then().otherwise()
df.merge() df.join()
df.pivot() df.pivot()
df.melt() df.unpivot()

Benefits

  • Performance: Polars is significantly faster than pandas for large datasets
  • Memory efficiency: Lazy evaluation and better memory management
  • Expressiveness: Chainable API with clear intent
  • Type safety: Better handling of nulls and type conversions

Test plan

  • Run just fetch <date> to fetch data
  • Run just render <date> to verify all notebooks render correctly
  • Verify visualizations display correctly in rendered HTML

Migrate data transformation logic from pandas to polars for improved
performance and more expressive data manipulation.

Changes:
- Add polars>=1.0 to dependencies
- Migrate all 9 notebooks to use polars for data transformations
- Keep pandas only where needed for plotly compatibility (.to_pandas())
- Pattern: pl.from_pandas(load_parquet(...)) -> transform -> .to_pandas()

Key polars patterns used:
- group_by().agg() instead of groupby().agg()
- filter(pl.col(...)) instead of df[df["col"]]
- with_columns() instead of df["col"] = ...
- sort() instead of sort_values()
- unique() instead of drop_duplicates()
- fill_null() instead of fillna()
- pl.when().then().otherwise() instead of map()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants