Skip to content

StreetEasy/dfs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

054e7b2 · Mar 15, 2023
Nov 6, 2022
Nov 6, 2022
Mar 15, 2023
Nov 6, 2022
Oct 18, 2022
Mar 14, 2023
Oct 24, 2022
Mar 15, 2023
Oct 24, 2022
Mar 15, 2023
Mar 15, 2023
Oct 19, 2022
Oct 24, 2022
Mar 15, 2023
Oct 18, 2022
Mar 14, 2023
Nov 6, 2022
Mar 15, 2023
Oct 18, 2022

Repository files navigation

DFS (aka Dataframe_Schema)

DFS is a lightweight validator for pandas.DataFrame. You can think of it as a jsonschema for dataframe.

Key features:

  1. Lightweight: only dependent on pandas and pydantic (which depends only on typing_extensions)
  2. Explicit: inspired by JsonSchema, all schemas are stored as json (or yaml) files and can be generated or changed on the fly.
  3. Simple: Easy to use, no need to change your workflow and dive into the implementation details.
  4. Comprehensive: Summarizes all errors in a single summary exception, checks for distributions, works on subsets of the dataframe
  5. Rapid: base schemas can be generated from given dataframe or sql query (using pd.read_sql).
  6. Handy: Supports command line interface (with [cli] extra).
  7. Extendable: Core idea is to validate dataframes of any type. While now supports only pandas, we'll add abstractions to run same checks on different types of dataframes (CuDF, Dask, SparkDF, etc )

QuickStart

1. Validate DataFrame

Via wrapper

import pandas as pd
import dfschema as dfs


df = pd.DataFrame({
  "a": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
  "b": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
})

schema_pass = {
  "shape": {"min_rows": 10}
}

schema_raise = {
  "shape": {"min_rows": 20}
}


dfs.validate(df, schema_pass)  # won't raise any issues
dfs.validate(df, schema_raise) # will Raise DataFrameSchemaError

Alternatively (v2 optional), you can use the root class, DfSchema:

dfs.DfSchema.from_dict(schema_pass).validate(df)  # won't raise any issues
dfs.DfSchema.from_dict(schema_raise).validate(df)  # will Raise DataFrameSchemaError

2. Generate Schema

dfs.DfSchema.from_df(df)

3. Read and Write Schemas

schema = dfs.DfSchema.from_file('schema.json')
schema.to_file("schema.yml")

4. Using CLI

Note: requires [cli] extra as relies on Typer and click

Validate via CLI

dfschema validate --read_kwargs_json '{delimiter="|"}' FILEPATH SCHEMA_FILEPATH

Supports

  • csv
  • xlsx
  • parquet
  • feather

Generate via CLI

dfs generate --format 'yaml' DATA_PATH > schema.yaml

Installation

WIP

Alternatives

Changes

  • [[changelog]]

Roadmap

  • Add tutorial Notebook
  • Support tableschema
  • Support Modin models
  • Support SQLAlchemy ORM models
  • Built-in Airflow Operator?
  • Interactive CLI/jupyter for schema generation