Skip to content

Commit

Permalink
Merge pull request #6 from danielvdende/dvde-api-cleanup
Browse files Browse the repository at this point in the history
API cleanup
  • Loading branch information
danielvdende authored May 4, 2019
2 parents 38b897e + 6e4bb8e commit f90a08f
Show file tree
Hide file tree
Showing 9 changed files with 321 additions and 99 deletions.
128 changes: 123 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,127 @@
[![Build Status](https://travis-ci.com/danielvdende/opulent-pandas.svg?token=km81qsbsLrgZWGfcfi7a&branch=master)](https://travis-ci.com/danielvdende/opulent-pandas)
[![PyPI version](https://badge.fury.io/py/opulent-pandas.svg)](https://badge.fury.io/py/opulent-pandas)
# opulent-pandas
Opulent-pandas is a schema validation packages aimed specifically at validating the schema of pandas dataframes.
It takes heavy inspiration from [voluptuous](), and tries to stay as close as possible to the API defined in this package.
# Opulent-Pandas
Opulent-Pandas is a schema validation packages aimed specifically at validating the schema of pandas dataframes.
It takes heavy inspiration from [voluptuous](https://github.com/alecthomas/voluptuous), and tries to stay as close as possible to the API defined in this package. Opulent-Pandas
is different from voluptuous in that it heavily relies on [Pandas](https://pandas.pydata.org/) to perform the validation. This makes Opulent-Pandas considerably faster
than voluptuous on larger datasets. It does, however, mean that the input format is also a Pandas DataFrame, rather than a dict (as is the case for voluptuous)
A performance comparison of voluptuous and Opulent-Pandas will be added to this readme soon!

## Documentation
## Example
Defining a schema in Opulent-Pandas is very similar to how you would in voluptuous. To make the similarities and differences clear, let's walk through the same example as is done in the voluptuous readme.

Twitter's [user search API](https://dev.twitter.com/rest/reference/get/users/search) accepts
query URLs like:

## Examples
```
$ curl 'https://api.twitter.com/1.1/users/search.json?q=python&per_page=20&page=1'
```

To validate this we might use a schema like:

```pycon
>>> from opulent_pandas import Schema, TypeValidator, Required
>>> schema = Schema({
... Required('q'): [TypeValidator(str)],
... Required('per_page'): [TypeValidator(int)],
... Required('page'): [TypeValidator(int)],
... })

```
Comparing with voluptuous, you'll notice that the validators per field are always specified as a list. Other than that,
it's very similar to how you would define the schema with voluptuous

If we look at the more complex schema, as defined in the readme of voluptuous, we see very similar schemas:

```pycon
>>> from opulent_pandas.validator import Required, RangeValidator, TypeValidator, ValueLengthValidator
>>> schema = Schema({
... Required('q'): [TypeValidator(str), ValueLengthValidator(min_length=1)],
... Required('per_page'): [TypeValidator(int), RangeValidator(min=1, max=20)],
... Required('page'): [TypeValidator(int), RangeValidator(min=0)],
... })

```

One difference between Opulent-Pandas and voluptuous is that Opulent-Pandas has a `validate` function that can be used
to validate a given data structure rather tha voluptuous' approach of passing the data directly to your schema as a parameter.

If you pass data in that does not satisfy the requirements specified in your Opulent-Pandas schema, you'll get a corresponding error message. Walking
through the examples provided in the voluptuous readme:

There are 3 required fields:
TODO: this example should also tell you which columns are missing. Seems to be a bug.
```pycon
>>> from opulent_pandas import MissingColumnError
>>> try:
... schema.validate({})
... raise AssertionError('MissingColumnError not raised')
... except MissingColumnError as e:
... exc = e
>>> str(exc) == "Columns missing"
True

```

`q` must be a string:

```pycon
>>> from opulent_pandas import InvalidTypeError
>>> try:
... schema.validate(pd.DataFrame({'q': [123], 'per_page':[10], 'page': [1]})
... raise AssertionError('InvalidTypeError not raised')
... except InvalidTypeError as e:
... exc = e
>>> str(exc) == "Invalid data type found for column: q. Required: <class 'str'>"
True

```

...and must be at least one character in length:

```pycon
>>> from opulent_pandas import ValueLengthError
>>> try:
... schema.validate(pd.DataFrame({'q': [''], 'per_page': 5, 'page': 12}))
... raise AssertionError('ValueLengthError not raised')
... except ValueLengthError as e:
... exc = e
>>> str(exc) == "Value found with length smaller than enforced minimum length for column: q. Minimum Length: 1"
True

```

"per\_page" is a positive integer no greater than 20:

```pycon
>>> from opulent_pandas import RangeError
>>> try:
... schema.validate(pd.DataFrame({'q': ['#topic'], 'per_page': [900], 'page': [12]}))
... raise AssertionError('RangeError not raised')
... except RangeError as e:
... exc = e
>>> str(exc) == "Value found larger than enforced maximum for column: per_page. Required maximum: 20"
True

>>> try:
... schema.validate(pd.DataFrame({'q': ['#topic'], 'per_page': [-10], 'page': [12]}))
... raise AssertionError('RangeError not raised')
... except RangeError as e:
... exc = e
>>> str(exc) == "Value found larger than enforced minimum for column: per_page. Required minimum: 1"
True

```

"page" is an integer \>= 0:

```pycon
>>> try:
... schema.validate(pd.DataFrame({'q': ['#topic'], 'per_page': ['one']})
... raise AssertionError('InvalidTypeError not raised')
... except InvalidTypeError as e:
... exc = e
>>> str(exc) == "Invalid data type found for column: page. Required type: <class 'int'>"
True

```
6 changes: 6 additions & 0 deletions opulent_pandas/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# flake8: noqa

from opulent_pandas.schema import *
from opulent_pandas.column import *
from opulent_pandas.validator import *
from opulent_pandas.error import *
1 change: 0 additions & 1 deletion opulent_pandas/column.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

class ColumnType(object):
def __init__(self, column_name, description=""):
self.column_name = column_name
Expand Down
5 changes: 5 additions & 0 deletions opulent_pandas/error.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ class Error(Exception):
"""
Base exception
"""

def __init__(self, msg):
self.msg = msg

Expand All @@ -13,6 +14,7 @@ class GroupError(Error):
"""
Base exception class for group errors
"""

def __init__(self, errors: List[Error]):
self.errors = errors

Expand All @@ -28,6 +30,9 @@ class InvalidDataError(Error):
Error indicating the data is not valid in some way.
"""

def __init__(self, msg):
Error.__init__(self, msg)


class InvalidTypeError(InvalidDataError):
""""""
Expand Down
7 changes: 5 additions & 2 deletions opulent_pandas/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,17 @@ def validate(self, df: pd.DataFrame):
# now check any other restrictions on those columns
# TODO: need to split out required vs optional
for col, validators in self.schema.items():
if isinstance(col, Required) or (isinstance(col, Optional) and col.column_name in list(df)):
if isinstance(col, Required) or (
isinstance(col, Optional) and col.column_name in list(df)
):
for validator in validators:
validator.validate(df[col.column_name])

def check_column_presence(self, df: pd.DataFrame):
# check if all Required columns are there
if not set(df).issuperset(self.get_column_names(Required)):
raise MissingColumnError("Columns missing")
missing_columns = set(df) - self.get_column_names(Required)
raise MissingColumnError(f"Columns missing: {missing_columns}")

def get_column_names(self, column_type: ColumnType) -> set:
columns = set()
Expand Down
48 changes: 37 additions & 11 deletions opulent_pandas/validator.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,14 @@
import pandas as pd

from opulent_pandas.error import (AnyInvalidError, Error, InvalidTypeError, MissingTimezoneError, RangeError,
SetMemberError, ValueLengthError)
from opulent_pandas.error import (
AnyInvalidError,
Error,
InvalidTypeError,
MissingTimezoneError,
RangeError,
SetMemberError,
ValueLengthError,
)
from typing import List


Expand Down Expand Up @@ -54,7 +61,10 @@ def __init__(self, valid_type: type):

def validate(self, df_column: pd.Series):
if not (df_column.apply(type) == self.valid_type).all():
raise InvalidTypeError(f"Invalid data type found. Required: {self.valid_type}")
raise InvalidTypeError(
f"Invalid data type found for column: {df_column.name}. "
f"Required type: {self.valid_type}"
)


class RangeValidator(BaseValidator):
Expand All @@ -69,10 +79,16 @@ def __init__(self, min: int = None, max: int = None):
def validate(self, df_column: pd.Series):
if self.min:
if not (df_column >= self.min).all():
raise RangeError(f"Value found smaller than enforced minimum. Required minimum: {self.min}")
raise RangeError(
f"Value found smaller than enforced minimum for column: {df_column.name}. "
f"Required minimum: {self.min}"
)
if self.max:
if not (df_column <= self.max).all():
raise RangeError(f"Value found larger than enforced maximum. Required maximum: {self.max}")
raise RangeError(
f"Value found larger than enforced maximum for column: {df_column.name}. "
f"Required maximum: {self.max}"
)


class ValueLengthValidator(BaseValidator):
Expand All @@ -88,12 +104,16 @@ def __init__(self, min_length=None, max_length=None):
def validate(self, df_column: pd.Series):
if self.min_length:
if not (df_column.apply(len) >= self.min_length).all():
raise ValueLengthError(f"Value found with length smaller than enforced minimum length. "
f"Minimum Length: {self.min_length}")
raise ValueLengthError(
f"Value found with length smaller than enforced minimum length for "
f"column: {df_column.name}. Minimum Length: {self.min_length}"
)
if self.max_length:
if not (df_column.apply(len) <= self.max_length).all():
raise ValueLengthError(f"Value found with length larger than enforced maximum length. "
f"Maximum Length: {self.max_length}")
raise ValueLengthError(
f"Value found with length larger than enforced maximum length for "
f"column: {df_column.name}. Maximum Length: {self.max_length}"
)


class SetMemberValidator(BaseValidator):
Expand All @@ -106,14 +126,20 @@ def __init__(self, values: set):

def validate(self, df_column: pd.Series):
if not df_column.isin(self.values).all():
raise SetMemberError(f"Value found outside of defined set. Allowed: {self.values}")
raise SetMemberError(
f"Value found outside of defined set for column: {df_column.name}. "
f"Allowed: {self.values}"
)


class TimezoneValidator(BaseValidator):
"""
Checks that all values in the dataframe column have timezone information
TODO: extend to allow required timezone as a parameter, e.g. check that the timezone is correct
"""

def validate(self, df_column: pd.Series):
if not (df_column.apply(lambda x: x.tz is not None)).all():
raise MissingTimezoneError(f"Non-timezone-aware dates found.")
raise MissingTimezoneError(
f"Non-timezone-aware dates found for column: {df_column.name}."
)
15 changes: 3 additions & 12 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,21 +5,12 @@

setup(
name="opulent-pandas",
version='0.0.3',
version="0.0.3",
description="A package to validate the schema of a pandas dataframe",
author="Daniel van der Ende",
long_description=long_description,
long_description_content_type="text/markdown",
packages=["opulent_pandas"],
install_requires=[
"pandas==0.23.4",
],
extras_require={
"test": [
"pytest==4.0.2"
],
"lint": [
"flake8==3.5.0"
],
}
install_requires=["pandas==0.23.4"],
extras_require={"test": ["pytest==4.0.2"], "lint": ["flake8==3.5.0"]},
)
Loading

0 comments on commit f90a08f

Please sign in to comment.