Merge pull request #6 from danielvdende/dvde-api-cleanup

API cleanup
danielvdende · May 4, 2019 · f90a08f · f90a08f
2 parents 38b897e + 6e4bb8e
commit f90a08f
Show file tree

Hide file tree

Showing 9 changed files with 321 additions and 99 deletions.
diff --git a/README.md b/README.md
@@ -1,9 +1,127 @@
 [![Build Status](https://travis-ci.com/danielvdende/opulent-pandas.svg?token=km81qsbsLrgZWGfcfi7a&branch=master)](https://travis-ci.com/danielvdende/opulent-pandas)
 [![PyPI version](https://badge.fury.io/py/opulent-pandas.svg)](https://badge.fury.io/py/opulent-pandas)
-# opulent-pandas
-Opulent-pandas is a schema validation packages aimed specifically at validating the schema of pandas dataframes. 
-It takes heavy inspiration from [voluptuous](), and tries to stay as close as possible to the API defined in this package. 
+# Opulent-Pandas
+Opulent-Pandas is a schema validation packages aimed specifically at validating the schema of pandas dataframes. 
+It takes heavy inspiration from [voluptuous](https://github.com/alecthomas/voluptuous), and tries to stay as close as possible to the API defined in this package. Opulent-Pandas
+is different from voluptuous in that it heavily relies on [Pandas](https://pandas.pydata.org/) to perform the validation. This makes Opulent-Pandas considerably faster
+than voluptuous on larger datasets. It does, however, mean that the input format is also a Pandas DataFrame, rather than a dict (as is the case for voluptuous)
+A performance comparison of voluptuous and Opulent-Pandas will be added to this readme soon!
 
-## Documentation
+## Example
+Defining a schema in Opulent-Pandas is very similar to how you would in voluptuous. To make the similarities and differences clear, let's walk through the same example as is done in the voluptuous readme.
+
+Twitter's [user search API](https://dev.twitter.com/rest/reference/get/users/search) accepts
+query URLs like:
 
-## Examples
+```
+$ curl 'https://api.twitter.com/1.1/users/search.json?q=python&per_page=20&page=1'
+```
+
+To validate this we might use a schema like:
+
+```pycon
+>>> from opulent_pandas import Schema, TypeValidator, Required
+>>> schema = Schema({
+...   Required('q'): [TypeValidator(str)],
+...   Required('per_page'): [TypeValidator(int)],
+...   Required('page'): [TypeValidator(int)],
+... })
+
+```
+Comparing with voluptuous, you'll notice that the validators per field are always specified as a list. Other than that,
+it's very similar to how you would define the schema with voluptuous
+
+If we look at the more complex schema, as defined in the readme of voluptuous, we see very similar schemas:
+
+```pycon
+>>> from opulent_pandas.validator import Required, RangeValidator, TypeValidator, ValueLengthValidator 
+>>> schema = Schema({
+...   Required('q'): [TypeValidator(str), ValueLengthValidator(min_length=1)],
+...   Required('per_page'): [TypeValidator(int), RangeValidator(min=1, max=20)],
+...   Required('page'): [TypeValidator(int), RangeValidator(min=0)],
+... })
+
+```
+
+One difference between Opulent-Pandas and voluptuous is that Opulent-Pandas has a `validate` function that can be used
+to validate a given data structure rather tha voluptuous' approach of passing the data directly to your schema as a parameter. 
+
+If you pass data in that does not satisfy the requirements specified in your Opulent-Pandas schema, you'll get a corresponding error message. Walking
+through the examples provided in the voluptuous readme:
+
+There are 3 required fields:
+TODO: this example should also tell you which columns are missing. Seems to be a bug.
+```pycon
+>>> from opulent_pandas import MissingColumnError
+>>> try:
+...   schema.validate({})
+...   raise AssertionError('MissingColumnError not raised')
+... except MissingColumnError as e:
+...   exc = e
+>>> str(exc) == "Columns missing"
+True
+
+```
+
+`q` must be a string:
+
+```pycon
+>>> from opulent_pandas import InvalidTypeError
+>>> try:
+...   schema.validate(pd.DataFrame({'q': [123], 'per_page':[10], 'page': [1]})
+...   raise AssertionError('InvalidTypeError not raised')
+... except InvalidTypeError as e:
+...   exc = e
+>>> str(exc) == "Invalid data type found for column: q. Required: <class 'str'>"
+True
+
+```
+
+...and must be at least one character in length:
+
+```pycon
+>>> from opulent_pandas import ValueLengthError
+>>> try:
+...   schema.validate(pd.DataFrame({'q': [''], 'per_page': 5, 'page': 12}))
+...   raise AssertionError('ValueLengthError not raised')
+... except ValueLengthError as e:
+...   exc = e
+>>> str(exc) == "Value found with length smaller than enforced minimum length for column: q. Minimum Length: 1"
+True
+
+```
+
+"per\_page" is a positive integer no greater than 20:
+
+```pycon
+>>> from opulent_pandas import RangeError
+>>> try:
+...    schema.validate(pd.DataFrame({'q': ['#topic'], 'per_page': [900], 'page': [12]}))
+...    raise AssertionError('RangeError not raised')
+... except RangeError as e:
+...    exc = e
+>>> str(exc) == "Value found larger than enforced maximum for column: per_page. Required maximum: 20"
+True
+
+>>> try:
+...    schema.validate(pd.DataFrame({'q': ['#topic'], 'per_page': [-10], 'page': [12]}))
+...    raise AssertionError('RangeError not raised')
+... except RangeError as e:
+...    exc = e
+>>> str(exc) == "Value found larger than enforced minimum for column: per_page. Required minimum: 1"
+True
+
+```
+
+"page" is an integer \>= 0:
+
+```pycon
+>>> try:
+...   schema.validate(pd.DataFrame({'q': ['#topic'], 'per_page': ['one']})
+...   raise AssertionError('InvalidTypeError not raised')
+... except InvalidTypeError as e:
+...   exc = e
+>>> str(exc) == "Invalid data type found for column: page. Required type: <class 'int'>"
+True
+
+```
diff --git a/opulent_pandas/__init__.py b/opulent_pandas/__init__.py
@@ -0,0 +1,6 @@
+# flake8: noqa
+
+from opulent_pandas.schema import *
+from opulent_pandas.column import *
+from opulent_pandas.validator import *
+from opulent_pandas.error import *
diff --git a/opulent_pandas/column.py b/opulent_pandas/column.py
@@ -1,4 +1,3 @@
-
 class ColumnType(object):
     def __init__(self, column_name, description=""):
         self.column_name = column_name

diff --git a/opulent_pandas/error.py b/opulent_pandas/error.py
@@ -5,6 +5,7 @@ class Error(Exception):
     """
     Base exception
     """
+
     def __init__(self, msg):
         self.msg = msg
 
@@ -13,6 +14,7 @@ class GroupError(Error):
     """
     Base exception class for group errors
     """
+
     def __init__(self, errors: List[Error]):
         self.errors = errors
 
@@ -28,6 +30,9 @@ class InvalidDataError(Error):
     Error indicating the data is not valid in some way.
     """
 
+    def __init__(self, msg):
+        Error.__init__(self, msg)
+
 
 class InvalidTypeError(InvalidDataError):
     """"""

diff --git a/opulent_pandas/schema.py b/opulent_pandas/schema.py
@@ -19,14 +19,17 @@ def validate(self, df: pd.DataFrame):
         # now check any other restrictions on those columns
         # TODO: need to split out required vs optional
         for col, validators in self.schema.items():
-            if isinstance(col, Required) or (isinstance(col, Optional) and col.column_name in list(df)):
+            if isinstance(col, Required) or (
+                isinstance(col, Optional) and col.column_name in list(df)
+            ):
                 for validator in validators:
                     validator.validate(df[col.column_name])
 
     def check_column_presence(self, df: pd.DataFrame):
         # check if all Required columns are there
         if not set(df).issuperset(self.get_column_names(Required)):
-            raise MissingColumnError("Columns missing")
+            missing_columns = set(df) - self.get_column_names(Required)
+            raise MissingColumnError(f"Columns missing: {missing_columns}")
 
     def get_column_names(self, column_type: ColumnType) -> set:
         columns = set()

diff --git a/opulent_pandas/validator.py b/opulent_pandas/validator.py
@@ -1,7 +1,14 @@
 import pandas as pd
 
-from opulent_pandas.error import (AnyInvalidError, Error, InvalidTypeError, MissingTimezoneError, RangeError,
-                                  SetMemberError, ValueLengthError)
+from opulent_pandas.error import (
+    AnyInvalidError,
+    Error,
+    InvalidTypeError,
+    MissingTimezoneError,
+    RangeError,
+    SetMemberError,
+    ValueLengthError,
+)
 from typing import List
 
 
@@ -54,7 +61,10 @@ def __init__(self, valid_type: type):
 
     def validate(self, df_column: pd.Series):
         if not (df_column.apply(type) == self.valid_type).all():
-            raise InvalidTypeError(f"Invalid data type found. Required: {self.valid_type}")
+            raise InvalidTypeError(
+                f"Invalid data type found for column: {df_column.name}. "
+                f"Required type: {self.valid_type}"
+            )
 
 
 class RangeValidator(BaseValidator):
@@ -69,10 +79,16 @@ def __init__(self, min: int = None, max: int = None):
     def validate(self, df_column: pd.Series):
         if self.min:
             if not (df_column >= self.min).all():
-                raise RangeError(f"Value found smaller than enforced minimum. Required minimum: {self.min}")
+                raise RangeError(
+                    f"Value found smaller than enforced minimum for column: {df_column.name}. "
+                    f"Required minimum: {self.min}"
+                )
         if self.max:
             if not (df_column <= self.max).all():
-                raise RangeError(f"Value found larger than enforced maximum. Required maximum: {self.max}")
+                raise RangeError(
+                    f"Value found larger than enforced maximum for column: {df_column.name}. "
+                    f"Required maximum: {self.max}"
+                )
 
 
 class ValueLengthValidator(BaseValidator):
@@ -88,12 +104,16 @@ def __init__(self, min_length=None, max_length=None):
     def validate(self, df_column: pd.Series):
         if self.min_length:
             if not (df_column.apply(len) >= self.min_length).all():
-                raise ValueLengthError(f"Value found with length smaller than enforced minimum length. "
-                                       f"Minimum Length: {self.min_length}")
+                raise ValueLengthError(
+                    f"Value found with length smaller than enforced minimum length for "
+                    f"column: {df_column.name}. Minimum Length: {self.min_length}"
+                )
         if self.max_length:
             if not (df_column.apply(len) <= self.max_length).all():
-                raise ValueLengthError(f"Value found with length larger than enforced maximum length. "
-                                       f"Maximum Length: {self.max_length}")
+                raise ValueLengthError(
+                    f"Value found with length larger than enforced maximum length for "
+                    f"column: {df_column.name}. Maximum Length: {self.max_length}"
+                )
 
 
 class SetMemberValidator(BaseValidator):
@@ -106,14 +126,20 @@ def __init__(self, values: set):
 
     def validate(self, df_column: pd.Series):
         if not df_column.isin(self.values).all():
-            raise SetMemberError(f"Value found outside of defined set. Allowed: {self.values}")
+            raise SetMemberError(
+                f"Value found outside of defined set for column: {df_column.name}. "
+                f"Allowed: {self.values}"
+            )
 
 
 class TimezoneValidator(BaseValidator):
     """
     Checks that all values in the dataframe column have timezone information
     TODO: extend to allow required timezone as a parameter, e.g. check that the timezone is correct
     """
+
     def validate(self, df_column: pd.Series):
         if not (df_column.apply(lambda x: x.tz is not None)).all():
-            raise MissingTimezoneError(f"Non-timezone-aware dates found.")
+            raise MissingTimezoneError(
+                f"Non-timezone-aware dates found for column: {df_column.name}."
+            )
diff --git a/setup.py b/setup.py
@@ -5,21 +5,12 @@
 
 setup(
     name="opulent-pandas",
-    version='0.0.3',
+    version="0.0.3",
     description="A package to validate the schema of a pandas dataframe",
     author="Daniel van der Ende",
     long_description=long_description,
     long_description_content_type="text/markdown",
     packages=["opulent_pandas"],
-    install_requires=[
-        "pandas==0.23.4",
-    ],
-    extras_require={
-        "test": [
-            "pytest==4.0.2"
-        ],
-        "lint": [
-            "flake8==3.5.0"
-        ],
-    }
+    install_requires=["pandas==0.23.4"],
+    extras_require={"test": ["pytest==4.0.2"], "lint": ["flake8==3.5.0"]},
 )