stnbu
diff --git a/‎README.md
+8-6 b/‎README.md
+8-6
diff --git a/‎manage.py
+4 b/‎manage.py
+4
diff --git a/‎person_search/__init__.py
+7 b/‎person_search/__init__.py
+7
diff --git a/‎person_search/admin.py
+7 b/‎person_search/admin.py
+7
diff --git a/‎person_search/crypto/__init__.py
+36-16 b/‎person_search/crypto/__init__.py
+36-16
diff --git a/‎person_search/db.py
+10-3 b/‎person_search/db.py
+10-3
diff --git a/‎person_search/models.py
+31-11 b/‎person_search/models.py
+31-11
diff --git a/‎person_search/pie.py
+15-9 b/‎person_search/pie.py
+15-9
@@ -131,24 +131,25 @@ ORM
 
 Since we're talking about using the data "outside" of Django, I monkey-patched things so that I could use the ORM just using plain Python (the scraping in other words). There are many, many possibilities here and they all depend on the requirements. Things that come to mind:
 
-* Have the scraper write to a separate db and have things synchronized using RabbitMQ or something. The scrapers run in a "farm" collecting queries and reply back to the queue with some structured result. The web UI and/or the database would be participating in that queue, inserting the queries and waiting for responses (asyncronously).
-* Use a separate ORM for scraper I/O. Something like sqlalchemy is robust and well-tested and would have fewer risks, sinice it doesn't look like the Django ORM is often used thi way. The downside is maintaining two schemas (table classes), etc.
+* Have the scraper write to a separate db and have things synchronized using RabbitMQ or something. The scrapers run in a "farm" collecting queries and reply back to the queue with some structured result. The web UI and/or the database would be participating in that queue, inserting the queries and waiting for responses (asynchronously).
+* Use a separate ORM for scraper I/O. Something like sqlalchemy is robust and well-tested and would have fewer risks, since it doesn't look like the Django ORM is often used this way. The downside is maintaining two schema (table classes), etc.
+* The pie chart should be implemented in the browser. We could just return a single number and let JavaScript do the rest. I chose to implement everything in Python.
 
 Scaling
 -------
 
-I have no idea how much scaling we'd be expected to do. Three per second or millions per second...? The answer varies. For a more robust PoC I would consider simple PostgreSQL clustering. That wold not improve write performance, but that would be slower so we might not care. Some kind of CDS for anything static or cachable. Django would be running in some "clustered" fashion. Minimally Django would be running behind a fast, secure web server like nginix (tbd). On the web side of things, db access would be read-only (so clustering improves performance there) so minimally we would have a separate Postgres user for that, but having something else running that has write access to the same database is sketchy. It needs to get updated *somehow* gut that is some security "surface area" to keep in mind.
+I have no idea how much scaling we'd be expected to do. Three per second or millions per second...? The answer varies. For a more robust PoC I would consider simple PostgreSQL clustering. That wold not improve write performance, but that would be slower so we might not care. Some kind of CDS for anything static or cachable. Django would be running in some "clustered" fashion. Minimally Django would be running behind a fast, secure web server like NGINX (tbd). On the web side of things, db access would be read-only (so clustering improves performance there) so minimally we would have a separate Postgres user for that, but having something else running that has write access to the same database is sketchy. It needs to get updated *somehow* gut that is some security "surface area" to keep in mind.
 
 Encryption
 ----------
 
 Whether this is a good idea or not depends on the requirements. In most cases, I'd question the benefit. Some process has clear read access to that database via some kind of encryption and that's where an attacker would attack. So encryption gets you nothing unless someone breaks in the hard way (but why do that)?
 
-That's not to say there are no use cases for it. It would be an interesting discussino and at worst it complicates and slows thigns down. I could be wrong afterall!!
+That's not to say there are no use cases for it. It would be an interesting discussion and at worst it complicates and slows things down. I could be wrong after all!!
 
 I chose to use symmetric encryption since having the private key and having a secret are pretty much the same thing. There are use cases for asymmetric encryption. I just don't know if this is one of them. Also, the **the encryption should live in the database** most likely. I wrote it into the ORM, which for performance and design reasons is a poor choice, at least in my implementation. Searching the data is greatly complicated, we can search by encrypted output and that's what I did, but 1) it must be encryption that results in the same output for any given input, otherwise searching and querying get **REALLY** complicated and 2) it should probably be done at the RDBMS level. I didn't investigate that for the sake of time.
 
-And because my first implementation doesn't satisfy the first point above, I was unable to use it. So I created a trival rot13 based "encryption" as a standin so in some weak sense, the data is "encrypted". Since this was fast, I had no caching concerns (see below.)
+And because my first implementation doesn't satisfy the first point above, I was unable to use it. So I created a trivial rot13 based "encryption" as a stand-in so in some weak sense, the data is "encrypted". Since this was fast, I had no caching concerns (see below.)
 
 In production I would investigate PostgreSQL's encryption options, which are [well documented](https://www.postgresql.org/docs/current/static/encryption-options.html).
 
@@ -170,8 +171,9 @@ Misc
 A list of smaller "todos" and remarks about the implementation:
 
 * Imports should be done in a smart, consistent way. I import ``person_search`` by name everywhere for now. It might be cleaved off as a separate package anyhow. All things being equal, just "best practices" is a good idea.
-* I assume that for a given email, the scraped data never chances. This both simplifies things and speeds things up. In production, we might want to periodically chedck for updates: if a records has "expired", add that email address to the queue of addresses to be scraped.
+* I assume that for a given email, the scraped data never chances. This both simplifies things and speeds things up. In production, we might want to periodically check for updates: if a records has "expired", add that email address to the queue of addresses to be scraped.
 * I also chose to use some OTS libraries, which I think is bad practice. It's ok to use some libraries, but if they're not well maintained, be prepared to completely understand and *own* them indefinitely.
+* Encryption as it's implemented adds a large amount of overhead. This also fixed by encrypting at the RDBMS-level.
 
 Resources
 =========
 
@@ -1,4 +1,8 @@
 #!/usr/bin/env python
+# -*- mode: python; coding: utf-8 -*-
+"""Helper script to use ``manage.py``. It might make sense to extend ``manage``
+so that it can run the scraper. e.g. ``python manage.py scrape``.
+"""
 import os
 import sys
 from django.core.management import execute_from_command_line
 
@@ -0,0 +1,7 @@
+# -*- mode: python; coding: utf-8 -*-
+"""``person_search``. Crawl web pages for profiles, record results in an encrypted database and generate a pie chart displaying "percent persons with masters".
+"""
+
+# INPROD: We would have a smarter logging system. I just set everthing to debug here.
+import logging
+logging.basicConfig(level=logging.DEBUG)
@@ -1,3 +1,10 @@
+# -*- mode: python; coding: utf-8 -*-
+"""Register our models for the admin UI
+"""
+
+# INPROD: I would find ways of divorcing the "admin" stuff from the production
+# instance.
+
 from django.contrib import admin
 from .models import Person, Degree
 
 
@@ -1,17 +1,32 @@
 # -*- mode: python; coding: utf-8 -*-
-"""
+"""Simple cryptography helper functions
+
+    crypto('clear_string') --> 'encrypted_string'
+    crypto('encrypted_string', decrypt=True) --> 'clear_string'
 """
 # INPROD:
 #
 #  tl;dr -- Crypto is hard, and this would require some research to do properly.
 #
-# This code does what it says, but I'm under no illusion that this is safe to use in production. I chose the path of least resistance here and use symmetric encryption using the popular `cryptography` module. Depending on the use case, it might be better to focus on making sure no one ever gets to the data in the first place. One possible case where we'd want to use asymmetric encryption is where we have a multi-master cluster of database servers, in which case we could have only the public key on these hosts to permit writting. Reading is a different problem. If read performance is not critical, just have one host with the private key?
-# After implementing this I find ``Fernet seems not to be maintained anymore. There has been no updates for the spec in three years. Original developers are in radio silence`` [https://appelsiini.net/2017/branca-alternative-to-jwt/]
+# This code does what it says, but I'm under no illusion that this is safe to
+# use in production. I chose the path of least resistance here and use symmetric
+# encryption using the popular `cryptography` module. Depending on the use case,
+# it might be better to focus on making sure no one ever gets to the data in the
+# first place. One possible case where we'd want to use asymmetric encryption is
+# where we have a multi-master cluster of database servers, in which case we
+# could have only the public key on these hosts to permit writing. Reading is a
+# different problem. If read performance is not critical, just have one host
+# with the private key?
+#
+# After implementing I find this: ``Fernet seems not to be maintained
+# anymore. There has been no updates for the spec in three years. Original
+# developers are in radio silence``
+# [https://appelsiini.net/2017/branca-alternative-to-jwt/]
 
-# this code is based upon example code in the docs: https://cryptography.io/en/latest/fernet/#using-passwords-with-fernet
+# this code is based upon example code in the docs:
+# https://cryptography.io/en/latest/fernet/#using-passwords-with-fernet
 
 import logging
-logging.basicConfig(level=logging.DEBUG)
 logger = logging.getLogger(__name__)
 logger.setLevel(logging.DEBUG)
 
@@ -32,7 +47,8 @@
 
 
 def log_io(crypt_func):
-    """Log i/o of crypto functions: Input (message) and output (return value) are logged in human-readable
+    """Log i/o of crypto functions: Input (message) and output (return value)
+    are logged in human-readable
     """
     def wrapper(message, decrypt=False):
         mode = ['encrypting', 'decrypting'][decrypt]
@@ -49,7 +65,8 @@ def wrapper(message, decrypt=False):
 
 
 def get_secret():
-    """Use the contents of `~/.ps_secret` if it exsists. Otherwise just use the string `secret`
+    """Use the contents of `~/.ps_secret` if it exists. Otherwise just use the
+    string `secret`
     """
     global SECRET
     if SECRET is not None:
@@ -64,7 +81,8 @@ def get_secret():
 
 
 def get_key():
-    """Get the "key" for symmetric Fernet encryption. If we've already calculated it, return that.
+    """Get the "key" for symmetric Fernet encryption. If we've already
+    calculated it, return that.
     """
     global KEY
     if KEY is not None:
@@ -95,7 +113,8 @@ def crypt13(message, decrypt=False):
     """ROT13 cipher for testing
     """
     def rot13(message):
-        """We put an `E:` in front of the "encrypted" version. This helper function does just  ROT13
+        """We put an `E:` in front of the "encrypted" version. This helper
+        function does just ROT13
         """
         return codecs.getencoder('rot-13')(message)[0]
 
@@ -107,10 +126,11 @@ def rot13(message):
 
 @log_io
 def crypt(message, decrypt=False):
-    """Perform simple symmetric encryption of `message`. `decrypt=True` to decrypt
+    """Perform simple symmetric encryption of `message`. `decrypt=True` to
+    decrypt
 
-    input can be bytes (utf-8) or a string, which is immedately converted to bytes
-    output is always bytes (utf-8)
+    Input can be bytes (utf-8) or a string, which is immediately converted to
+    bytes. Return value is always bytes (utf-8)
     """
 
     if not isinstance(message, bytes):
@@ -122,7 +142,8 @@ def crypt(message, decrypt=False):
             result = f.decrypt(message)
         except InvalidToken as e:
             raise Exception(
-                'Wrong secret. Did you lose `~/.ps_secret` or `~/.ps_salt`? [%s]' % repr(e))
+                'Wrong secret. Did you lose `~/.ps_secret` or `~/.ps_salt`? '
+                '[%s]' % repr(e))
     else:
         result = f.encrypt(message)
 
@@ -131,9 +152,8 @@ def crypt(message, decrypt=False):
 
 if __name__ == '__main__':
 
-    if False:
-        assert crypt13('foo') == 'E:sbb', 'crypt13 appears to be broken.'
-        assert crypt13(
+    assert crypt13('foo') == 'E:sbb', 'crypt13 appears to be broken.'
+    assert crypt13(
             'E:sbb', decrypt=True) == 'foo', 'crypt13 appears to be broken.'
 
     # how I did very basic testing:
 
@@ -1,14 +1,21 @@
 # -*- mode: python; coding: utf-8 -*-
-"""Use the Django ORM independently of the web UI. This involves some "tricks" but has the benefit of there being only one place to keep the database i/o logic. This is especially helpful if we're doing something to the data before storing (like encryption)
+"""Use the Django ORM independently of the web UI. This involves some "tricks"
+but has the benefit of there being only one place to keep the database i/o
+logic. This is especially helpful if we're doing something to the data before
+storing (like encryption).
 """
 
-# INPROD: I would very seriously reconsider this. It may be fine, but it may not be! It might be better to teach Django to use a different ORM, like sqlalchemy. Whatever the case, it's very helpful not to re-create database logic in two places. That's why this file...
+# INPROD: I would very seriously reconsider this. It may be fine, but it may not
+# be! It might be better to teach Django to use a different ORM, like
+# sqlalchemy. Whatever the case, it's very helpful not to re-create database
+# logic in two places. That's why this file...
 
 import os
 import importlib
 import django
 
-# I have removed some comments and strings to keep things anonymous. It's pretty easy to follow by just reading the code... (again, needs testing and rigor)
+# I have removed some comments and strings to keep things anonymous. It's pretty
+# easy to follow by just reading the code... (again, needs testing and rigor)
 
 try:
     dsm_name = os.environ['DJANGO_SETTINGS_MODULE']
 
@@ -1,19 +1,27 @@
 # -*- mode: python; coding: utf-8 -*-
+"""Django database model classes.
+
+* ``Person`` -- Stores persons scraped from the web, all data encrypted here.
+* ``Degree`` -- A "degree name", institution tuple. A foreign key lets us assign
+(a single) degree to each person.
+
+Encryption is seamlessly performed here using custom Django fields.
+"""
 
 import logging
-logging.basicConfig(level=logging.DEBUG)
 logger = logging.getLogger(__name__)
 logger.setLevel(logging.DEBUG)
 
 from django.db import models
-from person_search.crypto import crypt13 as crypt
 #from person_search.crypto import crypt
+# INPROD: In production we would use postgres-level encryption. See
+# ``README.md``
+from person_search.crypto import crypt13 as crypt
 
 
 class EncryptedCharField(models.CharField):
     """An encrypted ``CharField``
     """
-
     def from_db_value(self, value, expression, connection, context):  # DECRYPT
         logger.debug('calling from_db_value with `%s`' % value)
         return self.to_python(value)
@@ -32,9 +40,9 @@ def get_prep_value(self, value):  # ENCRYPT
         return crypt(value, decrypt=False)
 
 class EncryptedLowerCharField(EncryptedCharField):
-    """Encrypted ``CharField`` where values are always lower-cased  before encryption (see remark about encryption in ``README.md``)
+    """An encrypted ``CharField`` where values are always lower-cased before
+    encryption (see remark about encryption in ``README.md``)
     """
-
     def get_prep_value(self, value):  # ENCRYPT
         logger.debug('calling get_prep_value with `%s`' % value)
         value = value.lower()
@@ -46,9 +54,14 @@ class Degree(models.Model):
     institution = models.CharField(max_length=200, blank=False)
 
     def is_masters(self):
-        """Using available information about the ``Degree``, return ``True`` if master's degree.
+        """Using available information about the ``Degree``, return ``True`` if
+        master's degree.
         """
-        # of course this is a rediculous implementation. It's hard to know without more information. Perhaps have a "master list" of regular expressions, where any match on the contents of the ``name`` column (the name of the degree, BS, etc) constitutes "is masters". Lots of possiblilities.
+        # Of course this is a ridiculous implementation. It's hard to know
+        # without more information. Perhaps have a "master list" of regular
+        # expressions, where any match on the contents of the ``name`` column
+        # (the name of the degree, BS, etc) constitutes "is masters". Lots of
+        # possibilities. This works for our PoC.
         return 'M' in self.name
 
     def __str__(self):
@@ -59,13 +72,20 @@ class Meta:
         unique_together = (('name', 'institution'),)
 
 class Person(models.Model):
+    """The column widths are over-spec'd here to account for encoding overhead.
+    See ``README.md``
+    """
 
     full_name = EncryptedCharField(max_length=200, blank=False)
-    # INPROD: for symplicity, one email per person
-    email = EncryptedLowerCharField(db_index=True, max_length=100, unique=True, blank=False)
+    # INPROD: for simplicity, one email per person
+    email = EncryptedLowerCharField(db_index=True, max_length=100, unique=True,
+                                    blank=False)
     gender = EncryptedCharField(db_index=True, max_length=10, blank=True)
-    # INPROD: for symplicity one degree per person. Note that this is not encrypted. Posgres level encryption was the way to go. See ``README.md``
-    degree = models.ForeignKey(Degree, on_delete=models.CASCADE, null=True, blank=True)  # null=True -- person without degree
+    # INPROD: for simplicity one degree per person. Note that this is not
+    # encrypted. PostgreSQL level encryption was the way to go. See
+    # ``README.md``
+    degree = models.ForeignKey(Degree, on_delete=models.CASCADE, null=True,
+                               blank=True)  # null=True - person without degree
 
     def __str__(self):
         return '%s <%s>' % (self.full_name, self.email)
 
@@ -1,22 +1,25 @@
 # -*- mode: python; coding: utf-8 -*-
-"""``get_pie()`` returns SVG for a single-parameter pie chart for a given ``percent``  (hardcoced size, color).
+"""``get_pie()`` returns SVG for a single-parameter pie chart with slice
+``color`` for a given ``percent`` (hardcoded size, color).
 """
 
-def get_pie(percent):
-    """returns SVG for a single-parameter pie chart for a given ``percent``  (hardcoced size, color).
+def get_pie(percent, color):
+    """returns SVG for a single-parameter pie chart with slice ``color`` for a
+    given ``percent`` (hardcoded size, color).
     """
-    # from https://www.smashingmagazine.com/2015/07/designing-simple-pie-charts-with-css/
+    # from
+    # https://www.smashingmagazine.com/2015/07/designing-simple-pie-charts-with-css/
     pie = """
     <style>
     svg {
       width: 100px; height: 100px;
       transform: rotate(-90deg);
-      background: yellowgreen;
+      background: green;
       border-radius: 50%%;
     }
     circle {
-      fill: yellowgreen;
-      stroke: #655;
+      fill: green;
+      stroke: %(color)s;
       stroke-width: 32;
       stroke-dasharray: %(percent)s 100; /* for 38%% */
     }
@@ -25,7 +28,10 @@ def get_pie(percent):
       <circle r="16" cx="16" cy="16" />
     </svg>
     """
-    return pie % {'percent': str(percent)}
+    return pie % {
+        'percent': str(percent),
+        'color': color,
+    }
 
 
 if __name__ == '__main__':
@@ -36,7 +42,7 @@ def get_pie(percent):
     path = '/tmp/.pie.html'
 
     print('Writing to file %s !!' % path)
-    open(path, 'w').write(get_pie(33))
+    open(path, 'w').write(get_pie(33, 'blue'))
 
     import webbrowser
     webbrowser.open(path, new=0, autoraise=True)