Skip to content

Commit 9889158

Browse files
committed
blarg ispell etc
1 parent 04e565c commit 9889158

17 files changed

+237
-143
lines changed

README.md

+8-6
Original file line numberDiff line numberDiff line change
@@ -131,24 +131,25 @@ ORM
131131

132132
Since we're talking about using the data "outside" of Django, I monkey-patched things so that I could use the ORM just using plain Python (the scraping in other words). There are many, many possibilities here and they all depend on the requirements. Things that come to mind:
133133

134-
* Have the scraper write to a separate db and have things synchronized using RabbitMQ or something. The scrapers run in a "farm" collecting queries and reply back to the queue with some structured result. The web UI and/or the database would be participating in that queue, inserting the queries and waiting for responses (asyncronously).
135-
* Use a separate ORM for scraper I/O. Something like sqlalchemy is robust and well-tested and would have fewer risks, sinice it doesn't look like the Django ORM is often used thi way. The downside is maintaining two schemas (table classes), etc.
134+
* Have the scraper write to a separate db and have things synchronized using RabbitMQ or something. The scrapers run in a "farm" collecting queries and reply back to the queue with some structured result. The web UI and/or the database would be participating in that queue, inserting the queries and waiting for responses (asynchronously).
135+
* Use a separate ORM for scraper I/O. Something like sqlalchemy is robust and well-tested and would have fewer risks, since it doesn't look like the Django ORM is often used this way. The downside is maintaining two schema (table classes), etc.
136+
* The pie chart should be implemented in the browser. We could just return a single number and let JavaScript do the rest. I chose to implement everything in Python.
136137

137138
Scaling
138139
-------
139140

140-
I have no idea how much scaling we'd be expected to do. Three per second or millions per second...? The answer varies. For a more robust PoC I would consider simple PostgreSQL clustering. That wold not improve write performance, but that would be slower so we might not care. Some kind of CDS for anything static or cachable. Django would be running in some "clustered" fashion. Minimally Django would be running behind a fast, secure web server like nginix (tbd). On the web side of things, db access would be read-only (so clustering improves performance there) so minimally we would have a separate Postgres user for that, but having something else running that has write access to the same database is sketchy. It needs to get updated *somehow* gut that is some security "surface area" to keep in mind.
141+
I have no idea how much scaling we'd be expected to do. Three per second or millions per second...? The answer varies. For a more robust PoC I would consider simple PostgreSQL clustering. That wold not improve write performance, but that would be slower so we might not care. Some kind of CDS for anything static or cachable. Django would be running in some "clustered" fashion. Minimally Django would be running behind a fast, secure web server like NGINX (tbd). On the web side of things, db access would be read-only (so clustering improves performance there) so minimally we would have a separate Postgres user for that, but having something else running that has write access to the same database is sketchy. It needs to get updated *somehow* gut that is some security "surface area" to keep in mind.
141142

142143
Encryption
143144
----------
144145

145146
Whether this is a good idea or not depends on the requirements. In most cases, I'd question the benefit. Some process has clear read access to that database via some kind of encryption and that's where an attacker would attack. So encryption gets you nothing unless someone breaks in the hard way (but why do that)?
146147

147-
That's not to say there are no use cases for it. It would be an interesting discussino and at worst it complicates and slows thigns down. I could be wrong afterall!!
148+
That's not to say there are no use cases for it. It would be an interesting discussion and at worst it complicates and slows things down. I could be wrong after all!!
148149

149150
I chose to use symmetric encryption since having the private key and having a secret are pretty much the same thing. There are use cases for asymmetric encryption. I just don't know if this is one of them. Also, the **the encryption should live in the database** most likely. I wrote it into the ORM, which for performance and design reasons is a poor choice, at least in my implementation. Searching the data is greatly complicated, we can search by encrypted output and that's what I did, but 1) it must be encryption that results in the same output for any given input, otherwise searching and querying get **REALLY** complicated and 2) it should probably be done at the RDBMS level. I didn't investigate that for the sake of time.
150151

151-
And because my first implementation doesn't satisfy the first point above, I was unable to use it. So I created a trival rot13 based "encryption" as a standin so in some weak sense, the data is "encrypted". Since this was fast, I had no caching concerns (see below.)
152+
And because my first implementation doesn't satisfy the first point above, I was unable to use it. So I created a trivial rot13 based "encryption" as a stand-in so in some weak sense, the data is "encrypted". Since this was fast, I had no caching concerns (see below.)
152153

153154
In production I would investigate PostgreSQL's encryption options, which are [well documented](https://www.postgresql.org/docs/current/static/encryption-options.html).
154155

@@ -170,8 +171,9 @@ Misc
170171
A list of smaller "todos" and remarks about the implementation:
171172

172173
* Imports should be done in a smart, consistent way. I import ``person_search`` by name everywhere for now. It might be cleaved off as a separate package anyhow. All things being equal, just "best practices" is a good idea.
173-
* I assume that for a given email, the scraped data never chances. This both simplifies things and speeds things up. In production, we might want to periodically chedck for updates: if a records has "expired", add that email address to the queue of addresses to be scraped.
174+
* I assume that for a given email, the scraped data never chances. This both simplifies things and speeds things up. In production, we might want to periodically check for updates: if a records has "expired", add that email address to the queue of addresses to be scraped.
174175
* I also chose to use some OTS libraries, which I think is bad practice. It's ok to use some libraries, but if they're not well maintained, be prepared to completely understand and *own* them indefinitely.
176+
* Encryption as it's implemented adds a large amount of overhead. This also fixed by encrypting at the RDBMS-level.
175177

176178
Resources
177179
=========

manage.py

+4
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,8 @@
11
#!/usr/bin/env python
2+
# -*- mode: python; coding: utf-8 -*-
3+
"""Helper script to use ``manage.py``. It might make sense to extend ``manage``
4+
so that it can run the scraper. e.g. ``python manage.py scrape``.
5+
"""
26
import os
37
import sys
48
from django.core.management import execute_from_command_line

person_search/__init__.py

+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# -*- mode: python; coding: utf-8 -*-
2+
"""``person_search``. Crawl web pages for profiles, record results in an encrypted database and generate a pie chart displaying "percent persons with masters".
3+
"""
4+
5+
# INPROD: We would have a smarter logging system. I just set everthing to debug here.
6+
import logging
7+
logging.basicConfig(level=logging.DEBUG)

person_search/admin.py

+7
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,10 @@
1+
# -*- mode: python; coding: utf-8 -*-
2+
"""Register our models for the admin UI
3+
"""
4+
5+
# INPROD: I would find ways of divorcing the "admin" stuff from the production
6+
# instance.
7+
18
from django.contrib import admin
29
from .models import Person, Degree
310

person_search/crypto/__init__.py

+36-16
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,32 @@
11
# -*- mode: python; coding: utf-8 -*-
2-
"""
2+
"""Simple cryptography helper functions
3+
4+
crypto('clear_string') --> 'encrypted_string'
5+
crypto('encrypted_string', decrypt=True) --> 'clear_string'
36
"""
47
# INPROD:
58
#
69
# tl;dr -- Crypto is hard, and this would require some research to do properly.
710
#
8-
# This code does what it says, but I'm under no illusion that this is safe to use in production. I chose the path of least resistance here and use symmetric encryption using the popular `cryptography` module. Depending on the use case, it might be better to focus on making sure no one ever gets to the data in the first place. One possible case where we'd want to use asymmetric encryption is where we have a multi-master cluster of database servers, in which case we could have only the public key on these hosts to permit writting. Reading is a different problem. If read performance is not critical, just have one host with the private key?
9-
# After implementing this I find ``Fernet seems not to be maintained anymore. There has been no updates for the spec in three years. Original developers are in radio silence`` [https://appelsiini.net/2017/branca-alternative-to-jwt/]
11+
# This code does what it says, but I'm under no illusion that this is safe to
12+
# use in production. I chose the path of least resistance here and use symmetric
13+
# encryption using the popular `cryptography` module. Depending on the use case,
14+
# it might be better to focus on making sure no one ever gets to the data in the
15+
# first place. One possible case where we'd want to use asymmetric encryption is
16+
# where we have a multi-master cluster of database servers, in which case we
17+
# could have only the public key on these hosts to permit writing. Reading is a
18+
# different problem. If read performance is not critical, just have one host
19+
# with the private key?
20+
#
21+
# After implementing I find this: ``Fernet seems not to be maintained
22+
# anymore. There has been no updates for the spec in three years. Original
23+
# developers are in radio silence``
24+
# [https://appelsiini.net/2017/branca-alternative-to-jwt/]
1025

11-
# this code is based upon example code in the docs: https://cryptography.io/en/latest/fernet/#using-passwords-with-fernet
26+
# this code is based upon example code in the docs:
27+
# https://cryptography.io/en/latest/fernet/#using-passwords-with-fernet
1228

1329
import logging
14-
logging.basicConfig(level=logging.DEBUG)
1530
logger = logging.getLogger(__name__)
1631
logger.setLevel(logging.DEBUG)
1732

@@ -32,7 +47,8 @@
3247

3348

3449
def log_io(crypt_func):
35-
"""Log i/o of crypto functions: Input (message) and output (return value) are logged in human-readable
50+
"""Log i/o of crypto functions: Input (message) and output (return value)
51+
are logged in human-readable
3652
"""
3753
def wrapper(message, decrypt=False):
3854
mode = ['encrypting', 'decrypting'][decrypt]
@@ -49,7 +65,8 @@ def wrapper(message, decrypt=False):
4965

5066

5167
def get_secret():
52-
"""Use the contents of `~/.ps_secret` if it exsists. Otherwise just use the string `secret`
68+
"""Use the contents of `~/.ps_secret` if it exists. Otherwise just use the
69+
string `secret`
5370
"""
5471
global SECRET
5572
if SECRET is not None:
@@ -64,7 +81,8 @@ def get_secret():
6481

6582

6683
def get_key():
67-
"""Get the "key" for symmetric Fernet encryption. If we've already calculated it, return that.
84+
"""Get the "key" for symmetric Fernet encryption. If we've already
85+
calculated it, return that.
6886
"""
6987
global KEY
7088
if KEY is not None:
@@ -95,7 +113,8 @@ def crypt13(message, decrypt=False):
95113
"""ROT13 cipher for testing
96114
"""
97115
def rot13(message):
98-
"""We put an `E:` in front of the "encrypted" version. This helper function does just ROT13
116+
"""We put an `E:` in front of the "encrypted" version. This helper
117+
function does just ROT13
99118
"""
100119
return codecs.getencoder('rot-13')(message)[0]
101120

@@ -107,10 +126,11 @@ def rot13(message):
107126

108127
@log_io
109128
def crypt(message, decrypt=False):
110-
"""Perform simple symmetric encryption of `message`. `decrypt=True` to decrypt
129+
"""Perform simple symmetric encryption of `message`. `decrypt=True` to
130+
decrypt
111131
112-
input can be bytes (utf-8) or a string, which is immedately converted to bytes
113-
output is always bytes (utf-8)
132+
Input can be bytes (utf-8) or a string, which is immediately converted to
133+
bytes. Return value is always bytes (utf-8)
114134
"""
115135

116136
if not isinstance(message, bytes):
@@ -122,7 +142,8 @@ def crypt(message, decrypt=False):
122142
result = f.decrypt(message)
123143
except InvalidToken as e:
124144
raise Exception(
125-
'Wrong secret. Did you lose `~/.ps_secret` or `~/.ps_salt`? [%s]' % repr(e))
145+
'Wrong secret. Did you lose `~/.ps_secret` or `~/.ps_salt`? '
146+
'[%s]' % repr(e))
126147
else:
127148
result = f.encrypt(message)
128149

@@ -131,9 +152,8 @@ def crypt(message, decrypt=False):
131152

132153
if __name__ == '__main__':
133154

134-
if False:
135-
assert crypt13('foo') == 'E:sbb', 'crypt13 appears to be broken.'
136-
assert crypt13(
155+
assert crypt13('foo') == 'E:sbb', 'crypt13 appears to be broken.'
156+
assert crypt13(
137157
'E:sbb', decrypt=True) == 'foo', 'crypt13 appears to be broken.'
138158

139159
# how I did very basic testing:

person_search/db.py

+10-3
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,21 @@
11
# -*- mode: python; coding: utf-8 -*-
2-
"""Use the Django ORM independently of the web UI. This involves some "tricks" but has the benefit of there being only one place to keep the database i/o logic. This is especially helpful if we're doing something to the data before storing (like encryption)
2+
"""Use the Django ORM independently of the web UI. This involves some "tricks"
3+
but has the benefit of there being only one place to keep the database i/o
4+
logic. This is especially helpful if we're doing something to the data before
5+
storing (like encryption).
36
"""
47

5-
# INPROD: I would very seriously reconsider this. It may be fine, but it may not be! It might be better to teach Django to use a different ORM, like sqlalchemy. Whatever the case, it's very helpful not to re-create database logic in two places. That's why this file...
8+
# INPROD: I would very seriously reconsider this. It may be fine, but it may not
9+
# be! It might be better to teach Django to use a different ORM, like
10+
# sqlalchemy. Whatever the case, it's very helpful not to re-create database
11+
# logic in two places. That's why this file...
612

713
import os
814
import importlib
915
import django
1016

11-
# I have removed some comments and strings to keep things anonymous. It's pretty easy to follow by just reading the code... (again, needs testing and rigor)
17+
# I have removed some comments and strings to keep things anonymous. It's pretty
18+
# easy to follow by just reading the code... (again, needs testing and rigor)
1219

1320
try:
1421
dsm_name = os.environ['DJANGO_SETTINGS_MODULE']

person_search/models.py

+31-11
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,27 @@
11
# -*- mode: python; coding: utf-8 -*-
2+
"""Django database model classes.
3+
4+
* ``Person`` -- Stores persons scraped from the web, all data encrypted here.
5+
* ``Degree`` -- A "degree name", institution tuple. A foreign key lets us assign
6+
(a single) degree to each person.
7+
8+
Encryption is seamlessly performed here using custom Django fields.
9+
"""
210

311
import logging
4-
logging.basicConfig(level=logging.DEBUG)
512
logger = logging.getLogger(__name__)
613
logger.setLevel(logging.DEBUG)
714

815
from django.db import models
9-
from person_search.crypto import crypt13 as crypt
1016
#from person_search.crypto import crypt
17+
# INPROD: In production we would use postgres-level encryption. See
18+
# ``README.md``
19+
from person_search.crypto import crypt13 as crypt
1120

1221

1322
class EncryptedCharField(models.CharField):
1423
"""An encrypted ``CharField``
1524
"""
16-
1725
def from_db_value(self, value, expression, connection, context): # DECRYPT
1826
logger.debug('calling from_db_value with `%s`' % value)
1927
return self.to_python(value)
@@ -32,9 +40,9 @@ def get_prep_value(self, value): # ENCRYPT
3240
return crypt(value, decrypt=False)
3341

3442
class EncryptedLowerCharField(EncryptedCharField):
35-
"""Encrypted ``CharField`` where values are always lower-cased before encryption (see remark about encryption in ``README.md``)
43+
"""An encrypted ``CharField`` where values are always lower-cased before
44+
encryption (see remark about encryption in ``README.md``)
3645
"""
37-
3846
def get_prep_value(self, value): # ENCRYPT
3947
logger.debug('calling get_prep_value with `%s`' % value)
4048
value = value.lower()
@@ -46,9 +54,14 @@ class Degree(models.Model):
4654
institution = models.CharField(max_length=200, blank=False)
4755

4856
def is_masters(self):
49-
"""Using available information about the ``Degree``, return ``True`` if master's degree.
57+
"""Using available information about the ``Degree``, return ``True`` if
58+
master's degree.
5059
"""
51-
# of course this is a rediculous implementation. It's hard to know without more information. Perhaps have a "master list" of regular expressions, where any match on the contents of the ``name`` column (the name of the degree, BS, etc) constitutes "is masters". Lots of possiblilities.
60+
# Of course this is a ridiculous implementation. It's hard to know
61+
# without more information. Perhaps have a "master list" of regular
62+
# expressions, where any match on the contents of the ``name`` column
63+
# (the name of the degree, BS, etc) constitutes "is masters". Lots of
64+
# possibilities. This works for our PoC.
5265
return 'M' in self.name
5366

5467
def __str__(self):
@@ -59,13 +72,20 @@ class Meta:
5972
unique_together = (('name', 'institution'),)
6073

6174
class Person(models.Model):
75+
"""The column widths are over-spec'd here to account for encoding overhead.
76+
See ``README.md``
77+
"""
6278

6379
full_name = EncryptedCharField(max_length=200, blank=False)
64-
# INPROD: for symplicity, one email per person
65-
email = EncryptedLowerCharField(db_index=True, max_length=100, unique=True, blank=False)
80+
# INPROD: for simplicity, one email per person
81+
email = EncryptedLowerCharField(db_index=True, max_length=100, unique=True,
82+
blank=False)
6683
gender = EncryptedCharField(db_index=True, max_length=10, blank=True)
67-
# INPROD: for symplicity one degree per person. Note that this is not encrypted. Posgres level encryption was the way to go. See ``README.md``
68-
degree = models.ForeignKey(Degree, on_delete=models.CASCADE, null=True, blank=True) # null=True -- person without degree
84+
# INPROD: for simplicity one degree per person. Note that this is not
85+
# encrypted. PostgreSQL level encryption was the way to go. See
86+
# ``README.md``
87+
degree = models.ForeignKey(Degree, on_delete=models.CASCADE, null=True,
88+
blank=True) # null=True - person without degree
6989

7090
def __str__(self):
7191
return '%s <%s>' % (self.full_name, self.email)

person_search/pie.py

+15-9
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,25 @@
11
# -*- mode: python; coding: utf-8 -*-
2-
"""``get_pie()`` returns SVG for a single-parameter pie chart for a given ``percent`` (hardcoced size, color).
2+
"""``get_pie()`` returns SVG for a single-parameter pie chart with slice
3+
``color`` for a given ``percent`` (hardcoded size, color).
34
"""
45

5-
def get_pie(percent):
6-
"""returns SVG for a single-parameter pie chart for a given ``percent`` (hardcoced size, color).
6+
def get_pie(percent, color):
7+
"""returns SVG for a single-parameter pie chart with slice ``color`` for a
8+
given ``percent`` (hardcoded size, color).
79
"""
8-
# from https://www.smashingmagazine.com/2015/07/designing-simple-pie-charts-with-css/
10+
# from
11+
# https://www.smashingmagazine.com/2015/07/designing-simple-pie-charts-with-css/
912
pie = """
1013
<style>
1114
svg {
1215
width: 100px; height: 100px;
1316
transform: rotate(-90deg);
14-
background: yellowgreen;
17+
background: green;
1518
border-radius: 50%%;
1619
}
1720
circle {
18-
fill: yellowgreen;
19-
stroke: #655;
21+
fill: green;
22+
stroke: %(color)s;
2023
stroke-width: 32;
2124
stroke-dasharray: %(percent)s 100; /* for 38%% */
2225
}
@@ -25,7 +28,10 @@ def get_pie(percent):
2528
<circle r="16" cx="16" cy="16" />
2629
</svg>
2730
"""
28-
return pie % {'percent': str(percent)}
31+
return pie % {
32+
'percent': str(percent),
33+
'color': color,
34+
}
2935

3036

3137
if __name__ == '__main__':
@@ -36,7 +42,7 @@ def get_pie(percent):
3642
path = '/tmp/.pie.html'
3743

3844
print('Writing to file %s !!' % path)
39-
open(path, 'w').write(get_pie(33))
45+
open(path, 'w').write(get_pie(33, 'blue'))
4046

4147
import webbrowser
4248
webbrowser.open(path, new=0, autoraise=True)

0 commit comments

Comments
 (0)