You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+8-6
Original file line number
Diff line number
Diff line change
@@ -131,24 +131,25 @@ ORM
131
131
132
132
Since we're talking about using the data "outside" of Django, I monkey-patched things so that I could use the ORM just using plain Python (the scraping in other words). There are many, many possibilities here and they all depend on the requirements. Things that come to mind:
133
133
134
-
* Have the scraper write to a separate db and have things synchronized using RabbitMQ or something. The scrapers run in a "farm" collecting queries and reply back to the queue with some structured result. The web UI and/or the database would be participating in that queue, inserting the queries and waiting for responses (asyncronously).
135
-
* Use a separate ORM for scraper I/O. Something like sqlalchemy is robust and well-tested and would have fewer risks, sinice it doesn't look like the Django ORM is often used thi way. The downside is maintaining two schemas (table classes), etc.
134
+
* Have the scraper write to a separate db and have things synchronized using RabbitMQ or something. The scrapers run in a "farm" collecting queries and reply back to the queue with some structured result. The web UI and/or the database would be participating in that queue, inserting the queries and waiting for responses (asynchronously).
135
+
* Use a separate ORM for scraper I/O. Something like sqlalchemy is robust and well-tested and would have fewer risks, since it doesn't look like the Django ORM is often used this way. The downside is maintaining two schema (table classes), etc.
136
+
* The pie chart should be implemented in the browser. We could just return a single number and let JavaScript do the rest. I chose to implement everything in Python.
136
137
137
138
Scaling
138
139
-------
139
140
140
-
I have no idea how much scaling we'd be expected to do. Three per second or millions per second...? The answer varies. For a more robust PoC I would consider simple PostgreSQL clustering. That wold not improve write performance, but that would be slower so we might not care. Some kind of CDS for anything static or cachable. Django would be running in some "clustered" fashion. Minimally Django would be running behind a fast, secure web server like nginix (tbd). On the web side of things, db access would be read-only (so clustering improves performance there) so minimally we would have a separate Postgres user for that, but having something else running that has write access to the same database is sketchy. It needs to get updated *somehow* gut that is some security "surface area" to keep in mind.
141
+
I have no idea how much scaling we'd be expected to do. Three per second or millions per second...? The answer varies. For a more robust PoC I would consider simple PostgreSQL clustering. That wold not improve write performance, but that would be slower so we might not care. Some kind of CDS for anything static or cachable. Django would be running in some "clustered" fashion. Minimally Django would be running behind a fast, secure web server like NGINX (tbd). On the web side of things, db access would be read-only (so clustering improves performance there) so minimally we would have a separate Postgres user for that, but having something else running that has write access to the same database is sketchy. It needs to get updated *somehow* gut that is some security "surface area" to keep in mind.
141
142
142
143
Encryption
143
144
----------
144
145
145
146
Whether this is a good idea or not depends on the requirements. In most cases, I'd question the benefit. Some process has clear read access to that database via some kind of encryption and that's where an attacker would attack. So encryption gets you nothing unless someone breaks in the hard way (but why do that)?
146
147
147
-
That's not to say there are no use cases for it. It would be an interesting discussino and at worst it complicates and slows thigns down. I could be wrong afterall!!
148
+
That's not to say there are no use cases for it. It would be an interesting discussion and at worst it complicates and slows things down. I could be wrong after all!!
148
149
149
150
I chose to use symmetric encryption since having the private key and having a secret are pretty much the same thing. There are use cases for asymmetric encryption. I just don't know if this is one of them. Also, the **the encryption should live in the database** most likely. I wrote it into the ORM, which for performance and design reasons is a poor choice, at least in my implementation. Searching the data is greatly complicated, we can search by encrypted output and that's what I did, but 1) it must be encryption that results in the same output for any given input, otherwise searching and querying get **REALLY** complicated and 2) it should probably be done at the RDBMS level. I didn't investigate that for the sake of time.
150
151
151
-
And because my first implementation doesn't satisfy the first point above, I was unable to use it. So I created a trival rot13 based "encryption" as a standin so in some weak sense, the data is "encrypted". Since this was fast, I had no caching concerns (see below.)
152
+
And because my first implementation doesn't satisfy the first point above, I was unable to use it. So I created a trivial rot13 based "encryption" as a stand-in so in some weak sense, the data is "encrypted". Since this was fast, I had no caching concerns (see below.)
152
153
153
154
In production I would investigate PostgreSQL's encryption options, which are [well documented](https://www.postgresql.org/docs/current/static/encryption-options.html).
154
155
@@ -170,8 +171,9 @@ Misc
170
171
A list of smaller "todos" and remarks about the implementation:
171
172
172
173
* Imports should be done in a smart, consistent way. I import ``person_search`` by name everywhere for now. It might be cleaved off as a separate package anyhow. All things being equal, just "best practices" is a good idea.
173
-
* I assume that for a given email, the scraped data never chances. This both simplifies things and speeds things up. In production, we might want to periodically chedck for updates: if a records has "expired", add that email address to the queue of addresses to be scraped.
174
+
* I assume that for a given email, the scraped data never chances. This both simplifies things and speeds things up. In production, we might want to periodically check for updates: if a records has "expired", add that email address to the queue of addresses to be scraped.
174
175
* I also chose to use some OTS libraries, which I think is bad practice. It's ok to use some libraries, but if they're not well maintained, be prepared to completely understand and *own* them indefinitely.
176
+
* Encryption as it's implemented adds a large amount of overhead. This also fixed by encrypting at the RDBMS-level.
"""``person_search``. Crawl web pages for profiles, record results in an encrypted database and generate a pie chart displaying "percent persons with masters".
3
+
"""
4
+
5
+
# INPROD: We would have a smarter logging system. I just set everthing to debug here.
# tl;dr -- Crypto is hard, and this would require some research to do properly.
7
10
#
8
-
# This code does what it says, but I'm under no illusion that this is safe to use in production. I chose the path of least resistance here and use symmetric encryption using the popular `cryptography` module. Depending on the use case, it might be better to focus on making sure no one ever gets to the data in the first place. One possible case where we'd want to use asymmetric encryption is where we have a multi-master cluster of database servers, in which case we could have only the public key on these hosts to permit writting. Reading is a different problem. If read performance is not critical, just have one host with the private key?
9
-
# After implementing this I find ``Fernet seems not to be maintained anymore. There has been no updates for the spec in three years. Original developers are in radio silence`` [https://appelsiini.net/2017/branca-alternative-to-jwt/]
11
+
# This code does what it says, but I'm under no illusion that this is safe to
12
+
# use in production. I chose the path of least resistance here and use symmetric
13
+
# encryption using the popular `cryptography` module. Depending on the use case,
14
+
# it might be better to focus on making sure no one ever gets to the data in the
15
+
# first place. One possible case where we'd want to use asymmetric encryption is
16
+
# where we have a multi-master cluster of database servers, in which case we
17
+
# could have only the public key on these hosts to permit writing. Reading is a
18
+
# different problem. If read performance is not critical, just have one host
19
+
# with the private key?
20
+
#
21
+
# After implementing I find this: ``Fernet seems not to be maintained
22
+
# anymore. There has been no updates for the spec in three years. Original
Copy file name to clipboardexpand all lines: person_search/db.py
+10-3
Original file line number
Diff line number
Diff line change
@@ -1,14 +1,21 @@
1
1
# -*- mode: python; coding: utf-8 -*-
2
-
"""Use the Django ORM independently of the web UI. This involves some "tricks" but has the benefit of there being only one place to keep the database i/o logic. This is especially helpful if we're doing something to the data before storing (like encryption)
2
+
"""Use the Django ORM independently of the web UI. This involves some "tricks"
3
+
but has the benefit of there being only one place to keep the database i/o
4
+
logic. This is especially helpful if we're doing something to the data before
5
+
storing (like encryption).
3
6
"""
4
7
5
-
# INPROD: I would very seriously reconsider this. It may be fine, but it may not be! It might be better to teach Django to use a different ORM, like sqlalchemy. Whatever the case, it's very helpful not to re-create database logic in two places. That's why this file...
8
+
# INPROD: I would very seriously reconsider this. It may be fine, but it may not
9
+
# be! It might be better to teach Django to use a different ORM, like
10
+
# sqlalchemy. Whatever the case, it's very helpful not to re-create database
11
+
# logic in two places. That's why this file...
6
12
7
13
importos
8
14
importimportlib
9
15
importdjango
10
16
11
-
# I have removed some comments and strings to keep things anonymous. It's pretty easy to follow by just reading the code... (again, needs testing and rigor)
17
+
# I have removed some comments and strings to keep things anonymous. It's pretty
18
+
# easy to follow by just reading the code... (again, needs testing and rigor)
"""Using available information about the ``Degree``, return ``True`` if master's degree.
57
+
"""Using available information about the ``Degree``, return ``True`` if
58
+
master's degree.
50
59
"""
51
-
# of course this is a rediculous implementation. It's hard to know without more information. Perhaps have a "master list" of regular expressions, where any match on the contents of the ``name`` column (the name of the degree, BS, etc) constitutes "is masters". Lots of possiblilities.
60
+
# Of course this is a ridiculous implementation. It's hard to know
61
+
# without more information. Perhaps have a "master list" of regular
62
+
# expressions, where any match on the contents of the ``name`` column
63
+
# (the name of the degree, BS, etc) constitutes "is masters". Lots of
64
+
# possibilities. This works for our PoC.
52
65
return'M'inself.name
53
66
54
67
def__str__(self):
@@ -59,13 +72,20 @@ class Meta:
59
72
unique_together= (('name', 'institution'),)
60
73
61
74
classPerson(models.Model):
75
+
"""The column widths are over-spec'd here to account for encoding overhead.
0 commit comments