Skip to content

Commit 46e33f6

Browse files
joshua-ossvictoria de sainte agathe
and
victoria de sainte agathe
authored
New column metadata properties (opendp#435)
Co-authored-by: victoria de sainte agathe <[email protected]>
1 parent f36f8bb commit 46e33f6

File tree

12 files changed

+289
-83
lines changed

12 files changed

+289
-83
lines changed

Diff for: sql/HISTORY.md

+5
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
# SmartNoise SQL v0.2.4 Release Notes
2+
3+
* Support for nullable and fixed imputation
4+
* Allow override of sensitivity via metadata
5+
16
# SmartNoise SQL v0.2.3 Release Notes
27

38
* Add scalar functions to AST

Diff for: sql/Metadata.md

+4-1
Original file line numberDiff line numberDiff line change
@@ -55,10 +55,13 @@ These overrides should be used with caution, because they may affect privacy if
5555

5656
## Column Options
5757

58-
* `type`: Required. This type attribute indicates the simple type for all values in the column. Type may be one of “int”, “float”, “string”, “boolean”, or “date”. The “date” type includes date or time types. If type is set to "unknown", the column will be ignored by the system.
58+
* `type`: Required. The type attribute indicates the simple type for all values in the column. Type may be one of “int”, “float”, “string”, “boolean”, or “date”. The “date” type includes date or time types. If type is set to "unknown", the column will be ignored by the system.
5959
* `private_key`: Boolean. Default is `False`. indicates that this column is the private identifier (e.g. “UserID”, “Household”). This column is optional. Only columns which have private_id set to ‘true’ are treated as individuals subject to privacy protection.
6060
* `lower`: Valid on numeric columns. Specifies the lower bound for values in this column.
6161
* `upper`: Valid on numeric columns. Specifies the upper bound for values in this column.
62+
* `nullable`: Boolean. Default is `True`. Indicates that this column can contain null values. If set to `False`, the system will assume that all values are set. This is useful when the data curator knows that all values are set, and will allow some budget to be preserved by sharing counts across columns.
63+
* `missing_value`: A value of the same type as the `type` for this column. Default is `None`. If set, the system will replace NULL with the specified value, ensuring that all values are set. If set, `nullable` will be treated as `False`, regardless of its value.
64+
* `sensitivity`: The sensitivity to be used when releasing sums from this column. Default is `None`. If not set, the system will compute the sensitivity from upper and lower bounds. If `sensitivity` is set, the upper and lower bounds will be ignored for sensitivity, and this value will be used. The upper and lower bounds will still be used to clamp the columns. If this value is set, and no bounds are provided, the metadata must specify `clamp_columns` as `False`. Note that counts will always use a sensitivity of 1, regardless of the value of this attribute.
6265
* `cardinality`: Integer. This is an optional hint, valid on columns intended to be used as categories or keys in a GROUP BY. Specifies the approximate number of distinct keys in this column.
6366

6467
## Metadata

Diff for: sql/VERSION

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
0.2.2
1+
0.2.4

Diff for: sql/docs/source/metadata.rst

+5-3
Original file line numberDiff line numberDiff line change
@@ -95,10 +95,13 @@ These overrides should be used with caution, because they may affect privacy if
9595
Column Options
9696
--------------
9797

98-
* ``type``: Required. This type attribute indicates the simple type for all values in the column. Type may be one of “int”, “float”, “string”, “boolean”, or “date”. The “date” type includes date or time types. If type is set to "unknown", the column will be ignored by the system.
98+
* ``type``: Required. The type attribute indicates the simple type for all values in the column. Type may be one of “int”, “float”, “string”, “boolean”, or “date”. The “date” type includes date or time types. If type is set to "unknown", the column will be ignored by the system.
9999
* ``private_id``: Boolean. Default is ``False``. indicates that this column is the private identifier (e.g. “UserID”, “Household”). This column is optional. Only columns which have private_id set to true are treated as individuals subject to privacy protection.
100100
* ``lower``: Valid on numeric columns. Specifies the lower bound for values in this column.
101101
* ``upper``: Valid on numeric columns. Specifies the upper bound for values in this column.
102+
* ``nullable``: Boolean. Default is ``True``. Indicates that this column can contain null values. If set to ``False``, the system will assume that all values are set. This is useful when the data curator knows that all values are set, and will allow some budget to be preserved by sharing counts across columns.
103+
* ``missing_value``: A value of the same type as the ``type`` for this column. Default is ``None``. If set, the system will replace NULL with the specified value, ensuring that all values are set. If set, ``nullable`` will be treated as ``False``, regardless of its value.
104+
* ``sensitivity``: The sensitivity to be used when releasing sums from this column. Default is ``None``. If not set, the system will compute the sensitivity from upper and lower bounds. If ``sensitivity`` is set, the upper and lower bounds will be ignored for sensitivity, and this value will be used. The upper and lower bounds will still be used to clamp the columns. If this value is set, and no bounds are provided, the metadata must specify ``clamp_columns`` as ``False``. Note that counts will always use a sensitivity of 1, regardless of the value of this attribute.
102105

103106

104107
Other Considerations
@@ -195,5 +198,4 @@ The following is an example of a collection containing 3 tables, representing Cr
195198
EndTrial:
196199
type: datetime
197200
TrialGroup:
198-
type: int
199-
201+
type: int

Diff for: sql/pyproject.toml

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "smartnoise-sql"
3-
version = "0.2.3"
3+
version = "0.2.4"
44
description = "Differentially Private SQL Queries"
55
authors = ["SmartNoise Team <[email protected]>"]
66
license = "MIT"

Diff for: sql/setup.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525

2626
setup_kwargs = {
2727
'name': 'smartnoise-sql',
28-
'version': '0.2.3',
28+
'version': '0.2.4',
2929
'description': 'Differentially Private SQL Queries',
3030
'long_description': '[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Python](https://img.shields.io/badge/python-3.7%20%7C%203.8-blue)](https://www.python.org/)\n\n<a href="https://smartnoise.org"><img src="https://github.com/opendp/smartnoise-sdk/raw/main/images/SmartNoise/SVG/Logo%20Mark_grey.svg" align="left" height="65" vspace="8" hspace="18"></a>\n\n## SmartNoise SQL\n\nDifferentially private SQL queries. Tested with:\n* PostgreSQL\n* SQL Server\n* Spark\n* Pandas (SQLite)\n* PrestoDB\n\nSmartNoise is intended for scenarios where the analyst is trusted by the data owner. SmartNoise uses the [OpenDP](https://github.com/opendp/opendp) library of differential privacy algorithms.\n\n## Installation\n\n```\npip install smartnoise-sql\n```\n\n## Querying a Pandas DataFrame\n\nUse the `from_df` method to create a private reader that can issue queries against a pandas dataframe.\n\n```python\nimport snsql\nfrom snsql import Privacy\nimport pandas as pd\nprivacy = Privacy(epsilon=1.0, delta=0.01)\n\ncsv_path = \'PUMS.csv\'\nmeta_path = \'PUMS.yaml\'\n\npums = pd.read_csv(csv_path)\nreader = snsql.from_df(pums, privacy=privacy, metadata=meta_path)\n\nresult = reader.execute(\'SELECT sex, AVG(age) AS age FROM PUMS.PUMS GROUP BY sex\')\n```\n\n## Querying a SQL Database\n\nUse `from_connection` to wrap an existing database connection.\n\n```python\nimport snsql\nfrom snsql import Privacy\nimport psycopg2\n\nprivacy = Privacy(epsilon=1.0, delta=0.01)\nmeta_path = \'PUMS.yaml\'\n\npumsdb = psycopg2.connect(user=\'postgres\', host=\'localhost\', database=\'PUMS\')\nreader = snsql.from_connection(pumsdb, privacy=privacy, metadata=meta_path)\n\nresult = reader.execute(\'SELECT sex, AVG(age) AS age FROM PUMS.PUMS GROUP BY sex\')\n```\n\n## Querying a Spark DataFrame\n\nUse `from_connection` to wrap a spark session.\n\n```python\nimport pyspark\nfrom pyspark.sql import SparkSession\nspark = SparkSession.builder.getOrCreate()\nfrom snsql import *\n\npums = spark.read.load(...) # load a Spark DataFrame\npums.createOrReplaceTempView("PUMS_large")\n\nmetadata = \'PUMS_large.yaml\'\n\nprivate_reader = from_connection(\n spark, \n metadata=metadata, \n privacy=Privacy(epsilon=3.0, delta=1/1_000_000)\n)\nprivate_reader.reader.compare.search_path = ["PUMS"]\n\n\nres = private_reader.execute(\'SELECT COUNT(*) FROM PUMS_large\')\nres.show()\n```\n\n## Privacy Cost\n\nThe privacy parameters epsilon and delta are passed in to the private connection at instantiation time, and apply to each computed column during the life of the session. Privacy cost accrues indefinitely as new queries are executed, with the total accumulated privacy cost being available via the `spent` property of the connection\'s `odometer`:\n\n```python\nprivacy = Privacy(epsilon=0.1, delta=10e-7)\n\nreader = from_connection(conn, metadata=metadata, privacy=privacy)\nprint(reader.odometer.spent) # (0.0, 0.0)\n\nresult = reader.execute(\'SELECT COUNT(*) FROM PUMS.PUMS\')\nprint(reader.odometer.spent) # approximately (0.1, 10e-7)\n```\n\nThe privacy cost increases with the number of columns:\n\n```python\nreader = from_connection(conn, metadata=metadata, privacy=privacy)\nprint(reader.odometer.spent) # (0.0, 0.0)\n\nresult = reader.execute(\'SELECT AVG(age), AVG(income) FROM PUMS.PUMS\')\nprint(reader.odometer.spent) # approximately (0.4, 10e-6)\n```\n\nThe odometer is advanced immediately before the differentially private query result is returned to the caller. If the caller wishes to estimate the privacy cost of a query without running it, `get_privacy_cost` can be used:\n\n```python\nreader = from_connection(conn, metadata=metadata, privacy=privacy)\nprint(reader.odometer.spent) # (0.0, 0.0)\n\ncost = reader.get_privacy_cost(\'SELECT AVG(age), AVG(income) FROM PUMS.PUMS\')\nprint(cost) # approximately (0.4, 10e-6)\n\nprint(reader.odometer.spent) # (0.0, 0.0)\n```\n\nNote that the total privacy cost of a session accrues at a slower rate than the sum of the individual query costs obtained by `get_privacy_cost`. The odometer accrues all invocations of mechanisms for the life of a session, and uses them to compute total spend.\n\n```python\nreader = from_connection(conn, metadata=metadata, privacy=privacy)\nquery = \'SELECT COUNT(*) FROM PUMS.PUMS\'\nepsilon_single, _ = reader.get_privacy_cost(query)\nprint(epsilon_single) # 0.1\n\n# no queries executed yet\nprint(reader.odometer.spent) # (0.0, 0.0)\n\nfor _ in range(100):\n reader.execute(query)\n\nepsilon_many, _ = reader.odometer.spent\nprint(f\'{epsilon_many} < {epsilon_single * 100}\')\n```\n\n## Accuracy\n\nThe `get_simple_accuracy` method returns the column-wise accuracies for a given alpha for a given query.\n\n```python\nprivacy = Privacy(epsilon=1.0, delta=10e-6)\n\nreader = from_connection(conn, metadata=metadata, privacy=privacy)\n\nquery = \'SELECT COUNT(*) AS n, SUM(age) AS age FROM PUMS.PUMS\'\n\nacc95 = reader.get_simple_accuracy(query, alpha=0.05)\nprint(f\'n will be +/- {acc95[0]} in 95% of executions. Age will be +/- {acc95[1]}\')\n```\n\nThis method only returns simple accuracies, where the noise scale for each column is fixed and does not vary per row. Statistics like AVG and VARIANCE are computed from a quotient of noisy sum and noisy count, so the accuracy can vary widely per row. In these cases, a per-row accuracy can be obtained with `execute_with_accuracy`. \n\n## Histograms\n\nSQL `group by` queries represent histograms binned by grouping key. Queries over a grouping key with unbounded or non-public dimensions expose privacy risk. For example:\n\n```sql\nSELECT last_name, COUNT(*) FROM Sales GROUP BY last_name\n```\n\nIn the above query, if someone with a distinctive last name is included in the database, that person\'s record might accidentally be revealed, even if the noisy count returns 0 or negative. To prevent this from happening, the system will automatically censor dimensions which would violate differential privacy.\n\n## Private Synopsis\n\nA private synopsis is a pre-computed set of differentially private aggregates that can be filtered and aggregated in various ways to produce new reports. Because the private synopsis is differentially private, reports generated from the synopsis do not need to have additional privacy applied, and the synopsis can be distributed without risk of additional privacy loss. Reports over the synopsis can be generated with non-private SQL, within an Excel Pivot Table, or through other common reporting tools.\n\nYou can see a sample [notebook for creating private synopsis](samples/Synopsis.ipynb) suitable for consumption in Excel or SQL.\n\n## Limitations\n\nYou can think of the data access layer as simple middleware that allows composition of `opendp` computations using the SQL language. The SQL language provides a limited subset of what can be expressed through the full `opendp` library. For example, the SQL language does not provide a way to set per-field privacy budget.\n\nBecause we delegate the computation of exact aggregates to the underlying database engines, execution through the SQL layer can be considerably faster, particularly with database engines optimized for precomputed aggregates. However, this design choice means that analysis graphs composed with SQL language do not access data in the engine on a per-row basis. Therefore, SQL queries do not currently support algorithms that require per-row access, such as quantile algorithms that use underlying values. This is a limitation that future releases will relax for database engines that support row-based access, such as Spark.\n\nThe SQL processing layer has limited support for bounding contributions when individuals can appear more than once in the data. This includes ability to perform reservoir sampling to bound contributions of an individual, and to scale the sensitivity parameter. These parameters are important when querying reporting tables that might be produced from subqueries and joins, but require caution to use safely.\n\nFor this release, we recommend using the SQL functionality while bounding user contribution to 1 row. The platform defaults to this option by setting `max_contrib` to 1, and should only be overridden if you know what you are doing. Future releases will focus on making these options easier for non-experts to use safely.\n\n\n## Communication\n\n- You are encouraged to join us on [GitHub Discussions](https://github.com/opendp/opendp/discussions/categories/smartnoise)\n- Please use [GitHub Issues](https://github.com/opendp/smartnoise-sdk/issues) for bug reports and feature requests.\n- For other requests, including security issues, please contact us at [[email protected]](mailto:[email protected]).\n\n## Releases and Contributing\n\nPlease let us know if you encounter a bug by [creating an issue](https://github.com/opendp/smartnoise-sdk/issues).\n\nWe appreciate all contributions. Please review the [contributors guide](../contributing.rst). We welcome pull requests with bug-fixes without prior discussion.\n\nIf you plan to contribute new features, utility functions or extensions, please first open an issue and discuss the feature with us.\n',
3131
'author': 'SmartNoise Team',

Diff for: sql/snsql/_ast/ast.py

+21-4
Original file line numberDiff line numberDiff line change
@@ -431,8 +431,11 @@ def get_table_expr(name):
431431
colname=name,
432432
valtype=tc[name].typename(),
433433
is_key=tc[name].is_key,
434-
lower=tc[name].lower if tc[name].typename() in ["int", "float"] else None,
435-
upper=tc[name].upper if tc[name].typename() in ["int", "float"] else None,
434+
lower=tc[name].lower if hasattr(tc[name], "lower") else None,
435+
upper=tc[name].upper if hasattr(tc[name], "upper") else None,
436+
nullable=tc[name].nullable if hasattr(tc[name], "nullable") else True,
437+
missing_value=tc[name].missing_value if hasattr(tc[name], "missing_value") else None,
438+
sensitivity=tc[name].sensitivity if hasattr(tc[name], "sensitivity") else None,
436439
max_ids=table.max_ids,
437440
sample_max_ids=table.sample_max_ids,
438441
row_privacy=table.row_privacy,
@@ -527,6 +530,7 @@ def __init__(
527530
tablename,
528531
colname,
529532
valtype="unknown",
533+
*ignore,
530534
is_key=False,
531535
lower=None,
532536
upper=None,
@@ -535,6 +539,9 @@ def __init__(
535539
row_privacy=False,
536540
censor_dims=False,
537541
compare=None,
542+
nullable = True,
543+
missing_value = None,
544+
sensitivity = None
538545
):
539546
self.tablename = tablename
540547
self.colname = colname
@@ -547,6 +554,9 @@ def __init__(
547554
self.row_privacy = row_privacy
548555
self.censor_dims = censor_dims
549556
self.unbounded = lower is None or upper is None
557+
self.nullable = nullable
558+
self.missing_value = missing_value
559+
self._sensitivity = sensitivity
550560
self.compare = compare
551561

552562
def __str__(self):
@@ -564,9 +574,16 @@ def type(self):
564574
def sensitivity(self):
565575
if self.valtype in ["int", "float"]:
566576
if self.lower is not None and self.upper is not None:
567-
return max(abs(self.upper), abs(self.lower))
577+
bounds_sensitivity = max(abs(self.upper), abs(self.lower))
578+
if self._sensitivity is not None:
579+
return self._sensitivity
580+
else:
581+
return bounds_sensitivity
568582
else:
569-
return np.inf # unbounded
583+
if self._sensitivity is not None:
584+
return self._sensitivity
585+
else:
586+
return np.inf # unbounded
570587
elif self.valtype == "boolean":
571588
return 1
572589
else:

0 commit comments

Comments
 (0)