Add Fabrication mode to ACD (#87)

Co-authored-by: Darren Edge <[email protected]> Co-authored-by: Dayenne Souza <[email protected]>
microsoft · Jan 15, 2025 · 276436f · 276436f
1 parent 6d22ac0
commit 276436f
Show file tree

Hide file tree

Showing 4 changed files with 42 additions and 3,971 deletions.
diff --git a/app/workflows/anonymize_case_data/README.md b/app/workflows/anonymize_case_data/README.md
@@ -142,6 +142,8 @@ The only user input required for synthesis is the `Epsilon` value, set to `12.00
 
 We recommend using the lowest `Epsilon` value that results in synthetic data with sufficient accurary to support downstream analysis. Start with the default value, and reduce it in small increments at a time. If the accuracy values with an `Epsilon` value of `12.00` are themselves too low, go back to the `Prepare sensitive data` tab and continue refining the sensitive dataset in ways that reduce the `Excess combinations ratio`.
 
+Optionally, you can set the desired `Fabrication mode` for the anonymization process and compare the data quality metrics from the various settings.
+
 After pressing `Anonymize data`, you will see two differential privacy parameters: the `Epsilon` value you set, and a `Delta` value that is generated based on the data (and indicates the very small thereoretical chance that the `Epsilon` privacy guarantee does not hold). It is important to publish both values alongside any DP dataset for correct interpretation of the privacy protection provided.
 
 Once generated, you will see the `Aggregate data` and `Synthetic data` appear on the right hand side:

diff --git a/app/workflows/anonymize_case_data/workflow.py b/app/workflows/anonymize_case_data/workflow.py
@@ -110,7 +110,7 @@ def create(sv: ds_variables.SessionVariables, workflow: None):
             c1, c2, c3 = st.columns([1, 1, 1])
             with c1:
                 st.markdown("#### Anonymize data")
-                b1, b2 = st.columns([1, 1])
+                b1, b2, b3 = st.columns([1, 1, 1])
 
                 with b1:
                     epsilon = st.number_input(
@@ -119,13 +119,26 @@ def create(sv: ds_variables.SessionVariables, workflow: None):
                         help="The privacy budget, under differential privacy, to use when synthesizing the aggregate dataset.\n\nLower values of epsilon correspond to greater privacy protection but lower data quality.\n\nThe delta parameter is set automatically as 1/(protected_records*ln(protected_records)), where protected_records is the count of sensitive records protected using 0.5% of the privacy budget.\n\n**Rule of thumb**: Aim to keep epsilon at **12** or below.",
                     )
                 with b2:
+                    fab_mode = st.selectbox(
+                        "Fabrication mode",
+                        options=["Balanced", "Progressive", "Minimized", "Uncontrolled"],
+                        help="Options for controlling the fabrication of attribute combinations in the anonymized data. Experiment with different settings and compare the resulting data quality."
+                    )
+                with b3:
                     if st.button("Anonymize data"):
                         sv.anonymize_epsilon.value = epsilon
                         df = sv.anonymize_sensitive_df.value
                         with st.spinner("Anonymizing data..."):
+                            fab_option = AnonymizeCaseData.FabricationStrategy.BALANCED if fab_mode == "Balanced" \
+                                else AnonymizeCaseData.FabricationStrategy.PROGRESSIVE if fab_mode == "Progressive" \
+                                else AnonymizeCaseData.FabricationStrategy.MINIMIZED if fab_mode == "Minimized" \
+                                else AnonymizeCaseData.FabricationStrategy.UNCONTROLLED
+
+
                             acd.anonymize_case_data(
                                 df=df,
                                 epsilon=epsilon,
+                                fabrication_mode=fab_option
                             )
                             sv.anonymize_synthetic_df.value = acd.synthetic_df
                             sv.anonymize_aggregate_df.value = acd.aggregate_df