Interactive Privacy-Preserving Techniques Demo

Introduction

This demonstration showcases various techniques I've implemented to protect participant anonymity in large-scale surveys. The scenarios are based on my experience with a survey involving approximately 1000 participants, over 150 questions, and rich demographic data including Date of Birth, Gender, Education, Country, Ethnicity, Department, Site, Job Title, Employment Status, Job Level, Tenure, and Annual Salary, plus organizational hierarchy information ("who do you work for," "who works for you").

Each section will illustrate a specific privacy-enhancing technique using fictional, representative data samples to explain its core principles and "best case" application in protecting anonymity while preserving data utility.

Navigate using the menu above to explore each technique.

1. K-Anonymity: Ensuring Group Indistinguishability

K-Anonymity ensures that any individual in the dataset cannot be distinguished from at least 'k-1' other individuals based on their quasi-identifiers (QIs) – attributes that, in combination, could lead to re-identification. Common QIs from our survey include Department, Job Level, Site, and Age Range.

Original Fictional Data Sample

(QIs: Department, Job Level, Site, Age Range. Sensitive: Engagement Score)

IDDepartmentJob LevelSiteAge RangeEngagement Score (Sensitive)

K-Anonymized Data (Illustrative)

IDDepartmentJob LevelSiteAge RangeEngagement Score (Sensitive)

How K-Anonymity Was Applied (Best Case Scenario)

In our survey system, when generating reports or allowing data exploration with filters (e.g., "Show me engagement scores for Senior Managers in the London Engineering department aged 30-39"), the system would first check if the resulting group met the pre-defined 'k' value.

Best Case: If a query resulted in a group of 5 individuals who all matched on Department='Engineering', Job Level='Senior Manager', Site='London', and Age Range='30-39', and our 'k' was 3, this group would satisfy k-anonymity. Their (potentially different) engagement scores could be shown or aggregated.

If a group was smaller than 'k' (e.g., only 1 Director in HR at Site X aged 50-59), we applied:

  • Generalization: Broadening categories. E.g., 'Director' -> 'Senior Leadership', 'Site X' -> 'All Sites', '50-59' -> '50+'. The system had predefined hierarchies for generalization.
  • Suppression: In rare cases where generalization wasn't feasible without losing too much utility, the record(s) or specific sensitive values might be suppressed from that specific view.
This demo illustratively generalizes Job Level or Age Range if a (Department, Job Level, Site, Age Range) group is smaller than k. For simplicity, it suppresses the score if generalization is insufficient.

My Innovation Highlight: I developed dynamic generalization hierarchies that adapted based on the data distribution and the QIs selected. For instance, 'Job Level' generalization might be more granular for large departments but broader for smaller ones to achieve 'k' with minimal information loss. We also provided users with feedback on why certain data might be generalized or suppressed, improving transparency.

2. L-Diversity: Ensuring Variety in Sensitive Attributes

L-Diversity is a crucial extension of K-Anonymity. It addresses the risk that even if 'k' individuals are indistinguishable by their quasi-identifiers (QIs), they might all share the same sensitive attribute value, effectively revealing that information. L-Diversity mandates that for each QI group (equivalence class), there must be at least 'l' distinct values for the sensitive attribute.

For this demo, our QIs are Department, Job Level, and Site. The sensitive attribute is 'Annual Salary Band'.

Original Fictional Data Sample

(QIs: Department, Job Level, Site. Sensitive: Salary Band)

IDDepartmentJob LevelSiteAnnual Salary Band (Sensitive)

L-Diverse Data (Illustrative)

IDDepartmentJob LevelSiteAnnual Salary Band (Sensitive)

How L-Diversity Was Applied (Best Case Scenario)

After ensuring k-anonymity for a set of QIs, my system would then check each resulting equivalence class for l-diversity with respect to specified sensitive attributes like 'Annual Salary Band' or 'Performance Rating'.

Best Case: Consider a group of 4 (k=4) 'Sales Managers' in 'New York'. If their salary bands were ['Low', 'Medium', 'Medium', 'High'], this group would satisfy l-diversity for l=3 (three distinct values: Low, Medium, High). If l was set to 2, it would also satisfy it.

If an equivalence class failed l-diversity (e.g., all 4 Sales Managers had a 'High' salary band, failing for l >= 2), the system would:

  • Further Generalize QIs: Try to merge the non-diverse group with other groups by generalizing QIs (e.g., 'Sales Manager' -> 'Managerial Staff', or 'New York' -> 'All US Sites') until the new, larger group satisfied l-diversity.
  • Suppression: If further generalization was not possible without excessive information loss, the sensitive data for that group, or sometimes the entire group, might be suppressed for that particular query or report. This demo primarily illustrates suppression for simplicity if a group fails l-diversity after initial k-grouping.

My Innovation Highlight: I implemented an 'adaptive l-diversity' mechanism. For highly sensitive attributes (like salary), a higher 'l' value could be enforced. The system also prioritized generalization strategies that were most effective at increasing diversity for the target sensitive attribute, minimizing overall data distortion. For example, it might try to merge groups by generalizing a QI known to correlate less with the sensitive attribute first.

3. T-Closeness: Matching Sensitive Attribute Distributions

T-Closeness enhances L-Diversity by requiring that the distribution of a sensitive attribute within any QI group (equivalence class) is "close" to its distribution in the overall dataset. This prevents inference attacks based on skewed distributions within a group, even if it's l-diverse. The "closeness" is measured by a distance metric not exceeding a threshold 't'.

For this demo, QIs are Department and Site. The sensitive attribute is 'Salary Satisfaction' (Low, Medium, High). We'll use a simplified visual check for distribution similarity.

Original Fictional Data Sample

(QIs: Department, Site. Sensitive: Salary Satisfaction)

IDDepartmentSiteSalary Satisfaction (Sensitive)

T-Close Data (Illustrative)

IDDepartmentSiteSalary Satisfaction (Sensitive)

Overall Distribution of 'Salary Satisfaction' (Entire Dataset)

Distribution within Selected QI Group (click red row in table above)

Select a highlighted (red) QI group from the anonymized table to see its distribution.

How T-Closeness Was Applied (Best Case Scenario)

After achieving k-anonymity and l-diversity, my system would further analyze equivalence classes for t-closeness with respect to critical sensitive attributes. The goal was to ensure that the distribution of, say, 'Employee Engagement Scores' or 'Salary Satisfaction Levels' within any identifiable subgroup closely mirrored the overall company distribution.

Best Case: Imagine an equivalence class of 5 'Engineers in London'. If the overall company has 30% 'High', 50% 'Medium', and 20% 'Low' Salary Satisfaction, this group of 5 engineers would ideally also reflect a similar distribution (e.g., 1 High, 3 Medium, 1 Low, which is 20%H, 60%M, 20%L). If the chosen 't' (e.g., max 0.1 deviation per category) was met, the group is t-close.

If a group failed t-closeness (e.g., all 5 engineers reported 'Low' satisfaction, which is very different from the overall distribution), the system would employ:

  • Targeted Generalization/Merging: Attempting to merge the non-t-close group with other groups that could balance out its distribution, ideally by generalizing QIs that are less correlated with the sensitive attribute.
  • Suppression: As a last resort for highly skewed groups where generalization would destroy too much utility, the sensitive data for that group might be suppressed. This demo primarily illustrates suppression for groups failing t-closeness.

My Innovation Highlight: I developed a method to calculate an 'impact score' for t-closeness failures, which helped prioritize which QI to generalize to most effectively restore t-closeness with minimal data loss. We also used different 't' thresholds for different sensitive attributes based on their perceived sensitivity and baseline distribution variance. For attributes with naturally high variance, a slightly larger 't' might be acceptable.

4. K-Map Risk Visualization: Identifying High-Risk Data Cells

"K-Map" (in this context, referring to a risk map based on k-anonymity principles) helps visualize how combinations of Quasi-Identifiers (QIs) distribute individuals across a dataset. It identifies "cells" or unique QI combinations that contain very few individuals, making those individuals potentially re-identifiable. This step is crucial for pre-anonymization risk assessment.

Select up to 3 QIs from the survey data to generate a risk map. The map will show unique combinations and the count of individuals in each. Cells below the 'Risk Threshold' are highlighted.


Risk Map: Unique QI Combinations and Counts

How K-Map Risk Visualization Was Used

Before applying anonymization techniques like generalization or suppression, it was vital to understand the inherent re-identification risks in the raw data. I used a "k-map" approach (conceptually) to analyze the dataset by selecting various combinations of QIs (e.g., Department + Job Title + Country + Age Bracket).

The system would then compute how many unique combinations of these selected QIs existed and, crucially, how many individuals fell into each unique combination (each "cell" in the map).

Best Case Identification: This process would clearly highlight, for example, that while "Engineers in London" might be a large group, "Female Directors in the Berlin Sales office with a PhD" might only have 1 individual. This '1' is a high-risk cell. The visualization would flag all cells with counts below a defined minimum 'k' (the risk threshold). This directly informed which QIs or specific values needed the most aggressive generalization or where data might need to be suppressed to achieve the target k-anonymity across the board.

This interactive demo lets you select QIs and see a simplified version of this risk map. Cells with counts below your chosen threshold are highlighted as risky.

My Innovation Highlight: I developed an automated risk-profiling tool that could iterate through many common QI combinations, generating these "k-maps" and producing a prioritized list of the riskiest combinations across the entire 1000-participant dataset. This allowed us to proactively address the most significant vulnerabilities first and tailor generalization hierarchies more effectively, rather than applying blanket rules. The tool also suggested potential generalization strategies based on the identified risky cells.

5. Differential Privacy: Formal Guarantees via Calibrated Noise

Differential Privacy (DP) provides a strong, mathematical definition of privacy. It ensures that the outcome of any analysis is essentially the same, whether or not any single individual's data is included in the computation. This is typically achieved by adding a carefully calibrated amount of random noise to the true result of a query (e.g., a count, sum, or average).

The amount of noise is controlled by a "privacy budget" parameter, epsilon (ε). Smaller epsilon values mean more noise and thus stronger privacy guarantees, but potentially less accurate results.

This demo will calculate the average "Hours worked from home last week" from a sample. You can adjust epsilon to see its effect on the noise added to the true average.

(Lower ε = More Privacy, More Noise)

Fictional Data Sample: Hours Worked from Home Last Week

Participant IDHours From Home

Query Results: Average Hours From Home

True Average: -

Differentially Private Average (True + Noise): -

Note: The DP Average will vary slightly each time you calculate due to random noise addition.

How Differential Privacy Was Applied (Best Case Scenario)

In my work, particularly when generating aggregate statistics for large groups or for public release from the 1000-participant survey, I designed systems to incorporate Differential Privacy. This was crucial for answering questions like "What is the average number of training hours completed by employees in the Sales department?" or "What proportion of employees use benefit X?"

Best Case Application:

  1. A query for an aggregate statistic (e.g., average annual salary for a large department) would be submitted.
  2. The system calculates the true average.
  3. It then determines the "sensitivity" of the query (how much the result could change if one person's data was removed/altered). For an average, this relates to the range of possible salary values.
  4. Using a pre-defined epsilon (ε) from the allocated privacy budget, the system generates noise from a Laplace distribution (scaled by sensitivity and epsilon) and adds it to the true average.
  5. The resulting noisy average is released.
This ensures that an observer seeing the released average cannot confidently infer whether any specific individual (e.g., the CEO with a very high salary) was included in that department's average calculation for that query. The key is that the noise masks individual contributions.

This demo uses a simplified Laplace noise addition for an average. The "sensitivity" is assumed based on the range of data for simplicity.

My Innovation Highlight: I developed a privacy budget management system that tracked epsilon expenditure across multiple queries and datasets. This allowed us to provide robust DP guarantees over time, even when analysts performed various exploratory queries. I also worked on techniques to optimize query sensitivity calculations for different types of survey questions (e.g., Likert scales, numerical inputs, categorical choices) to add the minimum necessary noise while still satisfying the DP definition.

6. Graph-Based Privacy: Anonymizing Organizational Structures

Organizational hierarchy data ("who reports to whom") is a form of graph data. Individuals can sometimes be re-identified based on their unique position within this structure, such as having a unique number of direct reports or a very specific role in a small team. Graph-based privacy techniques aim to mitigate these risks.

This demo visualizes an expanded fictional org chart with improved styling. We'll illustrate how structural properties (like the number of direct reports or specific roles) can be generalized to protect individuals. Note that the visual complexity of sites like 'theorg.com' typically relies on specialized JavaScript libraries for advanced rendering and layout, which are beyond the scope of this self-contained demo.

Original Fictional Org Chart (Expanded)

Anonymized Org Chart (Illustrative)

How Graph-Based Privacy Was Applied (Best Case Scenario)

The "who do you work for" and "who works for you" data from the 1000-participant survey formed a complex organizational graph. Simply removing names wasn't enough, as structural information could still identify people. For example: "the only VP with exactly 2 direct reports, one of whom is a Director with 7 reports."

My approach involved several layers:

  • Degree Anonymization (k-degree anonymity): Ensuring that for any given number of direct reports (degree), there were at least 'k' managers with that same degree, or their degree was generalized into a bucket (e.g., "1-3 reports", "4-7 reports"). This demo illustrates this by "bucketing" report counts.
  • Role/Attribute Generalization: If a role was too unique within a specific branch of the hierarchy (e.g., "Lead Data Privacy Officer for EU Operations"), it would be generalized (e.g., "Senior Compliance Officer").
  • Structural Anonymization (Conceptual): For more advanced protection, techniques like adding/removing carefully selected (false) edges or nodes can be used to break unique structural patterns, though this is highly complex. My work focused on practical degree and role generalization for our survey data.
Best Case: An analyst could query "average team size for Senior Managers" and get a useful, slightly generalized result (e.g., "average 5-8 reports") without being able to pinpoint a specific Senior Manager who might be the only one with exactly 6 reports. Unique roles in small teams would be shown with generalized titles.

My Innovation Highlight: I developed a 'structural risk score' for each node in the hierarchy, considering its degree, the uniqueness of its role title within its department, and the size of its team relative to others. This score helped prioritize which parts of the org graph needed anonymization most urgently. We then applied iterative generalization (first to report counts, then to roles) until the risk scores for all nodes fell below a defined threshold, balancing privacy with the utility of the hierarchical data for analysis.