This demonstration showcases various techniques I've implemented to protect participant anonymity in large-scale surveys. The scenarios are based on my experience with a survey involving approximately 1000 participants, over 150 questions, and rich demographic data including Date of Birth, Gender, Education, Country, Ethnicity, Department, Site, Job Title, Employment Status, Job Level, Tenure, and Annual Salary, plus organizational hierarchy information ("who do you work for," "who works for you").
Each section will illustrate a specific privacy-enhancing technique using fictional, representative data samples to explain its core principles and "best case" application in protecting anonymity while preserving data utility.
Navigate using the menu above to explore each technique.
K-Anonymity ensures that any individual in the dataset cannot be distinguished from at least 'k-1' other individuals based on their quasi-identifiers (QIs) – attributes that, in combination, could lead to re-identification. Common QIs from our survey include Department, Job Level, Site, and Age Range.
(QIs: Department, Job Level, Site, Age Range. Sensitive: Engagement Score)
| ID | Department | Job Level | Site | Age Range | Engagement Score (Sensitive) |
|---|
| ID | Department | Job Level | Site | Age Range | Engagement Score (Sensitive) |
|---|
In our survey system, when generating reports or allowing data exploration with filters (e.g., "Show me engagement scores for Senior Managers in the London Engineering department aged 30-39"), the system would first check if the resulting group met the pre-defined 'k' value.
Best Case: If a query resulted in a group of 5 individuals who all matched on Department='Engineering', Job Level='Senior Manager', Site='London', and Age Range='30-39', and our 'k' was 3, this group would satisfy k-anonymity. Their (potentially different) engagement scores could be shown or aggregated.
If a group was smaller than 'k' (e.g., only 1 Director in HR at Site X aged 50-59), we applied:
L-Diversity is a crucial extension of K-Anonymity. It addresses the risk that even if 'k' individuals are indistinguishable by their quasi-identifiers (QIs), they might all share the same sensitive attribute value, effectively revealing that information. L-Diversity mandates that for each QI group (equivalence class), there must be at least 'l' distinct values for the sensitive attribute.
For this demo, our QIs are Department, Job Level, and Site. The sensitive attribute is 'Annual Salary Band'.
(QIs: Department, Job Level, Site. Sensitive: Salary Band)
| ID | Department | Job Level | Site | Annual Salary Band (Sensitive) |
|---|
| ID | Department | Job Level | Site | Annual Salary Band (Sensitive) |
|---|
After ensuring k-anonymity for a set of QIs, my system would then check each resulting equivalence class for l-diversity with respect to specified sensitive attributes like 'Annual Salary Band' or 'Performance Rating'.
Best Case: Consider a group of 4 (k=4) 'Sales Managers' in 'New York'. If their salary bands were ['Low', 'Medium', 'Medium', 'High'], this group would satisfy l-diversity for l=3 (three distinct values: Low, Medium, High). If l was set to 2, it would also satisfy it.
If an equivalence class failed l-diversity (e.g., all 4 Sales Managers had a 'High' salary band, failing for l >= 2), the system would:
T-Closeness enhances L-Diversity by requiring that the distribution of a sensitive attribute within any QI group (equivalence class) is "close" to its distribution in the overall dataset. This prevents inference attacks based on skewed distributions within a group, even if it's l-diverse. The "closeness" is measured by a distance metric not exceeding a threshold 't'.
For this demo, QIs are Department and Site. The sensitive attribute is 'Salary Satisfaction' (Low, Medium, High). We'll use a simplified visual check for distribution similarity.
(QIs: Department, Site. Sensitive: Salary Satisfaction)
| ID | Department | Site | Salary Satisfaction (Sensitive) |
|---|
| ID | Department | Site | Salary Satisfaction (Sensitive) |
|---|
Select a highlighted (red) QI group from the anonymized table to see its distribution.
After achieving k-anonymity and l-diversity, my system would further analyze equivalence classes for t-closeness with respect to critical sensitive attributes. The goal was to ensure that the distribution of, say, 'Employee Engagement Scores' or 'Salary Satisfaction Levels' within any identifiable subgroup closely mirrored the overall company distribution.
Best Case: Imagine an equivalence class of 5 'Engineers in London'. If the overall company has 30% 'High', 50% 'Medium', and 20% 'Low' Salary Satisfaction, this group of 5 engineers would ideally also reflect a similar distribution (e.g., 1 High, 3 Medium, 1 Low, which is 20%H, 60%M, 20%L). If the chosen 't' (e.g., max 0.1 deviation per category) was met, the group is t-close.
If a group failed t-closeness (e.g., all 5 engineers reported 'Low' satisfaction, which is very different from the overall distribution), the system would employ:
"K-Map" (in this context, referring to a risk map based on k-anonymity principles) helps visualize how combinations of Quasi-Identifiers (QIs) distribute individuals across a dataset. It identifies "cells" or unique QI combinations that contain very few individuals, making those individuals potentially re-identifiable. This step is crucial for pre-anonymization risk assessment.
Select up to 3 QIs from the survey data to generate a risk map. The map will show unique combinations and the count of individuals in each. Cells below the 'Risk Threshold' are highlighted.
Before applying anonymization techniques like generalization or suppression, it was vital to understand the inherent re-identification risks in the raw data. I used a "k-map" approach (conceptually) to analyze the dataset by selecting various combinations of QIs (e.g., Department + Job Title + Country + Age Bracket).
The system would then compute how many unique combinations of these selected QIs existed and, crucially, how many individuals fell into each unique combination (each "cell" in the map).
Best Case Identification: This process would clearly highlight, for example, that while "Engineers in London" might be a large group, "Female Directors in the Berlin Sales office with a PhD" might only have 1 individual. This '1' is a high-risk cell. The visualization would flag all cells with counts below a defined minimum 'k' (the risk threshold). This directly informed which QIs or specific values needed the most aggressive generalization or where data might need to be suppressed to achieve the target k-anonymity across the board.
This interactive demo lets you select QIs and see a simplified version of this risk map. Cells with counts below your chosen threshold are highlighted as risky.
Differential Privacy (DP) provides a strong, mathematical definition of privacy. It ensures that the outcome of any analysis is essentially the same, whether or not any single individual's data is included in the computation. This is typically achieved by adding a carefully calibrated amount of random noise to the true result of a query (e.g., a count, sum, or average).
The amount of noise is controlled by a "privacy budget" parameter, epsilon (ε). Smaller epsilon values mean more noise and thus stronger privacy guarantees, but potentially less accurate results.
This demo will calculate the average "Hours worked from home last week" from a sample. You can adjust epsilon to see its effect on the noise added to the true average.
| Participant ID | Hours From Home |
|---|
True Average: -
Differentially Private Average (True + Noise): -
Note: The DP Average will vary slightly each time you calculate due to random noise addition.
In my work, particularly when generating aggregate statistics for large groups or for public release from the 1000-participant survey, I designed systems to incorporate Differential Privacy. This was crucial for answering questions like "What is the average number of training hours completed by employees in the Sales department?" or "What proportion of employees use benefit X?"
Best Case Application:
This demo uses a simplified Laplace noise addition for an average. The "sensitivity" is assumed based on the range of data for simplicity.
Organizational hierarchy data ("who reports to whom") is a form of graph data. Individuals can sometimes be re-identified based on their unique position within this structure, such as having a unique number of direct reports or a very specific role in a small team. Graph-based privacy techniques aim to mitigate these risks.
This demo visualizes an expanded fictional org chart with improved styling. We'll illustrate how structural properties (like the number of direct reports or specific roles) can be generalized to protect individuals. Note that the visual complexity of sites like 'theorg.com' typically relies on specialized JavaScript libraries for advanced rendering and layout, which are beyond the scope of this self-contained demo.
The "who do you work for" and "who works for you" data from the 1000-participant survey formed a complex organizational graph. Simply removing names wasn't enough, as structural information could still identify people. For example: "the only VP with exactly 2 direct reports, one of whom is a Director with 7 reports."
My approach involved several layers: