Table of Contents
- Data Collection, Storage, and Retention
- De-identification Strategies by Risk Level
- Securing De-identified Data and Preventing Re-linkage
- Storing Original Identifiable Data and Crosswalk Files
- Security Controls for Systems Storing PII
- Resources
Data Collection, Storage, and Retention
The primary strategy for mitigating sensitive data risk is data minimization: evaluating what is strictly necessary and reducing the volume, identifiability, and retention period of project data.
-
Scrutinize Collection
Before starting your research, audit every intended data field. If a variable is not essential to your analysis, do not collect it. Avoid gathering highly sensitive Personally Identifiable Information (PII) such as Social Security Numbers, financial data, or Protected Health Information (PHI) unless there is a compelling rationale for doing so.
When handling PHI, follow the HIPAA Safe Harbor method. This requires removing 18 specific identifiers to prevent the re-identification of participants.
Note: Unauthorized exposure of this data constitutes a breach, carrying severe financial and reputational consequences for the University.
-
Purge Metadata and “Junk Data”
Regularly review datasets for unintentionally captured information that could compromise anonymity. This includes:
- Digital Identifiers: Timestamps, IP addresses, and specific dates.
- Socio-Demographic Outliers: Categories with small sample sizes that could lead to “deductive identification.”
- Accidental PII: Incidental personally identifiable information entered by participants (e.g., in open-text fields).
-
Define and Enforce Retention Periods
Data risk is a function of time: the longer it is held, the higher the risk.
- Establish a clear timeline for how long data will be kept post-analysis.
- Delete identifiable data once it is no longer required.
- Conduct periodic audits to ensure stored data complies with your retention policy.
-
Maintain Data Oversight
Maintain an accurate, up-to-date inventory of where all research data is stored. Establish a “transfer of custody” process to ensure that datasets are not orphaned as team members or student researchers depart. Data should always be associated with a current, responsible lead to prevent security gaps.
De-identification Strategies by Risk Level
Following collection, prioritize the immediate removal or replacement of PII. De-identifying data breaks the link between sensitive variables and individual identities. Depending on your research requirements, you may apply multiple methods in tandem.
| Strategy | Risk Level | Definition/Process |
|---|---|---|
| Anonymization | Low | The permanent, irreversible destruction of identifiers. Fully remove all direct and indirect identifiers so the individual can no longer be identified by any means. |
| Pseudonymization | Medium | Replacing direct identifiers with a “key” or artificial code (e.g., UID-12345). Use a pseudo-random number generator to create unique IDs. Never use sensitive data (like SSNs) to derive these keys. The risk increases significantly if the “key” (the link between the code and the identity) is stored insecurely so store the key separately from the research dataset in an encrypted, access-restricted location |
| Generalization & K-Anonymity | Medium | Reducing the precision of data to make individuals harder to distinguish within a group. Convert specific values into broader ranges (e.g., recording “Age: 20–30” instead of “Age: 25”). Re-identification is still possible if the “pool” or sample size of a specific demographic is too small (socio-demographic outliers) so use the highest level of granularity your analysis can tolerate. |
| Redaction & Masking | High | Obscuring specific portions of a data field (e.g., XXX-XX-1234) or applying digital “black boxes”. Masking is often superficial and can be reversed (e.g., removing a shape layer in a PDF) or re-linked via quasi-identifiers in the surrounding text. Use only for visual presentation, not as a primary method for securing raw datasets. |
Securing De-identified Data and Preventing Re-linkage
After performing de-identification, implement these safeguards to ensure data cannot be re-linked or de-anonymized by unauthorized parties.
- Implement Separation Architecture: Never store de-identified datasets in the same environment as their “re-identification keys” (e.g., crosswalk files, salts, mapping tables, or tokens). Maintain a physical or logical barrier between the data and the keys to ensure that a breach of one does not compromise the other.
- Sanitize File Paths and Metadata: Folder hierarchies and filenames can inadvertently leak context. For example, storing a de-identified file in /Oahu/Honolulu/participant_data.csv reveals the geographic origin of the subjects. Using generic or coded naming conventions for directories and files is recommended.
- Prevent Cross-Dataset Linkage: Individuals can often be re-identified by cross-referencing multiple “anonymized” datasets. For example, a pseudonymized patient ID in one file might be linked to another dataset containing the same birthdate or ZIP code. Use unique, non-overlapping keys for different projects and scrutinize how external datasets could be combined with your own.
- Mitigate Quasi-Identifiers and Outliers: Data points that seem anonymous can become identifiable if the sample size is small. A “socio-demographic outlier” (e.g., a specific ethnicity in a small town) acts as a fingerprint. Review datasets for rare attributes and apply further generalization (e.g., grouping data into broader categories) to protect participant privacy.
Storing Original Identifiable Data and Crosswalk Files
If original identifiers or “crosswalk” files (used to re-link data) must be retained, they require the highest level of security and access control. You must minimize the number of storage locations and, where possible, maintain these files in an offline environment. All digital PII and mapping files must be encrypted, whether stored online or offline.
Storage Options and Requirements
- Physical Media (Paper): Keep printed surveys or identifiable materials in a locked safe or high-security cabinet. Access must be strictly limited and logged.
- Removable Digital Media (External Drives/USBs): Use Full Disk Encryption (FDE) or containerized encryption for all devices. When not actively in use, these devices must be stored in a secured physical location (e.g., a locked safe).
-
Live Networked Systems: Ideally, keep systems containing PII offline. If network connectivity is required:
- Network Restriction: Limit server access to specific, authorized hosts.
- Account Separation: Use unique credentials for the crosswalk database. Do not share administrative rights or accounts between identifiable and de-identified environments.
If you manage a system that stores, processes, or transmits PII, it is mandatory to ensure it meets the Security Controls For Systems Storing Research Data with PII.
Security Controls for Systems Storing PII |
|
|---|---|
Technical Configuration Requirements | |
| Category | Requirement |
| Patching | Apply security patches monthly; use automated patching where available. No End-of-Life (EoL) devices allowed (End-of-Life refers to hardware or software that no longer receive security fixes from their manufacturer). |
| Authentication | Passwords must be 14+ characters with complexity. MFA is required for remote and privileged access. |
| Endpoint Security | SentinelOne must be installed on all devices. |
| Data Discovery and Management | Perform monthly Spirion scans for unknown PII. |
| Encryption | Use strong encryption (i.e. AES-256) for volumes. Use Full Disk Encryption (FDE) (BitLocker/FileVault) for physical media and container encryption (Cryptomator/VeraCrypt) for files. For Databases, implement Transparent Data Encryption (TDE) and Column-Level Encryption for sensitive fields. |
| Vulnerability Management | Perform weekly vulnerability scans with an approved vulnerability scanner. Remediate vulnerabilities based on severity: Critical (15 days), High (30 days), and Medium (90 days). |
| System Hardening | Implement CIS Level 2 Benchmarks at the OS, database, and application levels. |
| Backups |
Maintain encrypted backups of data. Follow the 3-2-1-0 rule for effective backups:
|
Access Control and User Management Requirements | |
| Category | Requirement |
| Manage user permissions to sensitive information [Maintain Access Control Lists (ACLs)] | Document every account, who owns the account, and its specific access level (View, Edit, Delete). Document accounts with administrative access to systems. |
| Principle of Least Privilege | Grant only the minimum data access required for a specific task. Use secure analysis environments that prevent downloads if possible. |
| Need-to-Know & Timely Revocation | Grant access per-study and revoke it immediately upon role changes, project completion, or University departure. |
Network Architecture Requirements | |
| Category | Requirement |
| Logical Separation | Isolate systems containing research data from the general internet and the broader UH network. |
| Subnetting | Keep untrusted devices (printers, IoT) on separate subnets from sensitive data servers. |
| Network Boundaries | Use a firewall with a “Default Deny-all” rule. Log and monitor all database queries for security events such as high-volume or granular targeting of individual records. |
External Partnership Management and Security Requirements | |
| Category | Requirement |
| Data Use Agreements (DUA) | Ensure signed agreements are in place for all third-party collaborators. |
| Application Audits | Verify that any third-party app integrating with your data meets the same security standards as your internal systems. |