Protecting Research Data

Table of Contents

Data Collection, Storage, and Retention

The primary strategy for mitigating sensitive data risk is data minimization: evaluating what is strictly necessary and reducing the volume, identifiability, and retention period of project data.

  1. Scrutinize Collection

    Before starting your research, audit every intended data field. If a variable is not essential to your analysis, do not collect it. Avoid gathering highly sensitive Personally Identifiable Information (PII) such as Social Security Numbers, financial data, or Protected Health Information (PHI) unless there is a compelling rationale for doing so.

    When handling PHI, follow the HIPAA Safe Harbor method. This requires removing 18 specific identifiers to prevent the re-identification of participants.

    Note: Unauthorized exposure of this data constitutes a breach, carrying severe financial and reputational consequences for the University.

  2. Purge Metadata and “Junk Data”

    Regularly review datasets for unintentionally captured information that could compromise anonymity. This includes:

    • Digital Identifiers: Timestamps, IP addresses, and specific dates.
    • Socio-Demographic Outliers: Categories with small sample sizes that could lead to “deductive identification.”
    • Accidental PII: Incidental personally identifiable information entered by participants (e.g., in open-text fields).
  3. Define and Enforce Retention Periods

    Data risk is a function of time: the longer it is held, the higher the risk.

    • Establish a clear timeline for how long data will be kept post-analysis.
    • Delete identifiable data once it is no longer required.
    • Conduct periodic audits to ensure stored data complies with your retention policy.
  4. Maintain Data Oversight

    Maintain an accurate, up-to-date inventory of where all research data is stored. Establish a “transfer of custody” process to ensure that datasets are not orphaned as team members or student researchers depart. Data should always be associated with a current, responsible lead to prevent security gaps.

De-identification Strategies by Risk Level

Following collection, prioritize the immediate removal or replacement of PII. De-identifying data breaks the link between sensitive variables and individual identities. Depending on your research requirements, you may apply multiple methods in tandem.

Strategy Risk Level Definition/Process
Anonymization Low The permanent, irreversible destruction of identifiers. Fully remove all direct and indirect identifiers so the individual can no longer be identified by any means.
Pseudonymization Medium Replacing direct identifiers with a “key” or artificial code (e.g., UID-12345). Use a pseudo-random number generator to create unique IDs. Never use sensitive data (like SSNs) to derive these keys. The risk increases significantly if the “key” (the link between the code and the identity) is stored insecurely so store the key separately from the research dataset in an encrypted, access-restricted location
Generalization & K-Anonymity Medium Reducing the precision of data to make individuals harder to distinguish within a group. Convert specific values into broader ranges (e.g., recording “Age: 20–30” instead of “Age: 25”). Re-identification is still possible if the “pool” or sample size of a specific demographic is too small (socio-demographic outliers) so use the highest level of granularity your analysis can tolerate.
Redaction & Masking High Obscuring specific portions of a data field (e.g., XXX-XX-1234) or applying digital “black boxes”. Masking is often superficial and can be reversed (e.g., removing a shape layer in a PDF) or re-linked via quasi-identifiers in the surrounding text. Use only for visual presentation, not as a primary method for securing raw datasets.

Securing De-identified Data and Preventing Re-linkage

After performing de-identification, implement these safeguards to ensure data cannot be re-linked or de-anonymized by unauthorized parties.

  • Implement Separation Architecture: Never store de-identified datasets in the same environment as their “re-identification keys” (e.g., crosswalk files, salts, mapping tables, or tokens). Maintain a physical or logical barrier between the data and the keys to ensure that a breach of one does not compromise the other.
  • Sanitize File Paths and Metadata: Folder hierarchies and filenames can inadvertently leak context. For example, storing a de-identified file in /Oahu/Honolulu/participant_data.csv reveals the geographic origin of the subjects. Using generic or coded naming conventions for directories and files is recommended.
  • Prevent Cross-Dataset Linkage: Individuals can often be re-identified by cross-referencing multiple “anonymized” datasets. For example, a pseudonymized patient ID in one file might be linked to another dataset containing the same birthdate or ZIP code. Use unique, non-overlapping keys for different projects and scrutinize how external datasets could be combined with your own.
  • Mitigate Quasi-Identifiers and Outliers: Data points that seem anonymous can become identifiable if the sample size is small. A “socio-demographic outlier” (e.g., a specific ethnicity in a small town) acts as a fingerprint. Review datasets for rare attributes and apply further generalization (e.g., grouping data into broader categories) to protect participant privacy.

Storing Original Identifiable Data and Crosswalk Files

If original identifiers or “crosswalk” files (used to re-link data) must be retained, they require the highest level of security and access control. You must minimize the number of storage locations and, where possible, maintain these files in an offline environment. All digital PII and mapping files must be encrypted, whether stored online or offline.

Storage Options and Requirements

  1. Physical Media (Paper): Keep printed surveys or identifiable materials in a locked safe or high-security cabinet. Access must be strictly limited and logged.
  2. Removable Digital Media (External Drives/USBs): Use Full Disk Encryption (FDE) or containerized encryption for all devices. When not actively in use, these devices must be stored in a secured physical location (e.g., a locked safe).
  3. Live Networked Systems: Ideally, keep systems containing PII offline. If network connectivity is required:

    • Network Restriction: Limit server access to specific, authorized hosts.
    • Account Separation: Use unique credentials for the crosswalk database. Do not share administrative rights or accounts between identifiable and de-identified environments.


If you manage a system that stores, processes, or transmits PII, it is mandatory to ensure it meets the Security Controls For Systems Storing Research Data with PII.


Security Controls for Systems Storing PII

Technical Configuration Requirements

Category Requirement
Patching Apply security patches monthly; use automated patching where available. No End-of-Life (EoL) devices allowed (End-of-Life refers to hardware or software that no longer receive security fixes from their manufacturer).
Authentication Passwords must be 14+ characters with complexity. MFA is required for remote and privileged access.
Endpoint Security SentinelOne must be installed on all devices.
Data Discovery and Management Perform monthly Spirion scans for unknown PII.
Encryption Use strong encryption (i.e. AES-256) for volumes. Use Full Disk Encryption (FDE) (BitLocker/FileVault) for physical media and container encryption (Cryptomator/VeraCrypt) for files. For Databases, implement Transparent Data Encryption (TDE) and Column-Level Encryption for sensitive fields.
Vulnerability Management Perform weekly vulnerability scans with an approved vulnerability scanner. Remediate vulnerabilities based on severity: Critical (15 days), High (30 days), and Medium (90 days).
System Hardening Implement CIS Level 2 Benchmarks at the OS, database, and application levels.
Backups Maintain encrypted backups of data. Follow the 3-2-1-0 rule for effective backups:

  • 3 Copies of data: Maintain the primary source plus two encrypted backups
  • 2 Different media types: Use at least 2 distinct formats (e.g., local disk and a secure cloud)
  • 1 Offline/Off-site Copy: Keep one copy offline and immutable to prevent against ransomware or a physical disaster
  • 0 Errors: Regularly perform recovery testing to ensure the integrity of the restoration process.

Access Control and User Management Requirements

Category Requirement
Manage user permissions to sensitive information [Maintain Access Control Lists (ACLs)] Document every account, who owns the account, and its specific access level (View, Edit, Delete). Document accounts with administrative access to systems.
Principle of Least Privilege Grant only the minimum data access required for a specific task. Use secure analysis environments that prevent downloads if possible.
Need-to-Know & Timely Revocation Grant access per-study and revoke it immediately upon role changes, project completion, or University departure.

Network Architecture Requirements

Category Requirement
Logical Separation Isolate systems containing research data from the general internet and the broader UH network.
Subnetting Keep untrusted devices (printers, IoT) on separate subnets from sensitive data servers.
Network Boundaries Use a firewall with a “Default Deny-all” rule. Log and monitor all database queries for security events such as high-volume or granular targeting of individual records.

External Partnership Management and Security Requirements

Category Requirement
Data Use Agreements (DUA) Ensure signed agreements are in place for all third-party collaborators.
Application Audits Verify that any third-party app integrating with your data meets the same security standards as your internal systems.

Resources