Installing and Using Find_SSNs on Linux and Solaris
This article is intended for Facutly/Staff who may have certain types of PI (personal information) located on a computer, running the Linux or Solaris operating systems.
Find_SSNs is a piece of software written in python at VirginiaTech that searches a computers files for Social Security #'s and Credit Card numbers. It requires python version 2.4+ to run. By default Find_SSNs searches the following file types: doc, docx, xlsx, xls, rtf, zip, text files (e.g. html, xml, txt) and Open Office 2 documents. It can additionally search pdf files when the pdftotext binary is installed. (It's part of the poppler package.) We provide two versions of Find_SSNs: One that searchs pdfs and another version that doesn't search pdfs (in case you can't install the poppler package). Our instructions below will include the necessary steps to get the poppler package installed.
The Find_SSNs software webpage at Virginia Polytechnic Institute is located here: https://security.vt.edu/software/Find_SSNs.html
The full Find_SSNs documentation at Virginia Polytechnic Institute is located here: https://security.vt.edu/software/Find_SSNs/find_ssns_referance_manual.html
Note: While these install steps should work on any modern Linux distro we've only verified that they work on RHEL5, RHEL6 and Ubuntu 11.04.
The requirements to run Find_SSNs are:
- python v2.4+
- pdftotext binary, which is part of the poppler-utils package on both RHEL and Ubuntu.
Note: RHEL and Ubuntu with a default install come with python installed.
- Grab a copy of Find_SSNs here: http://www.hawaii.edu/its/docs/find_ssns.tar
- Extract the Find_SSNs to the root users home directory or somewhere else where only root has access to it.
- Before you run Find_SSNs you need to have the poppler-utils package installed.
On RHEL5 install it with this command:
yum install poppler-utils
On Ubuntu install it with this command:
apt-get install poppler-utils
Note: If for some reason you can't install poppler-utils to scan pdf files you can grab a copy of Find_SSNs with pdf searching turned off: http://www.hawaii.edu/its/docs/find_ssns_nopdf.tar
- Python 2.4+ (part of a standard solaris 10 install)
- pdftotext binary which is part of the poppler package. (if you need to search pdfs)
The poppler package is not a part of the default install on Solaris so needs to be installed from a third-party package (and it's dependencies). To ease the process of installing poppler and it's deps. we've created a tar download which bundles the poppler package and it's dependencies together with a install script which will automate the process of installing the packages. The script verifies that none of the packages it installs are already installed on the system and it installs the packages in the /usr/local directory structure.
- Download the Solaris poppler install bundle here: http://www.hawaii.edu/its/docs/poppler_install.tar.gz
- Extract it and run the 'install_poppler.sh' which will install poppler and it's dependencies if any are not already installed.
- Download the Find_SSNs package here: http://www.hawaii.edu/its/docs/find_ssns.tar
- Extract the Find_SSNs package.
Note: If you cannot install poppler and it's dependencies to search pdfs you can download a copy of Find_SSNs with pdf support turned off: http://www.hawaii.edu/its/docs/find_ssns_nopdf.tar
Scanning your filesystem(s) for files that contain SSN or CC #'s is the same across all Unix/Linux boxes.
Note: Find_SSNs uses a few innovative methods to reduce false positives, but it *will* still find some false positives when it scans your computer.
We've found that the best way to reduce the number of false positives is to only scan locations on the servers that could hold PII information. For example, /home, /fileshare, etc...
We've included the false positives that Find_SSNs finds on a full scan of a default install of RHEL5 and Solaris 10 in the Find_SSNs packages in the directory named "default_false_positives".
The steps that are required for Find_SSNs to successfully run:
- An active Internet connection. Find_SSN's uses an internet connection at program startup to contact a Virginia Tech webserver to pull down the latest SSN patterns. This feature greatly reduces the number of false positives since Find_SSNs only flags SSN number patterns that actually have numbers that could possibly make a real SSN.
- Python interpreter in your $PATH environmental variable.
- pdftotext binary in your $PATH environmental variable. (unless you are using the version of Find_SSNs with pdf support disabled)
Note: For the full documentation on Find_SSNs, please refer to the Find_SSNs official documentation, located here: https://security.vt.edu/software/Find_SSNs/find_ssns_referance_manual.html.
A basic scenario of using Find_SSNs are these:
To scan your whole computer for SSN's and CC #'s use this command:
python Find_SSNs.pyw -p / -o /root/find_ssns/ -t csv -a
- '-p' indicates the starting path.
- '-o' indicates the directory to output the scan results.
- '-t' tells Find_SSNs that you want your results in a csv file.
- '-a' tells Find_SSNs to search for both SSN #'s and CC #'s.
Find_SSNs outputs two files:
- A csv file which lists the files that have suspicious numbers
- A txt file which lists the filenames and the actual suspicious numbers.
After reviewing the two output files they should be securely deleted from the computer.
Note: If you're receiving this error: "Error - Cannot load SSA areas to groups information. Are you connected to the Internet?"
Replace the URL in numbers.py line 89 to http://www.hawaii.edu/