Data Anonymization: What It Is And What It Is Not
For data to be used effectively and shared between business entities, it must be stripped of all identifiable characteristics in a way that does not render it useless to analytics algorithms. The practice of data anonymization and de-identification walks a thin line between data that can be used for analysis and data that divulges too much personal information and violates privacy rights.
Unfortunately, we have learned through experience that there is no such thing as “completely secure” when it comes to computers and the internet. Companies that take advantage of data analytics must continually refine and adapt data procedures for the most recent threats and advancing technologies.
To keep on top of the anonymization of data, tech firms and data-heavy organizations are looking to hire professionally trained business data analytics personnel –– individuals who are able to track and interpret data and stay on top of the latest trends in the digital analytics landscape. Those looking to advance in this career field often hold a graduate degree, such as a Master of Science in Business Data Analytics, which can arm you with the knowledge and skills to help lead new approaches to data collection and usage.
What Data Must Be Anonymized?
Healthcare and financial information are generally considered two of the most sensitive types of data available. For data in these two categories to be used legally and without fear of violating a person’s right to privacy, all identifying information, or identifiers, must be removed through a process called de-identification.
Anonymization is particularly important when sharing information that pertains to victims of sexual assault, minor children, and other delicate matters. There’s also the COVID-19 effect, which has seen a noticeable surge in online shopping, banking, education, and telehealth. Keeping this information safe and out the realm of hackers requires a highly specialized set of skills that protects identifying information from being misused.
A list of these identifiers is provided by the U.S. Department of Health and Human Services’ Guidance Regarding Methods for De-Identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. While this list is designed to apply to healthcare-specific data, most data analytics organizations aim to anonymize these same identifiers:
- Geographical subdivisions smaller than a state
- Dates more specific than a year
- Telephone numbers
- Vehicle identifiers, including license plate numbers
- Fax numbers
- Device identifiers and serial numbers
- Email addresses
- Social Security numbers
- IP addresses
- Medical record numbers
- Biometric identifiers (such as fingerprint and voice prints)
- Health plan beneficiary numbers
- Full-face photographs
- Account numbers
- Certificate or license numbers
- Any other unique number, characteristic, or code
When faced with attributable data elements, you can de-identify its data in several ways, though none are an absolute guarantee of security. “Data anonymization reduces the risk of unintended disclosure when sharing data between countries, industries, and even departments within the same company,” explains Investopedia on its “Data Anonymization” page. “Anonymization of data is done in various ways, including deletion, encryption, generalization, and a host of [other methods].”
To make matters even more complicated, identifiable data can be categorized by degree. Explicitly identifiable information includes obvious identifiers such as a person’s name, address, and Social Security number. Potentially identifiable data can include elements such as browser cookies, MAC addresses, and other information that can easily be used to single out a user, client, or patient.
If you want to focus your professional efforts in improving the anonymous collection and storage of data in the future, privacy expert oversight is needed. Certified information privacy professionals (CIPPs) Jules Polonetsky and Kelsey Finch, along with privacy expert Omer Tene, offer a possible regulatory answer in their International Association of Privacy Professionals article “PII, Cookies and De-ID: Shades of Gray.”
“A regulatory approach recognizing the complete spectrum of data categories will create incentives for organizations to avoid explicit identification and to deploy safeguards and controls, while at the same time allowing them to maintain the utility of data sets,” claim the authors.
Still, despite every effort to protect privacy, shared data can never be fully protected against 100% of threats.
Is Complete Anonymization Possible?
“[The] anonymization process is an illusion,” claims deep learning authority Pete Warden in his post “Why You Can’t Really Anonymize Your Data” on O’Reilly Media’s blog. “Precisely because there are now so many different public datasets to cross-reference, any set of records with a non-trivial amount of information on someone’s actions has a good chance of matching identifiable public records.”
What Warden is saying in his post is that re-identification of de-identified data is often possible by cross-referencing different sources of information. For example, if a de-identified bank account holder made a purchase at a specific restaurant last Friday, it may be possible to find out who that person was by scouring social media for the phrase, “I went to [specific restaurant’s name] for dinner tonight.” Obviously, this is a simplified example, but it illustrates exactly how cross-referencing re-identification is accomplished.
As new security breaches take place and new methods of cross-identification are developed, business data analytics professionals will need to continually adapt to evolving circumstances and acknowledge an ever-present margin of error. Additionally, by pursuing this career path you will need to stay fluent in new data analytics and security technologies as the requirements of your role continues to change.
The presence of threats doesn’t mean that data analytics should be scrapped and never attempted again. On the contrary, data analytics services, companies, and departments should continue their work, constantly looking for new ways to protect sensitive information and anonymized data. And companies dealing in data should always be upfront with clients, patients, and customers that their data will never be 100% secure. Also, details in data should be limited to only what is absolutely necessary for analytics purposes.
Maryville University’s Master’s Degree in Business Data Analytics
The demand for business analytics experts lies at the heart of Maryville University’s online Master of Science in Business Data Analytics degree, and you can learn to lead improved strategies for information management. As a program graduate, you can prepare to enter the workforce as a statistician, data scientist, data analyst, or actuary.
At Maryville University, you can learn how to handle data sets, orchestrate multiple infrastructures, monetize data, and make decisions based on valuable analytics insights. Courses will expose you to the training needed to combine business operational data with the latest analytical tools –– helping you develop skills and experience to help position you as a valuable asset to today’s employers.
U.S. Department of Health and Human Services, “Guidance Regarding Methods For De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule”