MSBDA

MSBDA Resources

Articles

Data Anonymization: What It Is And What It Is Not

For data to be used effectively and shared between business entities, it must be stripped of all identifiable characteristics in a way that does not render the data useless to analytics algorithms. The practice of data anonymization and de-identification walks a thin line between data that can be used for analysis and data that divulges too much information and subsequently violates privacy rights.

Unfortunately, the world has learned through bitter experience that there is no such thing as “completely secure” when it comes to computers and the Internet. Companies that take advantage of data analytics must continually refine and adapt data procedures to meet with the most recent threats and advancing technologies.

To keep on top of data anonymization, tech firms and data-heavy organizations are looking to hire professionally trained business data analytics personnel. These individuals typically have a Master of Science degree in Business Data Analytics and an eclectic knowledge of everything related to the analytics field.

What Data Must Be Anonymized?

Healthcare and financial data are the two most sensitive types of data available. For data in these two categories to be used legally and without fear of violating a person’s right to privacy, all identifying information, or identifiers, must be removed from the data to be analyzed through a process called de-identification.

A list of these identifiers is provided by the U.S. Department of Health and Human Service’s “Guidance Regarding Methods For De-Identification Of Protected Health Information In Accordance With The Health Insurance Portability And Accountability Act (HIPAA) Privacy Rule.” While this list is designed to apply to healthcare-specific data, most data analytics organizations aim to anonymize these same identifiers:

  • Names
  • Geographical subdivisions smaller than a state
  • Dates more specific than a year
  • Telephone numbers
  • Vehicle identifiers, including license plate numbers
  • Fax numbers
  • Device identifiers and serial numbers
  • Email addresses
  • URLs
  • Social security numbers
  • IP addresses
  • Medical record numbers
  • Biometric identifiers (such as fingerprint and voice prints)
  • Health plan beneficiary numbers
  • Full-face photographs
  • Account numbers
  • Certificate or license numbers
  • Any other unique number, characteristic, or code

When faced with identifiable data elements, a business can de-identify its data in several ways, though none are an absolute guarantee of security. “Data anonymization reduces the risk of unintended disclosure when sharing data between countries, industries, and even departments within the same company,” explains Investopedia.com on its “Data Anonymization” page. “Anonymization of data is done in various ways including deletion, encryption, generalization, and a host of [other methods].”

To make matters even more complicated, identifiable data can be categorized by degree. Explicitly identifiable information includes obvious identifiers such as name, address, and social security number. Potentially identifiable data can include elements such as browser cookies, MAC addresses, and other types of data that can easily be used to identify a user, client, or patient.

In an effort to improve data anonymization in the future, Certified Information Privacy Professionals (CIPPs) Jules Polonetsky and Kelsey Finch, along with privacy expert Omar Tene, propose a possible regulatory answer in their IAPP.org article “PII, Cookies And De-ID: Shades Of Gray.”

“A regulatory approach recognizing the complete spectrum of data categories will create incentives for organizations to avoid explicit identification and to deploy safeguards and controls, while at the same time allowing them to maintain the utility of data sets,” claim Polonetsky and Finch.

Still, despite every effort to protect privacy, shared data can never be fully protected against 100 percent of threats.

Is Complete Anonymization Possible?

“[The] anonymization process is an illusion,” claims deep learning authority Pete Warden in his post “Why You Can’t Really Anonymize Your Data” on O’Reilly Media’s blog, “Precisely because there are now so many different public datasets to cross-reference, any set of records with a non-trivial amount of information on someone’s actions has a good chance of matching identifiable public records.”

What Warden is saying in his post is that re-identification of de-identified data is often possible by cross-referencing different datasets. For example, if a de-identified bank account holder made a purchase at a specific restaurant last Friday, it may be possible to find out who that person was by scouring social media for the phrase “I went to [specific restaurant’s name] for dinner tonight.” Obviously, this is a simplified example, but it illustrates exactly how cross-referencing re-identification is accomplished.

Information privacy and security research firm, 33bits.org, conducted an experiment on Netflix that demonstrated re-identification in action. In the experiment, 33bits.org cross-references anonymous movie ratings on Netflix with the ratings and reviews available on the Internet Movie Database (IMDb), which are not anonymized. Just using this simple technique, a number of Netflix identifying characteristics of users were revealed, including sensitive information and political beliefs.

As new security breaches take place and new methods of cross-identification are developed, business data analytics professionals will need to continually adapt to evolving circumstances and acknowledge an ever-present margin of error.

The presence of threats doesn’t mean that data analytics should be scrapped and never attempted again. On the contrary, data analytics services, companies, and departments should continue their work, constantly looking for new ways to protect sensitive information. And companies dealing in data should always be upfront to clients, patients, and customers that their data will never be 100 percent secure. Also, details in data should be limited to only what is absolutely necessary for analytics purposes.

Maryville University’s Master Degree In Business Data Analytics

The demand for business analytics experts lies at the heart of Maryville University’s online Master’s of Science in Business Data Analytics degree. Graduates of this online program can be fully prepared to enter the workforce as a statistician, data scientist, data analyst, or actuary.

At Maryville University, students learn how to handle datasets, orchestrate multiple infrastructures, monetize data, and make decisions based on valuable analytics insights. Graduates will be exposed to the training they need to combine business operational data with the latest analytical tools, making them invaluable to employers.

Sources:

Guidance Regarding Methods For De-Identification Of Protected Health Information In Accordance With The Health Insurance Portability And Accountability Act (HIPAA) Privacy Rule – https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html
Data Anonymization – http://www.investopedia.com/terms/d/data-anonymization.asp
PII, Cookies And De-ID: Shades Of Gray – https://iapp.org/news/a/pii-cookies-and-de-id-shades-of-gray/
Why You Can’t Really Anonymize Your Data – https://www.oreilly.com/ideas/anonymize-data-limits
Netflix Paper Home – https://33bits.org/about/netflix-paper-home-page/