Thirty-five zettabytes. That is the equivalent of 35 trillion gigabytes, and that is how much information the Computer Sciences Corporation estimates user will generate on an annual basis by the year 2020. The phrase “big data” simply refers to sets of information that have grown too complex or too large to fit into the older, standard tools we used to use. Thanks to the exponential growth in the industry, there has been a push to figure out a way to manipulate and manage all this information, and classification is the key to making it work.
Analyzing Big Data
Broken down into its basic components, big data is just a collection of the simple pieces the everyday user knows well:
- Everyday work documents
These files are created and shared, getting saved somewhere within the data storage environment in the process. This information is being generated at an unprecedented pace, causing unstructured growth. The boom, so to speak, has left many in the industry scratching their heads when trying to set policies around it or simply maintain it.
The Problem With Unstructured Growth
One of the reasons so many people in the information technology field are concerned about big data is that there are legal implications now that require a business to be able to store and retrieve certain information. Left unmanaged, these key pieces of data can become a compliance liability or land a business in legal trouble.
What’s more, improper storage of all this information is how security breaches occur. Hackers need only to find one small opening in order to compromise an entire data set, as exhibited by the countless issues major retail companies, for example, have experienced.
The Process of Storing Data
So how can we properly and securely store all this information? The first step is to classify it correctly. In a catch-22, we find that in order to classify data, we often need to have policies in place. However, it is difficult to create policies without first having classified the information.
That is why breaking down big data to the ground level is essential, because we can figure out what kind of unstructured data is out there. Once we identify the items at a file level, we can start to classify it because we can determine where it is located, who owns it and when it was accessed last.
Being Smart About Information
The classification process is exactly how we can take a step back and attack the storage problem. Gaining a deeper understanding of the files will enable us to improve the way our data works and is governed.
There are six basic classifications that a piece of data may fall under:
Regulatory requirements will automatically mean that certain pieces of information are valuable to a business in the long term. Storage companies may be able to find these files by searching for keywords, figuring out who owns it or simply knowing the type of file. Once located, the information can be placed into an archive to satisfy legal or other standards.
If something has been created in the last three years, it is considered active and therefore is most likely to be accesses again. It can be managed in place until either aging out of the system or moving into another classification.
Information often moves from being active to being aged. Items that have not been accessed for three years may represent as much as 40 percent of the data on a company’s network. Therefore, it is imperative for a business to take this information and move it either into an archive file, if it has value, or the trash can if it does not. Classification enables us to view who owns the document or search it by keyword to determine if it is something that should be saved.
You know the drill: You create a version of a document and share it with a co-worker, who makes a few changes and shares it with someone else, who also makes changes. You now have three copies of the same document floating around and taking up space. Through data profiling, we can attach a signature to a document that can help us determine if it is an exact copy of something else and can be deleted.
Even companies with strict policies that restrict the use of machines for personal items will find that employees often store pictures or to-do lists on the system. Someone has a new baby and wants to share a picture with co-workers, and that image has now been saved somewhere in your data storage. While one photo may not be problematic, several photos from thousands of employees can be. Businesses can utilize data classification to identify personal information and ask employees to remove it from the network.
This last group is likely the easiest to identify and manage, as abandoned information typically does not have value. Usually, this is data that former employees owned, and it has not been accessed in the three-year timeframe. It is still a good idea to ensure that the files do not contain important information that should be stored, however, just to cover any liabilities.
Once the classification process is complete, businesses can easily manage the information by archiving it, deleting it or moving it to a less expensive data center. Policies are easily created once a business knows what kind of information it has, and these policies can be used as a legal defense for deleting a document.
At Titan Power, our goal is to help you run as efficiently as possible. Let’s combat the problems associated with big data by simply making our data smarter. Identify it, classify it and create a policy to give it structure.