Neat, organized data sets become challenging as they grow. Some data sets don’t even start out nice and orderly. So-called “unstructured” data is the first of many kinds of uncooperative data. Taming this beast can require extensive mathematical analysis and machine-learning experiments, all of which take time. But, when the hidden patterns and valuable inferences start materializing, it is time you’ll be glad you invested.
Other data is simply born broken. Hardware errors, noise, and software bugs can all wreck perfectly good records. Conventional data management excludes these records, but big data approaches can’t afford to exclude anything.
These damaged records are among the most important, because they describe problems and faults in an information environment. Leveraging damaged records to extract maximum value and make that value available quickly is a key challenge for a sustainable big data system.
Not to be overlooked, some malformed records are deliberate attacks on your organization. These records may have been handcrafted in an attempt to exploit or damage your infrastructure, reveal its weaknesses, and distract its operators. Once again, excluding these records is not a viable option because they contain crucial information about attacks and potential weaknesses.
When you are charged with managing disparate data, containing noise and potentially even malicious attacks in a big data system, you’re grappling with Extreme Confusion.