Dark Data – Assets that organizations collect but hardly ever use

January 13, 2016

Screen Shot 2016-01-13 at 17.09.35Between our planetary system and the next such system are light years of nearly nothing. The space between the outer bounds of two solar systems actually is filled with something – Dark Matter.

Not observable it sits there between us and ET. Most of our galaxy made up of such dark matter. It is a bit like air to my six year old daughter: All around us and yet not really visible.

Data is for most companies of similar texture. Evidence points to 90% of all data not being touched beyond its creation. It’s data that sits there, inhabiting terabytes of disk space, barely visible, and difficult to access and traverse. And often stored for the sole purpose of meeting somewhat shifting compliance requirements.

Gartner calls this type of data ‘Dark Data’. In their words “The information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes.”

Specific examples of such data could be some of the following:

  • Call notes & meeting minutes
  • Presentations, research & reports
  • Email conversations
  • Customer and account information
  • Service tickets & customer complaints
  • Knowledge base articles
  • Simply any kind of documents

The data is difficult to deal with for a number of reasons: Data is tied up in legacy systems, most of this data is unstructured data, difficult to put to use in traditional business intelligence (BI) systems, the authors of these documents have since left the company, etc.. Consequently, most of this data is not put to use either for analytics purposes nor customer insights.

As other bloggers point out this dark data also represents a number of risks related to storage, organization, legal, regulatory, and company reputation.

Is there a better way?

We believe there is. The dark data challenge may be broken up into four steps:

  • Step 1: Access to data & aggregation of data
  • Step 2: Enrich & relate this data
  • Step 3: Concept Search and insight generation
  • Step 4: Operationalize the insights in everyday business processes

Step 1 is the typical territory of the manifold ETL solutions around. Step 2 is critical: For any subsequent analysis you first need to inject some format and structure into the unstructured data imported in step 1. We developed a versatile enrich & relate pipeline for this purpose, combining a number of methods. Imagine an assembly line with robots executing repetitive steps on each component coming along on the line.

The Squirro pipeline provides a catalogue of such enrichments and relationship building methods (extract):

  • Search tagging
  • Classic (3rd party) text analytics
  • Language and sentiment detection
  • Deduplication and near-duplicate detection
  • Similar story detection
  • Known and unknown entity extraction

The third step is core to the Squirro engine: The Concept Search serves to identify insights in the processed data. These insights can be of various kind: previously hidden relationships, customer interactions, escalation issues, product discussions, financials, and more.

We believe the forth and final step is crucial. To get most out of dark data you need to operationalize the insights usage. Today this is mostly done by preparing a (complex) dashboard that requires an in-depth understanding of the data analyst’s intention. We think it is worth a lot more to instead include these insights in condensed format into the daily work routine, e.g. integrate a customer insights driven dashboard into your CRM instance (e.g. Salesforce, Microsoft Dynamics) or a service insights driven dashboard into your service management system (ServiceNow, BMC Remedy, Salesforce ServiceCloud).

Through this approach we helped a number of companies to turn their dark data into meaningful actionable insights.