Features

April 2008

March 2008

February 2008

January 2008

December 2007

November 2007

October 2007

September 2007

August 2007

July 2007

June 2007

May 2007

Feeds

Rss-icon RSS Atom-icon ATOM

What is RSS & ATOM?

The problem of dirty data

Tagged with: , ,

View Comments and Questions (4)

What's Your Question – Ask Our Experts..

Dirty Data Problem

Corporate IT systems are hiding a dirty secret: they are awash with data that is inaccurate, incomplete, out-of-date, duplicated, redundant or just plain meaningless.

And once databases are full of ‘dirty data', getting meaningful information about purchasing trends or customer demographics back out of them, is nigh-on impossible.

In short, dirty data is the enemy of business intelligence (BI) and data warehousing initiatives - but it is also a problem that many companies fail to take seriously, says Rob Bailey, a business intelligence specialist at Logicalis.

That happens for two reasons, he says: "First, companies tend to underestimate the extent of the problem, so they don't have in place adequate data quality controls. Second, they underestimate how hard it is to fix bad data once it's in the system. They simply don't understand the time and effort it takes to put it right."

The results of that are clear: ambitious BI implementations that ultimately fail to produce meaningful analysis of business conditions. Or, to put it more simply: "If you put rubbish in, you get rubbish out."

Fortunately, there are tools available that can help companies get a better handle on data quality control, he says. These broadly fall into two categories.

First, there are ETL (extraction, transformation and loading) tools. Originally designed simply to move data between databases, these now incorporate sophisticated data profiling functions that can be tuned to spot invalid data structures, incorrect or missing values, duplications, misplaced fields and other problems within data batches. What's more, they can make automatic changes to data, removing much of the manual effort that might otherwise be involved in putting it right.

Second, there are data hubs that sit at the centre of the IT infrastructure and synchronise the data (most typically, customer data) captured by multiple different applications - a company's customer relationship management system, for example, or its financial accounting software. These hubs can be tuned to standardise addresses or check customer names as they are entered in any of these applications, Bailey explains.

The cost of such tools may be an issue for some organisations - but it should be viewed in relation to the costs arising from inadequate data quality checks: in a 2006 survey by independent industry body The Data Warehousing Institute, for example, over half of respondents (53%) said that their companies had suffered unnecessary losses and business problems as a direct result of dirty data.

"It's a big mistake to embark on a BI project if you're working on the assumption that your data is OK," says Bailey. "In most cases, it's not."

.

Your Comments and Questions

Rob Bailey, 22 days ago

First of all, apologies for the delay in responding - I hope you're still looking! ETL tools can certainly play an important part in protecting your data warehouse against poor quality data. Indeed, you must protect your data warehouse, otherwise you will seriously compromise its usefulness. Most ETL tools, such as Oracle Warehouse Builder (OWB), have data quality features. In the case of OWB you use these first of all to carry out an audit on the quality of the data in your source system. Having identified problem areas you then build rules to deal with them, either by cleaning the data as it is processed and allowing it to be loaded, or if this is not possible simply loading it to an exceptions table rather than the main warehouse tables. In the latter case you should then build processes (automated or manual) to see if the data can be corrected at source and re-extracted. Business Intelligence is a very wide-ranging term, but in general it refers to applications that present business data, from operational systems or data warehouses, in ways which make it readily comprehensible and allow management at all levels to make business decisions to improve performance within their area of responsibility. Dashboards, ad-hoc query and reporting tools and enterprise reporting packages all provide business intelligence, but aimed at different types and levels of user.

Vaibhav Asthana, 2 months ago

What are the best practices to avoid junk data or invalid data from our database to make data more relevant. How do ETL help in this regard. Please explain Term 'BI' (Business Intelligence).

Rob Bailey, 8 months ago

Interesting question. I don't think this is what data hubs are really intended for, though it is certainly conceivable that they could be used in this way. For example, a customer of ours has a data warehouse in which their main concern is with their customers' postal addresses. They operate in the publishing industry and like to entice new business by offering free 3-month trial subscriptions. Naturally, they would prefer it if the same person did not continually keep re-registering for free tials, thereby getting a permanent free subscription. The fact that they carefully track postal addresses has worked well for them so far, but now they are moving to digital its the email address that's important, not the postal address. And, yes, looking in their data warehouse they can see cases where one email address has registered for numerous free trials, using subtly different names and addresses. They do not use a data hub to look for this sort of thing at the moment, but it is certainly possible that the presence of a data hub (or even just feeding data in batch through a suitably enabled ETL tool) could help them identify and stop these activities, which are essentially fraudulent.

Victoria Furness, 8 months ago

Do you know of any organisations that have used the data hubs to look for potential cases of fraud as well? If they work by looking for anomalies, I wonder whether they could be used to spot any unusual activities, perhaps, that might indicate fraudulent activity?

What's Your Question or Comment?





583206379899cca7ceca18d9f67fa94051f8c641

type the text from the image

Digg_icon digg it!   Delicious Add to del.icio.us
.

Wiki

Data Warehouse

A data warehouse is a repository of an organisation's electronically stored data. Data warehouses are... Read More

Blog

Keep data safe – or face jail

Red faces and humiliating public apologies in the wake of serious data breaches are nothing... Read More

.