|Dealing with dirty data|
|Written by Nelson KingHits : 2303|
|Friday, 28 February 2003 19:00|
What is the cost of bad information, meaning data that is in error, duplicated, missing, or misleading? If you said "hard to know" or "incalculable," that's not only right, but also a big part of the problem. We may feel that bad data exists, but we often don't really know exactly what it is or how much it affects us. Sometimes it's obvious: Five traveling salespeople return to the office each afternoon and enter their sales information for the day (sales items, customer info, notes, etc.) into a computer database. They do their own data entry and are allowed to add, edit, or delete anything. They can see each other's data. There are no restrictions. How long do you think it will be before this database is unreliable?
I've conducted a test situation like this myself and there are plenty of similar studies. The data degrades very rapidly, starting with customer information such as spelling of names, numeric accuracy in addresses, and missing information. Errors accumulate within days. The last to go are sales figures, but even there it takes only a week or so before serious errors appear. An unprotected database like this can become unreliable--unusable, really--in less than two months.
This example is extreme, since no company with solid management would allow such an open database. (Would it?) But even in a professionally designed and operated system, where the who, what, when, and where of data is strictly controlled, errors exist. To know this is not an intuitive leap. Nothing is perfect, so the real big-bucks question is: When does dirty data become a problem? Or put another way, how much bad data can be tolerated? Part of the job description of most full-time database administrators is the maintenance of the data, which includes data cleaning. Those companies that have database administrators typically have procedures for training users on data entry, rules for data validation, and software to support data cleaning. Unfortunately, smaller businesses without this level of data management may not even know they have a problem, which is one of the reasons for this column.
A plague of bad data
Spam is a good example of the cost of bad data because we have first-hand experience--we know how much time it takes to deal with it, and that time is a cost. It has other costs too, like plugging up networks and servers, and crowding out legitimate e-mail. However, I hadn't made the link between spam and dirty data until I realized that it's one of the things spam, business data, knowledge management, and XML have in common. Spam contains data--words, information--that are unwanted and very often erroneous. It just so happens that the whole message is more or less bad data. In other forms, like business data, the bad data is usually scattered or fragmented.
The important point, however, is that the basic steps for dealing with bad data are the same: detecting it, determining how much to tolerate, and removing it. The first two steps are the hardest and require the most sophistication. In fact, the techniques involved in the process of data cleaning are very similar, whether it's e-mail spam, a business database, or an XML document.
Not long ago I encountered the question of tolerating dirty data (or not) while I was testing antispam software. As I'm sure you know from personal experience, the tide of spam is rising, and controlling it is becoming increasingly difficult. The old do-it-yourself "find and delete" routine is withering under the pressure of dozens (if not scores) of spam messages a day. Of course, there are spam killers. I've looked at nearly 20 of these programs. Not one of them detects all of the spam. Why not?
The main reason is that a certain percentage of spam doesn't look like spam. Most antispam systems scan messages for key words, symbols, or other detectable characteristics. If, for example, the words "FREE MONEY" appear in the subject or body of a message, most programs will kick the message out as spam. But what about "best mortgage rates"? I get four to 10 mortgage offers a day (as you probably do). Point 1: I would bet that not one of these is a fully legitimate offer. Point 2: On principle, I am totally uninterested in a mortgage via e-mail. So, I consider these messages spam, just as obtuse and irritating as Viagra spam. But most antispam software has trouble with mortgage rate.
That's because the words mortgage and rate are legitimate, and in fact are vital everyday words in financial news of all kinds.
You can train some antispam software to recognize mortgage and rate as spam words, but that will kick out every message that contains them. The question becomes, am I willing to tolerate a certain number of mortgage rate spam messages in order to make sure I don't miss news or other correspondence? This brings me back to the question concerning toleration of bad data, because it is very much the same issue.
Please parse the analysis
Identification of dirty data is the starting point. For obvious reasons, you can't deal with the problem until you know that it exists and what its scope is. The most straightforward approach is to scan the message, document, or database for letters, words, and numbers that either don't belong or are missing. This process involves parsing (taking apart) the content and comparing it against something that indicates what should be there.
With spam, the subject line of the message and often the body of the message are parsed to look for words that are on an exclusion list. With databases, the individual items (records and fields) are parsed and compared against either lists of allowed data or against rules that govern the type or format of data. XML documents can be checked for content and are almost always checked for structure. The software that does this sort of checking is often specifically written for the target (e.g., e-mail, XML document, database tables) but the search and identification techniques are quite similar. The techniques can also be extremely sophisticated, using approaches such as pattern recognition, artificial intelligence routines, and statistical analysis. The result is usually some kind of report on the dirty data--a listing--with information on frequency, location, and type.
The cost of clean
Once you know your information contains dirty data, and how much, you're in position to evaluate the problem. As with the example of pulling spam out of e-mail, decisions need to be made about the severity of the problem, and the thoroughness of the counter measures.
Database administrators know there is some dirty data in every database. The question is, what is it worth to spend the time and money to find and remove it? If the bad data exists in critical areas, such as accounting, then almost any effort is justified. On the other hand, dirty data in a salesman's notes--if it can be identified at all--may not be of particular concern and not worth the money to clean it.
Weeding spam is relatively simple compared to cleaning vast databases. When a company decides to do some data mining, one of the first things it needs to do is an analysis of data quality. Large amounts of aggregated data like pulling together 20 years of sales figures can be significantly affected by relatively small amounts of dirty data. There have been plenty of cases where at least initially the data was so bad that data mining and other business analysis couldn't be done.
There is plenty of help for cleaning data. Dozens of companies specialize in it. Many also provide software that can do the job. Products such as The Computing Group Ultra Address Management, Evolutionary Technologies ETI*Data Cleanse, and Knowledge Integrity Inc. and DQGuard attack various aspects of dirty data. Choices need to be guided by the target for cleaning (different approaches for e-mail, databases, etc.). Also, the jargon and rules for valid data are different within a specific industry or discipline (a domain). If you're in the petrochemical industry, then data cleaning needs to be tailored for that industry.
It's still somebody's call
Having described what professionals can do for data cleaning, I come back to the spam example to reiterate: The final issue in cleaning data is, how much should be done and at what cost? There is a kind of law of diminishing returns. Just as you may not want to remove all forms of spam, trying to remove all dirty data may be unrealistic or too expensive. Ultimately, however, the point is to never think, "Well, the data are probably OK." It's important to know it's OK. Within the context of how the data will be used, a company should have confidence that data quality is at a particular level. Somebody has to take the responsibility and decide where to draw the lines. That's the right way to balance the cost of dirty data with the cost of removing it.