Large organisations are being increasingly challenged by the volumes of data they have to deal with and analyse. As the sheer quantity of data that they capture grows, techniques like RFID and clickstream analysis of web traffic are having to deal with much larger sets of data than was common in data warehouses even a few years ago.
Yet while new product areas such as data warehouse appliances spring up to help with these issues, it is easy to overlook a key fact: there is little point in employing ever more powerful hardware and software to analyse our data if the quality of that data is suspect. In a survey that my firm did some months ago, just four per cent of respondents rated the quality of their data as "excellent", while a recent IBM study found that one in three business leaders frequently make business decisions based on data they either don't trust or don't have.
Practical experiences mirror the broad figures. One UK bank I worked with discovered that 8000 customers were aged over 150 according to their systems, which made it interesting when their marketing department decided that selling life insurance to existing customers was just the thing to try.
When I was talking recently to a global pharmaceutical firm, it transpired that they had discovered that the data quality of their spare parts catalogue was so poor that depots kept far more spares for their factories than they needed. After a major data clean-up they discovered that in one case they had enough spare parts to last them 90 years. This unnecessary inventory tied up working capital as well as warehouse space, and the savings made in the first year alone were over £2m for just five sites.
Yet in our survey a remarkable 63 per cent of large firms had no idea what poor quality data was costing them, while 42 per cent did not try to measure or monitor the quality of their data at all, never mind figure out what this may be costing them. Unsurprisingly, given that so few companies know the scale of the problem, the biggest barriers to implementing data quality were "it's difficult to present a business case" and "management does not see this as a priority", two sides of the same coin in my view.
The software industry has done a poor job of helping to articulate the benefits of good data quality, and has largely focused on dealing with customer name and address verification. This is understandable since every firm has customer data, and names and addresses are well structured so they are a relatively easy problem to attack. There are plenty of published algorithms to deal with common errors and misspellings in name data; put a user interface on top of one of these and you have a data quality tool.
My firm tracks several dozen data-quality vendors and almost all of their products are aimed at this area, yet in the survey 81 per cent of organisations said that their data-quality issues went beyond name and address. Indeed, name and address ranked just third in the data domains regarded as high priority by respondents (product data and financial data came higher). This would seem a major opportunity for vendors, since very few specialise in other data domains like product data. One reason is that product data presents more complex technical challenges than customer data, since product data tends to have dozens or even hundreds of attributes, and is frequently stored in unstructured text files.
Yet whatever the data domain, it's clear that many large organisations have yet to grasp the nettle of data quality. This may be because it is one of those dirty little secrets that no-one wants to own up to. After all, if you had a choice of working on a sexy new analytics system to analyse customer behaviour, or a project ploughing through vast files of data looking for errors, which would you choose? Many data quality problems occur because the people inputting the data get no direct benefit from it: a telesales person will definitely care about making the sale since this is how their commission is paid, but why should they record pesky details like delivery address and demographic data for other people's benefit? If these details are incorrect, they don't see the direct consequences.
Even when data quality software is employed to detect problems, it is often done in a retrospective manner. Errors such as duplicate records are pointed out by the software, and names and addresses are standardised and corrected by the software and written out to a shiny new file, yet what actually happens to all those problems in the source systems? In many cases the multiple, overlapping systems are left intact, complete with errors, since this is what is technically known as an SEP - someone else's problem. Companies need to take a holistic view of their data, and their data quality, and start measuring what it is costing them.
Andy Hayler is founder of research company The Information Difference. Previously, he founded data management firm Kalido after commercialising an in-house project at Shell