Deduplicating Data for Quality Information - IRM Connects, by IRM UK

By Dr. Barry Devlin

Register for Essentials of Data Warehouses, Lakes and BI in Digital Business – Live Streaming only, 15-16 March 2022

What is your biggest information problem? If there was one single behaviour within business or IT that you could change overnight, which one could bring the greatest data management/governance benefits? How could you immediately reduce ongoing information delivery and maintenance costs?

In my experience, business often has a somewhat dysfunctional approach to information quality. First is its use of the term data when what’s really meant is information. A focus on data—best described as information largely stripped of context—is simply too narrow to adequately address information quality challenges. And because data consists of raw numbers and bare text, creating copies is easy without considering how that data has been originally created or how it will be maintained. So, let’s focus on data here…

At its most pervasive, we see this dysfunction in the wonderful world of spreadsheets. Perfectly adequate data in the company’s business intelligence (BI) system is copied into a spreadsheet, manipulated and mangled, pivoted and prodded until new insights emerge. Of course, this is valid and often valuable and innovative business behaviour. The problem is what happens next. The spreadsheet data and calculations are saved for future use. The copy of the data has become hardened, in terms of structure and often content as well. Future changes in the BI system, especially in structure and meaning, can instantly invalidate this spreadsheet, downstream copies built upon it and the entire decision-making edifice constructed around them. And let’s not even mention the effect of an invisible calculation error in a spreadsheet…

Let’s move up a level. Marketing wants to do the latest gee-whiz analysis of every click-through pattern on the company’s website since 2010. Vendor X has the solution—a new cloud data warehouse app offering faster query speeds and financed by via operational expenditure. It’s a no-brainer. Marketing is happy with its innovative campaigns, and even finance signs off on the clear return on investment delivered by the new approach. Except, of course, that this bright, shiny app requires all the existing clickstream data to be copied and maintained to the new database on an ongoing basis. Who’s counting the cost of managing this additional, extensive effort?

And did I hear mention of the data lake? How much of the data here has been copied from elsewhere? And let’s not even ask how many (near-)copies or corrupted copies of the same data exists within the data lake… or the three data lakes that have emerged across the organisation.

It’s easy to blame businesspeople who, driven by passion for business results and unaware of data management implications, simply want to have the information they need in the most useful form possible… now. IT, of course, would never be guilty of such short-sighted behaviour. Really?

The truth is that IT departments behave in exactly the same way. New applications are built with their own independent databases—to reduce inter-project dependencies, shorten delivery times, etc—irrespective of the existence of the information elsewhere in the IT environment.

Even the widely accepted data warehouse architecture explicitly sanctions data duplication between the enterprise data warehouse (EDW) and dependent data marts; and implicitly assumes that copying (and transforming) data from the operational to the informational environment is the only way to provide decision support. However, allowing duplication is not the same as demanding it. Technology advances since the last millennium may remove the need to make a copy in many cases.

In most businesses and IT departments, it doesn’t take much analysis to get a rough estimate of the costs of creating and maintaining these copies of data. The hardware and software costs, especially in the cloud, may be relatively small these days. But cloud or not, the staff costs of finding and analysing data duplicates, tracking down inconsistencies and firefighting when a discrepancy becomes evident to business management grow exponentially as more copies of more data are built. On the business side are similar ongoing costs of trying to govern copies of data, but by far the most telling are the costs of lost opportunities or mistaken decisions when the duplicated data has diverged from the truth of the properly managed and controlled centralised data warehouse.

So, if you’d like to reduce some of these costs, here are five behavioural changes you could implement to improve data management/governance and reduce data duplication in your organisation:

Instigate a “lean data” policy across the organisation and educate both business users and IT personnel in its benefits. Although some data duplication is unavoidable, this policy ensures that the starting point of every solution is the existing data resource.
Revisit your existing data marts with a view to either combining marts with similar content or absorbing marts back into the EDW. Improvements in database performance since the marts were originally defined may enable the same solutions without duplicate data.
Define and implement a new policy regarding ongoing use or reuse of spreadsheets. When the same spreadsheet has been used in a management meeting three times in succession, for example, it should be evaluated by IT for possible incorporation of its function into the standard BI system.
Evaluate new database technologies to see if the additional power they offer could allow a significant decrease in the level of data duplication in the data warehouse environment.
Apply formal governance and management techniques to your data lake, on-premises and/or in the cloud to discover how the savings from cheap data storage are being more than consumed in subsequent analysis and correction of avoidable data consistency problems.

Deduplicating data is a first and necessary step toward quality information with clear benefits for the organisation. It’s surprising how reluctant many businesses are to take this first step while embarking on a digital transformation strategy that demands information of the highest quality.

Comments

04/02/2022 Reply

Vladimir Stavrov

Hi,
working more than 10 years on SAP MDM/MDG implementation projects,
I have met another side of data deduplicating problem.
If we talk, let say, about material master data, the biggest problem is to find a trusted sources for items.
Partial/short description of stock items in material master records in many cases do not allow to identify them clearly in order to remove duplicate records before MDG system becomes ready for Go&Live phase.
Some items are exactly identified by brand name and catalogue number, but this is best case.
If item record is connected to ISO 22745 identifier, and appropriate properties, it also solves the problem in general.
But some items, produced according to ASME/ASTM/DIN/ISO/etc standards, could be correctly identified using properties, defined by that standards. The problem is that sandards are provided as PDF texts rather than by form, suitable for computer processing.
In my opinion, it would make sense to review this practice and make available more computer-friendly form of standards, for people who implement and use master data management systems.
Standards can be represented, let say, as OWL ontologies, even if there will be paid subscription for such access instead of necessity to buy and read PDF.

Register for Essentials of Data Warehouses, Lakes and BI in Digital Business – Live Streaming only, 15-16 March 2022

Comments

Leave a Comment Cancel reply