Junk Metadata And Data Catalogs - IRM Connects, by IRM UK

Here are a few fun facts about each of us:

• The human genome consists of about 3 billion nucleotides (the basic information units of DNA).
• The Human Genome Project found that within this there are about 20,000 functional genes – genes that encode the design for proteins.
• But 20,000 genes represent only about 2% of nucleotides in the entire genome (each gene is only a few hundred nucleotides in length).

So about 98% of our DNA seems to have no functional role in our bodies. Scientists like to call it “Non-coding DNA”, but the more popular name is “Junk DNA”.

Malcolm Chisholm, President, Data Millennium
Malcolm will be presenting the course, ‘Successful Implementation of a Master Data Management Programme‘ via Live Streaming 8-9 November 2021.

Having 98% useless DNA would seem like the mother of all data quality problems. But this DNA is not being used to create proteins, so does not really matter that we all have so much of it. That said, it seems that scientists are not happy to simply dismiss Junk DNA as having no role, and there is considerable controversy and speculation about why we have so much of it and what it might be doing.

Junk Metadata parallels Junk DNA

What does all this have to do with metadata and data catalogs. Well, data catalogs are a collection of information, just like the human genome is, and they are filled with metadata. Now, there are many different kinds of metadata, but the type I want to focus on is technical metadata. Technical metadata is metadata that comes from something other than direct data entry by human beings. It includes database structures, data profiles, ETL metadata, inferred foreign keys, report structures, APIs, and so on. Increasingly, this technical metadata is being collected and integrated automatically at vast scale in data catalogs.

At first this might seem like a great and beneficial achievement. All that technical metadata in one place should be tremendously useful for use cases like data discovery, understanding the provenance of data, and finding the best source of data. And it is undeniable that the development of the capabilities to collect and integrate all this metadata has been a significant technical achievement.

However, there is an assumption here: we are assuming that all metadata is equally useful. This is similar to how all of the human genome was thought to be functional DNA before the Human Genome Project found that 98% was in fact Junk DNA. Luckily, we know our bodies work and are useful despite having so much Junk DNA, but what about data catalogs that are enormous reservoirs of technical metadata?

How Metadata can be Junk

This problem came home to me first when a client explained to me that he was afraid to allow business users access to a data catalog because they might do something like type “CUST” into the search bar and get back tens of thousands of results from a variety of technical components and services. He rightly feared that the users would be horrified and give up, unable even to comprehend the types of technical objects the metadata has been harvested from.

So here we have a paradox. The more technical metadata that data catalogs contain, the more accurately and completely they hold a picture of the enterprises data assets – but at the same time the more unusable they are by business users, who are meant to be the principal beneficiaries of data catalogs. It seems that we have created Junk Metadata – metadata that cannot usefully be consumed by business users.

“Junk” is a Business Viewpoint

Is this a fair conclusion? Going back to our Junk DNA parallel, we should remember that many scientists think that there must be a role for it, and in the future it may be proven to have a use we currently do not understand. Perhaps the same is true of Junk Metadata, and in the future AI or ML may be used to derive business insights from it.

We can clarify this by defining Junk Metadata and its properties. Junk Metadata is:

Metadata that cannot be understood in business terms by business users

That is, an item of Junk Metadata either

Has no business understandable content; or
Is not related to sufficient other metadata objects that do have enough business understandable content for the user to infer a business understanding of the item

A major point here is that it is the business user’s viewpoint that is being considered. What we are calling Junk Metadata may be very useful for IT users. However, Data Catalogs have promised us that they are going to be enterprise-wide, and they are going to democratize data for all users in the enterprise. Otherwise, they would just be another IT technical tool like a DBA workbench.

Is Junk Metadata real? I think it is to some extent. All metadata in a data catalog must be understandable in business terms to be even considered by business users. Even then it may have no business use. But I certainly do not want to imply that all technical metadata is Junk Metadata – just that some of it is. And, like Junk DNA, we cannot dismiss Junk Metadata completely, as there may be a way to figure out how to extract business value from it in the future.

Malcolm Chisholm has over 25 years experience in data management, and has worked in a variety of sectors, including finance, insurance, manufacturing, government, defense and intelligence, pharmaceuticals, and retail. He is a consultant specialising in data governance, master/reference data management, metadata engineering, business rules management/execution, data architecture and design, and the organisation of Enterprise Information Management. Malcolm is a well-known presenter at conferences in the US and Europe, writes columns in trade journals, and has authored the books: Managing Reference Data in Enterprise Databases; How to Build a Business Rules Engine; and Definitions in Information Management. In 2011, Malcolm was presented with the prestigious DAMA International Professional Achievement Award for contributions to Master Data Management. He holds an M.A. from the University of Oxford and a Ph.D. from the University of Bristol, and can be contacted at mchisholm@datamillennium.com. Malcolm can be followed on Twitter via @MDChisholm

Leave a Comment Cancel reply