Choosing An Enterprise Information Catalog - IRM Connects, by IRM UK

Over recent years, the demand to analyse new types of data has caused significant changes to the analytical landscape. Many companies today are way beyond just having a data warehouse.

Mike Ferguson, Managing Director, Intelligent Business Strategies
Mike will be presenting the course, ‘Unified Data Delivery: From Data Lake to Enterprise Data Marketplace‘ via live streaming 8-9 March 2021

The demand now is to also capture, process and analysing new structured, semi-structured and unstructured data from internal and external sources for analysis that are not in a traditional data warehouse.

As a result, new types of analytical workloads are needed to derive insight these new types of data and it is this that has resulted in new data stores and analytical platforms being used in addition to the data warehouse. This includes cloud storage, NoSQL column family databases and key value stores capable of rapidly ingesting data, NoSQL graph DBMSs for graph analysis, Hadoop, and streaming data analytics platforms. All of these are now in play. It seems that companies are now creating analytical ecosystems consisting of multiple data stores including the traditional data warehouse.

The problem however with multiple analytical data stores on–premises and in the cloud is that complexity has increased. Also different types of data are being ingested into all of these data stores. As a result, many companies are facing the reality that they do not have a centralised data lake of all data in one data store, but instead have a distributed data lake with multiple data stores that may include multiple Hadoop systems, relational DBMSs, NoSQL data stores and cloud storage.

In this kind of setup, it is hard to know what data is located where. Also data relationships across multiple data stores are often unknown. We also often don’t know what kind of data preparation is going on where and what analytical models exist analyse prepared data. The emergence of self-service data preparation tools has made this even more challenging with both IT and business users now preparing and integrating data.

The problem is that there is no common place to tell you what data is available, what data preparation jobs exist and what analytical models exist that you can potentiall reuse rather than reinvent. Business users have no place to go to find out if trusted, prepared and integrated data is already in existance that could satisfy their needs and save them time.

The answer to managing all these issues is to establish an enterprise information catalog. Information catalog technology penables you to see what data and artefacts exist across multiple data stores both on-premises and in the cloud. It is now central to data governance as well as analytics.

If you are in this situation and looking to buy an information catalog you probably need to know what to look for. Below is a list of capabilities that hopefully may help you evaluate information catalog products to meet your organizations’ needs.

Data ingestion

It should be possible to:
• Allow authorised users to nominate / bookmark and register data sources containing data of interest in the catalog so that data can be brought into a data store associated with a centralised or distributed data lake

Data Discovery

It should be possible to:
• Automatically discover ingested data to understand what data exists in all data stores that contain raw ingested data and trusted data already cleaned and integrated in data warehouses, data marts and master data management systems. This would include RDBMSs, Hadoop, cloud storage and NoSQL databases. During automatic discovery it should be possible to:

o Use built-in machine learning to automatically tag /label (name) and annotate individual data fields to indicate what the meaning of the data
o Use built-in machine learning to automatically recognise data that matches out-of-the-box or user-defined patterns to instantly determine what data means
o Automatically discovery of the same, similar and related data across multiple data stores regardless of whether the data names for these data are different
o Automatically profile data to understand the quality of every item
o Automatically derive data lineage to understand where data came from
o Automatically discover personally identifiable information (PII)
o Automatically detect change (a critical requirement)

• Allow users to manually tag, data to introduce it into the catalog

Collaboration

It should be possible to:
• Create roles within the catalog that users can be assigned to e.g. data owners, data experts, data curators/producers, data stewards, approvers, catalog editors, consumers
• Allow virtual communities to be created and maintained to allow people to:

 Curate, collaborate over, and manually override tags automatically generated by the software during automatic discovery
 Collaborate other artefacts published in the catalog e.g. ETL jobs, self-service data preparation jobs, analytical models, dashboards, BI reports, etc.

• Allow nominated users to collaborate over, vote on and approve data names either created by automatic data discovery or created manually by authorised users

Data Governance

It should be possible to:

• Define a set of common business terms in a catalog business glossary and/or support the import of business glossary terms and ontology vocabularies into a catalog business glossary that can be used to tag data published in a catalog to understand what the data means
• Automatically semantically tag data at field, data set, folder, database and collection level.
• Support multiple pre-defined ‘out-of-the-box’ data governance classification (tagging) schemes that indicate levels of data confidentiality, data retention, and data trustworthiness (quality). The purpose of these schemes is to be able to label data with a specific level of confidentiality and with a specific level of retention in order to know how govern it in terms of data protection and data retention.
• Add user defined data governance classification schemes to allow data to be tagged/labelled in accordance with these schemes in order know how to organise and govern it.
• Add 3rd party classifiers to the catalog to extend it to enable support for new data types and sources
• Automate data classification by making use of pre-defined patterns, user defined patterns (e.g. regular expressions or reference lists) to automatically identify and classify specific kinds of data in a data lake e.g. to recognise a social security number, and email address, a company name, a credit card number etc.
• Automate data classification using artificial intelligence to, observe, learn and predict the meaning of data in a data lake
• Allow manual tagging of data and other artefacts in the catalog to specify data meaning and to allow the data in a data lake to be correctly governed. It is critical to allow users to tag data with the terms they are used to, so they and their peers can find it again using those terms
• Automatically detect and classify or manually classify sensitive data (e.g. personally identifiable information – PII) in the catalog to enable governance of that data via tag-based data privacy and security policies
• Allow multiple governance and use tags to be placed on data including:

o A level of confidentiality tag e.g. to classify it as PII
o A level of quality tag
o A level of data retention tag
o A business use tag e.g. Customer engagement, risk management etc.
o Tagging a file to indicate its retention level or which processing zone it resides in within a data lake e.g. ingestion zone, approved raw data zone, data refinery zone, trusted data zone

• Automatically propagate tags by using machine learning to recognise similar data across multiple data stores
• Define, manage and attach policies and rules to specific tags (e.g. a Personally Identifiable Information tag) to know how to consistently govern all data in the catalog that has been labelled with that same tag
• Nominate data items in catalog to be used in the creation of a business glossary to establish a common vocabulary for the enterprise
• Derive and generate schemas from discovered data to enable that data to be quickly and easily accessed by authorised users via 3rd party tools. An example here would be the ability to generate of Hive tables on data discovered in Hadoop
• The generation of virtual tables in data virtualization servers on data automatically discovered across multiple data stores in a distributed data lake

Governance of Artefacts

It should be possible to:

• Import the metadata from 3rd party tools to automatically discover, classify and publish the following in the catalog:

o IT developed ETL jobs
o Self-service data preparation jobs (also known as ‘recipes’)
o BI tool artefacts (queries, reports, dashboards)
o Analytical models
o Data science notebooks

to understand what is available across the distributed data lake to prepare, query, report and analyse data held within it

• Manually classify (tag) and publish IT developed ETL jobs, self-service data preparation ‘recipes’, BI queries, reports, dashboards, analytical models and data science notebooks to the catalog

Findability

It should be possible to:
• Create a ‘data marketplace’ within the catalog to offer up data and insights as a service to subscribing consumers
• Support faceted search to zoom in on and find ‘ready to use’ data and other artefacts published in the catalog that a user is authorised to see
• Allow faceted search including searching via specific tags (e.g. PII data) and properties, lineage (e.g., only data from a specific data source or derived from DW data), data set location and format
• Allow users to easily find data in the catalog, select it and then launch third party tools such as self-service data preparation or to work on that data or self-service BI tools to visualise data (if authorised to do so)
• Allow users to easily find data and analytical assets in the catalog:

o IT developed ETL jobs and business developed self-service data preparation ‘recipes’ in the catalog and to launch or schedule them to provision data
o BI queries, reports and dashboards in the catalog and to launch tor schedule them to provision insights
o Interactive data science notebooks and analytical models in the catalog and launch or schedule them to refine and analyse data to provision insights

• Allow users to search metadata to find data sets that they are not authorized to use and request access to those data sets
• Understand relationships across data and artifacts in the catalog to make recommendations on related data
• Allow consumers to collaborate over and rate data and artifacts within the catalog in terms of quality, sensitivity and business value

Trust

It should be possible to:
• Allow users to easily see lineage from end-to-end in technical and business terms and navigate relationships to explore related data
• Provide data profiles to help users decide whether data is of sufficient quality to meet their needs
• Be able to easily distinguish between disparate technical data names and common business vocabulary data names in the catalog

Scalability

It should be possible to:
• Scale automatic discovery processing and profiling (e.g. using scalable technologies like Apache Spark) to deal with the volume of data in multiple underlying data stores

Integration

It should be possible to:
• Integrate the catalog with other tools and applications via REST APIs

Mike Ferguson is Managing Director of Intelligent Business Strategies Limited. As an analyst and consultant he specialises in business intelligence / analytics, data management, big data and enterprise architecture. With over 35 years of IT experience, Mike has consulted for dozens of companies on business intelligence strategy, technology selection, enterprise architecture and data management. He has spoken at events all over the world and written numerous articles. Formerly he was a principal and co-founder of Codd and Date Europe Limited – the inventors of the Relational Model, a Chief Architect at Teradata on the Teradata DBMS and European Managing Director of Database Associates. He teaches popular master classes in Big Data, Predictive and Advanced Analytics, Fast Data and Real-time Analytics, Enterprise Data Governance, Master Data Management, Data Virtualisation, Building an Enterprise Data Lake and Enterprise Architecture. Follow Mike on Twitter @mikeferguson1.