Data Catalog: Content Tips to Get You Started
Data catalogs are a powerful tool that can help companies effectively manage their data assets. Our Data Architect Sami sheds light on the tool and helps you discover how it can be the missing piece in your data management toolbox.
Written by — Sami Helin, Data Architect
This is the second part of the blog series on data catalogs and this time, we describe possible content areas in a data catalog to give some practical tips on how to get started and build further. The first part focused on understanding data catalog purpose and main use cases – you can find it here.
As a background to these content tips, I’ve been working with metadata-driven development and metadata management for over 10 years and formally with an enterprise data catalog for 3 years. Starting with the catalog, I saw many opportunities. However, the focus could be narrowed down to a few core things. I test-drove them with one SaaS catalog and even if the functionality was very basic, it could quite well support that basic content.
In this writing, I also describe some ideas on how to extend the basic content, so also try to picture some evolution paths in the content. You should start with the most important things and narrow the scope to develop your way of working, not trying to depict all things mentioned here.
Of course, your organization's topics and needs might differ from this content, but I hope some ideas are still useful. As a rule of thumb, it is important to understand the stakeholders of your catalog and involve them in defining the use cases. Anyhow, I encourage you to start clarifying what data understanding is valuable to your organization and how to develop it.
Content maintenance way of working
Current data catalogs read data descriptions from many of the tools that are part of modern data solutions. Such descriptions are for example all databases and tables where data resides, data flows between systems, and dashboard contents for end users. These descriptions can be automatically kept up to date. This also makes extending these systems and technical descriptions with business terminology and other business-relevant information more meaningful.
The effort to maintain documentation, keep it in sync with implementation, and manage the overall picture of your data environment has decreased a lot. With data catalogs, we have end-to-end visibility and automatic physical data descriptions – meaning the business definitions sync to tangible artifacts rather than maintaining separate documents with redundant physical data linking.
Data catalog content examples
What is useful to describe in the data catalog? This depends on the business, data catalog users, and use cases, but below are some typical topics to get started.
Managing system data descriptions and flow of data
One typical data catalog topic is documenting the content of your databases and other data repositories and building lineage from the origin of the data to where it is used. With this functionality, you have basic documentation about your data environment (provided by choosing a catalog fitting your tech and system environment).
Using connectors for reporting and data transformation tools, it is possible to have the dashboards and reports documentation and track back to the origin of the data and see the logic in between.
Having the basic content available enables adding useful further info such as understandable descriptions to serve as basic data documentation.
Data assets, business glossary, and data responsibilities
Your data catalog should not be merely a representation of your databases or reports, but most importantly, it should describe your business and data relation. What data does our business run upon, and what are the definitions and responsibilities related? I.e. we need to be able to describe the business and how it relates to those physical data objects. Making business context explicit in the data catalog will help in relating systems data to business operations. Using business terminology will help users find correct content. Also, responsibilities to data should relate to business responsibilities. Having mere database tables easily directs terminology and responsibilities to technical persons.
Start with identifying information assets: the data your business needs in its operations and depict related data ownership. These can then be developed to detail further data governance and quality practices. You also need to align the terminology in key data concepts within your assets.
Things to describe about the data assets are e.g. what it is about, key data rules, data lifecycle, data protection, and compliance considerations as well as data quality management, and development. Of course, detailing should not be done to all data and immediately in that depth, but evolving from prioritized focus areas and most important items.
With data assets, we refer to the data business runs upon. It does not refer to any systems or databases, but the data business owns and manages to operate. Data assets need governance practices to ensure quality data is collected and managed for data utilizing data products and processes.
Metrics and data products
Another important aspect of business data descriptions is what we use the data for. To end users, we can describe the metrics and data products they should access. Going further, we can link these to the reports and use cases in business.
Data catalogs typically have interfaces to many BI tools, but data products could support other types of applications as well such as AI models, APIs, or web solutions. In a simple model also application details can be described in the data product, but typically applications and related use cases detail to own data content and data catalog structures.
Data contracts and data rules
Data contracts can define the rules in the available data. For all incoming data, it is important to understand data collection and ingestion principles. For downstream data pipelines, there should be contracts related to the different data products. In a data warehouse-based environment, producers are responsible for mastering quality data and consumers for the end user applications in reporting, process mining, data science, or application integration.
In granular data catalog implementations specific business and governance rules can be set on any level of detail from specific attributes to entities and domains. In a simpler model, a contract might just be a more aggregate document for meaningful interfaces within your data environment.
Closing thoughts and possible next steps
There is a varied number of useful functionalities in different data catalogs in addition to the above-mentioned. And these may be important depending on your needs and use cases. Furthermore, as the usage of catalogs develops, new success stories emerge, functionality develops and new things become part of the typical core set of use cases.
In general, one important direction in development is utilizing the data catalog in enabling all data work. This would mean better collaboration between different data stakeholders: enabling data discussions and questions, and evolving and correcting the content. A data catalog could also serve as a starting point for various data processes, such as report development, data quality development, and data issue indications - and enrich the understanding in the separate tools where those processes take place. It can also help business and data experts with other data-related processes.