S01E01 - The Data Catalog is dead, long live the Data Catalog!

DatActionable

0:00

-12:51

S01E01 - The Data Catalog is dead, long live the Data Catalog!

It's time to move data catalog from IT to business to ensure consistency between total cost of ownership of such catalog and its value for business.

Frederic BERNARD-PAYEN

Mar 26, 2021

Since 90’s, IT teams aims to know where are data in the information system. Since 30 years, they are struggling in demonstrating value of such catalog to businesses and ensure recurring budget for it. Like Data Governance itself, this topic should be a business topic, not an IT topic. Creating value with data comes with responsibilities, starting by having a knowledge of this data.

What’s wrong with data catalog ?

The first comparable element between data catalog solution is the type or number of connectors available. It’s representative of the source of the problem, current data catalogs are looking to cartography at attribute level the physical tables of the information system.

To answer the business question “where is this data”, the data catalogs map all data with precision. It is as if in order to know in which districts the inhabitants of a city live, one had to keep precise addresses with street number, street, postal code and city. Don't put words in my mouth. I'm not saying that this level of granularity isn't necessary, but that it isn't always necessary. And when it is necessary, it is because there is a use case that justifies it.

To continue with this people comparison, the business question in that case is not always - almost never in fact - where are “the” people. It’s is more where is “a population” of people. So does data requests : when we want to create value with data we don’t necessarily need all data of one type but only a subpart. If we want to have businesses asking data with a business wording and not database table name, we must have a data catalog ready for this.

Let’s stop this people comparison because data is not people. One of its superpower is ubiquity (I’ve heard about a guy with this superpower but it as another story). To be precise, even data doesn’t have this superpower : when it is duplicated, its quality - especially related to timeliness - changes. By the way, it shakes the concept of looking for the source of truth : business is in fact looking for the good source for the purpose intended. The definition of good data would be more data at quality … and at cost, rather that a quest for the truth.

I’ll finish with a last point, even it is not really last one : four is enough to challenge current principle of data catalogs. How do you know that your data catalogue is complete ? Since it was initially designed for databases, making the shortcut dataset equal a table, we miss other nature of datasets : unstructured documents, videos, images and we don’t talk about hidden data in free text areas. And even for structured data, it’s easy to miss some population of data because they are in a different information system. It typically occurs during merger and acquisition processes letting IT legacy live.

To make it short :

The granularity of identification of location of data is not adapted for business.
The granularity of the business data expression is not adapted for business.
The link between business view and IT view is not adapted for business.
The catalogue completeness is unknown so, it is not adapted for business.

How to enable a business ready data catalog ?

Start by a new way to describe data for businesses by creating “Business Object”

It all starts by business! We need a new way to express data for businesses without going into information system implementation detail. Ambition is to fulfil the following ambition :

It should be understandable first by business : by both business accountable of the data and by the other businesses.
It should enable the description of population of data, from a business perspective.
It should be manageable over time, businesses expect a return on investment on the time spent to describe the data.
It should be efficient, even if we don’t have a precise mapping, we need to know where to look.

Now forget what vendors put behind “Business Object” which is, like you may have understood, not an IT representation of something. Let’s introduce a notion for business, like the name implies it.

A Business Object starts by a Name and a Description. Nothing revolutionary.

Business Object Characteristics.

First new notion to add is Characteristics. The word is important by itself, characteristic, not attribute. It’s mandatory to close the natural tendency to go directly to the fields of a database.

As an example, the "Price” of a product can be a characteristic while its representation in one information system will be 2 attributes (eg. price and currency) … and 4 attribute in another one (eg. price excl. tax, tax rate, currency) or dozen in a last one managing discount by quantity for example. If we don’t do this abstraction step, we drown in the details and between experts discussions.

Unfortunately, it can’t be so simple, especially to manage data handled by several domains. I propose to introduce also a notion of Set of Characteristics, simply grouping and naming a group of Characteristics. I won’t deep dive on this topic today but you can imagine a business object managed by two successive processes, the first one in charge of scheduling it and the other of execution.

Business Object applicable business contexts.

Second key notion would be to manage the population of data. And what is making sense for the businesses are the ApplicableBusiness Contexts.

Designing these business contexts requires to step back while looking at your company : what are its products, how to classify them, is there some classifications in the market, how the company is deployed worldwide, etc. This axis is really specific to your company, you may have commonalities between business objects but also specificities. As example, product range can be an axis, region another, or factory sites a last one … and factory sites may have a sense only for manufacturing domain, not for engineering one.

Finally, keep in mind that we are looking for applicable business contexts for your company. They give the borders of data population you have to consider within your company. If I come back on factory in my previous example, we need the list of applicable factories for the company, neither factories of competition nor potential factory one day. We look for the current applicable contexts.

Business object business states.

As a complementary concept to be used to identify data population, the Business States is really close to the nature of the business object. Of course, we may have group it with applicable business contexts but the business state is really specific to each business objects and it really talk to business. Having it separately ensure that the business ask itself the question.

It’s key to understand we are talking about business states, not technical states. It’s neither a “CRUD” (Create Read Update Delete) concept nor the lifecycle of datasets. As example, the business states of a “Task” could be : identified, designed, planed, executed, verified, closed.

What do we have with business objects definition ?

Nothing but the beginning of knowledge! We have laid the foundations for business to describe data in a top (business) down (IT) approach.

But we are still far from the data itself. To close the gap between business and IT view, we need now an intermediary concept we will call Business Object View and describe just below.

Continue by describing population of data with Business Object Views.

We have defined a taxonomy to describe both the Business Object itself but also possible population of data.

The wording I have chosen is Business Object View. Again it has to be considered as a new vocabulary for businesses. It’s important not to mix this concept with a view on a table that your database admin has in mind.

Defining Business Object views.

The idea is to be able to express a population of data. Thanks to the concepts we have seen before we can express them, starting with the whole possible population in the company which can be expressed by “A dataset representing all the Characteristics in all Business Contexts and Business states of a considered business object”.

It’s quite easy then to define sub population of data by playing with the 3 components of the business objects : the characteristics, the business contexts and the states.

For example, I’m able now to consider not all the products of the company but a population. The name and the composition (characteristics) of the product (business object) for the European market (business context) which are currently sold (state).

The interest is multiple. The business object view has multiple interests and can be enhanced but that require a dedicated article. We will go further later. For now, let’s continue with the data catalog usage.

Redefine the way to link business and IT view.

I’m sure you closed the gap between this concept and the IT cartography : a specific information system contains a population of data. The business object view gives a necessary refinement to connect the dots. Necessary, yes, sufficient ... no. We need to refine the link itself.

Defining the dataset index.

As I mentioned before, data has - almost - the superpower of ubiquity. I say almost because of the evoked change of quality. Yes, the population of data in the authoring tool is the same you loaded in your data lake - I hope - but they don’t have the same freshness.

So, if you want to link the business object view and the dataset you need to specify the quality on the link - link I propose to call datasetindex as reference to the index of a book … index to find where is the information in the book.

Unfortunately this complementary specification of the index is not sufficient because of architecture of the information system.

Sometimes, a specific table contains a mix of business objects which are close from a conceptual point of view but really far from a business point of view. Simple example, a table containing “items” may mix the product of your company and … the pens you buy for office.

In the same way we specified the link from business object view to the dataset, we need to specify the link from the dataset to the business object view. For this last case, it’s about giving the keys to filter the population to the business object view population.

You may notice that we are at the interface between the business world and the IT world. When we look to the dataset from the Business Object perspective, it is with a business expression (quality) but when we have the answer from the IT side, it is necessarily in terms of IT - expressed on rules based on the technical structure, with filters on column/attributes.

To conclude : Start small, think minimal viable product.

As you understood, we focus on the knowledge starting from the business side and limit the IT cartography at dataset level. This first level of knowledge targeted is - only - at dataset level : a table of an application hosted on a platform … or a folder of videos hosted on a filesystem.

Do we need more ? Maybe. But we don’t need more everywhere. At this level, the model is quite resistant to change while giving a minimum level of knowledge to govern the company data. It also gives the path to go more in detail when necessary, we know where to start the refinement.

Starting with this level of cartography permits a delivery timeline which is more reasonable than the one starting from the datasets of thousand (millions?) of tables of your information system which, by the way, will miss your unstructured datasets.