I was looking for an introduction for this episode and found this quote :
The rules have changed. There's a fine line between right and wrong. And, somewhere in the shadows, they send us in to find it.
It made me smile and, I’m sure it will make you smile if you have the reference. Don’t ask me how my search led me to this text… and let’s come back to our topic.
Yes, the data rules have changed and they will continue to change in the near future. The data related regulations like GDPR have impacted the usage of data in business, up to our private usage. If you are conducting a data ecosystem watch, you may have heard about such coming changes. I will mention the proposal of regulations which will impact data, like Data Governance Act or Digital Services Act. Of course, depending on your business sector, you may have other laws, regulations and other constraints leading to requirements on data.
On top of these external requests, your own company requirements are also putting your data under constraints. I talk typically about your internal data security classification.
Understanding the data compliance complexity.
The data compliance requirements usually finish in the security thematic and give you constraints in terms of protection of confidentiality, integrity, availability or traceability.
You have now the data openness dilemma, data in jail or people in jail ? If you open all datasets, you will have people in jail, or at least, you will have to pay fines. If you close too much the datasets, putting data in jail, people won’t be able to create value with data. It sounds familiar to you ?
Of course, we need to consider as datasets not only tables well structured in a database. It encompasses also the free texts in such tables but also your office documents or even videos.
To achieve this quest, you need to be able to know the risks coming with your data. You need an effective system when you see the impact. You also need an efficient system because nobody wants to pay for this enabler. Especially, when having in mind that rules will change, we are talking about recurring costs.
First observation : a dataset is sensible for a reason.
Let’s illustrate this with a fictive dataset : a database table containing the products of the company, especially the following informations :
Name of the product.
Recommended retail price.
List of ingredients on the package.
Colour.
Dimensions.
Production recipe.
Production costs.
Minimal acceptable price.
Responsible of the production.
Some photos of the product.
Just reading it, you can feel this dataset is sensible. You also feel that only some piece of information are sensible, not all :
Some informations are public by nature, like the name of the product, recommended retail price, dimensions or colour.
For the production costs, it’s clear. You don’t want them to go to your competition. So does for the minimal acceptable price.
For the production recipe, it’s not so obvious. On your latest product, you want to protect it. For an older product for which the patent is outdated, not really.
About the person responsible of the production, you face another sensibility axis : privacy. You have to protect it even if, from a pure business perspective, before GDPR, it wasn’t really coming with a big risk.
For photo of the product, you open potentially a pandora box. Do the photos are illustratives or can give some manufacturing secrets ? I won’t deep dive on this last one today …
If you consider only the dataset as an indivisible whole, it will be marked as “confidential plus privacy”. It will block the data analyst working on ingredients. Annoying, isn’t it ?
It can be even worse if we mark the dataset as “confidential”, hiding the privacy topic under the same axis. In that case, if privacy law changes, we don’t even know how we are impacted, and we need to reopen all datasets marked as “confidential” to check.
Second observation : Data compliance requirements are business expressed, not IT.
Based on the first observation, the temptation would be to tag directly in the dataset each column instead of the whole dataset. Good try, but bad idea…
The first problem, especially occurring in companies with a long history of their IT legacy, is the heterogeneity of your information system. Let’s come back on the minimal acceptable price of the example. In one application, this information could be represented by one column : price without tax. When, in another application, it could be two columns, with and without taxes. The business expresses that the minimal acceptable price is confidential, whatever is its representation in the information system.
The second problem you will face is that even for a column, not all records may have the same sensibility. I remind you the example of the production recipe : depending on the patent status, the answer will be different. So, you need to tag column for a set of lines. Ouch.
The third problem that should completely chill you is the size of the obstacle to overcome. Just do a simple calculation : the number of applications multiplied by the average number of tables multiplied by 30 minutes, let's be ambitious about our productivity. You are a big company and you just find half a million hours ? 58 years and … 14 days ? If you have 200 people ready to spend half of their time this year on it, feel free to try.
Some will tell me that thanks to Artificial Intelligence we can reduce that time, I’m sure of it. I will answer, why not … if you are able to provide training datasets for all your cases. I’m sure there is a more efficient usage, I’ll describe it in the last part.
Third observation : Wherever the data is located, the data compliance requirements are the same.
To make matters worse, dataset is, for good or bad reasons, replicated in your information system. The product table I used for the previous example could be in the Customer relationship management system, in the data hub of the sales domain, in the company data lake but also (partially) replicated in the list of product used for you online store and, of course, in several backups.
Can you imagine the rework if you need to update sensibility tags ? Typically when a patent is deprecated or if the privacy law changes or, if, don’t ask me why, a regulation is impacting the confidentiality of ingredient lists ?
Reducing the data compliance knowledge complexity.
If you have read the first episode of this season, I guess you have already connected the dots with Business Object View concept. If not, I invite you to read or re read it to make you familiar with the concept.
By design, this notion of Business Object view is there to fulfil this requirement on data compliance knowledge. By playing with characteristics, business contexts and states we can express the population of data which are sensible and why.
The idea is only to document the applicable data compliance requirements and, thus, prepare the feeding of a sensibility tagging engine.
It may look like something like this :
The production cost of the product is confidential.
The minimal acceptable price of the product is confidential.
The production recipe of product which are under patent is confidential.
The person responsible of the production of product is personal data.
Any population of data combining these minimalistic business object view inherits of classification. All other populations of data are not sensible.
Enriching the data compliance knowledge.
As you may have understood reading me, I push Data Governance as the balance between Data Business Value and Data Responsibility. I believe it’s also the moto that should drive the enrichment of this business centric way of consolidating data compliance knowledge.
From a value creation perspective, before a dataset usage, the Data Officer will have to ensure the sensibility is assessed. By opening the dataset, its population will be documented as a Business Object View and a Dataset Index. Existing classification rules will be completed if necessary.
From a responsibility point of view, thanks to the list of classification rules, Data Officers can interpolate most sensible Business Objects Views. Thanks to Lead Data Architects, we can focus our effort on the Datasets representing these Business Objects Views, finding them, protecting them.
Accelerating the tagging thanks to artificial intelligence.
Before going further, I would like to precise some vocabulary : data sensitivity classification versus data sensitivity tagging versus data sensitivity labelling. At least the vocabulary I push, fell free to react in comments.
Data sensitivity classification : this concept is at Business Object View level, it is the rule of cascading sensitivity requirement on some populations of data. When I wrote in the example “The production cost of the product is confidential”, it was the classification of this Business Object View. These information changes rather slowly, at the speed of law and regulations or, sometimes, at the one of the political-economic context.
Data sensitivity tagging : this concept is at dataset level, it represents the application of the classification rule. It take the form of meta data attached to the dataset. This information changes at a higher speed than the classification, following the lifecycle of the dataset. An example I like is the annual financial report, which is confidential until disclosure and public when published. The classification is almost set in stone, when the tagging of the current annual financial report will change in one second, the day of its publication.
Data sensitivity labelling : this last concept is also at dataset level. It is a visual representation of some of the tagging when it is required. You remember the last James Bond and the big stamp “Top Secret” ? Do you remember any film with a stamp “subject to GDPR” ? You get the point, we need this last information for processing of the data but nobody asked to put a big stamp on it… for now.
That being said, let’s come back where I challenged artificial intelligence usage. I was challenging it when the ambition was to use it at dataset level, for both classifying and tagging. As soon as we have identified the business rules for classification, we have limited set of rules. They are like : “The production cost of the product is confidential”.
What we can automatise is what I defined in episode one as Dataset Index. In other words, find where is the data according to its business description. On our example, where are the datasets containing production costs of products.
You have now a field of experiment for artificial intelligence, typically for some automated ontology mapping or natural langage processing techniques, like named entity recognition. It will especially makes sense for the free text you have everywhere and which are both mines of information and explosive mines.
Stay tuned for the next episode, The fall of the Kingdom of Process and the birth of the Data Federation.
Share this post