I happened across an interesting article by David Gee recently. It was entitled “A guide to the little jobs that will make big data work.”
What I found particularly noteworthy, not to mention refreshing, was the acknowledgement of the importance of metadata in a big data project. This is not often the case and in fact is usually ignored, conveniently forgotten or at best paid lip service in most big data discussions or presentations.
The only disagreement I might have is that metadata in this context is often not a ‘little job’.
Why is metadata important? Well I guess it might not be… if your big data repository in the form of a data lake or data warehouse is solely populated by records from a single source with a straight forward data structure. Perhaps Twitter may fit this bill, or machine or sensor data, however in my experience the whole purpose of these types of initiative are to give business users insights that allow them to make better and more effective decisions, whether those are strategic, or more focused, say to give a customer a credit based on their history with the company.
The result is that most data warehouses will contain data from more than one system in order to reduce the number of ‘data silos’ and whilst there seems to be some debate about what the exact make up of data lakes should be in terms of the number and types of sources, they are likely to contain more than just a single application.
The importance of metadata here cannot be understated. If those who are developing the ETL, ELT or data moving routines to feed warehouses and lakes cannot be absolutely sure of where they are getting the ‘raw’ data from, how it is structured and what it means then there can be no certainty that the business users have access to the right information.
Again, it can be argued that the underlying data model (metadata) for many source systems are simple enough to be easily understood with only a few tables, nicely documented and simple data definitions.
However many source applications used by larger organisations as their systems of record are from packaged system vendors such as SAP and Oracle and there are 3 distinct challenges associated with them in terms of metadata. Failing to overcome these often leads to overspend, late deli very and increases the risk of introducing inaccurate data into the ecosystem.
Firstly, anything remotely useful or easily understandable is virtually inaccessible without specialist tools – which the vendors do not provide. This is because the RDBMS System Catalogue contains no logical information about tables, fields etc., nor does it contain any referential integrity meaning the relationships between tables – essential for sensible query building – are just not available.
The valuable base metadata can be found however in the Application Data Dictionary and specialist tools exist for extracting and making it usable.
Secondly, the sheer size of the data model underpinning these systems means that traditional methods of finding the right data will simply not work.
For example an SAP ECC system will typically comprise at least 90,000 tables before customisation and we have come across Salesforce applications with in excess of 3,000 tables.
Oracle’s applications are typically somewhere between those numbers.
Reverse engineering from the System Catalogue into a standard data modelling tool such as SAP PowerDesigner, ER/Studio or ERwin is impractical both due to the size and also the fact that there is no referential or logical information available.
A solution which provides lists of tables from which the user is asked to try to select the right ones assumes that you know what you are looking for in the first place, which is often not the case.
Thirdly, these applications are often heavily customised (for example we know of one SAP system with 117,000 tables) which means that relying on templates based on data definitions of standard ERP tables in a specific version or documentation is likely to introduce risk or at least delay the project while the reality of the data model as implemented is compared with them.
To compound this, if you have multiple instances of the same application, perhaps as a result of merger and acquisition activity, or to support different lines of business or regional activities then the customisations made to these will need to be understood in the context of the other systems.
Some might suggest that data preparation or data wrangling tools in the hands of data scientists will solve this challenge. I do not believe this to be the case unless these solutions can make sense of raw data which has virtually no terms of reference and no context across different instances and potentially different data types.
So, my contention is that metadata is not a ‘little job’, it is critical to the success of data warehouse and data lake initiatives. It is probably a good idea to remember that in enterprise IT, there are no silver bullet solutions to complex data led problems and that there remains work to be done to understand the metadata to provide the platforms to support the business.