I have been reading quite a lot recently about the applicability of Agile methodologies for Data Warehousing amongst other initiatives and it appears to me that often the topic of source data analysis is under-represented in the literature and blogs so I thought I would discuss the subject of application metadata in an Agile data warehouse world.
In the Agile Manifesto published in 2001, much is made of the need for delivering solutions which satisfy the business user (customer) in a timely way, incorporating change, frequent delivery, collaboration between the business and IT and much more. In an excellent piece written by Kent Graziano (Data Warrior LLC, for Oracle Scene recently, he reflects on the 12 specific principles of the Agile Manifesto in the context of making delivery of Data Warehouses faster and more effective.
However in amongst the discussion about data modelling for a data warehouse, possibly using the Data Vault method, changes to ETL coding and the warehouse itself due to requirement changes there is little attention paid to the originating applications which provide the raw data for the warehouse itself.
In an Agile world it is important to be able to deliver small and often, to be able to cater for changes due to user needs, business imperatives and regulatory requirements. Using approaches such as Data Vault and modern model driven data warehouse design solutions, and ETL tools the IT department or project team has a good chance of succeeding in delivering effective sources for BI, Data Discovery and Analytics tools.
In an Agile world I would maintain that it is also critical to have a way to understand the data models (metadata) which underpin the ‘systems of record’ from which a great deal of the data for the Data Warehouse is sourced.
Obviously profiling that data to discern its quality and suitability for the end user is important – however one needs to know where the data is first.
An Enterprise Data Model could potentially fulfil this need as long as it contains accurate up to date information about the data sources although these are rare and not easy to build and maintain, especially if the sources include large strategic COTS packages. Whether as part of an Enterprise Data Model or not, understanding the metadata of many of these packaged systems often presents a significant challenge even for seasoned application specialists.
They, the COTS packages, were simply not developed with this type of activity in mind so when a data warehouse team is faced with trying to find data in packaged systems from vendors such as SAP and Oracle they regularly experience delays and rework. This is because the data models for these systems are so big, complex and opaque that navigating around them find to the right tables is difficult and time-consuming and, in the absence of tools from ERP and IM vendors, has traditionally been a manual or at best partially automated task.
As an example, in an SAP system they would encounter a data model which typically runs to over 90,000 tables and has been customised. In itself that might not be a problem if the System Catalogue held the logical names and descriptions and also the relationships between tables – but it doesn’t so looking there or trying to reverse engineer that information is of little or no help. Trying to figure out, for example, the meaning of a table called T077D, which includes columns FAUSV and FAUSF, and how it is joined to other tables could be a bit tricky and take a long time.
I would suggest that this approach cannot possibly support an Agile method. An alternative to the manual method is to use a metadata discovery tool which automates the process of reverse engineering the metadata, including logical names, customisations, relationships, views etc. from the application’s Data Dictionary tables in a matter of minutes or hours.
This will then give the data warehouse designers, architects and modellers an interface into that information so that they can find and use the tables they need quickly and easily even after additional requirements are identified or rework needed and would enable an Agile approach for the whole project life-cycle.
SAP Table T077D showing columns and related tables (only those with data)
By the way, in case you were wondering T077D is the Customer Account Groups table in SAP. Two of its columns are FAUSV (Sales Data) and FAUSF (Company Code Data). The table has 30 Child Tables (in our system only 17 of which contain data) and 2 Parent Tables.