Recently I have been attending some vendor webinars on Big Data, Data Lake strategy, and Hadoop in the context of Data Warehousing and Analytics in an attempt to gain a clearer understanding both of the technology platforms available and more importantly some knowledge of the use cases and benefits a business might expect to accrue from them.
I was also naturally curious about what these vendors thought, if anything, about the importance of metadata in support of these projects. These have been illuminating and useful, although it still seems to me that there are relatively few customer case studies which point to significant reduced cost, improved revenue or profitability. I am sure that I will receive a few in response to this.
One point that did strike me, however, is that there are marked differences in opinion as to what should make up the contents of these ‘data receptacles’ and how they should be populated. There are also differences of opinion as to whether the data lake should take the form of a repository for analysis, or a staging area where data scientists can do their stuff to provision another repository with data for business users to analyse. For example, there are those who suggest that the receptacle should only contain the raw data from say Social Media or Machine sources, whilst some suggest that it should contain anything and everything that can be loaded.
Some recommend a single enterprise wide data lake whilst some would propose different lakes for different regions, business areas, products or domains. Some suggest that data should not be pre-qualified or analysed in any way before being ‘landed’ into a data lake and that the data scientists can make sense of it all.
I can understand this if the structure of the source data is relatively simple and easily understood; Call Data Records for a telecommunication company will be relatively easy to understand and analyse. Similarly machine data from sensors, clickstream, network appliances for example is fairly simple to analyse. The advantage of data science tools and big data is that for the first time it is economical to store and investigate this data to gain insight into it.
However, I would be concerned about the idea of pouring the entire contents of, say, an ERP or CRM system into a data lake with no initial source data discovery or analysis and then expecting data scientists to be able to figure what what might be useful to business users downstream or to gain insights from it.
It is not the quantity of data that concerns me as Hadoop can easily handle huge quantities of data, it is the complexity of the data model which underpins these applications.
For example an SAP application can quite easily contain 90,000 to 100,000 plus tables which includes customisations to the original implementation. One of our customers who use Safyr to give them “a level of understanding of data in SAP that we previously thought impossible” has 117,000 tables in their SAP instance. Some of these tables will store data, some may not. Many of them are intermediary tables for calculations etc.
You might think that the metadata for these tables is easy to find and understand. However it is not held at the database level and neither are any of the relationships between tables. Therefore without first having a clear picture of what you are intending to bring in to the data lake I think that the task of trying to make sense of it once you do, will be almost insurmountable without recourse to technical specialists, external consultants who may or may not be able to shed some light on the problem.
All this of course will add time and expense to the project. It may be that an organisation has multiple instances of a given ERP or CRM system with different versions or customisations. If some or all of these are required in the Data Lake then additional complexity ensues.
Further if you are planning to bring data from a variety of systems into the data lake, say Twitter and Facebook plus data from your systems of record (SAP, JD Edwards, Siebel, Salesforce etc) and perhaps from your online shopping cart system, then unless the individuals or businesses are very accommodating and provide a common ID, name or email address trying to gain insight into sentiment for targeted marketing to increase revenues will be tricky.
I also think that loading raw data into a data lake without knowing what it means, for the purposes of profiling or analysis before then provisioning a data warehouse or similar for business users means that any form of comprehensive data lineage is lost.
When the business user says “I’m not sure whether this looks right, can you tell me where it came from before I make a decision?”… unless the original source can be found then there is no certainty in the figures.
I am sure that Data Lakes and Big Data will prove to be beneficial and deliver value to customers.
I am also sure that just as with any other critical Information Management projects, there needs to be significant governance (for data and project), realistic expectations (I’ve yet to find the silver bullet solution for data related projects!) and and perhaps a reduction in the level of hyperbole and confusion as to the best architecture and platform.
However, I am still left with the question: “’Is it important to know what you are pouring into your data lake to avoid it becoming a swamp and if so how will that be achieved without metadata?”