In a recent interview with The Wall Street Journal Arvind Krishna, IBM’s senior vice president of cloud and cognitive software, said “data-related challenges are a top reason IBM clients have halted or cancelled artificial-intelligence (AI) projects.”
He went on to say that “about 80% of the work with an AI project is collecting and preparing data. Some companies aren’t prepared for the cost and work associated with that going in…. And so you run out of patience along the way, because you spend your first year just collecting and cleansing the data. And you say: ‘Hey, wait a moment, where’s the AI? I’m not getting the benefit.’ And you kind of bail on it.”
He also said that: “In the world of IT in general, about 50% of projects run either late, over budget or get halted. I’m going to guess that AI is not dramatically different.”
A recent report by Forrester Research Inc. would appear to back up his comments. It found that data quality is among the biggest AI project challenges. Forrester analyst Michele Goetz said companies pursuing such projects generally lack an expert understanding of what data is needed for machine-learning models and struggle with preparing data in a way that’s beneficial to those systems.
She said “producing high-quality data involves more than just reformatting or correcting errors: Data needs to be labeled to be able to provide an explanation when questions are raised about the decisions machines make.”
We believe that to collect and label data correctly, a deep insight into the business meaning of the data is necessary. To do that you need access to the business as well as technical metadata associated with your data sources. Once you have done that you can employ a wide range of data quality and other tools to prepare and cleanse the data before using it within AI or machine learning projects.
Well, the answer to that question is yes, and no. As you might expect, it all depends.
It depends on how easily you can find the business metadata in your source systems. Normally the technical metadata is relatively straightforward although even that may present a challenge.
In many data sources the business metadata is quite accessible and not too extensive. Typically, this applies to home grown applications, smaller packages, systems and files. Even sensor data would probably fall into this category and most data coming from external sources or social media have relatively simple metadata structures.
Your data scientists, armed with good tools, should be able to make sense of these sources and ensure data quality and consistency for an AI project.
Where it becomes more difficult is with larger, more complex systems which have more extensive data models. These could have been built in house using what are now old development environments on legacy platforms many years ago and have been regularly updated and customised since to meet the needs of the business.
Grace Hopper promotor of COBOL.
(Courtesy of the Computer History Museum)
In situations like this it is common for documentation to be out of date and for those who developed the initial system to have left the organisation or even retired which hampers those who need to gain access to the metadata and the data.
For example, as recently as 2017 MicroFocus estimated that there are still over 240 billion lines of COBOL code being used.
These handle 70% of all business transactions, 95% of ATM transactions and 85% of Point of Sale transactions. Not bad for a programming language which began life in 1959!
Another class of large complex systems which do not give up their business metadata easily are the ERP and CRM enterprise packages from vendors such as SAP, Oracle and Microsoft. The challenge with these applications is that they have very large data models, for example a typical SAP ECC implementation will have over 90,000 tables. In itself this might not be considered to be a problem and you could potentially scan the database for metadata. However, once you realise that the RDBMS System Catalog contains no business names for tables and fields, and that it does not provide any information about how tables are related you can see that the challenge in data preparation is bigger than you might have originally anticipated. Finally, it is very common for only a portion of the tables to be used so scanning all of them would be unnecessary anyway.
These applications store their business and other useful metadata in groups of data dictionary tables. Obviously, the vendors have tools which can access these; however, they are not designed for use by a data scientist or to provide the sort of functionality they are likely to need to search and analyse the metadata. Then of course there is the challenge of provisioning that metadata into whatever tools the data scientist is using.
Finding and using the right metadata in extensive and highly customised Salesforce systems is also problematic. It is not so much that the metadata is opaque or difficult to find, it isn’t. However, the nature of the platform makes it easy to add products from the AppExchange, your own customisations or even to build your own applications. The development environment also lends itself to agile development methods so it is common for the system to change quite dramatically and often. Keeping up with how the data model is developing and evolving is then difficult without specialist tools.
The challenge is that there are only a few tools aimed at any of these more complex data sources which are designed for metadata discovery and analysis as well as having the ability to provision other systems with their results. For example other than our own product Safyr, which works with ERP and CRM packages from SAP, Oracle, Microsoft and Salesforce we know of only a handful of specialist tools which address this specific problem.
Of course, it is common for these larger, complex enterprise class applications to contain the base data which is critical to the success of AI projects. Given that and the challenges they pose in identifying and labelling data effectively perhaps it is not surprising that AI projects taken longer to deliver.
To give yourself the best chance of avoiding costly delays or even cancellation due to data issues, it is important to fully understand the data landscape you are working with in the context of the project goals.
It is worth asking a few questions and documenting the answers before starting on trying to extract and prepare data for your AI project.
- Have you identified which applications, systems and files contain the data you need?
- For each of those do you know how their data models are structured?
- Do you have access to their business metadata? By this I mean the business names for data objects such as tables and fields etc.
- Do you know how tables relate to each other or to particular business artefacts (e.g. Bill of Materials)
- Have you identified which tools and technologies might be able to help you find, catalog and curate the metadata you will need?
- What is the cost of any internal or external resources you might need to perform the source metadata discovery and analysis?
- How long will it take to gather the metadata necessary? This should be factored in to any time and cost estimate for the project?
- Where are you going to store that metadata so that it is accessible by those who need to use it?
Of course using a metadata discovery and repository tool is only one of the components which will solve all your AI project problems. It does not for example deal with data quality which is another significant challenge for any data transformation initiative, especially AI. It does not address issues such as scope creep, potential gaps in immature product offerings or the lack of skills in implementing these new solutions.
However, having a platform which contains business as well as technical metadata will help you to avoid some of the pitfalls and delays associated with lack of intelligence about your data sources. It is worth considering implementing a metadata management or repository product as a prerequisite for any AI or indeed other data management projects. It will repay the investment many times over in reducing delays, improving accuracy and delivering a central source of business metadata across multiple initiatives.