In the years that we have been working with artificial intelligence (AI) in IFS Labs I’ve talked to many of our customers and have seen a lot of use cases out there. One of the most important topics being discussed is data! And don’t make the mistake of ignoring it, data is the single largest factor in the success of an AI implementation. Let’s look at some of the challenges around data.

data

Get a Data Strategy in Place, and Quickly

In recent years, everyone has discussed and focused on big data to the point that we think we have all of the data we need. In fact, a questionnaire among 200 of our customers showed that 39% think they have all the data they need. I think that’s incorrect. Yes, there might be many cases where we have a lot of the data– think consumer behavior or performance data of aircraft engines. And yes, the rise of IoT will give us a lot of data on many different assets and products, but in many other cases we will have two issues with data: quality and quantity.

Quantity

Take a simple example of using machine learning to predict what kind of spare parts a field service technician needs for a specific problem or job.

You should realize that to solve an average classification problem with machine learning approximately 1,000 data points are required. This means about 1,000 occurred instances of “problems” with the same spare part set logged are needed to begin predicting spare parts when new problems are logged. If you’re incorrectly registering this data, which unfortunately still happens, you can start today and have the necessary data a couple of years from now.

Of course you might have the data for this one specific use case, but what about the dozens of others down the road. Interestingly enough, in the same questionnaire 64% indicated not having started with AI at all and only 3% think AI is core to their business processes. I wonder, how can you already know the most relevant use cases if you haven’t started yet? You most likely can’t, but perhaps you can think on which data areas make sense?

Quality

Then there is the issue of data quality. A machine learning algorithm learns based on the data provided. If you would feed it with a biased data set, then the algorithm will provide biased results.

Often when we start out, we don’t recognize this bias in the data. Combine this with the ambiguity in how machine learning algorithms, specifically neural networks, come to results and you potentially have a dangerous situation at hand.

Look at Amazon. They had a very good idea of using machine learning to pick up patterns in the thousands and thousands of resumes Amazon receives each year. Could they use a machine learning algorithm to analyze new resumes to provide lists of candidates most likely to be hired? Great idea. However, reportedly, Amazon shut down the program after a couple of years because it appeared the algorithm was biased against women. The reason was very simple, because the historical data set was predominantly male, the algorithm concluded that men were preferable over women, and subsequently was filtering out women. This is an example of how biased data can get in the way of a great idea on applying machine learning.

Data Annotation

And even if we have the data, do we know what the data means? People talk about “data” being the new gold, but if we put it in context of machine learning, it should be “labeled data” is the new gold.

Within machine learning we differentiate supervised and unsupervised learning. Unsupervised learning is like the grand prize. It’s the capability of a machine to predict the outcome where nobody has told the machine anything on the training data (unlabeled data). Unfortunately, we’re nowhere near that goal. It’s very difficult simply because it’s hard to know what to train beforehand.

Supervised learning is easier, but requires training data that comes with a labeled outcome. For example, if you want to create an algorithm that can tell (classify) whether a picture is a cat or a dog, you’ll need to feed it first with pictures that you have manually labeled cat or dog. And while that may be relatively simple for cat and dog pictures it can be much harder or much more work for more complex problems. And if you can do this automated based on business rules, then you probably don’t need a machine learning algorithm at all.

Oh Yes, Our Privacy!

So, what does Europe’s General Data Protection Regulation (GDPR) got to do with machine learning? Well, here’s the issue. Machine learning uses data. So far so good. But what if this data contains personal data that falls under the GDPR. After all, the GDPR says (amongst a lot of other stuff) that:

You need to be able to tell people what data you hold on them and what you do with it;
You need to get rid of personal data if people ask for it;
You need to be able to explain the logic behind how automated decision about people are made;

And this is where the potential problem sits. Many machine learning algorithms pull all this data in, create a black-box decision model, which is not rule-based and start taking decisions. So, if the model takes decision to deny someone a loan and it has used personal data, how can we explain to the person involved what the logic is behind the decision? Perhaps something that applies only to a limited number of use case, and perhaps we can find other satisfactory explainable reasons, but until the concept of “explainable AI” has been solved, we can’t know yet.

Big Data Needs Big Computing Power

Computing power has grown tremendously over the past decades. Moore’s Law says that computing power doubles every two years. This has been more or less true for a long time, but that growth cannot be sustained with current technologies. If there is one thing machine learning needs, it is computing power. Cloud computing and parallel processing systems are what have provided the answer so far. Perhaps technologies like using GPU processors takes us a step further. But, either way, the combination of growing data volumes and machine learning will lead to some sort of bottle neck down the road.

It’s likely not until the next generation of computer infrastructure, such as quantum computing, that we can make the next jump in AI.

Light on the Horizon

So, what can we do about all of this? My advice, rethink your data strategy. It’s difficult to predict what the most valuable use cases are for you. But, it’s worthwhile to learn the possibilities of machine learning, and to think about the areas where machine learning is going to be important for your business in the coming years. Then it might sound simple. Reinforce your data strategy, capture what you can, label your data in the process and keep in mind that data hygiene is again one of the most important things there is. Because it may sound silly, but without data there is no machine learning!