Serengeti logo BLACK white bg w slogan
Menu

The Importance of Data in Machine Learning

Serengeti
09.09.2020.

Well, the short answer: data is everything with a little bit of modelling on the side. Now that we have this out of the way, let’s focus on a couple of key points of what kind of data is important, how much data you need, and how you can handle all this data.

The Kind of Data

Data that is needed varies widely depending on the task you are set out to accomplish. Be it a collection of video footage, medical images, sound, electronic signals or financial data – you always need your data to be goodGood means that it is correctly labelled if needed (within reason), properly structured and readable by a machine. Then again, based on the methods you would like to use, you will need different amounts of data. And then, when you have gathered enough data, you need to make sure that your data represents or contains the real-world issues/samples/information that you are trying to tackle/identify/extract.

So, for example, you might be doing a financial analysis of issued loans. You are trying to determine if the loan is going to default. Of all the loans issued, let us say a very conservative policy is in place, 1% of the loans ends up in default. This kind of representation of real-world situations in your data and the large disparity in categories might not represent the diversity of situations which you might hope to predict. Even if you have tons of data – abstract tons 🙂 – you might decide to just roll with it and say, “If I say none of the loans will default, I will end up with 99% of issued loans predicted correctly, sweet.” Of course, this example is exaggerated to prove a point. But the point is that you need to put some effort into understanding real-world problems/situations and their data representation. Imagine that. Data is a tool to help you form a hypothesis, and at the same time to test it and confirm or reject it.

How Much Data, and How Much Storage?

One of the presumptions that we have almost always seen proven wrong in practice is that you need just enough storage to contain your datasets. Storage is expensive, fast storage even more so. Let us start by designing our storage to contain our dataset which we plan to process. Then we iterate through several scenarios:

  1. If the dataset proved to be hard or costly to obtain, you will need backup, perhaps multiple backups of the data.
  2. The storage medium is too slow to load huge chunks of data – one of the solutions is to store the dataset on multiple locations from which you can read simultaneously.
  3. The dataset needs to undergo a heavy data augmentation process which might double, triple or n-ple it in size.
  4. One of the researchers wants to try out something which might put pressure on the current storage system, so they ask for a copy of the data so they can play around a bit.
  5. Processing data produces so much intermediary results or extracted metadata that you need more storage to handle it.

These scenarios prove that each dataset and probably each processing pipeline will have its own data byproducts which will consume your resources. Therefore, it is not enough to take only datasets into account – the whole data lifecycle needs to be considered.

How to Handle the Data

In order to make the most out of the data you have, it is necessary to coordinate your whole data lifecycle to be usable. For example, petabytes of data will do you no good if there is not enough computing power or network throughput to effectively work with it.

From the perspective of data utilization, you need to be able to produce and label, store, transfer and process the data in a meaningful way within a time period. Too many times have we seen situations where the data lifecycle planning has been hurried and had a lot of moving parts that did not play well with each other. Cheap network equipment, slow disks, old processors. Everything counts in large amounts. It is very important to devote time to plan out the data lifecycle. It is in fact hard to recover from bad planning since in many cases capital expenditures and equipment which takes valuable time to procure are required. Here again, a cloud strategy comes to the rescue. Many of the harder aspects of planning can be remedied by having a cloud or hybrid strategy in place to leverage the agility and subscription-based pricing.

Conclusion

Data science and AI advisor Monica Rogati cautions companies keen to implement ML/AI by asking them to imagine AI as situated at the uppermost point of a ‘pyramid of needs.’ According to her, AI is amazing, but in all cases, you first need to establish the basics like data literacy, data collection, and infrastructure.

In the light of the fast progress made in digital technologies, developers and businesses are in a constant battle to jump the ML/AI bandwagon. In their hurry, time is rarely made to diligently follow the required steps to success. Some steps are omitted by accident and some are skipped on purpose, both cause a resounding ML/AI strategy failure. Therefore a suggestion is to keep in mind at all times the AI hierarchy of needs.

As a bonus, here is a nice list of publicly available datasets.

Any questions? Let us know:

Let's do business

The project was co-financed by the European Union from the European Regional Development Fund. The content of the site is the sole responsibility of Serengeti ltd.
cross