By Tejas V
There is something very satisfying when you build a machine learning classifier using a toy dataset. We can achieve high accuracy and feel good inside while doing it. But this doesn’t really help us or prepare us for real-world datasets and the issues it poses.
In this article, we will look at 3 issues real-world datasets pose to production systems
Models don’t like imbalanced datasets. If you have ever trained a machine learning classification model, you may have come across this issue. People use different words for it. ‘Imbalanced dataset’, ‘Model is Skewed’, etc.
Let’s say we are training a model to detect spam emails. This dataset will not be balanced. You will have many more regular emails (ham) instead of spam. Regular emails may account for nearly 95% of the dataset. This is clearly imbalanced. Not all machine learning models deal with this imbalance well, and the final model could have a lot of false-positive predictions.
One way to deal with this is to balance the dataset. 50% regular emails and 50% spam. But this will usually cut down the size of the dataset, which will decrease the accuracy of the model.
Ah, It’s like Sophie’s Choice.
As a counterpoint, balancing the dataset also means that we lose the original distribution of the data. The fact that only 5% of emails are spam can also be useful to these machine learning models during their training.
So what’s the solution? Depends on your dataset. It’s trial and error until you find what works for you.
Toy datasets are usually perfect. They don’t have any missing data and all the ground truth labels are accurate. This is not always the case with real-world data.
Software systems that originally generated the data would not have been designed to keep machine learning techniques in mind. There could be important info missing, or wrong labels. Using human annotators would also result in wrong labels. And it’s a very tedious task to verify each sample.
Training machine learning classification models on such datasets can be challenging. How much noise does the data have? 1%? 2%?
Generally, this number should be low. If it has 30 to 40% noise, it’s probably not a good idea to train on this data.
Classification with neural networks handles noise using a concept called ‘Learning Rate’. If the learning rate is high and the data is noisy, then the results will not be good. For noisy data, we need to use lower learning rates. It is also important for the noise to be pretty low.
Once you create a machine learning classification model and deploy it in production, we don’t need to worry about this model ever again right? Wrong…
As time passes, the distribution of the classes in the real world data might change slightly and gradually. This would slowly degrade the accuracy of the predictions.
It might also be possible that new data and classes have to be brought in to maintain the classifier.
So in this case, we need to keep an eye on the predictions to get a sense of the quality of the predictions and keep making changes to the dataset and retrain the machine learning models as needed.
The process by which a model’s predictions slowly degrade over time due to changing real-world data is referred to as ‘Model Drift’
So, We have taken a small peek beneath the curtain for the production of machine learning models and the problems they deal with. Real-world datasets are complicated. It feels like a constant wrestling match as you start cleaning the data to make it remotely resemble a dataset. As we scour through complex data generated by systems and engineers who had no idea at the time what this data would finally be used for, we begin to understand why machine learning techniques are not easy.
By Tejas Venugopal
We have all heard or read about how AI is the next big thing. It seems like every 3rd article or blog is about AI and how...
By Tejas Venugopal
How are they related to each other? How do we teach machines to perform tasks? The world of AI (Artificial...
In today’s hyper-driven consumer market, using AI right can be a daunting task. According, to a recent Gartner survey, 37% of...