img

Practical considerations for Machine Learning Classification

By Tejas V

There is something very satisfying when you build a machine learning classifier using a toy dataset. We can achieve high accuracy and feel good inside while doing it. But this doesn’t really help us or prepare us for real-world datasets and the issues it poses.
In this article, we will look at 3 issues real-world datasets pose to production systems

“Model Imbalance” in machine learning models

Models don’t like imbalanced datasets. If you have ever trained a machine learning classification model, you may have come across this issue. People use different words for it. ‘Imbalanced dataset’, ‘Model is Skewed’, etc. 
Let’s say we are training a model to detect spam emails. This dataset will not be balanced. You will have many more regular emails (ham) instead of spam. Regular emails may account for nearly 95% of the dataset. This is clearly imbalanced. Not all machine learning models deal with this imbalance well, and the final model could have a lot of false-positive predictions. 
One way to deal with this is to balance the dataset. 50% regular emails and 50% spam. But this will usually cut down the size of the dataset, which will decrease the accuracy of the model.
Ah, It’s like Sophie’s Choice.
As a counterpoint, balancing the dataset also means that we lose the original distribution of the data. The fact that only 5% of emails are spam can also be useful to these machine learning models during their training.
So what’s the solution? Depends on your dataset. It’s trial and error until you find what works for you.

How to deal with model imbalance?

  1. Balance the dataset. Dataset size decreases, but the model might still have good prediction quality
  2. Use class weights to let the model know about the imbalance. The model will try to train accordingly. This might work well. Depends on your data
  3. Keep the imbalanced dataset and use algorithms which handle imbalanced datasets well

“Model Noise” in machine learning models

Toy datasets are usually perfect. They don’t have any missing data and all the ground truth labels are accurate. This is not always the case with real-world data. 
Software systems that originally generated the data would not have been designed to keep machine learning techniques in mind. There could be important info missing, or wrong labels. Using human annotators would also result in wrong labels. And it’s a very tedious task to verify each sample.
Training machine learning classification models on such datasets can be challenging. How much noise does the data have? 1%? 2%?
Generally, this number should be low. If it has 30 to 40% noise, it’s probably not a good idea to train on this data.

How to deal with dataset noise?

  1. Try to quantify how much noise is present in the data, and what is the source of the noise. This will determine if the dataset needs to be cleaned up before moving forward
  2. If the dataset has low noise, many machine learning algorithms including neural networks are fine with some noise being present. The trained models would provide decent results

Classification with neural networks handles noise using a concept called ‘Learning Rate’. If the learning rate is high and the data is noisy, then the results will not be good. For noisy data, we need to use lower learning rates. It is also important for the noise to be pretty low.

Model Drift

Once you create a machine learning classification model and deploy it in production, we don’t need to worry about this model ever again right? Wrong…
As time passes, the distribution of the classes in the real world data might change slightly and gradually. This would slowly degrade the accuracy of the predictions.
It might also be possible that new data and classes have to be brought in to maintain the classifier.
So in this case, we need to keep an eye on the predictions to get a sense of the quality of the predictions and keep making changes to the dataset and retrain the machine learning models as needed.
The process by which a model’s predictions slowly degrade over time due to changing real-world data is referred to as ‘Model Drift’

Conclusion

So, We have taken a small peek beneath the curtain for the production of machine learning models and the problems they deal with. Real-world datasets are complicated. It feels like a constant wrestling match as you start cleaning the data to make it remotely resemble a dataset. As we scour through complex data generated by systems and engineers who had no idea at the time what this data would finally be used for, we begin to understand why machine learning techniques are not easy.

Is AI and Machine Learning the same?

By Tejas Venugopal

We have all heard or read about how AI is the next big thing. It seems like every 3rd article or blog is about AI and how...Continue

What is the difference between Artificial Intelligence, Machine Learning and Deep Learning?

By Tejas Venugopal

How are they related to each other? How do we teach machines to perform tasks? The world of AI (Artificial...Continue

Can artificial intelligence make your business far more intelligent?

By asksid

In today’s hyper-driven consumer market, using AI right can be a daunting task. According, to a recent Gartner survey, 37% of...Continue