ML Journey

A red and green apple, round in shape, that’s what we call a data point: specific data about an apple.

Now, if we add more data about other fruits : say, bananas that are green and yellow with a long-curved shape, and so on , this becomes a fruit dataset. In other words, a dataset is a collection of similar data with the same structure or category.

If we combine a fruit dataset with a vegetable dataset, we end up with two or more datasets. This collection is known as a data collection, multi-dataset, or combined dataset.

🧠 The Core of Learning: Data

In the world of technology, data is everything. Just like humans need letters to learn how to read, machines need data to learn anything. Those letters are the raw material or input , our version of data.

If the data is accurate and clean, then the understanding (or prediction) will also be accurate and meaningful.

So yup, Machine Learning (ML) learns from data to understand things, just like humans or what I jokingly call “Mammal Learning”. If a child wants to learn English, they need to first understand the Latin alphabet (A-Z). But imagine in the middle of learning, they suddenly get exposed to Japanese scripts or Russian Cyrillic letters.

They’ll be confused. And when they try to write “I’m hungry” in English, it might come out as: “Я’м はんгрý.”

A strange mix of Russian and Japanese letters! 😵

That’s exactly why consistent and context-appropriate data is critical for humans and machines alike.

🔧 Why Garbage In = Garbage Out

If a model is trained with inconsistent, mixed, or messy data, the output will also be flawed or invalid. That’s why preprocessing and data cleaning are so crucial.

When you’re dealing with thousands or millions of data points, it’s impossible to manually check each one for typos or missing values, that’s why automation is needed to clean and validate the data.

❓ So… How Can a Machine Learning Model Mislead Us?

Actually, it’s not the model that tricks us. We teach it wrong.

Imagine a child learning English and right before mastering the alphabet, their parent suddenly introduces Japanese scripts. Of course the child gets confused. Should we blame the child?

Nope.
Just like that child, machines are innocent. They only learn what we give them.

🐋 Real-Life Example: Mislabeling Destroys Meaning

Suppose you’re building a model to classify land vs sea animals. But in your training data, some marine animals like dolphins and seals are mistakenly labeled as land animals.

When the model sees a whale, it might classify it as a land animal. Funny, but tragic.

Why?
Because the model saw dolphins labeled as land animals, and it learned a false pattern.

🔍 Ways Machine Learning Can Mislead Us

1. ⚖️ Biased or Incomplete Data

Imagine building a face recognition model, but your training data only contains images of white-skinned individuals. Then when the model sees a black-skinned face, it misidentifies it.

Why? Because the model never really learned to recognize a wide range of faces.

2. 💔 False Correlations

Kind of like a relationship that looks strong but lacks meaning, Upss.

For example: a pneumonia detection model is trained on X-ray images. But it ends up learning from the word “pneumonia” written on some X-rays, not the image itself. Result? The model classifies healthy lungs as sick if it sees that word.

Yikes. The model learned from the label, not the lungs.

3. 🧠 Overfitting (Too Much Memorization)

A good ML model should learn patterns, not memorize data.

If it memorizes the training data too well, it performs great during training but fails on real-world data.

This happens when the model is too complex or trained too long.
High training accuracy, but low test accuracy.
It knows the answers, but doesn’t understand the questions.

4. 🎯 Misleading Metrics

We often rely only on accuracy or F1-score, but those don’t tell the whole story.

Just because a model scores 90% accuracy , doesn’t mean it’s good. Accuracy ≠ fairness, robustness, or trustworthiness.

📌 What Makes a Good Model?

Beyond metrics, a truly good model must be:

Check performance (like F1-score) for each subgroup: men vs women, urban vs rural, etc.
If the model performs well for one group but poorly for another, it’s not fair.

Test the model on slightly altered data:

If performance drops significantly, it’s not robust.

The model should reflect real-world data.
If it only sees one scenario (e.g., only old data, or data from one city), it won’t generalize well.

Let humans evaluate:

Especially important in healthcare, finance, and law.

The world changes. A model trained before COVID might fail after it.

Try Cross-Domain or Cross-Time Testing:

Many modern models (like deep learning) are black-boxes:
We know the input and output, but not why it made a decision.

For example:

Or worse: A heart disease model learns from patient names, not medical data.

That’s why we must understand what features the model relies on what it thinks is important.

🧪 So What’s the Solution?

Use interpretability tools like:

These tools help us answer:

“Why did the model make this prediction?”

🙌 Final Thought

Machine Learning doesn’t lie.
But if we teach it the wrong lessons, it will confidently make the wrong decisions.

So let’s not just aim for high accuracy.

Let’s build models that are:

✨ Because understanding how a model thinks is just as important as what it predicts.

🚀 Explore My Project

If you’re curious to see what I’ve been working on, feel free to explore the following project: