When Dee talked about the “human black box” with pre-trained patterns, I couldn’t help but think about how closely that parallels the machine learning process. Just as humans have multiple interconnected factors influencing their decisions, ML models have their version of this complexity.
So, what is Machine Learning?
It is a subset of AI that allows machines to learn from past data (or historical data) and then make predictions or decisions on new data records without being explicitly programmed for every possible scenario.
With this said, some of the more common ML “scenarios” are:
- Forecasting or Regression (e.g., predicting house prices)
- Classification (e.g., labelling images of cats and dogs)
- Clustering (e.g., finding groups of customers by analyzing their shopping habits)
- Anomaly Detection (e.g., finding outliers in your transactions for fraud analysis)
Or, to exemplify these scenarios with our human cognitive daily tasks, we also predict (e.g., will it rain today?), classify (e.g., is that a friend or stranger?), and detect anomalies (e.g., the cheese that went bad in our fridge). The difference lies in how we process these tasks and which inputs or data we have (e.g., the presence of clouds vs. a bright, clear sky).
So, data (and its quality) is always at the core of producing quality model outcomes from the above scenarios.
Data: The Core “Input”
Similar to humans, who gather multimodal sensory inputs from various sources (e.g., videos from YouTube, music coming from radio, blog posts from Medium, financial records from Excel sheets, etc.), ML models rely on data that can be:
- Structured (like rows in a spreadsheet)
- Semi-structured (JSON, XML files)
- Unstructured (images, PDF documents, free-form text, audio, etc.)
Because data fuels every insight an ML model produces, we (data professionals) spend a substantial amount of time preparing it — often cited as 50–70% of the overall ML project effort.
This preparation phase gives ML models a taste of the “filtering and pre-processing” that humans do naturally.
We look for outliers, handle missing values and duplicates, remove part of the inputs (features) unnecessary features, or create new ones.
Except for the above-listed tasks, we can additionally “tune” the data inputs. — Remember how Dee mentioned factors being “thicker” or “thinner”? — In ML, we achieve something similar through feature engineering and weight assignments, though fully in a mathematical way.
In summary, we are “organizing” the data inputs so the model can “learn” from clean, high-quality data, yielding more reliable model outputs.
Modelling: Training and Testing
While humans can learn and adapt their “factor weights” through deliberate practices, as Dee described, ML models have a similarly structured learning process.
Once our data is in good shape, we feed it into ML algorithms (like neural networks, decision trees, or ensemble methods).
In a typical supervised learning setup, the algorithm sees examples labelled with the correct answers (like a thousand images labelled “cat” or “dog”).
It then adjusts its internal weights — its version of “importance factors”— to match (predict) those labels as accurately as possible. In other words, the trained model might assign a probability score indicating how likely each new image is a “cat” or a “dog”, based on the learned patterns.
This is where ML is more “straightforward” than the human mind: the model’s outputs come from a defined process of summing up weighted inputs, while humans shuffle around multiple factors — like hormones, subconscious biases, or immediate physical needs — making our internal process far less transparent.
So, the two core phases in model building are:
- Training: The model is shown the labelled data. It “learns” patterns linking inputs (image features, for example) to outputs (the correct pet label).
- Testing: We evaluate the model on new, unseen data (new images of cats and dogs) to gauge how well it generalizes. If it consistently mislabels certain images, we might tweak parameters or gather more training examples to improve the accuracy of generated outputs.
As it all comes back to the data, it’s relevant to mention that there can be more to the modelling part, especially if we have “imbalanced data.”
For example: if the training set has 5,000 dog images but only 1,000 cat images, the model might lean toward predicting dogs more often — unless we apply special techniques to address the “imbalance”. But this is a story that would call for a fully new post.
The idea behind this mention is that the number of examples in the input dataset for each possible outcome (the image “cat” or “dog”) influences the complexity of the model’s training process and its output accuracy.
Ongoing Adjustments and the Human Factor
However, despite its seeming straightforwardness, an ML pipeline isn’t “fire-and-forget”.
When the model’s predictions start drifting off track (maybe because new data has changed the scenario), we retrain and fine-tune the system.
Again, the data professionals behind the scenes need to decide how to clean or enrich data and re-tune the model parameters to improve model performance metrics.
That’s the“re-learning” in machine learning.
This is important because bias and errors in data or models can ripple through to flawed outputs and have real-life consequences. For instance, a credit-scoring model trained on biased historical data might systematically lower scores for certain demographic groups, leading to unfair denial of loans or financial opportunities.
In essence, humans still drive the feedback loop of the improvement in training machines, shaping how the ML/AI model evolves and “behaves”.