Socratic argument against using Validation Sets
Validation datasets for model selection
There is no real utility in taking the position that you won’t do model selection via validation splits, although, know there is a meta-purpose to bringing a blowtorch to this concept…
Socratic argument against using Validation Sets
Socrates, a father to Stoicism & Western philosophy, started a legacy in his approach to gaining knowledge. Rather than bringing new ideas, he would challenge conventional ones (like the scientific method). The Socratic Method was questioning to get to the corners of someones knowledge until; … they walked away embarrassed, second guessing if they ever were an expert.
A disciplined structure to questioning can
- Acknowledge contradictions.
- Follow out logical consequences of thought, the implications of what we think.
- Distinguish what we know from what we don’t.
- To probe the extent of knowledge.
Different questions have different focuses
- To clarify thinking, explore origin:
- “Why, but Why, but Why?"
- Challenge assumptions:
- “Is this always the case?"
- Provide evidence as a basis for arguments:
- “Is there any reason to doubt this argument?"
- Discover alternative viewpoints and perspectives and conflicts between contentions:
- “What is the counter argument? did anyone see this another way?"
- Exploring implications and consequences:
- “How does that affect…X? But what else would result?"
So - Stay to listen, and discover the extent of the merit in the prescription “Always use validation data”.
Socrates questions his student on why they use validation splits for model selection
Pupil: Data Should be separated out, and used as validation for model selection
Socrates: Why does that help model selection?
Pupil: A process which optimised on the training set, but still does well on the validation set is going to do better on the population.
Socrates: Why is the population important?
Pupil: We are trying to understand the population the sample is generated from, because it will generate new sample data again in an identical process.
Socrates: Identically? What about the stock market? Is the population that created your sample the same one generating data points in 1990?
Pupil: That’s different.
Socrates: Ok, So by how much is it different to your ‘model selection’ challenge?
Pupil: I don’t know precisely
Socrates: Precisely?
Pupil: 1990 Stock market has foundational principles that make it the same as today, but some relationships are different than in 1990. The split between those, I can’t say
A caveat to the rule: Validation data splitting should be used for model selection to the extent that the population is similar to the sample we’re interested in modelling. We’re not sure how different tomorrows sample is to todays population.
Socrates: If we strictly stick to this selection method, what other situations could arise?
Pupil: The technique rejects a model even when it performs best on tomorrows sample.
Socrates: Why so?
Pupil: Validation data might have patterns in it that occur on tomorrows sample. The model won’t learn those if it’s hidden from training.
Socrates: What else is true, if validation data has unique patterns in it?
Pupil: The training data would have low entropy - be lacking in the amount of information.
Validation reducing performance If the data sample has a lot of noise, chance could remove patterns from training that are instructive for the prediction task.
Socrates: How does selection process affect model outcomes?
Pupil: It affects predictions.
Socrates: All predictions? Are there some predictions that are affected differently?
Pupil: not all predictions, only predictions until next model fitting or model selection step.
There is a tradeoff between cost of data removal vs. value of additional data: When dataset’s entropy lowers, or prediction occurs only for a short period, the uplift to performance because of model selection via validation dataset is diluted'
Socrates: If population does remain the same for tomorrow’s prediction task, Are there situations when splitting validation data out could have no impact on population prediction?
Pupil: Yes, there is a situation. When the validation data has nothing useful in it compared to the training data, for helping to select a model that predicts the population better.
Socrates: If this were not the case, are there situations which might still give this perception of the two datasets?
Pupil: Perhaps models that have a low ability to express nuances in data.
If a model cannot perform any worse on the validation set, than it does on the training set, the process is not helpful. A Model with assumptions that give it low ability to express nuances only found in training set, may limit the impact the validation selection process has on choosing which model will go on to make population predictions.
Questioning assumptions, inverting an argument, hypothesising other conclusions which must also hold, can help to discover more about a topic. This conversational style of questioning can be powerful. If you can’t find a team who embraces this, you can always talk with Socrates.
No ML task is the same, you must think from first principles
I recently had a machine learning task where thinking in first principles led me to question the rule of Always use validation data to select models. Through this, I learnt so much more deeply about optimising the efficiency of a Datasets.
As Socrates helped his pupil discover, there are were trade-offs happening;
- Information loss in training set, for prediction accuracy
- Model Variance-Bias tradeoff
- Automation via models to add variation to the dataa histoy for the sake of prototyping & learning, vs reverting to rules based automation due to poor validation performance caused by small number of observations in dataset.
The task was using datasets from partially markovian causual structures - they changed over time, so population that was sampled from was, in theory, different to the validation dataset. Because thousands of models needed to be rebuilt daily, algorithmically, edge cases due to hard rules would come up.
Model Validation has its reasons deeply rooted in elegant theories, but machine learning isn’t always so. A model rejected due to the only remaining feature, on a very new data generating process having pValue = 0.0500001, could lead to automation by use of a rules based model.
This might all seem like an argument against validation sets - it is… kind of. While it’s certainly my stance to use validation, the Socratic method has helped to probe the extent of the idea.
Even if you come to a situation where model selection via validation dataset has a weaker than normal rationale, remember;
An argument against something, isn’t an argument for the other.
Deciding not to use validation data, even with all the arguments in the world, doesn’t permit deploying the model. More conditions are necessary to ensure the model does the best it can on the data it hasn’t seen yet;
- Develop models in line with the theoretical understanding of the DGP / causal structures of the ML environment.
- Write rules that check these sensibilities in the model: “an increasing customer age shouldn’t predict a decreasing salary level”
- Should some coefficients have large power relative to the other variables?
- Data sources with anomalous amount of missing data?
- If data is an aggregate, weight the observation proportional to the number of samples in that aggregate?
One more mental model I’ve found useful, that applies here is to think of “Use Validation dataset to select model” as a map to guide your actions:
The map is not the territory. A map’s value becomes less when the world changes. A map cant represent the section of the world you are in - else it would be too large to carry with you. Question the cartographer: They had a purpose when writing the map. https://fs.blog/2015/11/map-and-territory/