Leakage Loophole
Emily's model had the best performance in the entire class, but there was a problem: the results were too good to be correct. The professor decided to review her process step by step. Emily stood in front of the class and wrote down her process:
- Loaded the entire dataset in memory.
- Replaced one column's missing values with the mean of the column and scaled another column using Min-Max Scaling.
- Split the dataset into a train and a test set.
- And finally, trained the model.
As soon as she finished, the professor knew what the issue was.
Which of the following is the reason for the model's inflated performance?