As far as I can tell, this is let's train a huge number of models and then cherry-pick few that works well on a test set, so an overfitted junk. What have I missed?
The train is not labeled, so it is not possible; and they do not mention that the labeled set was split or used in validation -- it is just called "test".
"We followed the experimental protocols specified
by (Deng et al., 2010; Sanchez & Perronnin, 2011), in
which, the datasets are randomly split into two halves
for training and validation. We report the performance
on the validation set and compare against state-of-theart baselines in Table 2. Note that the splits are not
identical to previous work but validation set performances vary slightly across different splits."
As I understand, this is only about this side experiment with ImageNet data which uses logistic regression on those neurons in some cryptic way; I was trying to comprehend the core work (faces) before that.