Get Time
Search | Watch Thread  |  My Post History  |  My Watches  |  User Settings
View: Flat (newest first)  | Threaded  | Tree
Previous Thread  |  Next Thread
Final testing | Reply
"The top 10 competitors according to the provisional scores will be invited to the final testing round. The details of the final testing are described in a separate document."

It's the second MM with that scoring mechanism. I think it's bad. For instance, in Spectrum Predictor one, there is a significant chance to overfit. However, one cannot choose the solution he consider best (in terms of stability) if it places outside of top10 on the provisional data. It kind of forces you to overfit, which is bad.

What's the rationale behind the mechanism ?
Re: Final testing (response to post by Mloody2000) | Reply
I had the same thoughts. But what are the other options? I mean, when you state these conditions:

- you don't want to release the test data to the contestants,
- you don't want to invite all the contestants to the final testing round,

you can hardly come up with a better mechanism.

Maybe one way to improve would be, that in the final tests, you would be allowed to select not the last submission, but any of the submissions you made. So you would be forced to overfit (to reach top 10 in provisional), but at least for the final tests, you would be able to select better solution (but still the one that was prepared before the deadline).

Another option would be to split the provisional data so that provisional ranking would be based on one part of the data, but the invitation for final tests would be based on the other part. But that would imply three different rankings: Provisional 1 (with overfit), Provisional 2 (without overfit) and Final (based on unseen data). The Marathon platform currently does not support that, so something would have to be done manually. Another drawback is that when the provisional data set is not very big, it is not a good idea to split it.
Re: Final testing (response to post by Mloody2000) | Reply
Thanks for your concern, these are valid points. The rationale behind this mechanism is exactly what nofto said: we can't invite everyone into the final testing round. Running the tests for only 10 contestants is already a significant effort in terms of time and money. Fortunately in most of the competitions there is good correlation between the provisional and final ranks, it's almost always the case when there is plenty of training and testing data available. In these cases overfitting is exactly what you should do in order to perform well. I think (can't promise, take it as a guess based on experience) that this will be the case also in fMoW.

"It's the second MM with that scoring mechanism." I know about at least 8 recent contests where this process was used, and I never had the feeling that we missed a winner because he was not even invited to the final round.

However, there are contests where we see a significant shake up of the leaderboard from provisional to final results, and I agree with you that SpectrumPredictor can easily be like that. We can't change the rules of that contest now but we are happy to hear your ideas how the final testing process could be improved in the future to avoid the overfitting scenario you described. One simple idea could be that you'll have the option to choose two submissions instead of one, and the one that performs better on the final tests will be considered for final ranking. But even this doubles the effort needed to do the final testing. Nofto's idea that you can simply select a different past submission is also a good one.
Re: Final testing (response to post by walrus71) | Reply
That would indeed be a good solution - the ability to select one solution for provisional and one for final testing.

I actually don't have any specific idea for now, outside of noting that it doesn't seem optimal in small data competitions, like the SpectrumPredictor one. From what I'm reading, in this one we will have plenty of data :)