
The possibility cannot be discarded. A few weeks back I posted on this same blog a discussion on a recent work we did analyzing the impact of natural noise on user ratings. That is, when users give us feedback through ratings, they are adding a background noise. There are many possible reasons for this. In some cases the user does not really see a difference between rating an item with a 3 or a 4. Other times, the user is not being careful enough when giving the feedback and is letting other factors affect the result. Ratings will be affected by things like how long ago the item was used, what was the previously rated item, or even the mood the user is in.
But, whatever the reason is, the result is that we have data with noise and/or errors. If we take a random rating and asked the user "What was your rating for item X?" we will inevitably get errors. Uhm... so even the user makes errors when recalling her own ratings? Yes! As a matter of fact we can easily measure this error by asking her to rate the same items several times (see, again, my previous post on this).
Now, here is a rule of thumb: we cannot predict the user any better than this same user can assess her own ratings. Therefore this natural noise threshold is setting a "magic barrier". We cannot try, and it really makes no sense, to go below this error in our predictions. What difference does it make that our system is very good in predicting some item should "be" a 3 if the user does not really see the difference between (or is not sure about) a 2 and a 3 for that item. Or, the other way around: "How can we predict a 3 with no error if the user "randomly" moved between 2,3, and 4 when giving us feedback?"
So, and returning to the initial issue, the question is now: is the Netflix Prize threshold below this "magic barrier"? Do they even know? Well, a member of the leading team Korbell and I had an informal conversation with Netflix VP for Personalization. Of course, they cannot say much about the prize in case they would be giving vital information for winning the price. However, when we asked him whether he had any information as for what was this "magic barrier" on the Netflix Prize dataset he answered that they did a small study to estimate something similar to this. Their estimation was "around 0.5". That is surely non-negligible but it is safely located below the winning threshold of 0.83. However, remember this was only a small study that gave them a rough estimate. Our measures on a similar dataset yielded RMSE values between 0.57 and 0.82 Although these values depend on several variables such as the time between ratings or even how items are presented to the user, we have reasons to believe the Netflix dataset should be on the higher end of this range (if not higher!). Read more on our UMAP article.
As a final appendix, let me throw in two important conclusions that pinpoint future directions. First, it is clear that the RMSE measure should be reconsidered. If, on average, the user does not know the difference between a 2 and a 3, we should not take that into account in our success measure. Top-N measures seem much more suitable as a measure of success in Recommender Systems: the user might not care or see a difference between a 2 and 3, but she will surely be deceived if we recommend something she values with a 1.
Another strategy is to select only users that are more consistent and use those to generate recommendations for the target user. If the target user is noisy herself, we will still get lousy recommendations. But we will be minimizing errors for the rest. This is the approach we took in our Wisdom of the Few. Finally, although we cannot aim at getting results below the "magic barrier", there is something we can do: lower that barrier. In a work we have under submission, we devised a "denoising" algorithm that is able to improve accuracy almost up to a 15% by lowering this noise threshold. But, I will leave this for a future post once we hopefully get the paper accepted.
Interesting point, Xavier. Perhaps Netflix should also score (informally) against fresh test data for each submission and release the RMSE along with the tally that counts for the contest.
ReplyDeleteBut would this be bad press for the prize?
Yes - I agree that rating scales leave a lot to be desired. A much better approach must be to use a paired comparison method (that way you get a built in error score together with the actual ranking of the various films (or whatever).
ReplyDeleteNevertheless - the Netflix target was inspired. Its too close to tell whether it can ever be achieved - my hunch is that someone will get there eventually.
Just a guy in a garage
I am sure somebody will get there eventually... there is no doubt about that! The question is whether that will be achieved by over-fitting or the solution will be somewhat generalizable.
ReplyDeleteXavier, the scores we get back from the netflixprize are from the Quiz set, which can be overfit. But the Grand Prizew is only won if 10% is reached on the Test set. There is no way to overfit on the Test set because it is a true hold-out set; participants get no information about it. So the Progress Prizes and the Grand Prize (should it be won) represent real progress that is generalizable to similar data.
ReplyDelete@CTV
ReplyDeleteEven if it is a true hold-out set participants can overfit to it... if it is static (i.e. always the same). Given enough submissions and results from the test set you can do a regression and adapt your parameters to fit that curve. Is that right or am I missing something?
I think CTV is wrong - the contest ends when someone hits the 10% on the quiz set and the winner is whoever then has the best result on the test set.
ReplyDeleteMy experience with running svd type algorithms on the Netflix prize suggests that, leaving out adjustments for time factors, the limit on the RMSE for prediction is in the range .7 to .75 for those users who have over 100 ratings. The challenge in the Netflix prize is the users who have fewer than 100 ratings.
A crucial insight is that noise is the deviation from the model. Thus, if effects of item aging, user changes in pickyness, user changes in types of liked movies, etc, are incorporated, a user changing it's opinion from 2 to 3 to 4 might involve less noise than if the same user rated the item 3 three times in a row. It all depends on the model.
ReplyDeleteWhat I'm saying is that one modeller's noise is another's signal. The main hypothesis of the blog entry is therefore simplistic in it's arguments.
@anderssjogren Sorry if you think the blog post seems simplistic but... have you gone into reading the "I like it... I like it not" paper? What you mention is the difference between stability and reliability and I am taking that into account in my discourse. The blog is a "popular" adaptation of what we found in our research. If you want hard science please read the paper.
ReplyDelete@Xavier: You cannot overfit to the Test set, because you get no information about it. The RMSE returned by netflix is only on the Quiz Set, which is a separate data set (although from the same distribution) from the Test Set. If you overfit to the Quiz, you will not do better on the Test.
ReplyDelete@absalon: The contest is won when a team gets 10% on the Test set, not the Quiz set. The rules are clear on this (trust me, I've read through them quite a bit :)
Nonetheless, it is clear that the winning methods developed for this competition, albeit good ones, are really only "best" for this extremely idealized case of the Netflix prize. In real world cases with messier data and cold starts, all bets are off!
@CTV I guess I am not making myself clear. I am also participating on the challenge so I know the rules and I knos the difference between the training and the quiz set, it's not about that.
ReplyDeleteHere is my point: You don't need to be able to "inspect" a dataset to overfit to it, you just need to know what results you get when applying your algorithm. Given enough pairs of (algorithm settings, results) you can overfit to it, you don't need to actually *look* at the data!
The only way Netflix avoids overfitting to the Quiz dataset is by limiting the amount of submissions to "one a day" per team. If you could send one submission a second, for instance, you would have enough results to overfit to the data.
Just to add to my previous comment. The quiz set has 2.8 million ratings you have to predict. Given that there are 5 possible values for each one of them, this gives you 1E 2000000 possible combinations. That is the maximum number of submissions you would need to make to win the prize.
ReplyDeleteHowever, we are all much smarter than that. And given that for each entry you receive feedback on how well you are doing you can reduce the number of possible subimssions by many, many orders of magnitudes!
In any case, given that this particular topic seems to deserve more explanation I will try to clarify in a future post.
Xavier - you need to be more careful about the definitions of the qualifying, quiz and test data sets.
ReplyDeleteCTV - you are right and I was wrong about the Rules.
@Absalon You are absolutely right. I used the terms a bit carelessly and without paying much attention to the Netflix wording (see here). The basic issue is that the "qualifying" set is split into "test" and "quiz". The RMSE you get on a submission is on the "quiz" and the "test" remains unknown. Also the assignment to test and quiz is unknown.
ReplyDeleteSo I guess this does avoid simple regression as I was saying in my previous example. However, if both quiz and test have "similar" statistical properties, overfitting could still happen just the same, right?
No Xavier, I don't think that "overfitting" would occur.
ReplyDeleteI do believe that there is already some implicit training against the quiz result and there may be some overfitting going on there. The test of this will be if there is a divergence between the quiz and test results. We may get evidence of that if the leaderboard shows someone having a better than 10% improvement on the quiz set but Netflix has not announced that anyone has beaten the 10% threshold on the test set.
Xavier, indeed, Netflix thought very hard about these issues and this is why they came up with the very smart decision to have a Test set. If you can overfit to the Test set, I do not know how.
ReplyDeleteOne interesting point: The 30-day period is triggered by someone getting 10% on the QUIZ set (as shown on the leaderboard), but the contest is over only when one gets 10% on the TEST set. Soooo, it will be possible for the 30 day period to trigger, but at the end, Netflix declares no winner because no one has gotten 10% on the TEST set.
In some sense, this has a non-trivial probability since undoubtedly all teams are overfitting to the Quiz set to some extent....
@CTV So I guess you are right, even if you can figure out which ratings are part of the test set (by seeing which ones do not make a difference on the RMSE feedback) it would be hard to overfit to it.
ReplyDeleteNow I wonder, is the RMSE on the test set checked on the progress prize also or only on the final prize?
all Prizes are given out based on Test Set performance.
ReplyDeleteYou can still overfit to the test set; trying different models on a test set IS by definition overfitting. If you want no overfit, you have to do variable selection, model selection and tuning on a training set, test on the test set and accept the results once. By submitting 100 times, you test 100 models on a test set -> overfitting
ReplyDeleteNice conversation. Even ignoring whether the data can be overfit or the ratings are objective, does the question Netflix is asking even get to the core of the "recommendations problem"? Will the improved algorithm actually translate to a better user experience on the Netflix site and further their business goals? At MediaUnbound we are doing a Countdown to 10% article series focusing on the underlying assumptions and conclusions we can draw from the exercise. You can find them here.
ReplyDelete