If you are participating in the Netflix Prize, don't worry... This post is not about the economic crisis and Netflix filing for bankrupcy and not paying the prize. But this is in fact about a much more "scary" perspective: what if there was no way to lower the threshold set in the competition? Or more precisely, what if the only way to lower the error threshold was actually overfitting to the existing training and testing dataset?


The possibility cannot be discarded. A few weeks back I posted on this same blog a discussion on a recent work we did analyzing the impact of natural noise on user ratings. That is, when users give us feedback through ratings, they are adding a background noise. There are many possible reasons for this. In some cases the user does not really see a difference between rating an item with a 3 or a 4. Other times, the user is not being careful enough when giving the feedback and is letting other factors affect the result. Ratings will be affected by things like how long ago the item was used, what was the previously rated item, or even the mood the user is in.

But, whatever the reason is, the result is that we have data with noise and/or errors. If we take a random rating and asked the user "What was your rating for item X?" we will inevitably get errors. Uhm... so even the user makes errors when recalling her own ratings? Yes! As a matter of fact we can easily measure this error by asking her to rate the same items several times (see, again, my previous post on this).

Now, here is a rule of thumb: we cannot predict the user any better than this same user can assess her own ratings. Therefore this natural noise threshold is setting a "magic barrier". We cannot try, and it really makes no sense, to go below this error in our predictions. What difference does it make that our system is very good in predicting some item should "be" a 3 if the user does not really see the difference between (or is not sure about) a 2 and a 3 for that item. Or, the other way around: "How can we predict a 3 with no error if the user "randomly" moved between 2,3, and 4 when giving us feedback?"

So, and returning to the initial issue, the question is now: is the Netflix Prize threshold below this "magic barrier"? Do they even know? Well, a member of the leading team Korbell and I had an informal conversation with Netflix VP for Personalization. Of course, they cannot say much about the prize in case they would be giving vital information for winning the price. However, when we asked him whether he had any information as for what was this "magic barrier" on the Netflix Prize dataset he answered that they did a small study to estimate something similar to this. Their estimation was "around 0.5". That is surely non-negligible but it is safely located below the winning threshold of 0.83. However, remember this was only a small study that gave them a rough estimate. Our measures on a similar dataset yielded RMSE values between 0.57 and 0.82 Although these values depend on several variables such as the time between ratings or even how items are presented to the user, we have reasons to believe the Netflix dataset should be on the higher end of this range (if not higher!). Read more on our UMAP article.

As a final appendix, let me throw in two important conclusions that pinpoint future directions. First, it is clear that the RMSE measure should be reconsidered. If, on average, the user does not know the difference between a 2 and a 3, we should not take that into account in our success measure. Top-N measures seem much more suitable as a measure of success in Recommender Systems: the user might not care or see a difference between a 2 and 3, but she will surely be deceived if we recommend something she values with a 1.
Another strategy is to select only users that are more consistent and use those to generate recommendations for the target user. If the target user is noisy herself, we will still get lousy recommendations. But we will be minimizing errors for the rest. This is the approach we took in our Wisdom of the Few. Finally, although we cannot aim at getting results below the "magic barrier", there is something we can do: lower that barrier. In a work we have under submission, we devised a "denoising" algorithm that is able to improve accuracy almost up to a 15% by lowering this noise threshold. But, I will leave this for a future post once we hopefully get the paper accepted.
19

View comments

  1. Interesting point, Xavier. Perhaps Netflix should also score (informally) against fresh test data for each submission and release the RMSE along with the tally that counts for the contest.

    But would this be bad press for the prize?

    ReplyDelete
  2. Just a guy in a garage7:34 AM

    Yes - I agree that rating scales leave a lot to be desired. A much better approach must be to use a paired comparison method (that way you get a built in error score together with the actual ranking of the various films (or whatever).

    Nevertheless - the Netflix target was inspired. Its too close to tell whether it can ever be achieved - my hunch is that someone will get there eventually.

    Just a guy in a garage

    ReplyDelete
  3. I am sure somebody will get there eventually... there is no doubt about that! The question is whether that will be achieved by over-fitting or the solution will be somewhat generalizable.

    ReplyDelete
  4. Xavier, the scores we get back from the netflixprize are from the Quiz set, which can be overfit. But the Grand Prizew is only won if 10% is reached on the Test set. There is no way to overfit on the Test set because it is a true hold-out set; participants get no information about it. So the Progress Prizes and the Grand Prize (should it be won) represent real progress that is generalizable to similar data.

    ReplyDelete
  5. @CTV

    Even if it is a true hold-out set participants can overfit to it... if it is static (i.e. always the same). Given enough submissions and results from the test set you can do a regression and adapt your parameters to fit that curve. Is that right or am I missing something?

    ReplyDelete
  6. I think CTV is wrong - the contest ends when someone hits the 10% on the quiz set and the winner is whoever then has the best result on the test set.

    My experience with running svd type algorithms on the Netflix prize suggests that, leaving out adjustments for time factors, the limit on the RMSE for prediction is in the range .7 to .75 for those users who have over 100 ratings. The challenge in the Netflix prize is the users who have fewer than 100 ratings.

    ReplyDelete
  7. A crucial insight is that noise is the deviation from the model. Thus, if effects of item aging, user changes in pickyness, user changes in types of liked movies, etc, are incorporated, a user changing it's opinion from 2 to 3 to 4 might involve less noise than if the same user rated the item 3 three times in a row. It all depends on the model.


    What I'm saying is that one modeller's noise is another's signal. The main hypothesis of the blog entry is therefore simplistic in it's arguments.

    ReplyDelete
  8. @anderssjogren Sorry if you think the blog post seems simplistic but... have you gone into reading the "I like it... I like it not" paper? What you mention is the difference between stability and reliability and I am taking that into account in my discourse. The blog is a "popular" adaptation of what we found in our research. If you want hard science please read the paper.

    ReplyDelete
  9. @Xavier: You cannot overfit to the Test set, because you get no information about it. The RMSE returned by netflix is only on the Quiz Set, which is a separate data set (although from the same distribution) from the Test Set. If you overfit to the Quiz, you will not do better on the Test.

    @absalon: The contest is won when a team gets 10% on the Test set, not the Quiz set. The rules are clear on this (trust me, I've read through them quite a bit :)

    Nonetheless, it is clear that the winning methods developed for this competition, albeit good ones, are really only "best" for this extremely idealized case of the Netflix prize. In real world cases with messier data and cold starts, all bets are off!

    ReplyDelete
  10. @CTV I guess I am not making myself clear. I am also participating on the challenge so I know the rules and I knos the difference between the training and the quiz set, it's not about that.

    Here is my point: You don't need to be able to "inspect" a dataset to overfit to it, you just need to know what results you get when applying your algorithm. Given enough pairs of (algorithm settings, results) you can overfit to it, you don't need to actually *look* at the data!

    The only way Netflix avoids overfitting to the Quiz dataset is by limiting the amount of submissions to "one a day" per team. If you could send one submission a second, for instance, you would have enough results to overfit to the data.

    ReplyDelete
  11. Just to add to my previous comment. The quiz set has 2.8 million ratings you have to predict. Given that there are 5 possible values for each one of them, this gives you 1E 2000000 possible combinations. That is the maximum number of submissions you would need to make to win the prize.

    However, we are all much smarter than that. And given that for each entry you receive feedback on how well you are doing you can reduce the number of possible subimssions by many, many orders of magnitudes!

    In any case, given that this particular topic seems to deserve more explanation I will try to clarify in a future post.

    ReplyDelete
  12. Xavier - you need to be more careful about the definitions of the qualifying, quiz and test data sets.

    CTV - you are right and I was wrong about the Rules.

    ReplyDelete
  13. @Absalon You are absolutely right. I used the terms a bit carelessly and without paying much attention to the Netflix wording (see here). The basic issue is that the "qualifying" set is split into "test" and "quiz". The RMSE you get on a submission is on the "quiz" and the "test" remains unknown. Also the assignment to test and quiz is unknown.

    So I guess this does avoid simple regression as I was saying in my previous example. However, if both quiz and test have "similar" statistical properties, overfitting could still happen just the same, right?

    ReplyDelete
  14. No Xavier, I don't think that "overfitting" would occur.

    I do believe that there is already some implicit training against the quiz result and there may be some overfitting going on there. The test of this will be if there is a divergence between the quiz and test results. We may get evidence of that if the leaderboard shows someone having a better than 10% improvement on the quiz set but Netflix has not announced that anyone has beaten the 10% threshold on the test set.

    ReplyDelete
  15. Xavier, indeed, Netflix thought very hard about these issues and this is why they came up with the very smart decision to have a Test set. If you can overfit to the Test set, I do not know how.

    One interesting point: The 30-day period is triggered by someone getting 10% on the QUIZ set (as shown on the leaderboard), but the contest is over only when one gets 10% on the TEST set. Soooo, it will be possible for the 30 day period to trigger, but at the end, Netflix declares no winner because no one has gotten 10% on the TEST set.

    In some sense, this has a non-trivial probability since undoubtedly all teams are overfitting to the Quiz set to some extent....

    ReplyDelete
  16. @CTV So I guess you are right, even if you can figure out which ratings are part of the test set (by seeing which ones do not make a difference on the RMSE feedback) it would be hard to overfit to it.

    Now I wonder, is the RMSE on the test set checked on the progress prize also or only on the final prize?

    ReplyDelete
  17. all Prizes are given out based on Test Set performance.

    ReplyDelete
  18. You can still overfit to the test set; trying different models on a test set IS by definition overfitting. If you want no overfit, you have to do variable selection, model selection and tuning on a training set, test on the test set and accept the results once. By submitting 100 times, you test 100 models on a test set -> overfitting

    ReplyDelete
  19. Nice conversation. Even ignoring whether the data can be overfit or the ratings are objective, does the question Netflix is asking even get to the core of the "recommendations problem"? Will the improved algorithm actually translate to a better user experience on the Netflix site and further their business goals? At MediaUnbound we are doing a Countdown to 10% article series focusing on the underlying assumptions and conclusions we can draw from the exercise. You can find them here.

    ReplyDelete

There have recently been some articles (e.g. This list of influencers) that have pointed to this blog and lamented that I don't update it regularly anymore. It is true. I now realize I should have at least posted something here to direct readers to the places where I keep posting in case they find I might have something interesting to say.

First and foremost, given that I joined Quora about a year ago, I have been using the Quora product itself to post most of my writing. You can find my profile here. I have found that I can reformulate almost anything I want to say in the form of an answer to a Quora question. Besides, my posts there get a ton of views (I am almost about to reach 2 million views in about a year) and good interactions.

(This is a blogpost version of a talk I gave at MLConf SF 11/14/2014. See below for original video and slides)

There are many good textbooks and courses where you can be introduced to machine learning and maybe even learn some of the most intricate details about a particular approach or algorithm (See my answer on Quora on what are good resources for this).

A couple of weeks ago, I gave a 4 hour lecture on Recommender Systems at the 2014 Machine Learning Summer School at CMU. The school was organized by Alex Smola and Zico Kolter and, judging by the attendance and the quality of the speakers, it was a big success.
1

I have recently heard complaints that this blog is rather quiet lately. I agree. I have definitely been focused on publishing through other sources and have found little time to write interesting things here. On the one hand, I find twitter ideal for communicating quick and short ideas, thoughts, or pointers. You should definitely follow me there if you want to keep up to date. On the other hand,  I have published a couple of posts on the Netflix Techblog.

As I have explained in other publications such as the Netflix Techblog, ranking is a very important part of a Recommender System. Although the Netflix Prize focused on rating prediction, ranking is in most cases a much better formulation for the recommendation problem. In this post I give some more motivation, and an introduction to the problem of personalized learning to rank, with pointers to some solutions.
1

A couple of days ago, I attended the Analytics @Webscale workshop at Facebook. I found this workshop to be very interesting from a technical perspective. This conference was mostly organized by Facebook Engineering, but they invited LinkedIn, and Twitter to present, and the result was pretty balanced. I think the presentations, though biased to what the 3 "Social Giants" do, were a good summary of many of the problems webscale companies face when dealing with Big Data.

(Sorry for allowing myself to depart from the usual geeky computer science algorithmic talk in this blog. I owed it to myself and my biggest hobby to write a post like this. I hope you bear with me.)

Around 3 years ago, I smoked, I was overweight, and only exercised occasionally. Being a fan of radical turns in my life, I decided one day to go on a week-long liquid diet, I stopped smoking, and I took up running, with the only goal in my mind to some time run the half marathon in my home town.
3

After a great week in beautiful and sunny Dublin (yes, sunny), it is time to look back and recap on the most interesting things that happened in the 2012 Recsys Conference. I have been attending the conference since its first edition in Minnesota. And, it has been great to see the conference mature to become the premiere event for recommendation technologies.
1

We are just a few days away from the 2012 ACM Recommender Systems Conference (#Recsys2012), that this year will take place in Dublin, Ireland. Over the years, Recsys has become my favorite conference because of its unique blend of academic research and industrial applications. If you are not familiar with the conference, you might get a flavor by reading my report from last year.

The discussion of whether it is better to focus on building better algorithms or getting more data is by no means new. But, it is really catching on lately. This was one of the preferred discussion topics in this year's Strata Conference, for instance. And, I do have the feeling that because of the Big Data "hype", the common opinion is very much favoring those claiming that it is "all about the data".
3
Cloud
Cloud
Blog Archive
About Me
Some Personal Links
Loading