I am a hard bloggin" scientist. Read the Manifesto.

Oct 26, 2009

Recsys 09


Last week I attended the 2009 ACM Conference on Recommender Systems, Recsys09 for short. The conference took place in New York University's Stern School of Business organized by Alex Tuzhilin. This was the 3rd edition of this very special conference for me. Special for several reasons such as the fact that it is the main conference in the area that I am focusing my research; or the fact that I am co-chairing the conference next year in Barcelona. The area of recommender systems has also a special attraction since it combines people with backgrounds as different as HCI, Marketing, Data Mining, Information Retrieval, or Mathematics. If you add the fact that there is an extremely important representation from industry, and many of which you won't easily see in many other conferences from Netflix to Autodesk and a great number of start-ups, you have an explosive cocktail. People in the audience that rave when they see a formula that cannot fit into one slide mix with senior committee members that propose to automatically reject papers that use the Greek alphabet.

The conference has been steadily growing for the past years. It started out of a workshop organized in Bilbao by Strands. The first edition was then held in Minneapolis, home to the Movielens group which could also be considered birth place of the area as a whole. Then off to EPFL and finally this year in NY. The numbers are astonishing for a conference as young (and presumably focused) as this one: more than 280 attendees and an acceptance rate of 19% make it look almost like a first-tier conference.

If you want to get a good idea of what went on during the conference I recommend you take a look at the tweets hashed with #recsys09. And if you want a really quick idea of what where the core topics, look at the beautiful tag cloud below, generated from the tweets by Barry Smyth. In the next paragraphs I will briefly highlight what I think were the most important ideas discussed during the conference.


The first day, we had 3 very interesting tutorials. These tutorials had the great virtue of already setting what would be 3 of the most important topics during the conference: Social Recommendations and Trust, Algorithms, and the Netflix Prize.

In the first tutorial, Jennifer Golbeck did an awesome job of introducing the field of Trust-based Recommendations and explain the challenges in the field. The tutorial was extremely interactive with many questions and comments from the audience. It is true that the idea of trust is also one that very easily leads to passionate debates and opinions. The area of trust and social-based recommendations appeared again and again during the conference. There was a whole session devoted to it in the main track (or 2 if we include the one on tags and Social Networks) and a workshop on the last day. Interestingly enough, though, I did hear relevant people from the industry say that they did not believe social recommendations to be of any practical use. Don't really know what to make of that though.

The second tutorial was more of a traditional and classical lecture on Bayesian Methods. Bayesian Methods is the most popular (but not only) approach to model-based recommendations. They have two main advantages: they allow for the use of nice probabilistic formalisms, and they allow to infer knowledge from the resulting model. However, latent models based on Matrix Factorization have proved to be more reliable and, in principle, they also allow to infer knowledge from the latent variables. During the conference there were 2 different sessions on algorithms, which were dominated by different approaches to hybridize recommendations and by improvements over pre-existing collaborative filtering methods. Among the latter, I should mention the Best Paper winner, Benjamin Marlin. His paper proves that missing data (i.e. items that have not been rated) cannot be considered random and he introduces a way of taking some non-random effects into account. I found the conclusions of the paper not very striking, but the approach and scope of the idea is. And Marlin deserves the award for being the first to point to this issue, and also for all his great work in the area in general.

The last tutorial in day 1, which started a thread of its own, was a discussion on the lessons learned from the Netflix Prize. Very, very interesting discussion where some of the issues I mentioned in my previous blog post were brought up. For instance, I asked about the goodness of RMSE as a success measure. Everybody agrees that the only way to really evaluate a recommender is to do A/B tests on a real system but you cannot do this in an unsupervised way such as the contest. However, I insisted on the possibility of using other measures such as top-N related ones (e.g. nDCG). The (not very convincing) answer to this possibility was from the participants: it would be much harder to optmize algorithms for top-N measures that for the much more simple RMSE. The Netflix prize appeared now and again during the conference, especially since it was finally awarded recently. For instance, there was a very provocative paper by one of the participant teams proving that metadata is useless. This has stirred a heated discussion on whether that means that content-based approaches are useless altogether. The simple answer: NO. They are useless in the very specific case of the Netflix competition and dataset, and using RMSE as the success measure. Content-based approaches (and hybrids) are here to stay and need much more research.

The last thread that was also started on the very first day was the industrial one. As I mentioned before, company presence in Recsys is very relevant. And this year it was kicked of by a panel where Netflix and Yahoo discussed on the 8 challenges of the Recommender Systems Field. The panel was extremely interesting because John Riedl did a great jog on conducting it and on getting the two industry particpants to prepare it for weeks. To summarize, the Challenges were: transparency, exploration, navigation, time value, user action interpretation, evaluation, scalability, and relation academy/industry. The next industrial activity in the program was Francisco Marin's keynote where instead of the challenges he talked about the 10 lessons learned during his years of experience. It was a brilliant keynote that impacted many people (especially some students that were then deciding to change the orientation of their PhD). In Francisco's vision the algorithm is only 5% of the Recommender, while the most important part is the User Interface, which should take around 50% of the resources. But, if you want an excellent summary of this keynote, take a look at Neal Lathia's reconstruction from tweets. The last activity worth mentioning from this industrial thread was the Industry Workshop on the last day. It was organized by Marc Torrens (the other co-chair of next year's conference) and it attracted more than 45 people from industry.

A final thread that did not start on the first day was the application-related one. There was an applications session that was a sort of miscellaneous but where Jill Freyne presented a very interesting and well-delivered paper on the effect of people recommendation on social networks. In this application thread I should include some of the very interesting posters in the poster session. Applications that went all the way from a source code recommender from Karatzoglou and Weimer to IPTV or mobile tourist recommender systems.

Anoother very interesting thing left out of these 5 thread was the Workshop on Context-aware Recommender Systems where I presented some of our preliminary work on time-dependent music recommendation.

As a final personal promotion note I should say that my paper was probably an interesting oddball in the conference. It was the only paper that addressed the issue of data quality and user feedback and the impact it has on the recommendations. It made it really tough on the organizers to decide what session it should belong to, so I ended up presenting in the Trust session. But my impression was the it was very well received and i opens up a whole new avenue of future research in the field. Here you can check the slides I used during the presentation.

Overall, a great conference. And although the bar was set very, very high, we hope to exceed expectations in our 2010 Recsys in Barcelona. Hope to see everyone there!

(Btw, this is a very personal overview. Feel free to leave you in the form of comments and let me know if there is any mistake or misinterpretation)

Sep 29, 2009

The Netflix Prize: Lessons Learned

Some time ago I published a post with the title "What if there is no Million $" in which I discussed the possibility that the Netflix prize had no solution. A few weeks later, two teams beat the 10% threshold that entitled them to the grand prize. Bellkor's Pragmatic Chaos beat The Ensemble in a photo finish, only because they sent their solution 20 minutes earlier.


If you want more details on how it all happened I recommend you start by reading the two winning teams web pages. The Ensemble (runner up) has an awesome web with lots of information on the prize. And Bellkor's Pragmatic Chaos web also gives inside information on the winners' road to the million.
And if you want the nasty technical details, the three teams that merged into the winning Bellkor's Pragmatic Chaos have published their solution. Here you will find a description of Pragmatic Theory's solution. Here is Big Chaos'. And here is the already well-known Bellkor approach.

After the competition ended, there have been countless reactions and discussions about it. See, for example, what Gavin Potter (a.k.a. Guy in the Garage), one of the Netflix Prize stars has to say in his blog.

So, I will add my voice to this choir of reactions to the prize. In this post I will try to summarize what, from my humble personal perspective, have been the biggest lessons learned from the prize. Take this as a warming up of the panel entitled "What did we learn from the Netflix Prize?" that we will attend in NY next week.

1. RMSE is not a valid success measure

Whether RMSE was a valid success measure for a recommender sytem was discussed very early after the prize started. As a matter of fact, this discussion is even older in academic circles.

The fact of the matter is that RMSE is not a valid success measure for several reasons. The ultimate one is that there is no direct correlation between this measure and the end-user satisfaction to recommendations. However, using more hci'ish measures related to user satisfaction is out of the question in the context of a prize such as Netflix's. It would be nice to see, though, some post-mortem in which at least the winning approach was used in a user study and compared to the original system. Hopefully, we should see a significant increase in user satisfaction with the recommendations... but, to be honest, I am not all that sure.

Once we have ruled out, user-study related measures, is RMSE the best we can do? Well, I (and many others) think that there are much better mathematical measures that correlate to the actual goal of a Recommender System. See, the main problem with RMSE is that it weighs the same the error you make by predicting a 2 where it should have been a 1 than the one you make when predicting a 4 instead of a 5. But, you would never recommend an item with a 2!

I particularly like Top-N measures such as Precision and Recall of "recommendable" items (either fixing N or, better, defining a recommendable threshold). An even more precise measure is the so-called Normalized Discounted Cumulative Gain (NDCG) where item order in the recommendation list is taken into account.

2. Time matters

Another interesting finding is the importance of modeling temporal evolution of user preferences. The fact that I liked Meatballs in 1979 does not mean that I will like Meatballs 4 now. As a matter of fact, it does not even mean that I still like the original one now. This is what we call stability in rating theory. Yehuda Koren, of the wining team, has a very interesting publication on the topic. Neal Lathia's latest publications have also interesting insights on the temporal evolution of collaborative filtering systems.

Now, it turns out that, just as you can model the importance of time, you can also take into account many other different factors. As a matter of fact, this is what the Bellkor team calls the "factor model". Again, let me point you to a publication by Koren to learn more about this (actually, now that we are at it, you might want to take a look at all of Koren's latest publications, most of which are very relevant for our discussion).

3. Matrix Factorization methods work best

Probably, the single event that marked a turning point in the Netflix competition was Simon Funk's publication of the SVD solution. Since then, many teams turned to SVD-like solutions. Matrix Factorization is the family of methods, so to say, that include particular implementations such as SVD but also many other like non-negative Matrix Factorization, Maximum Margin Matrix Factorization, and so on. Again, latest publication from Koren does a gentle introduction to Factor models in this context. The papers from the 2007 KDD cup are also a good source for information on Factor Models in the context of the Prize, since it was then when these approaches where probably thought as the ultimate solution.

Factor models are great since they can accomplish slightly better results than standard neighbor-based methods, they offer some sort of insight on the problem and, above all, they can be implemented in a much more efficient way. However, I am still to be convinced that, in isolation, they are the best method in a general case.

4. One method is not enough (nor 100)

Or, in other words, given any prediction method it usually pays off more to add a new one than to improve the existing one (If I don't remember wrong some member of the winning team said something similar in an interview).

So, yes, as sad as it may seem, there is no magical solution to the Netflix Prize. Factor models sort of work but, alone they wouldn't get you the million $. As a matter of fact, you need many, many methods combined to reach that number. The problem with this approach is that the resulting algorithms is as close to a black-box as it can get. No more insights, no more knowledge learned from it: millions of parameters that self-adjust to fit into the solution.

Don't get me wrong, this is an outstanding accomplishment from an engineering perspective. But it limits the scientific insight on the problem. It also raises the question of how portable the solution is to other domains and even datasets. I would like Netflix to pick 500K different users and 17K new movies and report the error that the system makes on them.

5. The importance of data and noise

So, to finish this list, let me bring the discussion home. As I have discussed in a talk at Boston University (and in a previous post), given a problem such as the one posed by Netflix you have two options: limit yourself to the existing data and try to bang your head to improve the algorithm, or try to improve the data itself. Of course, in the context of the prize, improving the data was not easy (although some tried to add content information to the movie titles without sucess). But in a realistic setting there are many ways this could be feasible and much, much efficient in terms of resources and results.

Take the approach to removing noise we propose in our upcoming paper, for instance. As we present in that paper there are improvements above 10% that can be accomplished by simply asking some users to re-rate some items. In another paper we also found that simply re-ordering the items to rate reduced inconsistent ratings and therefore helped in predicting recommendations.

----

As a final note, I think that the Netflix Prize has left more questions than answers while putting the spotlight on Recommender Systems research. This is of course great news for us researchers in the area. We can only hope that the 2nd edition of the prize, already announced will bring more glory to the field :-)

Let me also congratulate the winners, runner-ups, organizers and other participants. In case it was not clear: you did an awesome job!

Please post your other lessons learned as comments to this post.

Sep 21, 2009

Towards Context-aware Recommendations

The goal of a Recommender System is to model the users' preferences in order to recommend new items that the users is likely to find of interest. However, we know that user preferences are influenced by a contextual conditions, such as the time of the day, mood, or current activity, but this type of information is not exploited by standard models. Context-aware recommender systems (CARS) aim at improving user satisfaction to recommendations by tailoring these to each particular context.

Context-aware recommendation is a research hot topic since it bridges the gap between recommender systems and other areas of research such as ubiquitous computing. However, research on this topic is still on its first stages and there is a lot to be done. If you want some background reading, I would recommend start by looking at the work of Gedas Adomavicious and Alex Tuzhilin.

I have been meaning to work on contextual recommendations for some time and this summer I had the perfect opportunity since Linas Baltrunas, a student of Francesco Ricci working on this topic for his PhD, has been collaborating with us in the lab.

In this first approach to contextual recommendations we have tried to tackle the issue of time-dependent music recommendation. That is, designing a recommendation algorithm that can recommend not only personalized music but one that fits better to the current time of the day, day of the week, or season of the year. Our initial assumption is, of course, that music taste depends on those variables (i.e. you don't listen to the same kind of music on saturday evening than on monday morning).

There a couple of things with this use case that make it specially interesting (and dificult) when compared to previous work. First, music preference modeling is done through implicit feedback. That is, users don't tell you explicitly what they like or don't, they simply listen more to some music than other. Converting that to a user preference model has some issues of its own, especially if you need to take into account contextual variables such as time. And also, time is a continuous context variable. All previous work on contextual user modeling and recommendation is done using discrete variable such as who you are with or whether you are at work or at home. This raises another issue since there is a need to segment the data before building the context-aware preference model.


We are publishing our very first initial results in the CARS workshop during the 2009 Recsys Conference, in a couple of weeks in NY. Here you have the paper. Comments and suggestions are welcomed!

Aug 5, 2009

Rate it Again

A few weeks back, I described our work on trying to measure Natural Noise on user feedback. This was motivated by a study we recently published in the UMAP conference. One of the possible solutions that I commented then was to address the noise issue directly by trying to apply denoising algorithms to the user data.

Well, that is exactly what we have done in a paper that has been accepted for publication in Recsys09 (NY). The paper is entitled "Rate It Again" and you can access a preprint copy here. The basic idea in our approach is to ask users to re-rate items that they already rated in the past. We can then denoise ratings that prove to be inconsistent by minimizing their contribution to the recommendation process.

The biggest practical issue with the approach is that we don't want all users to have to re-rate all items in order to identify which ones to denoise. That is why in that same paper we propose ways to decide which items and users are most likely to introduce noise in order to have only those go over the burden of re-rating items.

We measured relative improvement in terms of RMSE up to 14% and we verified that this is consistent regardless of the particular recommendation algorithm (item and user-based CF, SVD, etc...).

This is another example of how to improve recommender systems using a data-driven approach. Denoising user feedback is a promising avenue, and there is still a lot of room for improvement!

Update: I forgot to mention the impressive numbers for this year's Recsys. There were 203 submisions (almost doubling last year's numbers), 140 for long papers. The acceptance rate was down to 17% making it more competitive than other 1st tier conferences. And, as a matter of fact, having been in the Program Committee, and looking at the list of accepted papers, I can say that the quality of the papers is comparable (if not higher) than the quality of recommender papers accepted to related first tier conferences.

Aug 4, 2009

UMAP and SIGIR 09

I usually do a short report after I attend a conference. However, because of multiple commitments, deadlines and important things, I have failed to do so in the last two: UMAP 09 in Trento, Italy and SIGIR 09 in Boston, MA. So, I thought I'd give it a go and write a short report on both.

I will start by saying that lately I am not very fond of conferences in general. Don't get me wrong, I love socializing with other researchers in the area, meeting new people, and getting a chance to present my research to a larger audience while getting expert feedback. The problem is that in most conferences this comes to be secondary. There is a very interesting and recent article in ACM Communications by Lance Fortnow that does a very good job in analyzing the issue (although I should warn you that I do not subscribe his solution of going back to journals!). In any case, given this context, take my review of these two conferences with a grain of salt, I shall return to the broader issue of conferences sometime soon.


Ok, so I will start by commenting on UMAP09, which was organized in Trento (Italy) late June. There, I presented a long paper "I like it... I like it not", that I already commented in this same blog. UMAP has been organized this year for the first time by joining two pre-existing bi-anual conferences: UM (User Modeling) and AH (Adaptive Hypermedia) (see information about past conferences here). Both of these conferences where highly regarded so the resulting union was anticipated to be a success. Besides, the area of User Modeling is gaining a lot of momentum recently and a conference such as UMAP09 was expected to ride the wave. However, attendance was around 200 people, which was the same that any of the two conferences had in isolation. Acceptance rate for long papers was 26%

For people like me coming from outside the community UMAP looks like a weird conference, and there are many things about the organization that are hard to understand. First, there is the fact that the conference is not sponsored by any well-known organization like ACM, IEEE, etc... but rather by a non-profit organization called User Modeling Inc. . I am sure there are (or were) good reasons for this, but to an outsider this sounds weird, you'll give me that. Then, and possibly related to the previous, there is the issue with the proceedings: they are published with Springer (a for profit publisher) in the infamous LNCS series... and proceedings are not available for download in electronic format even at this stage! If you add the fact that the choice of location in beautiful but hard to reach Trento was questionable I am not surprised the conference turned out to be less than what the orgs expected. I really think some of these issues should be addressed soon: there are many other conferences that are more than happy to accept research related to User Modeling, and UMAP will have to do their best to attract people. However, if well-managed, UMAP should be a very attractive and relevant conference. Next year it will be organized in Hawaii and there are already talks of co-locating it with a larger event in 2011.

Moving on to SIGIR, which is a completely different beast since it is a well-established first-tier conference and it was organized in the easy-to-reach Boston. I have no complaint about the organization (except for maybe the lack of lunches during main conference days). And as a matter of fact I have to congratulate them for excellent social events: both the banquet at JFK museum and the Harbor tour were great. I presented our paper "The Wisdom of the Few" in a conference that had the lowest acceptance rate since 1997, a 16% ... quite an honour being accepted.

SIGIR -- the most important conference on Information Retrieval, for those that are not in the field -- is one of those large conferences where you have several parallel tracks. This is great since you are always likely to find something you are interested on. However, it has the downside that most people might be attracted to the same track, leaving others almost empty. Curiously enough, this is what happened during the Industry track: most people were attracted to it, leaving the research tracks with much less attendance than expected. A suggestion for next years would be not to host the industry track in parallel but during a specific time. In any case, this brings me to my important question: why where researchers more attracted to the industry track than the research ones? The answer is simple: while the average presenter in the industry track is a well-known professional with an above-average public speaking skills, the average presenter is a PhD student that can barely hope to grasp the audience attention by not putting too many formulas on the slides. I will leave this analysis here but will try to come back to it in a dedicated post soon. If you are interested in reading more there is a great series of posts on the excellent Industry Track at SIGIR 2009 by the organizer Daniel Tunkelang, Endeca's Chief Scientist. Start here, and follow to similar or later posts. You can also find other great posts summarizing SIGIR, see Jeff Dalton's summaries here, for instance.

On the last SIGIR day I attended the Search in Social Media Workshop. This turned out to be one of the highlights of the conference. The setup of the workshop was great. It was divided in different topic blocks. For each block there was a keynote. Then other presenters had a short time for presentation and then they all joined for a discussion panel including participation from the audience and from a twitter feed projected on the side... Brilliant! I particularly liked the keynotes by Joseph Konstan and Abdur Chowdhury, Twitter's Chief Scientist.

Overall going to conferences is a great experience and I got to meet many interesting people, have interesting conversations and I presented 2 papers getting a lot of feedback. Conferences are essential in the work of a researcher. However, there is a lot of room for improvement in order to make the best of them. And definitely CS conferences should take the lead because technology will be key in this transformation.

Jul 21, 2009

It's all about the Data...

(or Data-driven approaches to the Recommendation problem)

This is the title of the talk I will be giving on Thursday July, 23, 3:30 pm at Boston University (Math & CS Building, room 135).

Here is the abstract of the talk, hope to see you around:

The Netflix Prize put Recommender Systems (RS) research in the spotlight. Given 100M ratings from 500K users to 17K movies, researchers from all over the world have been racing for almost 3 years to improve accuracy by 10% in order to win the 1M$ prize. A couple of weeks ago, it was announced that a merge between several teams that used hundreds of predictors might have won the prize. However, there are doubts about the generalization properties of this winning approach.

Our approach to the Recommender problem has been different: instead of taking data as is and invest in fine tuning a large number of machine learning algorithms to model the data, we have focused on understanding the data and improving it.

In a recent UMAP 2009 paper named "I Like it, I Like it Not..", we show that the natural noise due to the inconsistencies in user feedback sets a lower bound on the so-called "magic barrier" in RS and could in fact be very close to the Netflix Prize threshold. Once the inconsistencies of users when providing explicit feedback have been characterized, we can devise ways to minimize them.

In "The Wisdom of the Few" (SIGIR 2009), we propose a different approach to Collaborative Filtering by using feedback from experts instead of regular users. In "Adaptive Data Sources" (ITPW @IJCAI 2009) we propose to use ensembles of data sources instead of ensembles of predictors. Finally, in "Rate it Again" (RECSYS 2009), we present an algorithm for denoising user feedback based on a re-rating approach.

In this talk, I will give an overview of the issue of noise in user feedback for Recommender Systems and will briefly describe the work that we have done (as previously described) to overcome it.

Jul 13, 2009

Why the crowds are not (always) wise

More often than not I come across situations in which the now famous "Wisdom of the Crowds" is applied in the wrong context or situation. Let me try to explain it with an example of one such situations:

Let's imagine a well-known online newspaper decides to give a prize to the best... Linux application, for instance. In order to do so, it decides that it will let the crowds decide. First, developers who have an application and are interested in the $100K of the prize will submit their application. Then users will vote and the prize will be awarded to the application with most votes.

Ok, so, do you think the winner of the contest will be the "best Linux application"? Of course not. And there are many possible reasons why that won't be the case, right?

Let's start by the two most important conditions that we need to guarantee if we want to trust the Wisdom of the Crowds:
  1. There has to be... well, a crowd
  2. Whatever we are measuring needs to be close to the notion of "popularity"
Ok, the first condition seems fair and obvious enough. However, it is often forgotten. If you want the crowds to decide on something, you need to have enough opinions to avoid possible bias. Ideally, you would even worry about things such as demographics, etc... Given that this is seldom possible, you need to guarantee that malicious bias or shilling is not possible, or at least hard.

For instance, in our example, imagine the winner Linux application got 120 votes. "Uhm", I hear you say, "for $100K, I could get 120 people to vote my app". Well, there you are. And there are many cases in which this might not be so explicit, but in which the amount of opinions do not make a crowd.

Ok, let's go to the second condition: we need that whatever we are measuring is somewhat correlated with popularity. In our example, it might be that what we mean by "best application" is precisely the "most popular" one. If so, that's ok. If what we are trying to measure is something else: such as "most innovative", "most secure, estable, ...." chances are that the crowds will not be evaluating these features. Again, in many cases, what the "best" means, is not critical (think on TV programs such as American Idol where the crowds get to pick the best performer, for instance). However, in some others, we might be getting an answer to the wrong question.

Although I summarized/simplified the two most important conditions for the Wisdom of the Crowds to take place, the original book by Surowiecki talked about 4 conditions:
  • Diversity of opinion
  • Independence
  • Decentralization
  • Aggregation
(The Wikipedia entry does a good at summarizing these.)

Wisdom in crowds should be taken with care and only where appropriate.
Actually, in many situations, instead of crowds we might prefer to have a number of experts give us their opinion (would you post your symptoms on a web page and decide what pills to take based on the wisdom of the crowds?).
For this reason, we propose a Collaborative Filtering approach based on experts in our forthcoming SIGIR paper (see my previous blog entry on this).

And for that same reason, I am starting to like the concept of the Alchemy of the Crowds (as opposed to Wisdom) described by the authors of the same name.
Related Posts with Thumbnails