Review of “Mining of Massive Datasets”

Finally, I finished the reviewed of this excellent book about mining massive datasets. The sheer mass of data on the web is continuosly growing, a lot of new methods, algorithms and tools are emerging in order to deal with this big amount of data but in some cases without providing a formal model to process the information. In this book, authors present a compilation of the most used algorithms (and its formal definition) to build recommendation systems based on data mining techniques.

I strongly recommend the reading of this book because it focuses on data mining of very large amounts of data that does not fit in main memory. Currently this situation can be applied to the management of digital libraries, analysis of social networks, bioinformatics, etc. in which the processing of large datasets is necessary. The main topics can be shown in the next figures but according to authors you will learn the next concepts:

  • Distributed file systems and map/reduce approaches a a tool for creating parallel algorithms
  • Methods to estimate and calculate similarity search
  • Processing of data streams with specializaed algorithms
  • Technology of existing search engines: page rank, link-spam detection, etc.
  • Frequent-itemset mining
  • Algorithms for clustering very large and high-dimensional datasets
  • Two main applications of these techniques: advertising and recommendation systems

Nevertheless, I miss a section about real time processing of large amounts data instead of streaming techniques.

Hulu’s Recommendation System

Following with the review of existing recommending systems in multimedia sites I have found through Marcos Merino the recomendation engine provide by HULU (it is an online video service that offers a selection of hit shows, clips, movies and more).

It brings together a large selection of videos from over 350 content companies, including FOX, NBCUniversal, ABC, The CW, Univision, Criterion, A&E Networks, Lionsgate, Endemol, MGM, MTV Networks, Comedy Central, National Geographic, Digital Rights Group, Paramount, Sony Pictures, Warner Bros., TED and more. (Hulu, About)

But, which is the underlying technology in Hulu?

Checking the technological blog they have spent a lot of effort to provide a great recommending engine in which they have decided to recommend shows to users instead of individual videos. Thus, contents can be organized due to same shows videos are usually closely related. As well as Netflix one of the drivers of the recommendation is the user behavior data (implicit and explicit feedback). The algorithm implemented in Hulu is based on a collaborative filtering approach (user or item based) but the most important part lies in Hulu’s architecture which is comprised of the next components:

  1. User profile builder
  2. Recommendation core
  3. Filtering
  4. Ranking
  5. Explanation
Besides they have an off-line system for data processing that supports aforementioned processes and it is based on a data center, a topic model, a related table generator, a feedback analyzer and a report generator. According to these components and processes they have been applied an item-based collaborative filtering algorithm to make recommendations. One of the keypoints to evaluate recommendations is “Novelty”:
Just because a recommendation system can accurately predict user behavior does not mean it produces a show that you want to recommend to an active user. (Hulu, Tech Blog)

Other key points of their approach lies in explanation-based diversity and temporal diversity. This situation demonstrates that existing problems of recommending information resources in different domains are always similar. Nevertheless, depending on the domain (user behavior, type of content, etc.) new metrics can emerge such as novelty. On the other hand, real time capabilities, off-line processing and performance are again key-enablers of a “good” recommendation engine apart from accuracy. Following some interesting lessons from Hulu’s experience are highlighted:

  • Explicit Feedback data is more important than implicit feedback data
  • Recent behaviors are much more important than old behaviors
  • Novelty, Diversity, and offline Accuracy are all important factors
  • Most researchers focus on improving offline accuracy, such as RMSE, precision/recall. However, recommendation systems that can accurately predict user behavior alone may not be a good enough for practical use. A good recommendation system should consider multiple factors together. In our system, after considering novelty and diversity, the CTR has improved by more than 10%. Please check this document out: “Automatic Generation of Recommendations from Data: A Multifaceted Survey” (a technical report from the School of Information Technology at Deakin | University Australia)
But, in which components or processes semantic technologies can help recommenders?
Taking into account the main drivers of the semantic web, the use of semantics can be part of some processes (Mining Data Semantics-MDS2012) such as:
  • Classification and prediction in heterogeneous networks
  • Pattern-analysis methods
  • Link mining and link prediction
  • Semantic search over heterogeneous networks
  • Mining with user interactions
  • Semantic mining with light-weight reasoning
  • Extending LOD and Quality of LOD disambiguation, identity, provenance, integration
  • Personalized mining in heterogeneous networks
  • Domain specific mining (e.g., Life Science and Health Care)
  • Collective intelligence mining
Finally, I will continue reviewing main recommendation services of large social networks (sort by name) such as Amazon, Facebook, Foursquare, Linkedin, Mendeley or Twitter to finally make a complete comparison according to different variables and features of the algorithms: feedback, real time, domain, user behavior, etc. After that my main objective will be make an implementation of a real use case in a distributed environment merging semantic technologies and recommendation algorithms to demonstrate if semantics can improve results (accuracy, etc.) of existing approaches.

BellKor Solution to the Netflix Prize

Currently users are inundated with information and data coming from products and services. Recommending systems are an emerging research area from the last years but with a huge importance in any commercial application. A simple classification of these techniques lies in pushing, user-user or item-user based recommendations neighborhood models, simple matrix factorization model or latent models. The main of challenge of improving these techniques is addressed to get more accurate models in which information with regards to resources biases, user biases and user preferences biases are taken into account.

Collaborative filtering is a prime component to recommend products and services. Basically, the neighborhood approach and latent factor mode models (such as Singular Value Decomposition-SVD)  are the main approaches to easy comparisons. First ones are focused on computing relationships between items or users while the second ones translate all items to the same latent factor space thus them directly comparable.

After this short review of main approaches of collaborative filtering, I am going to focus on the subject of this post “The BellKor Solution to the Netflix Prize” [1], it is a contest to improve the accuracy of the Cinematch algorithm using the quality metric “RMSE” with a prize up to 1M $. The authors (Bob Bell and Chris Volinsky, from the Statistics Research group in AT&T Labs, and Yehuda Koren), of this algorithm have won the prize with the first approach that merges both models (neighborhood  and SVD) getting a more accurate model. Some of the main features of this approach lies in:

  •  a new model for neighborhood recommenders based on optimizing a global cost function keeping advantages such as explainability and handling new users while  improving accuracy
  • a set of extensions to existing SVD models to integrate implicit feedback and time features

Thus a new approach for recommending systemswas presented in 2008-2009 (a complete description of the algorithm is available in the article “Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model“) to win the Netflix Contest but some open issues are still open:

  • Scalability (millions of users and items) and Real time (map/reduce techniques to continuouslyprocess new data)
  • Explainability
  • Implicit and explicit feedback
  • Factorization techniques (please read this article from the same authors)
  • Quality including more data with regards to dates, attributes of users, etc.
  • …in general recommender systems are a young area in which a lot of improvements can be implemented

Finally, it is relevant to check last publications of Yehuda Koren: