Hulu’s Recommendation System

Following with the review of existing recommending systems in multimedia sites I have found through Marcos Merino the recomendation engine provide by HULU (it is an online video service that offers a selection of hit shows, clips, movies and more).

It brings together a large selection of videos from over 350 content companies, including FOX, NBCUniversal, ABC, The CW, Univision, Criterion, A&E Networks, Lionsgate, Endemol, MGM, MTV Networks, Comedy Central, National Geographic, Digital Rights Group, Paramount, Sony Pictures, Warner Bros., TED and more. (Hulu, About)

But, which is the underlying technology in Hulu?

Checking the technological blog they have spent a lot of effort to provide a great recommending engine in which they have decided to recommend shows to users instead of individual videos. Thus, contents can be organized due to same shows videos are usually closely related. As well as Netflix one of the drivers of the recommendation is the user behavior data (implicit and explicit feedback). The algorithm implemented in Hulu is based on a collaborative filtering approach (user or item based) but the most important part lies in Hulu’s architecture which is comprised of the next components:

  1. User profile builder
  2. Recommendation core
  3. Filtering
  4. Ranking
  5. Explanation
Besides they have an off-line system for data processing that supports aforementioned processes and it is based on a data center, a topic model, a related table generator, a feedback analyzer and a report generator. According to these components and processes they have been applied an item-based collaborative filtering algorithm to make recommendations. One of the keypoints to evaluate recommendations is “Novelty”:
Just because a recommendation system can accurately predict user behavior does not mean it produces a show that you want to recommend to an active user. (Hulu, Tech Blog)

Other key points of their approach lies in explanation-based diversity and temporal diversity. This situation demonstrates that existing problems of recommending information resources in different domains are always similar. Nevertheless, depending on the domain (user behavior, type of content, etc.) new metrics can emerge such as novelty. On the other hand, real time capabilities, off-line processing and performance are again key-enablers of a “good” recommendation engine apart from accuracy. Following some interesting lessons from Hulu’s experience are highlighted:

  • Explicit Feedback data is more important than implicit feedback data
  • Recent behaviors are much more important than old behaviors
  • Novelty, Diversity, and offline Accuracy are all important factors
  • Most researchers focus on improving offline accuracy, such as RMSE, precision/recall. However, recommendation systems that can accurately predict user behavior alone may not be a good enough for practical use. A good recommendation system should consider multiple factors together. In our system, after considering novelty and diversity, the CTR has improved by more than 10%. Please check this document out: “Automatic Generation of Recommendations from Data: A Multifaceted Survey” (a technical report from the School of Information Technology at Deakin | University Australia)
But, in which components or processes semantic technologies can help recommenders?
Taking into account the main drivers of the semantic web, the use of semantics can be part of some processes (Mining Data Semantics-MDS2012) such as:
  • Classification and prediction in heterogeneous networks
  • Pattern-analysis methods
  • Link mining and link prediction
  • Semantic search over heterogeneous networks
  • Mining with user interactions
  • Semantic mining with light-weight reasoning
  • Extending LOD and Quality of LOD disambiguation, identity, provenance, integration
  • Personalized mining in heterogeneous networks
  • Domain specific mining (e.g., Life Science and Health Care)
  • Collective intelligence mining
Finally, I will continue reviewing main recommendation services of large social networks (sort by name) such as Amazon, Facebook, Foursquare, Linkedin, Mendeley or Twitter to finally make a complete comparison according to different variables and features of the algorithms: feedback, real time, domain, user behavior, etc. After that my main objective will be make an implementation of a real use case in a distributed environment merging semantic technologies and recommendation algorithms to demonstrate if semantics can improve results (accuracy, etc.) of existing approaches.

BellKor Solution to the Netflix Prize

Currently users are inundated with information and data coming from products and services. Recommending systems are an emerging research area from the last years but with a huge importance in any commercial application. A simple classification of these techniques lies in pushing, user-user or item-user based recommendations neighborhood models, simple matrix factorization model or latent models. The main of challenge of improving these techniques is addressed to get more accurate models in which information with regards to resources biases, user biases and user preferences biases are taken into account.

Collaborative filtering is a prime component to recommend products and services. Basically, the neighborhood approach and latent factor mode models (such as Singular Value Decomposition-SVD)  are the main approaches to easy comparisons. First ones are focused on computing relationships between items or users while the second ones translate all items to the same latent factor space thus them directly comparable.

After this short review of main approaches of collaborative filtering, I am going to focus on the subject of this post “The BellKor Solution to the Netflix Prize” [1], it is a contest to improve the accuracy of the Cinematch algorithm using the quality metric “RMSE” with a prize up to 1M $. The authors (Bob Bell and Chris Volinsky, from the Statistics Research group in AT&T Labs, and Yehuda Koren), of this algorithm have won the prize with the first approach that merges both models (neighborhood  and SVD) getting a more accurate model. Some of the main features of this approach lies in:

  •  a new model for neighborhood recommenders based on optimizing a global cost function keeping advantages such as explainability and handling new users while  improving accuracy
  • a set of extensions to existing SVD models to integrate implicit feedback and time features

Thus a new approach for recommending systemswas presented in 2008-2009 (a complete description of the algorithm is available in the article “Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model“) to win the Netflix Contest but some open issues are still open:

  • Scalability (millions of users and items) and Real time (map/reduce techniques to continuouslyprocess new data)
  • Explainability
  • Implicit and explicit feedback
  • Factorization techniques (please read this article from the same authors)
  • Quality including more data with regards to dates, attributes of users, etc.
  • …in general recommender systems are a young area in which a lot of improvements can be implemented

Finally, it is relevant to check last publications of Yehuda Koren:

Technologies Cloud

After seven years I got expertise in some research domains and technologies…I believe a cloud can properly explain it!

Serializing an OWL ontology in DL-syntax

One  month ago I was finishing my contribution to a paper and I had to include some of the axioms defined in the ontology we designed. I knew DL-syntax but I did not want to spent time to rewrite the entire ontology (in RDF/XML format) from the scratch so I decided to look for a method or conversion tool (preferably on-line) to serialize a DL ontology using this syntax. I was surprised that most of the tools do not have support to this feature (on the best of my knowledge) but I went into the OWL-API (version 3.2.4) documentation and source code and I finally found a class, coded by Matthew Horridge, in which the load and serialization of ontologies were explained. After that I was checking more documentation about some packages for serialization and I could customize the previous code to get a DL-syntax serialization of the ontology.

...
OWLOntologyManager manager = OWLManager.createOWLOntologyManager();
OWLOntology localOntology = manager.loadOntologyFromOntologyDocument(file);
OWLOntologyFormat format = manager.getOntologyFormat(localOntology);
DLSyntaxOntologyFormat dlFormat = new DLSyntaxOntologyFormat();
DLSyntaxOntologyStorer storer = new DLSyntaxOntologyStorer();
//IRI documentIRI: file to save the DL version
if (storer.canStoreOntology(dlFormat)){
  storer.storeOntology(manager, localOntology, documentIRI, dlFormat);
}
...

It is just a code snippet…but I spent some time  to get a representation of an OWL ontology in DL-syntax when I thought that it was trivial!

Finally, these are the tools I checked:

R & Big Data Intro

I am very committed to enhance my know-how on delivering solutions deal with Big Data in a high-performance fashion. I am continuously seeking for tools, algorithms, recipes (e.g. Data Science e-book), papers and technology to enable this kind of processing because it is consider to be relevant in next years but it is now a truth!

Last week I was restarting the use of R and the rcmdr to analyze and extract statistics out from my phd experiments using the Wilcoxon Test. I started with R three years ago when I developed a simple graphical interface in Visual Studio to input data and request operations to the R interpreter, the motivation of this work was to help a colleague with his final degree project and the experience was very rewarding.

Which is the relation between Big Data and R?

It has a simple explanation, a key-enabler to provide added-value services is to manage and learn about historical logs so putting together an excellent statistics suite and the Big Data realm it is possible to answer the requirements of a great variety of services from domains like nlp, recommendation, business intelligence, etc. For instance, there are approaches to mix R with Hadoop  such as RICARDO or Parallel R and  new companies are emerging to offer services based on R to process Big Data like Revolution Analytics.

This post was a short introduction to R as a tool to exploit Big Data. If you’re interested in this kind of approaches, please take a look to next presentation by Lee Edfelsen:

Keep on researching and learning!

Compiling Related Work about Linked Data Quality

One of the cornerstones to boost the use of Linked Data is to ensure the quality of data according to different terms like timely, correctness, etc. The intrinsic features of this initiative provide a framework for the distributed publication of data and resources (linking together datasources on the web). Due to this open approach some mechanisms should be added to check if data is well linked or it is just a try to link together some part of the web. Most of the cases of linking data use an automatic way to discover and create links between resources (e.g. Silk Framework), this situation implies that the process is, in some factors, ambiguous so human decision is required. In the case of the data, the quality may vary as information providers have different levels of knowledge, objectives, etc. Thus information and data are released in order to accomplish a specific task and their quality should be assessed depending on different criteria according to a specific domain.

For instance, a data provider is releasing information about payments, is it possible to check which is the decimal separator, 10,000 or 10.000? is this information homogenous across all resources in the dataset?. If a literal value should be “Oviedo”, what happen if the real value is “Obiedo”? How we can detect and fix these situations?

These cases have motivated some related work:

  • The PhD thesis of Christian Bizer that purposes a template language and a framework (WIQA) to detect if a triple fulfills the requirements to be accepted in a dataset. (2007)
  • LODQ vocabulary, is a RDF model to express criteria about 15 kind of metrics that have been formulated by Glenn McDonald in a mailing list. A processor of this vocabulary is still missing. (2011)
  • A paper entitled “Linked Data Quality Assessment through Network Analysis” by Christian Gueret, in which some metrics are provided to check the quality of links. This work is part of  the LATC project.  (2011)
  • The workshop COLD (Consuming Linked Data) is also a good start point to check problems and approaches to deal with the requirements of implementing linked data applications.
  • …that are collected in the aforementioned works.
In some sense we should think that this problem is new but the truth is that it is inherited from the traditional databases. One of the arising questions is the possibility of applying existing approaches to solve the assessment of quality in the linked data realm…but this will be evaluated in next posts.
This first post is just a short introduction to the linked data quality research and approaches. In next weeks, we try to review in depth these works and purpose a solution (LODQAM).
Thank you very much!
Excellent regards,

Loading…

This is the first post in a blog about researching. Next updates will be covered different topics:

  • Linked (Open) Data
  • Real time systems
  • Complex Event Processing
  • Distributed Reasoning
  • Decission Support Systems
  • Visualization of large datasets
  • … applied to e-Health, e-Government, etc.
I look forward to seeing you in next posts…