Finally, I finished the reviewed of this excellent book about mining massive datasets. The sheer mass of data on the web is continuosly growing, a lot of new methods, algorithms and tools are emerging in order to deal with this big amount of data but in some cases without providing a formal model to process the information. In this book, authors present a compilation of the most used algorithms (and its formal definition) to build recommendation systems based on data mining techniques.
I strongly recommend the reading of this book because it focuses on data mining of very large amounts of data that does not fit in main memory. Currently this situation can be applied to the management of digital libraries, analysis of social networks, bioinformatics, etc. in which the processing of large datasets is necessary. The main topics can be shown in the next figures but according to authors you will learn the next concepts:
- Distributed file systems and map/reduce approaches a a tool for creating parallel algorithms
- Methods to estimate and calculate similarity search
- Processing of data streams with specializaed algorithms
- Technology of existing search engines: page rank, link-spam detection, etc.
- Frequent-itemset mining
- Algorithms for clustering very large and high-dimensional datasets
- Two main applications of these techniques: advertising and recommendation systems
Nevertheless, I miss a section about real time processing of large amounts data instead of streaming techniques.