WebIndex Launch

Today it is the official launch of the Web Index. Last months I have collaborated, through my activities in WESO Research Group, with the Web Foundation to promote its statistical data following the Linked Data principles. I think we have published an appropriate version of this data and I hope to continue this fruitful collaboration with my new colleagues in next months.

You can find more information about the Web Index as Linked Data in http://data.webfoundation.org/.

If you have any comment, suggestion, etc. please feel free to contact me at any time,

Best,

PhD Presentation

PhD Presentation

View more presentations from Jose María Alvarez

CFP: Data Mining on Linked Data workshop with Linked Data Mining Challenge

To be held during the 20th Int. Symposium on Methodologies of Intelligent Systems, ISMIS 2012, 4-7 December 2012 Macao (http://www.fst.umac.mo/wic2012/ISMIS/)

Official CFP

Workshop Scope

Over the past 3 years, the Semantic Web activity has gained momentum with the widespread publishing of structured data as RDF. The Linked Data paradigm has therefore evolved from a practical research idea into a very promising candidate for addressing one of the biggest challenges in the area of intelligent information management: the exploitation of the Web as a platform for data and information integration in addition to document search. Although numerous workshops and even Challenges already emerged in the intersection of Data Mining and Linked Data (e.g. Know@LOD at ESWC) and even Challenges have been organized (USEWODs at WWW, http://data.semanticweb.org/usewod/2012/challenge.html), the particular setting chosen (with a highly topical Government-related dataset) will allow to explore new methods of exploiting Linked Data with state-of-the-art mining tools.

Workshop Topic

The workshop consists of an Open Track and of a Challenge Track.
The Open Track will expect submission of regular research papers, describing novel approaches to applying Data Mining techniques on the Linked Data sources.

Participation in the Challenge Track will require the participants to download a real-world RDF dataset from the domain of Public Contract Procurement, and accomplish at least one of the four pre-defined tasks on it using their own or publicly available data mining tool. To get access to the data, participants have to register to the Challenge Track at http://keg.vse.cz/ismis2012. Partial mapping to external datasets will also be available, which will allow for extraction of further potential features from the Linked Open Data cloud. Task 1 will amount to unrestricted discovery of interesting nuggets in the (augmented) dataset. Task 2 will be similar but the category of interesting hypotheses will be partially specified. Task 3 will concern prediction of one of the features natively present in training data (but only added to the evaluation dataset after the result submission). Task 4 will concern prediction of a feature manually added to a sample of the data by a team of domain experts.
Participants will submit textual reports (Challenge Track papers) and, for Tasks 3 and 4, also the classification results.

Submissions

Both the research papers (submitted to the Open Track) and the Challenge Track papers should follow the Springer formatting style. The templates for Word and LaTeX are available at the workshop web http://keg.vse.cz/ismis2012 and can be also found at http://www.springer.com/authors/book+authors?SGWID=0-154102-12-417900-0. The length of the submission should not exceed 10 pages. All papers will be made available at the workshop web pages and there will be a post-conference proceedings for selected workshop papers.

Papers (and results for Tasks 3 and 4) should be submitted using the EasyChair http://www.easychair.org/conferences/?conf=ismis2012dmold .

Important Dates

Data ready for download: June 20, 2012
Workshop paper and result data submissions: August 10, 2012
Notification of Workshop paper acceptance: August 25, 2012
Workshop: December 4, 2012

Review of “Mining of Massive Datasets”

Finally, I finished the reviewed of this excellent book about mining massive datasets. The sheer mass of data on the web is continuosly growing, a lot of new methods, algorithms and tools are emerging in order to deal with this big amount of data but in some cases without providing a formal model to process the information. In this book, authors present a compilation of the most used algorithms (and its formal definition) to build recommendation systems based on data mining techniques.

I strongly recommend the reading of this book because it focuses on data mining of very large amounts of data that does not fit in main memory. Currently this situation can be applied to the management of digital libraries, analysis of social networks, bioinformatics, etc. in which the processing of large datasets is necessary. The main topics can be shown in the next figures but according to authors you will learn the next concepts:

Distributed file systems and map/reduce approaches a a tool for creating parallel algorithms
Methods to estimate and calculate similarity search
Processing of data streams with specializaed algorithms
Technology of existing search engines: page rank, link-spam detection, etc.
Frequent-itemset mining
Algorithms for clustering very large and high-dimensional datasets
Two main applications of these techniques: advertising and recommendation systems

Nevertheless, I miss a section about real time processing of large amounts data instead of streaming techniques.

Product Scheme Classifications

Following with the activities performed to promote the CPV as a linked dataset we have finished the first beta release of new product scheme classifications (PSCs) as linked data in the context of e-procurement. Next diagram shows the ongoing work in the transformation of PSCs (gray ones are not yet transformed):

The process to promote all these PSCs (more information can be found in pscs-catalogue at thedatahub.org) have been carried out in a stepwise method (similar to http://www.w3.org/2011/gld/wiki/Linked_Data_Cookbook):

Select the PSCs to be transformed and download the datasource (MSExcel in most of cases)
Model the information about a PSC item using existing vocabularies. If it is required new concepts and relations can be defined such as in CPV case. URI design.
Transform the data using Google Refine
Create the mappings between a PSC and the Product Ontology (custom java-based reconciliator adapted to the descriptions of PSCs items)
Create the mappings between a PSC and the CPV 2008 (custom java-based reconciliator between a source PSC and a target PSC)
Validate mappings and links
Add dataset descriptions using VoID vocabulary
Store in Virtuoso and publish data with Pubby

The definition of a PSC item (?product) is comprised of the next properties:

URI for datasets: http://purl.org/weso/pscs/{psc}/{year|version}/resource/ds
URI for resources: http://purl.org/weso/pscs/{psc}/{year|version}/resource/{id}
URI for classes and properties: http://purl.org/weso/pscs/{psc}/{year|version}/ontology/
rdf:type <pscs:PSCConcept> (rdf:type skos:Concept)
dcterms:identifier “id” (the id that is part of the URI)
skos:notation “raw id” (the real id that appears in the data source)
skos:prefLabel, gr:description and rdfs:label “description”
skos:inScheme <void:Dataset>, <skos:ConceptScheme>
skos:broaderTransitive/skos:narrowerTransitive <PSCConcept> (in some cases the broader of an item can not be inferred using the codes, in that case we have defined a custom property called “pscs:level“)
pscs:relatedMatch (mapping between ?product and items of ProductOntology). The next release will include a “confidence” value to stablish the weight of matchings.
skos:exactMatch <PSCConcept> (some PSCs have already defined mappings among them, we reuse this information)
skos:closeMatch <PSCConcept> (mapping between ?product and items of CPV 2008). The next release will include a “confidence” value to stablish the weight of matchings.

The whole linkset of PSCs can be found at http://purl.org/weso/pscs/ and we have also extracted out some statistics (PSC void:Dataset, IRI graph and triples):

http://purl.org/weso/pscs/cn/2012/resource/ds, http://purl.org/weso/pscs/cn/2012, 137,484
http://purl.org/weso/pscs/cpa/2008/resource/ds, http://purl.org/weso/pscs/cpa/2008, 92,749
http://purl.org/weso/pscs/cpc/2008/resource/ds, http://purl.org/weso/pscs/cpc/2008, 100,819
http://purl.org/weso/pscs/cpv/2003/resource/ds, http://purl.org/weso/pscs/cpv/2003, 546,135
http://purl.org/weso/pscs/cpv/2008/resource/ds, http://purl.org/weso/pscs/cpv/2008, 803,311
http://purl.org/weso/pscs/isic/v4/resource/ds, http://purl.org/weso/pscs/isic/v4, 18,986
http://purl.org/weso/pscs/naics/2007/resource/ds, http://purl.org/weso/pscs/naics/2007, 36,292
http://purl.org/weso/pscs/naics/2012/resource/ds, http://purl.org/weso/pscs/naics/2012, 35,390
http://purl.org/weso/pscs/sitc/v4/resource/ds, http://purl.org/weso/sitc/v4, 70,887

Try this query: “Give me 100 products or services related to ‘construction’ in any PSC that have a mapping with products or services in CPV 2008 (descriptions in English)”

Continue reading →

Old-Fasioned Common Procurement Vocabulary 2008 and 2003

The Common Procurement Vocabulary (CPV) establishes a single classification system for public procurement aimed at standardising the references used by contracting authorities and entities to describe the subject of procurement contracts.

The CPV consists of a main vocabulary for defining the subject of a contract, and a supplementary vocabulary for adding further qualitative information. The main vocabulary is based on a tree structure comprising codes of up to 9 digits (an 8 digit code plus a check digit) associated with a wording that describes the type of supplies, works or services forming the subject of the contract.

The main vocabulary is based on a tree structure comprising codes of up to nine digits associated with a wording that describes the supplies, works or services forming the subject of the contract.

The first two digits identify the divisions (XX000000-Y);
The first three digits identify the groups (XXX00000-Y);
The first four digits identify the classes (XXXX0000-Y);
The first five digits identify the categories (XXXXX000-Y);

Each of the last three digits gives a greater degree of precision within each category. A ninth digit serves to verify the previous digits.

The supplementary vocabulary may be used to expand the description of the subject of a contract. The items are made up of an alphanumeric code with a corresponding wording allowing further details to be added regarding the specific nature or destination of the goods to be purchased.

The alphanumeric code is made up of:

a first level comprising a letter corresponding to a section;
a second level comprising four digits, the first three of which denote a subdivision and the last one being for verification purposes

(Information available at: http://simap.europa.eu/codes-and-nomenclatures/codes-cpv/codes-cpv_en.htm)

The dataset created is comprised of CPV 2008 and CPV 2003 codes and the mappings between them. All this information is publicly available via the WESO SPARQL endpoint (5 star linked data) and a Pubby frontend. The structure of the data and definitions is the next one:

CPV 2008. Graph IRI: Graph IRI: http://purl.org/weso/cpv/2008. Total: 556,335
triples.
- Scheme: http://purl.org/weso/cpv/2008/scheme
- Dump file (Turtle) (25 MB)
- Division: http://purl.org/weso/cpv/2008/03000000
- Group: http://purl.org/weso/cpv/2008/03100000
- Class: http://purl.org/weso/cpv/2008/03110000
- Category: http://purl.org/weso/cpv/2008/03111000 | http://purl.org/weso/cpv/2008/03111100
- Mapping example:

http://purl.org/weso/cpv/2008/03111100

http://purl.org/weso/cpv/definitions/codeIn2003

http://purl.org/weso/cpv/2003/01113100

CPV 2003. Graph IRI: Graph IRI: http://purl.org/weso/cpv/2003. Total: 191,430
triples. http://purl.org/weso/cpv/2003/01113100
- Scheme: http://purl.org/weso/cpv/2003/scheme
- Dump file (Turtle) (7.8 MB)
CPV Definitions. Graph IRI: Graph IRI: http://purl.org/weso/cpv/definitions. Triples: 43
- Dump file (Turtle) (7,4 KB)

The definitions have been made using the vocabularies:

The whole dataset uses links to other datasets (28,839):

GoodRelations and Product Ontology products and descriptions

In order to create all this data we have used different tools:

Google Refine and the RDF extension (to produce data)
Pubby (to publish data)
OpenLink Virtuoso (to store data)

Collaborators:

José Emilio Labra (Main Researcher of WESO Research Group at the University of Oviedo)
The first version of the CPV was developed in conjunction with my colleagues of CTIC: Luis Polo and Emilio Rubiera in 2007.

Acknowledgements:

This work is part of MOLDEAS system developed by the WESO Research Group in the partnership project 10ders Information Services project partially funded by the Spanish Ministry of Industry, Tourism and Trade with code TSI-020100-2010-919 and the European Regional Development Fund (EFDR) according to the National Plan of Scientific Research, Development and Technological Innovation 2008-2011, leaded by Gateway Strategic Consultancy Services and developed in cooperation with Exis-TI.

TO DO List

Check broken links
Review the design of URIs
Create Named graphs to group different divisions/groups/classes/categories
Link to other datasets
Reconciliate all products and services with the DBPedia resources
Develop a GUI based on Exhibit, SNORQL, etc.
Send this dataset and statistics to the Linked Data Cloud
Update public procurement notices with the new URIs