Some leisure time hacking…

After some time of working in a new public spending effort with my colleagues from the National Technical University of Athens, a Linked Open Data portal of public spending has been released. I have contibuted in the part of unifying company names and linking product scheme classifications, neverthless they have performed a very good job promoting to the Linked Open Data initiative a lot of public contracts metadata. See the next video to have an idea of the work:

Publicspending.net from John Fidias on Vimeo.

Hope to continue this fruitful collaboration!

CFP: Data Mining on Linked Data workshop with Linked Data Mining Challenge

To be held during the 20th Int. Symposium on Methodologies of Intelligent Systems, ISMIS 2012, 4-7 December 2012 Macao (http://www.fst.umac.mo/wic2012/ISMIS/)

Official CFP

Workshop Scope

Over the past 3 years, the Semantic Web activity has gained momentum with the  widespread publishing of structured data as RDF. The Linked Data paradigm has therefore evolved from a practical research idea into a very promising candidate for addressing one of the biggest challenges in the area of intelligent information management: the exploitation of the Web as a platform for data and information integration in addition to document search. Although numerous workshops and even Challenges already emerged in the intersection of Data Mining and Linked Data (e.g. Know@LOD at ESWC) and even Challenges have been organized (USEWODs at WWW, http://data.semanticweb.org/usewod/2012/challenge.html), the particular setting chosen (with a highly topical Government-related dataset) will allow to explore new methods of exploiting Linked Data with state-of-the-art mining tools.

Workshop Topic

The workshop consists of an Open Track and of a Challenge Track.
The Open Track will expect submission of regular research papers, describing novel approaches to applying Data Mining techniques on the Linked Data sources.

Participation in the Challenge Track will require the participants to download a real-world RDF dataset from the domain of Public Contract Procurement, and accomplish at least one of the four pre-defined tasks on it using their own or publicly available data mining tool. To get access to the data, participants have to register to the Challenge Track at http://keg.vse.cz/ismis2012. Partial mapping to external datasets will also be available, which will allow for extraction of further potential features from the Linked Open Data cloud. Task 1 will amount to unrestricted discovery of interesting nuggets in the (augmented) dataset. Task 2 will be similar but the category of interesting hypotheses will be partially specified. Task 3 will concern prediction of one of the features natively present in training data (but only added to the evaluation dataset after the result submission). Task 4 will concern prediction of a feature manually added to a sample of the data by a team of domain experts.
Participants will submit textual reports (Challenge Track papers) and, for Tasks 3 and 4, also the classification results.

Submissions

Both the research papers (submitted to the Open Track) and the Challenge Track papers should follow the Springer formatting style. The templates for Word and LaTeX are available at the workshop web http://keg.vse.cz/ismis2012 and can be also found at http://www.springer.com/authors/book+authors?SGWID=0-154102-12-417900-0. The length of the submission should not exceed 10 pages. All papers will be made available at the workshop web pages and there will be a post-conference proceedings for selected workshop papers.

Papers (and results for Tasks 3 and 4) should be submitted using the EasyChair http://www.easychair.org/conferences/?conf=ismis2012dmold .

Important Dates

  • Data ready for download: June 20, 2012
  • Workshop paper and result data submissions: August 10, 2012
  • Notification of Workshop paper acceptance: August 25, 2012
  • Workshop: December 4, 2012

 

Compiling Related Work about Linked Data Quality

One of the cornerstones to boost the use of Linked Data is to ensure the quality of data according to different terms like timely, correctness, etc. The intrinsic features of this initiative provide a framework for the distributed publication of data and resources (linking together datasources on the web). Due to this open approach some mechanisms should be added to check if data is well linked or it is just a try to link together some part of the web. Most of the cases of linking data use an automatic way to discover and create links between resources (e.g. Silk Framework), this situation implies that the process is, in some factors, ambiguous so human decision is required. In the case of the data, the quality may vary as information providers have different levels of knowledge, objectives, etc. Thus information and data are released in order to accomplish a specific task and their quality should be assessed depending on different criteria according to a specific domain.

For instance, a data provider is releasing information about payments, is it possible to check which is the decimal separator, 10,000 or 10.000? is this information homogenous across all resources in the dataset?. If a literal value should be “Oviedo”, what happen if the real value is “Obiedo”? How we can detect and fix these situations?

These cases have motivated some related work:

  • The PhD thesis of Christian Bizer that purposes a template language and a framework (WIQA) to detect if a triple fulfills the requirements to be accepted in a dataset. (2007)
  • LODQ vocabulary, is a RDF model to express criteria about 15 kind of metrics that have been formulated by Glenn McDonald in a mailing list. A processor of this vocabulary is still missing. (2011)
  • A paper entitled “Linked Data Quality Assessment through Network Analysis” by Christian Gueret, in which some metrics are provided to check the quality of links. This work is part of  the LATC project.  (2011)
  • The workshop COLD (Consuming Linked Data) is also a good start point to check problems and approaches to deal with the requirements of implementing linked data applications.
  • …that are collected in the aforementioned works.
In some sense we should think that this problem is new but the truth is that it is inherited from the traditional databases. One of the arising questions is the possibility of applying existing approaches to solve the assessment of quality in the linked data realm…but this will be evaluated in next posts.
This first post is just a short introduction to the linked data quality research and approaches. In next weeks, we try to review in depth these works and purpose a solution (LODQAM).
Thank you very much!
Excellent regards,