Week #3 in Thessaloniki

This week I will be updating this post because I am reading a lot of papers and I need a way to track them. Following the same structure of last weeks I leave some links to the activities I am carrying out:

Reading

I have focused on some interesting subjects Statistics (Bayesian networks), Data Streams, Feedback Control Loops, Autonomous Computing and e-Learning systems (this is just for personal interest). I have started, and finished, the next list of papers and books:

Writing and reviewing

I would like to leave the link to an article about “How to review a paper“, an excellent guide to evaluate your reviews and take into account your responsibilities as reviewer.

Coding and Tools

I have not made any relevant progress in developing tasks but I have being refreshing my know-how on R.

Teaching

I have finished the evaluation of alumni in Health Information Systems and I am very proud of the marks and the work carried out by student during the last months. I have some links of their works building mashups but I prefer do not leave here the links due to privacy issues.

WebIndex Launch

Today it is the official launch of the Web Index. Last months I have collaborated, through my activities in WESO Research Group, with the Web Foundation to promote its statistical data following the Linked Data principles. I think we have published an appropriate version of this data and I hope to continue this fruitful collaboration with my new colleagues in next months.

You can find more information about the Web Index as Linked Data in http://data.webfoundation.org/.

If you have any comment, suggestion, etc. please feel free to contact me at any time,

Best,

R & Big Data Intro

I am very committed to enhance my know-how on delivering solutions deal with Big Data in a high-performance fashion. I am continuously seeking for tools, algorithms, recipes (e.g. Data Science e-book), papers and technology to enable this kind of processing because it is consider to be relevant in next years but it is now a truth!

Last week I was restarting the use of R and the rcmdr to analyze and extract statistics out from my phd experiments using the Wilcoxon Test. I started with R three years ago when I developed a simple graphical interface in Visual Studio to input data and request operations to the R interpreter, the motivation of this work was to help a colleague with his final degree project and the experience was very rewarding.

Which is the relation between Big Data and R?

It has a simple explanation, a key-enabler to provide added-value services is to manage and learn about historical logs so putting together an excellent statistics suite and the Big Data realm it is possible to answer the requirements of a great variety of services from domains like nlp, recommendation, business intelligence, etc. For instance, there are approaches to mix R with Hadoop  such as RICARDO or Parallel R and  new companies are emerging to offer services based on R to process Big Data like Revolution Analytics.

This post was a short introduction to R as a tool to exploit Big Data. If you’re interested in this kind of approaches, please take a look to next presentation by Lee Edfelsen:

Keep on researching and learning!

Nomenclátor Asturias 2010

DEPRECATED: NEED TO BE UPDATED, See:

This dataset created by the SADEI contains information about the populated places of my area, Asturias, including:

  • Codes to identify the type of a populated place: CC/PP/EE (C: code of first level division called “Concejo”, P: code of second level division called Parroquia Rural and EE: code of third level division the real place)
  • Name in Spanish and Asturian
  • Statistics about: altitude, distance, area, men, women and number of apartments (main and not main)

The structure of places is a hierarchy of 3 levels: Concejo (Municipality), Parroquia rural and others like: city, town, suburb, etc. Depending on the type of place some statistics are missing and their values are indicated with a value of “-1″. For instance “Concejo” and “Parroquia Rural” do not have “altitude and distance” and third level places do not have “area”.

Anyway all this information is publicly available via the WESO SPARQL endpoint (5 star linked data) and a Pubby frontend (more information about the dataset can be found in nomenclator-asturias dataset at thedatahub.org) . The structure of the data and definitions is the next one:

  • Noménclator statistics definitions. Graph IRI: http://purl.org/weso/nomenclator/stats/ontology. Total: 68 triples. Example at: http://purl.org/weso/nomenclator/stats/ontology/physicaldata
  • Noménclator statistics dataset. Graph IRI: http://purl.org/weso/nomenclator/asturias/2010/stats. Total: 370,160 triples.
  • Example of query: “Give me all municipalities that have more women than men”

    The definitions have been made using the vocabularies:

    The whole dataset uses links to other datasets (126,127):

    • 1 link to NUTS
    • 78 links to DBPedia one per each “Concejo”
    • 78,859 links to DBPedia, one per each populated place and observation
    • 55,146 links to Reference Data Gov UK, one per each populated place and observation
    • 70,904 links to SDMX attributes (sex-m and sex-f)
    • 29 links to GeoLinkedData.es