As you may well be aware, Richard Cyganiak and Anja Jentzsch have worked on producing an updated version of the now famous Linked Data cloud. Quite some work has gone into streamlining the process this time using CKAN as a means to list all existing repositories and additional details such as their links, the vocabularies used, etc. This will indeed be pretty handy for many other endeavours, so we have to thank them not just for taking the time to produce the diagram that we see so often in presentations but also for getting people to produce a pretty detailed list of datasets.
The new updated cloud shows that this initiative keeps gaining take up and its growth has now accelerated significantly. The creation of this new diagram has highlighted the large number of quite diverse datasets (around 200) and the amount of data (around 25 billion RDF triples) which is now estimated to be part of the Web of Data. Although currently you may perfectly not find the data you’d like to use in your application, it is now getting clearer that if it is not there yet, it may soon be there. Myself, I am happy to see in the cloud the first bubble capturing Semantic Web Services published by our own dataset iServe. We hope this will be the first step of many more to come towards better integrating both the data and the services world.
While they were working on it, I had a short discussion with Richard regarding the principles and decisions they have adopted in this respect. Part of this discussion was in a sense the “old discussion” on schemas vs. data. Indeed, it is clear that some pragmatic decisions have to be adopted for generating a sensible and usable diagram and the ones they have chosen are, in my opinion, perfectly reasonable. Based on these basic decisions, the diagram is focused on datasets that have enough links to other datasets and somehow disregards more scattered sources of data like those coming from FOAF files or GoodRelations descriptions in product pages. My only concern, though, is that this diagram is rather a datasets diagram than one of the Web of Data. Not that Richard and Anja claim something different, but this is often the way a lot of people present it and this diagram has now a considerable impact on people. Thus, it is quite possible that people will rather focus on trying to fit the restrictions for having their work appearing in the diagram rather than generating novel vocabularies, data, or even added-value services on top of these.
While thinking about this and the schema vs data issue I thought that perhaps a good way to deal with this would be to have a two-layered diagram. At one level one could have instances according to certain schemas and published by certain datasets or even Web sites just serving files (e.g., FOAF files). At the other level one would have the set of schemas used, interlinked both with other vocabularies as well as with the instances layer. Generating this diagram would obviously require considerable analysis and would possibly benefit from an interactive interface in order to provide a reasonably clear visualisation and navigation. I would be willing to create this myself although I have to admit that I can hardly get the time to do this right now. I would be happy if anybody can pick this up or team up for creating this kind of visualisation since I’m certain it would be pretty useful for many endeavours.