prefix.cc, MkII

January 11th, 2010

prefix.cc is a website I’ve made last February to ease a very common task in the life of RDF developers and SPARQL users: looking up namespace URIs. A short summary of what the site can do for you is available here.

The site was developed during a few weekends, and I haven’t touched the code since I first deployed it. Today I’m publishing the first serious update to the site. This post describes what’s new.

Reverse lookup. One of the most requested features is reverse lookup. You can now enter a URI of an RDF term into the query box on the start page, and the site will respond with the best prefix for contracting that URI into a QName. This functionality is also available as an API.

Negative votes. The site has received a moderate amount of spam, mostly from pranksters who think it would be funny to propose their own homepage as a better expansion for the foaf prefix. I’ve mostly cleaned this up manually, but I think it would be better to equip the user community with tools to handle this.

The site has always had a voting mechanism, which I intended as a tiebreaker in cases where people have submitted different URIs for the same prefix, for example in the case of the dc prefix. Starting today, you can submit both positive and negative votes. If a URI receives a certain amount of negative votes, it will be no longer shown.

New export formats. One of my favourite features is the ability to directly get output in various machine-readible syntaxes by composing an appropriate URI, such as http://prefix.cc/foaf.file.n3, which produces a declaration of the FOAF prefix in N3 format. I find this handy for copy-pasting into a text editor, but also for automating things.

A few formats have been added: vann produces an RDF/XML version of the namespace mapping in the VANN vocabulary (example). xmlns produces raw XML prefix declarations (example). go redirects to the namespace URI, so you can type http://prefix.cc/foaf.go into your browser bar as a shortcut for opening the FOAF specification. I’ve also added a table of all supported formats.

A side effect of the introduction of VANN support is that there is now a single VANN representation of all mappings known to the site.

Tweaks and fixes. Regular users will note a number of further small changes and bugfixes throughout the site. One notable fix is to the way namespace lookups are calculated for the list of popular prefixes. Ironically, most of the lookups actually are from web crawlers that followed the links in the list itself, making the list self-perpetuating. Also, the list featured the non-existing robots prefix, because many crawlers are looking for http://prefix.cc/robots.txt. These issues should now be fixed.

Internal changes. The site is developed in PHP, and started out as a quick weekend hack, so the initial code was a horrible mess that was hardly maintainable. I spent quite some time cleaning this up and refactoring the code into a much nicer structure that should be able to grow along with some of the additional features I’ve planned for the future. The codebase now totals some 1600 lines of PHP, CSS and Javascript.

Hidden goodies: RDFa markup and feed of latest additions. Finally, I want to highlight some features that have existed all along, but are easily missed: First, many pages contain RDFa markup, so if you want to re-use any prefix.cc data in your own site or application, you most likely can. Second, there is an RSS feed of the latest additions to the prefix database, and it is a neat way of learning about new vocabularies and ontologies that show up around the Web.

Bugs, comments, suggestions? Any feedback is appreciated. I did a lot of refactoring without a test harness, so it’s quite likely that a few new bugs have crept in. If you notice anything, please let me know. Also, if there is anything that you would like to see in prefix.cc Mk III, please share!

What’s in a name? And the Linked Data Police

November 21st, 2009

So I wrote a rather angry private email to Erik Wilde a few days ago, complaining about his use of the term “linked data” for a site that doesn’t follow the linked data practices. Erik decided to publish my email on his blog, along with a long defense of his use of the term, in a post called “The Linked Data™ Police”. Since it’s in public now, we can just as well see if we can get a useful discussion out of this.

First, I realize that Erik probably responded more to the tone of my email than to the content. It was an angry rant, and my tone misses the mark, so his response is fair enough, and I have apologised to him. I also have to say that I’m speaking only for myself and nobody else—before others become scared of the “Linked Data™ Police”, I can assure them that to the best of my knowledge, it has a staff of one, and my peers in the community are in general a friendly and civil lot.

The site in question. So, what is this about? Erik and team have built recovery.berkeley.edu, a site that publishes structured data about Recovery Act spending. The site was built with a grant from the Sunlight Foundation. The technologies of choice are Atom and various other XML formats. As far as I can see, it’s excellent in its adherence to REST principles, including good URI design and “hypermedia as the engine of application state”. These two together are labelled as “linked data” in the site’s technical documentation.

This is a discussion about names, and not about substance. At the very core is the following question: Should we understand “linked data” to mean “the idea of somehow connecting pieces of data with links”, or should we take it to mean “RDF published according to the rules outlined by Tim Berners-Lee in the design note that coined the term”?

Obviously, I’m of the latter opinion. In this post, I want to do two things: First, I want to respond to some specific points from Erik’s post. I will do this by paraphrasing each point, and then responding to it. Second, I want to explain why I care about the matter and why I think that “linked data” should continue to be associated with Tim’s rules, and why advocates of different sets of rules should use different terms.

Erik: “The attitude is scary: Instead of figuring out the most effective way of adding more semantics to the web, it starts with a set of technologies and claims that whatever you want to do, you have to use those.” Linked data didn’t start with a set of technologies; a lot of deliberation by a lot of people went into the choice. Also, I have no quarrel with Erik’s choice of technologies, and I didn’t even suggest that he should or shouldn’t use any technology. There can be good reasons against using RDF, and it’s a good thing that innovation continues in other areas of web data technology.

But if Erik and colleagues don’t buy into the set of technology choices commonly called “linked data”, then why would they insist on using that name? What’s wrong with the established technical terms, REST and Resource-oriented Architecture? Is it just because those are already way past the peak of the hype cycle?

Erik: “Using generic problem names to refer to specific technologies only confuses people.” Erik mentions Linked Data, XML Schema, Semantic Web and Web Services as specific technologies that he considers to be badly labelled. But the peculiar thing about those four is not their names; they are controversial for other reasons. The IT world is full of specific technologies that use generic names: World Wide Web. Structured Query Language. Extensible Markup Language. Hypertext Transfer Protocol. Portable Document Format. Resource-Oriented Architecture. Scalable Vector Graphics. Open Document Format. Erik may not like it, but it’s a common practice.

Erik: “Choosing such names is usually an attempt to make competition harder.” Usually? I doubt that. Technologies are named in their very infancy, when their future and success is far from certain, and when competition is usually not an issue. The naming is usually an attempt to communicate as clearly as possible what the proposed technology is supposed to achieve, which is not a bad thing at all. Some fail at achieving the goal, but everyone designs (and names) assuming that it can be eventually achieved.

Erik: “RDF is just a stylesheet away.” Erik points out that it would be trivial to create a GRDDL transform that translates from the service’s output to RDF. Personally I wouldn’t call it trivial, and being just one transformation away from being compatible is not the same as being compatible. If there were GRDDL transforms in place, I would have no reason at all to complain, although just a few linked data clients support GRDDL at this time.

Back to the roots. So where did the term “linked data” come from? To the best of my knowledge, Tim Berners-Lee coined it in his 2006 Design Note that is titled “Linked Data”. The document introduced the four rules that are now known as the “Linked Data Principles.” Erik’s service is following all of them except the one that demands RDF or SPARQL.

It’s worth pointing out that the four rules did not mention RDF when Tim originally published them, but it is clear from the rest of the document that the use of RDF was implied. The document was aimed at the semantic web community. His later change was a clarification, not a change of intention.

I don’t know why Tim wrote this piece back in 2006, but my interpretation was that he wanted more people to publish data that can be browsed with his Tabulator RDF browser, and most RDF out there at that time couldn’t be browsed because of problems with one of the four rules. So I read it as a call for better interoperability among RDF publishers.

Broadening the term? There have been a number of calls for broadening the meaning of the term, most eloquently from Paul Miller, so Erik is certainly not alone in his view. Their intention is to get linked data quicker into the mainstream, which is a goal that I share. The problem is that broadening a term makes it less meaningful. There is a danger that the term gets extended to the point where it’s equally meaningless to other buzzwords such as Web 3.0 or the venerable Semantic Web. If you can use other formats instead of RDF, then why not also use SOAP instead of HTTP? Why not do away with the URIs? Why not YQL instead of SPARQL? Where does “linked data” stop? Everything is somehow “data” and somehow “linked.”

Interoperability requires choices to be made. In my eyes, the great thing about the term “linked data” is that it has a reasonably precise technical definition, rooted in Tim’s Design Note and the early work of the Linking Open Data project. That work has turned the Semantic Web’s compelling but vague promises of a side-by-side “web for humans” and “web for machines” into concrete guidelines that people can actually implement, and the result is an ecosystem of interoperable tools, clients and datasets that continues to grow around these guidelines.

These guidelines will continue to evolve with the emergence of new technologies (e.g., RDFa) and increasing experience and maturity (e.g., importance of licensing and provenance handling).

But at the core, it has to be about a set of concrete technology choices and deployment practices that foster an interoperable ecosystem of data sources and clients. “Linked data” is the best name we have for that particular set of technology choices and practices. There is nothing magic about the name “linked data”, to the best of my knowledge it didn’t exist at all in the web community before 2006. The term has gained popularity because it has associated rules that tell you how to do it, not because of the “words”. Without the rules, the term would be meaningless fluff. Everything is somehow “linked” and somehow “data.”

If you think that a different set of rules would work better (which is entirely possible), then it would be prudent to write them down, coin a new term for them, and start the legwork of advertising them, just as Tim did since 2006.

Linked data at the New York Times: Exciting, but buggy

October 30th, 2009

Update: Evan Sandhaus reports that all the issues mentioned below will be fixed. Great!

Yesterday at the International Semantic Web Conference, Evan Sandhaus of the New York Times unveiled data.nytimes.com, a site that publishes linked data for some parts of the Times’ index. To me, this was one of the most exciting announcements at the conference, and it caused quite a tweetstorm during and after Evan’s talk.

A bit of background: Every article published in the newspaper or on the website is tagged, classified and categorized in many ways by skilled editors. This metadata allows the creation of topic pages that automatically collect relevant articles for notable people, organisations, and events. Examples include Michelle Obama, Swine Flu (H1N1 Virus) and Wrestling.

What’s in the data? The dataset published yesterday contains information on each of the concepts that have a topic page. For now, it is limited to topic pages about people. The concepts are modelled in SKOS. The information attached to each concept consists mostly of links: to DBpedia, to Freebase, into the Times API (which is not available as RDF at this point), and of course to the corresponding topic page. This means that if you have a DBpedia URI for an especially notable entity, a high-quality New York Times topic page with the latest news about the topic is only two RDF links away. A notable feature of the links is that every single one has been manually reviewed, making this perhaps the highest-quality linkset in the LOD cloud.

How to get the data? This being linked data, every concept has a dereferenceable URI. Examples:

The site’s URI scheme follows one of the Cool URIs recipes: The identifiers above are resolvable, and by using content negotiation, web browsers are redirected to

http://data.nytimes.com/N13941567618952269073.html

which has a nicely formatted summary of the data available about Michelle Obama. Data browsers and other RDF-enabled clients, on the other hand, are redirected to

http://data.nytimes.com/N13941567618952269073.rdf

which has all the data goodness in RDF/XML.

There is also a dump: people.rdf. You can browse the data starting from the data.nytimes.com page. Everything is available under a CC-BY license.

Bugs and problems

This being a new dataset and the Times’ first foray into linked data, it turns out that the Beta label on the site is quite warranted. I will highlight four issues.

Data and metadata are mixed together. Let’s look at the data about Michelle Obama, available at the N13941567618952269073.rdf URI above. I’m reformatting the data into Turtle for legibility.

<http://data.nytimes.com/N13941567618952269073>
    a skos:Concept;
    skos:prefLabel "Obama, Michelle";
    skos:definition "Michelle Obama is the first …";
    skos:inScheme nyt:nytd_per;
    nyt:topicPage <http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html>;
    owl:sameAs <http://rdf.freebase.com/rdf/en.michelle_obama>;
    owl:sameAs <http://data.nytimes.com/obama_michelle_per>;
    owl:sameAs <http://dbpedia.org/resource/Michelle_Obama>;

This makes perfect sense, it’s data about a person, modelled as a SKOS concept. But then it goes on:

<http://data.nytimes.com/N13941567618952269073>
    dc:creator "The New York Times Company";
    time:start "2007-05-18"^^xsd:date;
    time:end "2009-10-08"^^xsd:date;
    dcterms:rightsHolder "The New York Times Company"^^xsd:string;
    cc:license "http://creativecommons.org/licenses/by/3.0/us/";
    .

This is not data about Michelle Obama the person, it’s metadata about the data published by the NYT. It’s certainly not true that Michelle Obama was created by the New York Times, or that she “started” in 2007 (whatever that’s supposed to mean), and don’t even get me started on asserting a rights or a license over a person.

Note that the NYT team actually went through the effort of setting up separate URIs for Michelle the person (http://data.nytimes.com/N13941567618952269073), and for the HTML and RDF documents describing the concepts (http://data.nytimes.com/N13941567618952269073.html and http://data.nytimes.com/N13941567618952269073.rdf). The reason why linked data experts advocate this practice of having separate URIs is exactly because it enables separation of data and metadata: It lets you state some facts about the concepts, and other things about the documents that describe the concepts. This is what should be done in the data above: The metadata should not be asserted about the URI identifying Michelle, but about the URI identifying the document published by the NYT: N13941567618952269073.rdf. So we would get:

<http://data.nytimes.com/N13941567618952269073>
    a skos:Concept;
    skos:prefLabel "Obama, Michelle";
    skos:definition "Michelle Obama is the first …";
    skos:inScheme nyt:nytd_per;
    nyt:topicPage <http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html>;
    owl:sameAs <http://rdf.freebase.com/rdf/en.michelle_obama>;
    owl:sameAs <http://data.nytimes.com/obama_michelle_per>;
    owl:sameAs <http://dbpedia.org/resource/Michelle_Obama>;

<http://data.nytimes.com/N13941567618952269073.rdf>
    dc:creator "The New York Times Company";
    time:start "2007-05-18"^^xsd:date;
    time:end "2009-10-08"^^xsd:date;
    dcterms:rightsHolder "The New York Times Company"^^xsd:string;
    cc:license "http://creativecommons.org/licenses/by/3.0/us/";
    .

Eric Hellman has a post about this issue, calling it “a potential legal disaster” because a license is attached to a resource that’s said to be the same as a resource on a different site (DBpedia and Freebase). He’s a bit alarmist, but this example highlights why the separation of data and metadata, of concept URIs and document URIs, is critically important in a general-purpose data model.

Distinguishing URIs and literals. Here’s some selected snippets from the RDF/XML output:

    <nyt:topicPage>http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html</nyt:topicPage>
    <cc:License>http://creativecommons.org/licenses/by/3.0/us/</cc:License>
    <cc:Attribution>http://data.nytimes.com/N13941567618952269073</cc:Attribution>

The value of all three properties are URIs. In the RDF data model, URIs are of such central importance that they are treated differently from any other kind of value (strings, integers, dates). But not so in the code example above. There, the three URIs are encoded as simple strings. This should be:

    <nyt:topicPage rdf:resource="http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html" />
    <cc:License rdf:resource="http://creativecommons.org/licenses/by/3.0/us/" />
    <cc:Attribution rdf:resource="http://data.nytimes.com/N13941567618952269073" />

Why does this matter? It’s basically like making links “clickable” in HTML by putting them into a <a href=”…”> tag: RDF clients will not recognize URIs if they are encoded as literals, and will not know that they can treat them as links that can be followed.

Content negotiation for hybrid clients. As usual for linked data emitting sites, there is content negotiation on the concept URIs: They redirect either to RDF or HTML, based on the Accept header sent by the client when resolving the URI via the HTTP protocol. Also as usual for first-time linked data producers, the content negotiation is a bit broken.

Here is what happens when I ask for HTML (using cURL, which is a handy tool for debugging the HTTP behaviour of linked data sites):

$ curl -I -H "Accept: text/html" http://data.nytimes.com/N13941567618952269073

Response:

HTTP/1.1 303 See Other
Server: Apache/2.2.3 (Red Hat)
Location: http://data.nytimes.com/N13941567618952269073.html

Next I will ask for RDF:

$ curl -I -H "Accept: application/rdf+xml" http://data.nytimes.com/N13941567618952269073

Response:

HTTP/1.1 303 See Other
Server: Apache/2.2.3 (Red Hat)
Location: http://data.nytimes.com/N13941567618952269073.rdf

So far, so good. But many clients are “hybrid”, they can consume both RDF and HTML. This includes many tools that can consume RDFa (RDF embedded in HTML pages). So it’s not uncommon to find tools that combine multiple media types in the accept header. The Times server should also redirect those tools to the RDF, because any RDF-consuming client can probably handle the raw RDF data better than the (not overly useful) HTML pages. But let’s see what happens:

$ curl -I -H "Accept: text/html,application/rdf+xml" http://data.nytimes.com/N13941567618952269073

Response:

HTTP/1.1 303 See Other
Server: Apache/2.2.3 (Red Hat)
Location: http://data.nytimes.com/N13941567618952269073.rdf.html

The server redirects to a file that doesn’t exist, ending in .rdf.html. This is pretty funny to me as a programmer, because the bug gives me a glimpse into the Times codebase, where obviously a programmer didn’t consider that the two alternatives—sending HTML or sending RDF—are exclusive.

Update: Someone at the Times seems to be working on the server as I’m writing this; the latest behaviour is even worse; it redirects to .rdf.html even if I request only RDF, and uses 301 redirects instead of 303.

Using the Creative Commons schema. The NYT data uses the Creative Commons schema to license the data under CC-BY. Here’s the relevant RDF, in Turtle (I fixed the subject URI and turned literals into URIs where appropriate):

<http://data.nytimes.com/N13941567618952269073.rdf>
    cc:License <http://creativecommons.org/licenses/by/3.0/us/>;
    cc:Attribution >http://data.nytimes.com/N13941567618952269073<;
    cc:attributionName "The New York Times Company";
    .

This uses three properties: cc:License, cc:Attribution and cc:attributionName. But according to the schema, cc:License and cc:Attribution are classes, not properties. This should be:

<http://data.nytimes.com/N13941567618952269073.rdf>
    cc:license <http://creativecommons.org/licenses/by/3.0/us/>;
    cc:attributionURL <http://data.nytimes.com/N13941567618952269073>;
    cc:attributionName "The New York Times Company";
    .

Summary. The Times’ foray into linked data is an exciting new development, but it also shows how hard it is to get linked data right. This is a weakness of the linked data approach.

Can we do anything about this? Better tutorials and education can probably help. Another activity that is trying to address the issue is the Pedantic Web Group, a loose collection of people like me who obsess about the technical details of publishing data on the web and work with data publishers to get issues like the above fixed. We might even give you a hand with reviewing your stuff before you go live with it.

Multiple Java versions on OS X, and their paths

August 18th, 2009

Java version management on OS X is a wee bit complicated. Here’s what I understand.

All different versions of Java are installed into the directory:
/System/Library/Frameworks/JavaVM.framework/Versions/

For example, the JDK home directory for 1.4.2 would be Versions/1.4.2/Home/.

There are three ways to access specific versions.

  1. The hardcoded default. There is the hardcoded default version that comes with the current version of the OS. For OS X Leopard, this is Java 1.5. This version can always be accessed through the path /Library/Java/Home.
  2. The Java Preferences application. recent versions of Java on OS X install an application “Java Preferences” into /Applications/Utilities. It allows you to select the preferred version by dragging it to the top of the list. There is one list for applets, and one list for applications and the command line. The terminal commands (e.g., java) use that version from the second list. The path of this version, in case you need it, can be obtained by running the command /usr/libexec/java_home.
  3. Setting JAVA_HOME. By doing so one can override the choice that is made in the Java Preferences application. If the JAVA_HOME is set, the command line applications will use that version. However, /usr/libexec/java_home will still return the path of the version that was selected in Java Preferences.

“Magic” versions. In addition to the different Java versions, the Versions directory also contains several “magic” directories:

Versions/CurrentJDK/ is the hardcoded default version of the OS, so on Leopard it will always be an alias pointing to /Versions/1.5. Note that this is not affected by whatever version is selected in Java Preferences or via JAVA_HOME. Messing manually with the symlink to make it point to a different version is probably not a good idea. /Library/Java/Home points here.

Versions/Current/ is an alias that points to Versions/A/. This, in turn, is not a proper Java version like the other directories in /Versions/. It contains internal parts of the OS X Java machinery, e.g., in Versions/Current/Commands there are “fake” binaries such as java, javac and javadoc that internally use /usr/libexec/java_home (and JAVA_HOME, if set) to find the “real” binary. When you call Java commands from the command line, you actually invoke these “fake” binaries (via symlinks in /usr/bin). These are system internals, and poking around in there too much is probably not a good idea.

Finding the current version. So, what is the correct way to determine the location of the current Java version on OS X, say from a shell script?

  1. If JAVA_HOME is set, use that.
  2. Otherwise, invoke /usr/libexec/java_home to find the path.
  3. If that fails, fall back to /Library/Java/Home.

Should you set JAVA_HOME? If you don’t need it, don’t set it at all. If you need it, setting it via

export JAVA_HOME=`/usr/libexec/java_home`

is probably not a bad idea, because this will reflect future changes to the selected version from the Java Preferences application.

Back from ESWC 2008

June 9th, 2008

The European Semantic Web Conference is over and I’m back in Galway. It was a lot of fun, two hundred smart and interesting people in a sea-side tourist resort, and the thought “I can’t believe I’m getting paid for this” crossed my mind a few times.

It was good to see that some of the topics close to my interest — SPARQL, Linked Data, the LOD project, Semantic Search — were well-represented in the conference program and in the attention of the crowd.

I gave four talks:

  1. A five-minute lightning talk on Sindice.com in Sunday’s spontaneous and surprisingly well-attended LOD gathering. Slides (PDF)

  2. Neologism: Easy Vocabulary Publishing, together with Sergio Fernández, in the SFSW workshop, about our soon-to-be-released RDFS vocabulary editor and publishing system. Slides (PDF)

  3. Semantic Sitemaps about DERI’s proposed Semantic Sitemaps standard for improved discovery of RDF data published on the Web. Slides (PDF)

  4. There’s more than US-ASCII, a two-minute lightning talk in which I complain bitterly about the inability of Semantic Web developers to properly handle characters outside of the usual US-ASCII charset in their apps. The single slide is reproduced below.

There's more than US-ASCII (presentation slide)

The best part was meeting in person for the first time some of the people I’m working with on a regular basis, including Michael Hausenblas, Keith Alexander, Hugh Glaser, Andreas Langegger, and Olaf Hartig, and making some exciting new connections. Email is great, but sitting down face to face still beats it by an order of magnitude in effectivity, and also in fun if drinks are involved.

What is your RDF browser’s Accept header?

March 17th, 2008

I was debugging some content negotiation related issue the other day and made a little tool that allows me to find out what Accept header different RDF-aware HTTP clients send. If you ever need to know the Accept header of a particular RDF-aware HTTP client, just make it show the RDF loaded from this URI:

http://richard.cyganiak.de/2008/03/rdfbugs/accept.php

The RDF contains a dc:description with the browser’s Accept header. If you have the Tabulator Firefox extension installed, you can simply click the link and see the output.

I tried this with a couple of tools and here are the results:

Tabulator Firefox Extension 0.8.2:

application/rdf+xml, application/xhtml+xml;q=0.3, text/xml;q=0.2,
application/xml;q=0.2, text/html;q=0.3, text/plain;q=0.1, text/n3,
text/rdf+n3;q=0.5, application/x-turtle;q=0.2, text/turtle;q=1

Jena’s Model.read(…) method:

application/rdf+xml, application/xml; q=0.8, text/xml; q=0.7,
application/rss+xml; q=0.3, */*; q=0.2

Disco Hyperdata Browser:

application/rdf+xml;q=1,text/xml;q=0.6,text/rdf+n3;q=0.9,
application/octet-stream;q=0.5,application/xml q=0.5,
application/rss+xml;q=0.5,text/plain; q=0.5,application/x-turtle;q=0.5,
application/x-trig;q=0.5,text/html;q=0.5

OpenLink RDF Browser:

application/rdf+xml, text/rdf+n3, application/rdf+turtle,
application/x-turtle, application/turtle, application/xml, */*

SindiceBot:

application/rdf+xml, application/xml;q=0.6, text/xml;q=0.6

Some of these are pretty funny actually, but that’s a post for another day.

Tabulator does N3

March 17th, 2008

In my podcasted chat with Danny Ayers the other day I said that Tabulator doesn’t support N3, the highly readable RDF serialization syntax developed by Tim Berners-Lee.

It turns out I was wrong. I thought Tabulator supported RDF/XML only. But it turns out Tabulator has excellent support for N3 as well. I’m not sure how I managed to miss this. Seems like the Tab’r team sneaked that feature in while no one (or at least I) was not looking!

To make Tabulator eat your N3, you just have to make sure it’s served with the right content type: text/rdf+n3. If you publish N3, you can test this by installing cURL (see my earlier cURL for SemWebbers tutorial) and running:

curl -I "http://example.com/myfile.n3"

If your server is set up correctly, then the output contains a line like this:

Content-Type: text/rdf+n3; charset=utf-8

If you see e.g. text/plain instead, then your server is misconfigured. If you serve static files with Apache, you can fix this by adding this line to your Apache’s httpd.conf file or to a file called .htaccess in your DocumentRoot:

AddType text/rdf+n3;charset=utf-8 .n3

And then make sure that your filenames end in .n3.

Actually, this is really good news. As far as I’m concerned, the Tabulator Firefox extension, being the most readily accessible Semantic Web client currently out there, defines what is on the Semantic Web and what isn’t. If you can’t browse it in Tabulator, then it isn’t. (Insert caveat about RDFa and SPARQL here, which Tabulator probably will support in the future.)

I really like N3, because it is a much friendlier format than the alternative RDF/XML. It is very easy to read, generate, and even hand-write. More so than, say, JSON. A Semantic Web built on N3 would be a much nicer place than a Semantic Web built on RDF/XML. Tool support for N3 is still not quite as good as for RDF/XML, but we are getting there.

So, what does this mean in practice?

  1. Do you develop or maintain an application that consumes RDF from the Web? If you do, then you should make sure that it understands both RDF/XML and N3.
  2. Do you develop or maintain a library, framework or API that can load and parse RDF from the Web, from a URI? If you do, then you should make sure that it invokes the right parser, depending on the Content-Type header of the response. This should happen completely transparently from your users’ point of view.
  3. Do you author educational material, tutorials or slides that use RDF in examples? Do your audience a favor and do them in N3, not RDF/XML!
  4. If you already produce RDF/XML, as an information publisher, you shouldn’t worry. RDF-aware clients won’t stop supporting RDF/XML.

Making these things happen in the tools and documents that I myself maintain will be quite a bit of work. But I think it’s worth it. N3 is a bit like the old days of HTML, you can actually view source and understand what’s going on. N3 is the human-friendly way of writing RDF.

QOTD: timbl is a document

January 24th, 2008

Simon Spero on the SKOS list:

The meaning of “documet” in this context is extremely broad; if we follow Otlet’s definition of a document as anything which can convey information to an observer, the term would seem to cover anything which can have a subject.

By this standard, timbl is a document, but only when someone’s looking.

Ah, the Semantic Web community! Please leave your common sense at the door …

I named 61 HTML elements in 5 minutes.

November 26th, 2007

61

(via Dominik Wagner)

A challenge for semantic query wizards

November 13th, 2007

Bernard Vatant:

A challenge for semantic query wizzards: Find the smallest connected graph having at least a node in each LOD data set.

Interesting. The hardest part might be finding paths through the FOAF bubble.

Any takers?

Finding Rahoon

November 13th, 2007

Rahoon is a nondescript area of Galway, the third-largest city in the Republic of Ireland. Unsurprisingly, I had never heard about Rahoon before last week, when I moved there, to join Giovanni Tummarello’s team at DERI Galway.

Moving is stressful. But being an RDF geek, I not only have to drag physical stuff around, but I also have to update a few triples. For example, by joining DERI I got yet another URI:

http://www.deri.ie/fileadmin/scripts/foaf.php?id=313#me

That’s a new owl:sameAs for my FOAF file. I also have to update my foaf:based_near triple. Until now, my FOAF file stated:

<#cygri> foaf:based_near >http://dbpedia.org/resource/Berlin< .

The obvious new value for this property would be http://dbpedia.org/resource/Galway. But I want to be a bit more specific. That’s where this post’s title comes in. What is the URI of Rahoon?

The big datasets: Unfortunately, Wikipedia doesn’t have an article for Rahoon, hence it isn’t in DBpedia.

Then there’s Geonames, the most comprehensive source of geographical information on the Semantic Web. It has an entry for a nearby structure called Rahoon House, but not for the area itself.

The search engines: Next I tried all the Semantic Web search engines from the Linking Open Data project’s list.

Of the seven available services, only three produced any results. I was quite disappointed that the venerable Swoogle, one of the first large-scale Semantic Web indices, didn’t return any results at all.

SWSE (a DERI project) returned two hits, but neither of them was semantic (a web page mentioning Rahoon and a Bollywood-related RSS item).

Sindice (disclosure — it’s another DERI project, and I am now a team member) did much better, it found Geonames’ Rahoon House, and a bunch of loosely Rahoon-related things from DBpedia, but it didn’t turn up any URI for the Rahoon area itself.

Finally there is Falcons, a recently announced project developed at Nanjing’s Southeast University. Its results are similar to Sindice’s, except that it missed the Geonames entry.

In summary, only Falcons and Sindice found anything of interest, but neither struck gold.

Mint your own? At this point, I perhaps have to accept that the only existing relationship between RDF and Rahoon is the fact that Giovanni has been living here for a while, and that I’m not going to find a URI for the place.

The usual advice at this point is to mint a new URI for the thing in question. But I don’t want to go down this road, because I simply do not feel sufficiently competent in matters of Irish geography. What ”is” Rahoon, really? An administrative area? A geographical region? A postal code? A bus stop? I don’t know.

Based near a blank node: My solution is to ignore the widely accepted wisdom that RDF blank nodes are considered harmful. I will state that I’m living near something, and describe that something as good as I can:

foaf:based_near [
    a pos:Point;
    pos:lat "53.27702";
    pos:long "-9.09019";
    rdfs:label "125 Rosan Glas, Rahoon, Galway, Ireland";
    geo:parentFeature <http://sws.geonames.org/2964180/>
];

The skeleton of this N3 fragment was easily created using my FOAF geolocator (which I recently extended with address search and N3 output — have a look at it if your FOAF file still lacks geolocation!). I added a label and a link to the next-largest Geonames feature (Galway), which I easily found with Sindice.

Rahoon is still without a URI, but I guess I should let it be for now and rather worry about applying for a social security number, registering for taxation, and so forth.

DERI! At any rate, I’m happy to be here and look forward to working with a great team on some very exciting projects.

LazyWeb request

April 11th, 2007

a) A Sudoku solving web service.
b) A Sudoku generating web service.
c) Hook them up to each other.

Objectviewer: Yet another linked data browser

April 3rd, 2007

Via Troy Self’s introduction to the Linking Open Data list I came across the Objectviewer, yet another Semantic Web browser based on the linked data principles. This increases the number of available Semantic Web browser prototypes to four: Tabulator, Disco, the OpenLink Ajax Toolkit browser, and now ObjectViewer.

ObjectViewer has quite a nice visualization of the browsed resource as a simple graph, which isn’t really all that useful in practice, but always makes for stunning demos. A live example on some dbpedia data is here (the browser chrome is missing, I couldn’t figure out how to make a proper direct link), and a screenshot is below.

Objectviewer screenshot

Webby data everywhere!

Neil Bartlett: “StatSVN helps startups get funded”

March 15th, 2007

Neil Batlett has an interesting take on StatSVN and StatCVS:

One problem that startup companies often have is demonstrating to investors that they’re actually doing something productive rather than just pouring away money on office plants, Herman Miller chairs, and playing foosball all day. … One thing you can do is show the evolution of your code over a period of time using a tool like StatSVN.

Lines of code are certainly not the most meaningful numbers, but they are a nice and simple way of demonstrating activity. Sometimes that’s all you need.

Less code: eRDF templates for RDF-driven web sites

March 15th, 2007

Keith Alexander experiments with using eRDF markup to populate HTML templates:

I was writing a php template, marking it up with eRDF, and I realised that what I was doing was describing variables with triples - which is essentially what I would be doing to write a SPARQL query to retrieve data for the template.

So the core of the idea is: using semantic markup in a template to generate queries, retrieve data and populate the template.

I have started to implement the idea, using eRDF for the semantic markup, SPARQL as the query language I generate to, and Smarty as the templating language. (I use the ARC RDF PHP classes for parsing the eRDF into triples, and for running the SPARQL queries).

Keith has blogged this in much more detail here, including code and template samples.

This is quite a clever idea. Let’s assume you have a web application driven by data from an RDF triple store. You generate HTML pages by querying the triple store and inserting the bits and pieces into an HTML template. Now if you add eRDF or RDFa annotations to the HTML template, in a way that reflects the original RDF data, then by definition the annotations completely specify what data you need to populate the page. And the template itself therefore must be sufficient to extract all the required triples from the store. No coding needed!

So, generalising the approach and glossing over many details: Just take a big ball of RDF data (dump or behind a SPARQL endpoint), and throw a bunch of HTML templates with embedded annotations at it, and you get a dynamic web site without writing any code. And the web site will have complete semantic annotations.

That’s an example of what becomes possible after you’ve payed the RDF tax.

Keith points out that this is similar to what Fresnel is designed to do, but I have to say that I find this template-based approach more appealing.

(Via [simile-general])

Trilingual word mashup

March 13th, 2007

The German readers will appreciate my mix of surprise and horror when I realized I had just typed this word in an email:

folksonomymäßig

A word that is certain to hurt the sensibilities of every lover of either the English, Latin, or German language. Now if I manage to sprinkle a little bit of french into the mix …

SPARUL—SPARQL Update Language

March 9th, 2007

Andy Seaborne announces a first draft of SPARUL, the SPARQL/Update Language:

This document describes SPARQL/Update (nicknamed “SPARUL”), an update language for RDF graphs. It uses a syntax derived form SPARQL. Update operations are performed on a collection of graphs in a Graph Store. Operations are provided to change existing RDF graphs as well as create and remove graphs with the Graph Store. A binding of SPARQL/Update using HTTP POST is described.

Max Völkel and I did a very rough proposal for a similar language back last year. We received some criticism over this: Tunneling application protocols over HTTP is not an optimal use of the web. Case in point: the WS-* stack. I tried to work out the issues by asking how RESTful SQL would look like, a potentially illuminating analogy. I found the results inconclusive—I understand the concerns raised by REST proponents, but haven’t seen a better alternative.

The main question, I think, is one of scope: Is SPARQL Update intended as an SQL-like language that applications use to communicate with their local or nearby data store? Or is it intended as public web infrastructure, similar to Web 2.0 APIs and HTTP PUT?

The SPARUL proposal doesn’t really take a position here, although this might be interpreted as a nod towards the former:

An update service that is separate from the query service has the advantages that different security mechanisms can be applied and that the query interface remains a legal, SPARQL service.

So, public query service and local update service?

An answer to all (well, some) of your URI questions

March 5th, 2007
  • Aren’t URNs much more elegant than those brittle HTTP URIs?
  • Why is everyone yapping about 303 redirects?
  • Hash vs. slash?
  • What’s the deal with content negotiation and the Semantic Web?
  • Shouldn’t we use blank nodes anyway?

There’s a lot of confusion around URIs on the Semantic Web. You have to do quite a bit of reading and trial-and-error to arrive at effective solutions. Leo, Max and I wrote Cool URIs for the Semantic Web (Leo’s announcement) to take some of the pain out of this process.

A couple of random companion posts from my archives:

And, always worth a link:

URIs for exceptions?

March 4th, 2007

Over in the comments to Henry Story’s bug ontology post, I wrote:

There should be RDF representations of program error reports, such as Java exceptions. Then I could SPARQL for “NullPointerException in class so-and-so of project foobar”, and possibly a solution has been filed, or at least I will find a related bug.

Drew Perttula adds:

As to Richard’s “RDF representations of program error reports”, see http://themongoose.sourceforge.net for one of several projects that hash up the stack trace into an error id. Those seem like they could lead to excellent automatic URLs which can be later associated with the tracking of the bug that makes that stack trace. I’d love to get an error and paste its url directly in my browser to see “this error has [n] frequency in the last few weeks; [these] other users have been experiencing it; [this] developer is working on the bug fix, and the details for that bug are [here]“.

This would be very useful and is entirely doable. Exceptions should have URIs that resolve to the project’s issue tracker or web-based support forum.

Dr. Chris Bizer

February 15th, 2007

Congrats, Chris!