The semantic web and languages that are not English

I didn’t come across this recent addition to HP Lab’s list of technical reports before: An Introduction to the Semantic Web, Considerations for building multilingual Semantic Web sites and applications by Jeremy Carroll.

If you read my blog, then you probably want to skip the “Introduction to the SW” part. The rest of the report is a highly focused look at the issues involved in building semantic web applications in any language that is not English: Unicode, language tags on literals and embedded XML, URIs vs. rdfs:labels, and IRIs.

The paper also mentions a feature introduced to RDQL in Jena 2.2: the langeq operator. It is used to filter literals based on their language tags, which is useful if your RDF data contains literals in multiple languages.

langeq can deal with subtags, that is, asking for labels in German (tag de) will also give you labels in German as used in Switzerland and as written using the spelling reform beginning in the year 1996 C.E. (tag de-CH-1996.)

Example usage:

SELECT ?resource, ?label
WHERE (?resource, rdfs:label, ?label)
AND (?label langeq 'it')

The current SPARQL working draft doesn’t have a facility like this, but thanks to SPARQL’s powerful expression system, you can emulate the same thing:

SELECT ?resource ?label
WHERE {
    ?resource rdfs:label ?label
    FILTER REGEX(LANG(?label), '^it(-|$)', 'i')
}

What’s going on here? LANG(?label) gives the label’s language tag. It is matched against the regular expression ^it(-|$), which matches the string it and any string that starts with it-. The ‘i’ modifier to the REGEX function makes the match case-insensitive, as required by RFC 3066 and its replacement-in-progress, draft-phillips-langtags.

Jeremy also points out yet another ugly wart of RDF(/XML): Language tagging is inconsistent for XML literals. To tag a plain literal, you put an xml:lang attribute on its property element. If you do the same for an XML literal, the language tag will be ignored. Instead, you have to put the attribute onto some element within the XML literal.

IRIs (Internationalized Resource Identifiers) are yet another interesting addition to the semantic web acronym soup. The next time someone you know tries to understand the differences between URLs, URIs and URNs, just mention that “they soon will all be replaced by IRIs anyway.” This is a great way to keep sane people away from our line of work.

Back to Jeremy’s report. I enjoyed this quote:

If you are asked to help with production of a multilingual Semantic Web application you will be asking tool developers for new features, you will be pushing at the boundaries, and finding problems in the specifications - budget accordingly …

Very true. But the same applies to unilingual semantic web applications.

Leave a Reply