Pérez et al.: Semantics and Complexity of SPARQL

Jorge Pérez, Marcelo Arenas, Claudio Gutierrez: Semantics and Complexity of SPARQL (preprint)

SPARQL has a dirty little secret: The spec doesn’t define the formal semantics. There is a lot of language to explain how everything is supposed to work, and there are a lot of examples, but no rigorous mathematical definitions that allow one to settle weird corner cases.

That HP Labs tech report: When I was working on SPARQL-to-SQL translation at HP last year, I struggled a lot with this lack of formal semantics. I ended up creating my own – A relational algebra for SPARQL, published as an HP tech report. Unfortunately, it clearly disagreed with what Andy and the other DAWG folks had in mind in a couple of cases. And I lack the mathematical skills to write up a thorough formal definition of anything. In summary, I’m not particularily proud of the report.

Pérez et al. to the rescue: Yesterday, Claudio Gutierrez (of Temporal RDF fame) and his co-authors announced their paper at the DAWG mailing list. It’s a great piece of work. Their contributions:

  • a formal “compositional” semantics for SPARQL (which matches my relational definition, but with real maths),
  • a formalization of the “computational” semantics implicit in the SPARQL spec (and implemented in ARQ),
  • a proof that both are equivalent except in a few edge cases*,

  • a convincing argument that none of the edge cases do occur in real-world queries,
  • a clear and simple syntactic condition that avoids the problematic edge cases,
  • a bunch of useful transformation rules for normalizing SPARQL queries.
  • Some caveats:

  • The paper looks only at the default graph; SPARQL’s dataset features are not discussed,
  • only the boolean operators for FILTER conditions are discussed,
  • FILTER conditions are restricted to the variables occuring in the containing pattern – a restriction that SPARQL doesn’t make,
  • no formal account of literals and datatypes is given (e.g. for FILTER comparisions and ORDER BY).
  • Why this is good news: I feel that this paper is a great formal foundation for SPARQL. It presents clear and reasonable answers to all of the weird edge cases we had to pussyfoot around until now. It makes me confident that SPARQL can be formally analyzed and there are answers to the hard optimization problems I and other people face when implementing SPARQL (e.g. in D2RQ).

    On a personal level, I also feel good about it because it validates some of the inconclusive results I got last year, like the SQL translation presented in the report. I knew there were edge cases where my approach didn’t match SPARQL, and I suspected that they wouldn’t occur much in real queries – but now I know for sure, and I know that there are no further edge cases lurking.

    That’s the progress of science, I suppose. I wonder how my time at HP would have gone if I had known the results of this paper a year ago. It would have saved me many weeks of unproductive headaches. But no one knew these things back then. That’s what it feels like to live on the bleeding edge of research … And I like it.

    This entry was posted in General, Semantic Web. Bookmark the permalink.