Preserve search order in GROUP_CONCAT #43

chiarcos · 2020-10-27T21:27:29Z

Requested enhancement: Aggregate over paths in the graph by preserving search order.

In SPARQL 1.1, GROUP_CONCAT aggregates do not preserve any order of elements. In particular, the result of an embedded subquery that provides elements in the correct order will be reordered when the outer query applies a GROUP_CONCAT.

For asserting an order, see w3c/sparql-dev#9.

A related requirement is to preserve search order. While this could be solved once order can be reliably asserted to aggregates (e.g., by using yet another COUNT aggregate to approximate the length of the search path, and then ordering over the counts), this will be very inefficient in comparison to just preserve search order. Also, if multiple paths exist, it will be incorrect if different search paths exist.

From the user perspective, maintaining search order would be fully backward-compatible with the current behavior. From the implementation perspective, certain optimizations may not be applicable anymore. If that is the case, I suggest to introduce a new aggregate ORDERED_GROUP_CONCAT that otherwise behaves like GROUP_CONCAT.

Example 1: Assume you have an RDF description of states and state transitions (workflows, finite state automata, ...), with every state associated with a particular value. For a given initial state, return all sequence of al values associated with the subsequent states in their order of occurrence.

(See here for a [partially successful, but implementation-specific] attempt to model that in SPARQL 1.1.)

Example 2: Search and aggregate over linguistic annotations, e.g., those provided by the NLP Interchange Format.

It is often necessary to return the concatenated string value for a span of words. An approximate solution (that works in many [but not all] cases in Apache Jena) is to iterate over the span using a property path that starts with the first word:

SELECT ?w ?myspan
WHERE {
{ SELECT ?w (GROUP_CONCAT(?word; separator=" ") AS ?myspan)
  WHERE {
  ?w a nif:Word. 
  ?first conll:HEAD* ?w. 
  MINUS { [conll:HEAD* ?w] nif:nextWord+ ?first }
  ?first nif:nextWord* [ conll:HEAD* ?w; conll:WORD ?word ]
 } GROUP BY ?w 
}
# some stuff in the outer query
}

This is not guaranteed to work by the SPARQL 1.1 spec, it does not work 100% in Apache Jena, and it is very slow.

Suggestion:

In the inner query, return a concatenated list of strings and a concatenated list of integers (say, from conll:ID)
provide a custom function that takes the string concatenation and the int concatenation as arguments and returns a modified string ordered for the integers (with duplicates removed)

Example:

SELECT ?w ?orderedSpan
WHERE {
{ SELECT ?w (GROUP_CONCAT(?word; separator=" ") AS ?myspan) (GROUP_CONCAT(?id; separator=" ") AS ?mykeys)
  WHERE {
  ?w a nif:Word. 
  [conll:HEAD* ?w; conll:WORD ?word; conll:ID ?id ]
 } GROUP BY ?w 
}
BIND(conll_fn:get-ordered-span(?myspan, ?mykeys) AS ?orderedSpan)
# some stuff in the outer query
}

The text was updated successfully, but these errors were encountered:

chiarcos added the enhancement label Oct 27, 2020

chiarcos changed the title ~~order-sensitive replacement for GROUP_CONCAT~~ Preserve search order in GROUP_CONCAT Nov 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve search order in GROUP_CONCAT #43

Preserve search order in GROUP_CONCAT #43

chiarcos commented Oct 27, 2020 •

edited

Loading

Preserve search order in GROUP_CONCAT #43

Preserve search order in GROUP_CONCAT #43

Comments

chiarcos commented Oct 27, 2020 • edited Loading

chiarcos commented Oct 27, 2020 •

edited

Loading