Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve search order in GROUP_CONCAT #43

Open
chiarcos opened this issue Oct 27, 2020 · 0 comments
Open

Preserve search order in GROUP_CONCAT #43

chiarcos opened this issue Oct 27, 2020 · 0 comments

Comments

@chiarcos
Copy link
Contributor

chiarcos commented Oct 27, 2020

Requested enhancement: Aggregate over paths in the graph by preserving search order.

In SPARQL 1.1, GROUP_CONCAT aggregates do not preserve any order of elements. In particular, the result of an embedded subquery that provides elements in the correct order will be reordered when the outer query applies a GROUP_CONCAT.

For asserting an order, see w3c/sparql-dev#9.

A related requirement is to preserve search order. While this could be solved once order can be reliably asserted to aggregates (e.g., by using yet another COUNT aggregate to approximate the length of the search path, and then ordering over the counts), this will be very inefficient in comparison to just preserve search order. Also, if multiple paths exist, it will be incorrect if different search paths exist.

From the user perspective, maintaining search order would be fully backward-compatible with the current behavior. From the implementation perspective, certain optimizations may not be applicable anymore. If that is the case, I suggest to introduce a new aggregate ORDERED_GROUP_CONCAT that otherwise behaves like GROUP_CONCAT.

Example 1: Assume you have an RDF description of states and state transitions (workflows, finite state automata, ...), with every state associated with a particular value. For a given initial state, return all sequence of al values associated with the subsequent states in their order of occurrence.

(See here for a [partially successful, but implementation-specific] attempt to model that in SPARQL 1.1.)

Example 2: Search and aggregate over linguistic annotations, e.g., those provided by the NLP Interchange Format.

It is often necessary to return the concatenated string value for a span of words. An approximate solution (that works in many [but not all] cases in Apache Jena) is to iterate over the span using a property path that starts with the first word:

SELECT ?w ?myspan
WHERE {
{ SELECT ?w (GROUP_CONCAT(?word; separator=" ") AS ?myspan)
  WHERE {
  ?w a nif:Word. 
  ?first conll:HEAD* ?w. 
  MINUS { [conll:HEAD* ?w] nif:nextWord+ ?first }
  ?first nif:nextWord* [ conll:HEAD* ?w; conll:WORD ?word ]
 } GROUP BY ?w 
}
# some stuff in the outer query
}

This is not guaranteed to work by the SPARQL 1.1 spec, it does not work 100% in Apache Jena, and it is very slow.

Suggestion:

  • In the inner query, return a concatenated list of strings and a concatenated list of integers (say, from conll:ID)
  • provide a custom function that takes the string concatenation and the int concatenation as arguments and returns a modified string ordered for the integers (with duplicates removed)

Example:

SELECT ?w ?orderedSpan
WHERE {
{ SELECT ?w (GROUP_CONCAT(?word; separator=" ") AS ?myspan) (GROUP_CONCAT(?id; separator=" ") AS ?mykeys)
  WHERE {
  ?w a nif:Word. 
  [conll:HEAD* ?w; conll:WORD ?word; conll:ID ?id ]
 } GROUP BY ?w 
}
BIND(conll_fn:get-ordered-span(?myspan, ?mykeys) AS ?orderedSpan)
# some stuff in the outer query
}
@chiarcos chiarcos changed the title order-sensitive replacement for GROUP_CONCAT Preserve search order in GROUP_CONCAT Nov 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant