Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoNLLStreamExtractor / CoNLL2RDF produces unreadable ttl with SRL-Args #45

Open
glaserL opened this issue Nov 21, 2020 · 0 comments
Open
Labels

Comments

@glaserL
Copy link
Collaborator

glaserL commented Nov 21, 2020

Resolving trailing SRL annotations in conll sometimes produces exceptions due to unreadable TMP nodes.

the CoNLL2RDF.java class (in conll2ttl func) produces temporary nodes "TMP_SRL{ID}" that are fixed later on. However, if they can't be fixed, unusable ttl gets putout as "_TMP_SRL_2" for example isn't a valid node name.

A potential (hacky?) fix I found is to change "_TMP_SRL" to ":_TMP_SRL" at the following two places:

if(col2field.get(col2field.size()-1).toLowerCase().matches(".*args")) {
for(int i = 0; i<predicates.size(); i++) {
sentence=sentence.replaceAll("_TMP_"+col2field.get(col2field.size()-1).replaceFirst("[\\-_]*[Aa][rR][gG][sS]$","_"+i),predicates.get(i));
}
}

argTriples.add(
"_TMP_"+col2field.get(col2field.size()-1).replaceFirst("[\\-_]*[Aa][rR][gG][sS]$","_"+(i+1-col2field.size()))+
" conll:"+field[i].trim()+
" "+URI);

This makes them acceptable node identifiers so one could fix them later in sparql queries?

Here a (somewhat) minimal example where this happens, I couldn't find a fix for the annotation so I can't sensibly minimize it. The call was
cat broken.conll | ./run.sh CoNLLStreamExtractor http://example.com ID WORD FEATS HEAD EDGE SRL SRL-ARGs

1	Although	_	6	mark	_	_	_	ARGM-ADV
2	data	_	6	nsubj	_	_	_	ARGM-ADV
3	in	_	2	prep	_	_	_	ARGM-ADV
4	this	_	5	det	_	_	_	ARGM-ADV
5	project	_	3	pobj	_	_	_	ARGM-ADV
6	are	_	24	advcl	_	_	_	ARGM-ADV
7	de	_	8	dep	_	ARG0	ARGM-MNR	ARGM-ADV
8	-	_	9	dep	-	V	_	ARGM-ADV
9	identified	_	12	amod	identified	_	V	ARGM-ADV
10	,	_	12	punct	_	_	_	_
11	certain	_	12	amod	_	_	_	_
12	information	_	24	nsubjpass	_	_	_	_
13	such	_	14	amod	_	_	_	_
14	as	_	12	prep	_	_	_	_
15	the	_	16	det	_	_	_	_
16	number	_	14	pobj	_	_	_	_
17	of	_	16	prep	_	_	_	_
18	ED	_	19	compound	_	_	_	_
19	visits	_	17	pobj	_	_	_	_
20	by	_	19	prep	_	_	_	_
21	zip	_	22	compound	_	_	_	_
22	code	_	20	pobj	_	_	_	_
23	were	_	24	auxpass	_	_	_	_
24	considered	_	0	ROOT	considered	_	_	V
25	proprietary	_	26	amod	_	_	_	ARG1
26	information	_	24	oprd	_	_	_	ARG1
27	by	_	26	prep	_	_	_	ARG0
28	some	_	30	det	_	_	_	ARG0
29	health	_	30	compound	_	_	_	ARG0
30	systems	_	27	pobj	_	_	_	ARG0
31	.	_	24	punct	_	_	_	_

And full error:

18:41:58 INFO  CoNLLStreamExtractor :: synopsis: CoNLLStreamExtractor baseURI FIELD1[.. FIELDn] [-u SPARQL_UPDATE1..m] [-s SPARQL_SELECT]
	baseURI       CoNLL base URI, cf. CoNLL2RDF
	FIELDi        CoNLL field label, cf. CoNLL2RDF
	SPARQL_UPDATE SPARQL UPDATE (DELETE/INSERT) query, either literally or its location (file/uri)
	              can be followed by an optional integer in {}-parentheses = number of repetitions
	              The SPARQL_UPDATE parameter is DEPRECATED - please use CoNLLRDFUpdater instead!
	SPARQL_SELECT SPARQL SELECT statement to produce TSV output
	reads CoNLL from stdin, splits sentences, creates CoNLL RDF, applies SPARQL queries
18:41:58 INFO  CoNLLStreamExtractor :: running CoNLLStreamExtractor
18:41:58 INFO  CoNLLStreamExtractor :: 	baseURI:       http://example.com
18:41:58 INFO  CoNLLStreamExtractor :: 	CoNLL columns: [ID, WORD, FEATS, HEAD, EDGE, SRL, SRL-ARGs]
18:41:58 INFO  CoNLLStreamExtractor :: 	SPARQL update: []
18:41:58 INFO  CoNLLStreamExtractor :: 	SPARQL select: null
18:41:58 INFO  CoNLLStreamExtractor :: read SPARQL ..
18:41:58 INFO  CoNLLStreamExtractor :: .. ok
18:41:58 INFO  CoNLLStreamExtractor :: process input ..
18:41:59 ERROR riot                 :: [line: 46, col: 1 ] Out of place: [UNDERSCORE]
org.apache.jena.riot.RiotException: [line: 46, col: 1 ] Out of place: [UNDERSCORE]
	at org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:147)
	at org.apache.jena.riot.lang.LangEngine.raiseException(LangEngine.java:148)
	at org.apache.jena.riot.lang.LangEngine.exceptionDirect(LangEngine.java:143)
	at org.apache.jena.riot.lang.LangEngine.exception(LangEngine.java:137)
	at org.apache.jena.riot.lang.LangTurtleBase.triplesSameSubject(LangTurtleBase.java:239)
	at org.apache.jena.riot.lang.LangTurtle.oneTopLevelElement(LangTurtle.java:46)
	at org.apache.jena.riot.lang.LangTurtleBase.runParser(LangTurtleBase.java:91)
	at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:41)
	at org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:206)
	at org.apache.jena.riot.RDFParser.read(RDFParser.java:338)
	at org.apache.jena.riot.RDFParser.parseNotUri(RDFParser.java:324)
	at org.apache.jena.riot.RDFParser.parse(RDFParser.java:273)
	at org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:498)
	at org.apache.jena.riot.RDFDataMgr.parseFromReader(RDFDataMgr.java:880)
	at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:298)
	at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:283)
	at org.apache.jena.riot.adapters.RDFReaderRIOT.read(RDFReaderRIOT.java:62)
	at org.apache.jena.rdf.model.impl.ModelCom.read(ModelCom.java:298)
	at org.acoli.conll.rdf.Format2RDF.conll2model(Format2RDF.java:235)
	at org.acoli.conll.rdf.CoNLL2RDF.conll2model(CoNLL2RDF.java:39)
	at org.acoli.conll.rdf.CoNLLStreamExtractor.processSentenceStream(CoNLLStreamExtractor.java:139)
	at org.acoli.conll.rdf.CoNLLStreamExtractor.main(CoNLLStreamExtractor.java:405)
18:41:59 INFO  Format2RDF           :: while processing the following input:
<code>
PREFIX nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#>
PREFIX conll: <http://ufal.mff.cuni.cz/conll2009-st/task-description.html#>
PREFIX x: <http://purl.org/acoli/conll-rdf/xml#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX terms: <http://purl.org/acoli/open-ie/>
PREFIX powla: <http://purl.org/powla/powla.owl#>
PREFIX : <http://example.com>
:s1_0 a nif:Sentence.
:s1_1 a nif:Word; conll:ID "1"; conll:WORD "Although"; conll:HEAD :s1_6; conll:EDGE "mark"; nif:nextWord :s1_2.
:s1_2 a nif:Word; conll:ID "2"; conll:WORD "data"; conll:HEAD :s1_6; conll:EDGE "nsubj"; nif:nextWord :s1_3.
:s1_3 a nif:Word; conll:ID "3"; conll:WORD "in"; conll:HEAD :s1_2; conll:EDGE "prep"; nif:nextWord :s1_4.
:s1_4 a nif:Word; conll:ID "4"; conll:WORD "this"; conll:HEAD :s1_5; conll:EDGE "det"; nif:nextWord :s1_5.
:s1_5 a nif:Word; conll:ID "5"; conll:WORD "project"; conll:HEAD :s1_3; conll:EDGE "pobj"; nif:nextWord :s1_6.
:s1_6 a nif:Word; conll:ID "6"; conll:WORD "are"; conll:HEAD :s1_24; conll:EDGE "advcl"; nif:nextWord :s1_7.
:s1_7 a nif:Word; conll:ID "7"; conll:WORD "de"; conll:HEAD :s1_8; conll:EDGE "dep"; nif:nextWord :s1_8.
:s1_8 a nif:Word; conll:ID "8"; conll:HEAD :s1_9; conll:EDGE "dep"; nif:nextWord :s1_9.
:s1_9 a nif:Word; conll:ID "9"; conll:WORD "identified"; conll:HEAD :s1_12; conll:EDGE "amod"; conll:SRL "identified"; nif:nextWord :s1_10.
:s1_10 a nif:Word; conll:ID "10"; conll:WORD ","; conll:HEAD :s1_12; conll:EDGE "punct"; nif:nextWord :s1_11.
:s1_11 a nif:Word; conll:ID "11"; conll:WORD "certain"; conll:HEAD :s1_12; conll:EDGE "amod"; nif:nextWord :s1_12.
:s1_12 a nif:Word; conll:ID "12"; conll:WORD "information"; conll:HEAD :s1_24; conll:EDGE "nsubjpass"; nif:nextWord :s1_13.
:s1_13 a nif:Word; conll:ID "13"; conll:WORD "such"; conll:HEAD :s1_14; conll:EDGE "amod"; nif:nextWord :s1_14.
:s1_14 a nif:Word; conll:ID "14"; conll:WORD "as"; conll:HEAD :s1_12; conll:EDGE "prep"; nif:nextWord :s1_15.
:s1_15 a nif:Word; conll:ID "15"; conll:WORD "the"; conll:HEAD :s1_16; conll:EDGE "det"; nif:nextWord :s1_16.
:s1_16 a nif:Word; conll:ID "16"; conll:WORD "number"; conll:HEAD :s1_14; conll:EDGE "pobj"; nif:nextWord :s1_17.
:s1_17 a nif:Word; conll:ID "17"; conll:WORD "of"; conll:HEAD :s1_16; conll:EDGE "prep"; nif:nextWord :s1_18.
:s1_18 a nif:Word; conll:ID "18"; conll:WORD "ED"; conll:HEAD :s1_19; conll:EDGE "compound"; nif:nextWord :s1_19.
:s1_19 a nif:Word; conll:ID "19"; conll:WORD "visits"; conll:HEAD :s1_17; conll:EDGE "pobj"; nif:nextWord :s1_20.
:s1_20 a nif:Word; conll:ID "20"; conll:WORD "by"; conll:HEAD :s1_19; conll:EDGE "prep"; nif:nextWord :s1_21.
:s1_21 a nif:Word; conll:ID "21"; conll:WORD "zip"; conll:HEAD :s1_22; conll:EDGE "compound"; nif:nextWord :s1_22.
:s1_22 a nif:Word; conll:ID "22"; conll:WORD "code"; conll:HEAD :s1_20; conll:EDGE "pobj"; nif:nextWord :s1_23.
:s1_23 a nif:Word; conll:ID "23"; conll:WORD "were"; conll:HEAD :s1_24; conll:EDGE "auxpass"; nif:nextWord :s1_24.
:s1_24 a nif:Word; conll:ID "24"; conll:WORD "considered"; conll:HEAD :s1_0; conll:EDGE "ROOT"; conll:SRL "considered"; nif:nextWord :s1_25.
:s1_25 a nif:Word; conll:ID "25"; conll:WORD "proprietary"; conll:HEAD :s1_26; conll:EDGE "amod"; nif:nextWord :s1_26.
:s1_26 a nif:Word; conll:ID "26"; conll:WORD "information"; conll:HEAD :s1_24; conll:EDGE "oprd"; nif:nextWord :s1_27.
:s1_27 a nif:Word; conll:ID "27"; conll:WORD "by"; conll:HEAD :s1_26; conll:EDGE "prep"; nif:nextWord :s1_28.
:s1_28 a nif:Word; conll:ID "28"; conll:WORD "some"; conll:HEAD :s1_30; conll:EDGE "det"; nif:nextWord :s1_29.
:s1_29 a nif:Word; conll:ID "29"; conll:WORD "health"; conll:HEAD :s1_30; conll:EDGE "compound"; nif:nextWord :s1_30.
:s1_30 a nif:Word; conll:ID "30"; conll:WORD "systems"; conll:HEAD :s1_27; conll:EDGE "pobj"; nif:nextWord :s1_31.
:s1_31 a nif:Word; conll:ID "31"; conll:WORD "."; conll:HEAD :s1_24; conll:EDGE "punct".
:s1_9 conll:ARG0 :s1_7.
:s1_9 conll:V :s1_8.
:s1_24 conll:ARGM-MNR :s1_7.
:s1_24 conll:V :s1_9.
_TMP_SRL_2 conll:ARG0 :s1_27.
_TMP_SRL_2 conll:ARG0 :s1_28.
_TMP_SRL_2 conll:ARG0 :s1_29.
_TMP_SRL_2 conll:ARG0 :s1_30.
_TMP_SRL_2 conll:ARG1 :s1_25.
_TMP_SRL_2 conll:ARG1 :s1_26.
_TMP_SRL_2 conll:ARGM-ADV :s1_1.
_TMP_SRL_2 conll:ARGM-ADV :s1_2.
_TMP_SRL_2 conll:ARGM-ADV :s1_3.
_TMP_SRL_2 conll:ARGM-ADV :s1_4.
_TMP_SRL_2 conll:ARGM-ADV :s1_5.
_TMP_SRL_2 conll:ARGM-ADV :s1_6.
_TMP_SRL_2 conll:ARGM-ADV :s1_7.
_TMP_SRL_2 conll:ARGM-ADV :s1_8.
_TMP_SRL_2 conll:ARGM-ADV :s1_9.
_TMP_SRL_2 conll:V :s1_24.

</code>
@glaserL glaserL added the bug label Nov 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant