Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert .ttl to Conll-2012 #65

Open
AlanQuille opened this issue Aug 4, 2021 · 7 comments
Open

Convert .ttl to Conll-2012 #65

AlanQuille opened this issue Aug 4, 2021 · 7 comments

Comments

@AlanQuille
Copy link

AlanQuille commented Aug 4, 2021

Hi @leogott, thanks again for all your help. As stated in the other issue, I successfully got conll-rdf working, I converted my conllu file to a ttl file.

I have one final question: What is the command to convert my .ttl file to Conll-2012? I checked the example shown here and it only shows how to convert back to Conll-U.

The command I used is this:

cat irishtimes.conllu | ./run.sh CoNLLStreamExtractor https://github.com/UniversalDependencies/UD_English# ID WORD LEMMA UPOS POS FEAT HEAD EDGE DEPS MISC | ./run.sh CoNLLRDFFormatter -rdf > irishtimes.ttl

@AlanQuille AlanQuille changed the title Convert Conll-U to Conll-2012 Convert .ttl to Conll-2012 Aug 4, 2021
@leogott
Copy link
Contributor

leogott commented Aug 4, 2021

ok, I had a look and I don't know yet how to achieve that. But I can give you a few pointers and maybe one of my colleagues will know the full answer already.

You can skip using the CoNLLRDFFormatter here, as the output of the Stream Extractor is already in rdf.

Looking at https://cemantix.org/conll/2012/data.htm under the relevant section, the cols in conll 2012 are as in the table below. Some of this information isbe present in the file you started out with, some you may need to convert or substitute.

nr name description
1 Document ID probably insert using sparql
2 Part number Insert as 0 using sparql
3 Word number  ID?
4 Word itself WORD?
5 Part-of-Speech  POS, probably. I'm not sure if conll-u and conll-2012 use the exact same POS tags or if there are differences
6 Parse bit I don't know how to do this one: "This is the bracketed structure broken before the first open parenthesis in the parse, and the word/part-of-speech leaf replaced with a *. The full parse can be created by substituting the asterix with the "([pos] [word])" string (or leaf) and concatenating the items in the rows of that column."
7 Predicate lemma "The predicate lemma is mentioned for the rows for which we have semantic role information. All other rows are marked with a '-'"
8 Predicate Frameset ID "This is the PropBank frameset ID of the predicate in Column 7."
9 Word sense "This is the word sense of the word in Column 3."
10 Speaker/Author Usually "speaker1"
11 Named Entities "These columns identifies the spans representing various named entities."
12:N Predicate Arguments "There is one column each of predicate argument structure information for the predicate mentioned in Column 7."
N Coreference "Coreference chain information encoded in a parenthesis structure."

@AlanQuille
Copy link
Author

That's very helpful, I will try to use this to create the Conll-2012 file. If I am successful I will post here how to do it.

@AlanQuille
Copy link
Author

Perhaps the specific command is in one of these files? https://github.com/acoli-repo/conll-transform? My understanding is that conll-transform is more or less conll-rdf but makes it easier to convert between formats (I tried to use it but I had an error and raised an issue. I didn't get a response so I tried conll-rdf instead). I will look through conll-transform myself and see if I can find anything.

@cfaeth
Copy link
Collaborator

cfaeth commented Aug 4, 2021

Dear Alan,

as Leo already pointed out, a simple "conversion" to CoNLL-2012 might not be trivial, since the information required for some of the columns is unavailable in CoNLL-U. I will just add a few comments regarding the table above.

Columns 1 & 2 are document-specific. Such info is not encoded in CoNLL-U, except in comments sometimes
3 = ID
4 = WORD
5 = UPOS. Should you need a different tagset / annotation scheme you could use a CoNLLRDFUpdater workflow to transform it via OLiA. There is an example in this repo: examples/link-ud.sh . In case you need it, I could elaborate on how this works.
6 = Parse bit. Here it gets extremely tricky. Since CoNLL-U does not contain a parse tree but instead relies on dependency relations, this information is missing from your input data. Probably you have a bracketing notation at hand or can create one e.g. by using the stanford parser which supports both schemes. In that case, you could look into the following paper on how these can be processed with CoNLL-RDF. A Tree Extension for CoNLL-RDF
The rest of the columns goes further into semantics. We do also support predicate arguments 12:N during the conversion from/to CoNLL files but we cannot infer this information directly from syntactic annotations such as CoNLL-U.

Since you will probably need to generate and merge a lot of additional information, you could also look at the following paper on the CoNLL libraries, specifically conll-merge: The ACoLi CoNLL Libraries: Beyond Tab-Separated Values. Here we illustrate how to combine annotated corpora with heterogeneous annotations and tokenizations.

@AlanQuille
Copy link
Author

@cfaeth Thanks for your prompt reply. Is conll-transform capable of the transformation? I know it relies on conll-rdf.

@cfaeth
Copy link
Collaborator

cfaeth commented Aug 4, 2021

@AlanQuille conll-transform relies on the CoNLL-RDF ontology to determine how information is rendered in various CoNLL dialects. You can try to run it and it will probably write a basal conversion script for you but it will probably also tell you about the missing info, we just described here.

But it is always a good starting point.

@chiarcos
Copy link
Contributor

chiarcos commented Oct 9, 2021

@AlanQuille I looked into your issue and it seems to be a Windows 10 issue with the Java path, see there.

The response of ./transform.sh from CoNLL-Transform is

building bash script

1. configure preprocessing

2. configure extraction

3. configure update
approximative specialization: FORM => WORD
generalization: XPOS => POS
warning: no mapping for target format property DOCUMENT_ID
warning: no mapping for target format property PART_NUMBER
warning: no mapping for target format property PARSE
warning: no mapping for target format property PRED_LEMMA
warning: no mapping for target format property PRED_FRAMESET
warning: no mapping for target format property WORD_SENSE
warning: no mapping for target format property SPEAKER
warning: no mapping for target format property NER
warning: no mapping for target format property arguments
warning: no mapping for target format property COREF
warning: no mapping for source format property LEMMA
warning: no mapping for source format property UPOS
warning: no mapping for source format property FEATS
warning: no mapping for source format property HEAD
warning: no mapping for source format property EDGE
warning: no mapping for source format property DEPS
warning: no mapping for source format property MISC
{6=bracketEncoding}

4. configure formatter

5. configure postprocessing

6. writing script
./run.sh CoNLLStreamExtractor # ID FORM LEMMA UPOS XPOS FEATS HEAD EDGE DEPS MISC | \
./run.sh CoNLLRDFUpdater -custom -updates PREFIX conll: <http://ufal.mff.cuni.cz/conll2009-st/task-description.html#>
INSERT { ?a conll:WORD ?b } WHERE { ?a conll:FORM ?b};
INSERT { ?a conll:POS ?b } WHERE { ?a conll:XPOS ?b}; | \
./run.sh CoNLLRDFFormatter -conll DOCUMENT_ID PART_NUMBER ID WORD POS PARSE PRED_LEMMA PRED_FRAMESET WORD_SENSE SPEAKER NER arguments COREF

You can use the script from point 6 to convert your data (note that you might need to escape certain characters), and it will produce the correct format, but columns that cannot be reliably predicted (see warnings above) will just be empty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants