-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert .ttl to Conll-2012 #65
Comments
ok, I had a look and I don't know yet how to achieve that. But I can give you a few pointers and maybe one of my colleagues will know the full answer already. You can skip using the CoNLLRDFFormatter here, as the output of the Stream Extractor is already in rdf. Looking at https://cemantix.org/conll/2012/data.htm under the relevant section, the cols in conll 2012 are as in the table below. Some of this information isbe present in the file you started out with, some you may need to convert or substitute.
|
That's very helpful, I will try to use this to create the Conll-2012 file. If I am successful I will post here how to do it. |
Perhaps the specific command is in one of these files? https://github.com/acoli-repo/conll-transform? My understanding is that conll-transform is more or less conll-rdf but makes it easier to convert between formats (I tried to use it but I had an error and raised an issue. I didn't get a response so I tried conll-rdf instead). I will look through conll-transform myself and see if I can find anything. |
Dear Alan, as Leo already pointed out, a simple "conversion" to CoNLL-2012 might not be trivial, since the information required for some of the columns is unavailable in CoNLL-U. I will just add a few comments regarding the table above. Columns 1 & 2 are document-specific. Such info is not encoded in CoNLL-U, except in comments sometimes Since you will probably need to generate and merge a lot of additional information, you could also look at the following paper on the CoNLL libraries, specifically conll-merge: The ACoLi CoNLL Libraries: Beyond Tab-Separated Values. Here we illustrate how to combine annotated corpora with heterogeneous annotations and tokenizations. |
@cfaeth Thanks for your prompt reply. Is conll-transform capable of the transformation? I know it relies on conll-rdf. |
@AlanQuille conll-transform relies on the CoNLL-RDF ontology to determine how information is rendered in various CoNLL dialects. You can try to run it and it will probably write a basal conversion script for you but it will probably also tell you about the missing info, we just described here. But it is always a good starting point. |
@AlanQuille I looked into your issue and it seems to be a Windows 10 issue with the Java path, see there. The response of
You can use the script from point 6 to convert your data (note that you might need to escape certain characters), and it will produce the correct format, but columns that cannot be reliably predicted (see warnings above) will just be empty. |
Hi @leogott, thanks again for all your help. As stated in the other issue, I successfully got conll-rdf working, I converted my conllu file to a ttl file.
I have one final question: What is the command to convert my .ttl file to Conll-2012? I checked the example shown here and it only shows how to convert back to Conll-U.
The command I used is this:
cat irishtimes.conllu | ./run.sh CoNLLStreamExtractor https://github.com/UniversalDependencies/UD_English# ID WORD LEMMA UPOS POS FEAT HEAD EDGE DEPS MISC | ./run.sh CoNLLRDFFormatter -rdf > irishtimes.ttl
The text was updated successfully, but these errors were encountered: