Annotated Corpus of Scottish Gaelic - Simplified
ARCOSG-S is a representative, tagged corpus of Scottish Gaelic, divided into 8 registers (4 spoken, 4 written) of approximately 10k words each. The corpus is presented as individual txt files. It differs from ARCOSG in that it uses less complex tags. For instance, common nouns are tagged in ARCOSG-S simply as 'Nc', rather than with information about number, gender and case (e.g. Ncsmn), as in ARCOSG. The tags were converted automatically from the ARCOSG tags using a mapping file in Python. While the ARCOSG tagset has 246 tags, the ARCOSG-S one has 41.
ARCOSG was hand-tagged by Lamb, Arbuthnot and Naismith and separately verified by them. It uses the Brown format tag separators ('/': e.g. 'agus/Cc') and an annotation scheme derived from the Irish PAROLE tagset (Uí Dhonnchadha, E. and van Genabith, J. 2006. A Part-of-Speech tagger for Irish using finite state morphology and constraint grammar disambiguation. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), 2241-2244.).
The annotation scheme is described in a PDF included with the data: Lamb, W. and Naismith, S (2020) Scottish Gaelic Part-of-Speech Annotation Guidelines.
Work towards ARCOSG was funded by Bòrd na Gàidhlig and Carnegie Trust for the Universities of Scotland.
CITATION Lamb, William; Arbuthnot, Sharon; Naismith, Susanna; Danso, Samuel (2020). Annotated Reference Corpus of Scottish Gaelic -Simplified (ARCOSG-S), 1997-2020 [dataset]. University of Edinburgh. School of Literatures, Languages and Cultures. Celtic and Scottish Studies.