An improvement for CORGA with applications to other corpora and languages: the tagging of scientific binomial nomenclature
Main Article Content
The treatment of multiword units is an unfinished task in natural language processing. In this context, we isolate binomial scientific nomenclature terms, whose main traits – Latin or Latinized multiword expressions and international recognition – distinguish them from the Galician ‘popular’ lexicon and make their treatment applicable to other languages. After reviewing their characterization in CORGA and other Peninsular corpora, we propose an analysis of scientific names as a particular subtype of nouns, namely, scientific nomenclature, without specifying values for gender and number. We then describe the interventions conducted on the kernel and the training corpus to include the new tag into the XIADA system and, subsequently, we asses two strategies for the detection of candidates: a specific tool for extracting scientific names and online inventories. Finally, in light of the data provided by CORGA, we verify a significant presence of binomial scientific terms and show the relevance of the new tag for their identification and distribution.
Article Details
BNC: British National Corpus (XML edition)> [Consultado: 9/2/2022]
CB: Corpus Brasileiro [Consultado: 9/2/2022]
CdE: Corpus del español (Género/Histórico) [Consultado: 9/2/2022]
CdP: Corpus do português (Género/Histórico) [Consultado: 9/2/2022]
CORGA: Corpus de Referencia do Galego Actual (CORGA) [Consultado: 1-17/2/2022]
CORPES: Corpus del Español del Siglo XXI. [Consultado: 9/2/2022]
CRPC: Corpus de Referencia do Português Contemporâneo. [Consultado: 9/2/2022]
CT: Corpus Tècnic. [Consultado: 9/2/2022]
CTAG: Corpus Técnico Anotado do Galego. [Consultado: 9/2/2022]
CTILC: Corpus textual informatitzat de la llengua catalana. [Consultado: 9/2/2022]
TILG: Tesouro informatizado da lingua galega. [Consultado: 9/2/2022]
XIADA: Etiquetador/Lematizador do Galego Actual (XIADA) [2.8]
Bunge, Mario. 1972. La investigación científica. Barcelona: Ariel.
Calzolari, Nicoletta, Charles J. Fillmore, Ralph Grishman, Nancy Ide, Alessandro Lenci, Catherine MacLeod & Antonio Zampolli. 2002. Towards Best Practice for Multiword Expressions in Computational Lexicons. En Manuel González Rodríguez & Carmen Paz Suarez Araujo (eds.), Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02). 1934-1940. Las Palmas: European Language Resources Association (ELRA).
Caseli, Helena, Aline Villavicencio, André Machado & Maria José Finatto. 2009. Statistically-Driven Alignment-Based Multiword Expression Identification for Technical Domains. En Dimitra Anastasiou, Chikara Hashimoto, Preslav Nakov & Su Nam Kim (eds.), Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (MWE 2009). 1-8. Singapore: Association for Computational Linguistics.
Darriba, Víctor, Yerai Doval & Elmurod Kuriyozov. 2021. Procesamiento de expresiones multipalabra en gallego mediante Aprendizaje Profundo. Procesamiento del Lenguaje Natural, 67, 45-57.
Domínguez Noya, Eva María. 2013. Etiquetaxe e desambiguación automáticas en galego: o sistema XIADA. Santiago de Compostela: Universidade de Santiago de Compostela. [Tese de doutoramento inédita].
Domínguez Noya, Eva María. 2016. O etiquetador probabilístico de XIADA e o seu teito de acerto: a elaboración de regras lingüísticas. En Manuel González González (ed.), Lingua, pobo e terra. Estudos en homenaxe a Xesús Ferro Ruibal. 213-232. Santiago de Compostela: Xunta de Galicia / Centro Ramón Piñeiro para a Investigación en Humanidades.
Ernout, Alfred & Antoine Meillet. 2001. Dictionnaire étymologique de la langue latine. Histoire des mots. Paris: Klincksieck. [Obra publicada orixinalmente en 1932].
Graña Gil, Jorge. 2000. Técnicas de análisis sintáctico robusto para la etiquetación del lenguaje natural. A Coruña: Universidade da Coruña. [Tese de doutoramento inédita].
Manning, Christopher D. 2011. Part-of-speech tagging from 97 % to 100 %: is it time for some linguistics?. En Alexander F. Gelbukh (ed.), Computational linguistics and intelligent text processing, 12th International Conference, CICLing 2011, Proceedings. Part I: Lecture notes in computer science 6608. 171-189. Berlin: Springer.
Nguyen, Nhung T. H., Roselyn S. Gabud & Sophia Ananiadou. 2019. COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature. Biodiversity Data Journal 7, e29626.
Pafilis, Evangelos, Sune P. Frankild, Lucia Fanini, Sarah Faulwetter, Christina Pavloudi, Aikaterini Vasileiadou, Christos Arvanitidis & Lars Juhl Jensen. 2013. The SPECIES and ORGANISMS resources for fast and accurate identification of taxonomic names in text. PLoSONE 8(6), e65390.
Pavlinov, Igor Ya. 2021. Taxonomic nomenclature: What’s in a name – theory and history. Boca Raton: CRC Press.
Pyle, Richard L. 2016. Towards a Global Names Architecture: The future of indexing scientific names. ZooKeys 550, 261-281.
Resolución de 24 de mayo de 2019, de la Secretaría General de Pesca, por la que se publica el listado de denominaciones comerciales de especies pesqueras y de acuicultura admitidas en España, Boletín Oficial del Estado, 143, de 15/06/2019.
Rivers, Malin. 2019. European Red List of trees. Cambridge / Brussels: IUCN.
Rojo, Guillermo. 2017. Sobre la configuración estadística de los corpus textuales. Lingüística 33(1), 121‑134.
Rouco, Miguel, José Luis Copete, Eduardo de Juana, Marcel Gil-Velasco, Juan Antonio Lorenzo, Marce Martín, Borja Milá, Blas Molina & David M. Santos. 2019. Lista de las aves de España. Madrid: SEO/BirdLife.
Seideh, Mohamed Aly Fall, Hela Fehri, & Kais Haddar. 2017. Recognition and extraction of Latin names of plants for matching common plant named entities. En Linda Barone, Mario Monteleone & Max Silberztein (eds.), Automatic processing of natural-language electronic texts with NooJ. 10th International Conference, NooJ 2016, České Budějovice, Czech Republic, June 9-11, 2016, Revised Selected Papers. 132-144. Berlin: Springer.
Villavicencio, Aline, Valia Kordoni, Yi Zhang, Marco Idiart & Carlos Ramisch. 2007. Validation and Evaluation of Automatically Acquired Multiword Expressions for Grammar Engineering. En Jason Eisner (ed.), Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 1034-1043. Prague: Association for Computational Linguistics.
Most read articles by the same author(s)
- Vítor Míguez, Syntactic and semantic parameters of object/oblique alternations: a comparison between Galician complements with en, the conative alternation and the antipassive construction , Estudos de Lingüística Galega: Vol 8 (2016)
- Vítor Míguez, Nuyts, Jan / Johan van der Auwera (eds.) (2016): The Oxford Handbook of Modality and Mood. Oxford: Oxford University Press, xiv + 670 pp. , Estudos de Lingüística Galega: Vol 9 (2017)
- Vítor Míguez, Gabriel Rei-Doval & Fernando Tejedo-Herrero (eds.) (2019): Lusophone, Galician, and Hispanic Linguistics: Bridging Frames and Traditions. London / New York: Routledge, 282 pp. , Verba: Anuario Galego de Filoloxía: Vol 47 (2020)
- Vítor Míguez, Kato, Mary A., Martins, Ana Maria & Nunes, Jairo. 2023. "The Syntax of Portuguese". Cambridge: Cambridge University Press [507 pp.]. ISBN: 978-0521860611. , Estudos de Lingüística Galega: Vol 15 No 1 (2023): Estudos de Lingüística Galega