Main Article Content

Marcos Garcia
Universidade de Santiago de Compostela
Spain
Iria Gayo
Universidade de Santiago de Compostela
Spain
Isaac González López
Cilenis Language Technology
Spain
Vol 4 (2012), Pescuda
Submitted: 14-09-2012 Accepted: 14-09-2012
Copyright How to Cite

Abstract

Automatic named entity recognition and classification are important tasks for many natural language processing applications, such as machine translation, information extraction or question-answering systems. This paper describes the adaptation and implementation of several open-source systems for the identification and classification of the following named entities in Galician: (i) dates, (ii) numerals, (iii) quantities and (iv) proper nouns. Analysis of the first three types of named entities is performed with the software FreeLing, using finite-state automata. For the proper noun recognition task, two methods were compared: (i) finite-state automata and (ii) machine learning models. Finally, the semantic classification of proper nouns was carried out with a rulebased system that takes advantage of automatically obtained resources. This paper shows some evaluations for each tool, all available under free licenses.

Article Details

References

Barcala, Francisco Mario et al. (2007): “A corpus and lexical resources for multi-word terminology extraction in the field of economy in a minority language”, em Zygmunt Vetulani (ed.), Human Language Technology as a Challenge for Computer Science and Linguistics. Proceedings of the 3rd Language and Technology Conference. Poznand: Wydawnictwo Poznańskie, 359-363 (http://www.grupocole.org/cole/library/ps/BarDomGamLopMosRojSanSot2007b.pdf ).

Bick, Eckhard (2006): “Functional aspects on Portuguese NER”, em Renata Vieira et al. (eds.), Proceedings of the 7th Workshop on Computational Processing of Written and Spoken Language (PROPOR 2006). Lecture Notes in Computer Science, vol. 3960. Berlin / Heidelberg: Springer-Verlag, 260- 263 (http://193.136.2.105/aval_conjunta/LivroHAREM/Cap12-SantosCardoso2007-Bick.pdf ).

Carreras, Xavier et al. (2002): “Named entity extraction using AdaBoost”, em Proceedings of the 6th Conference on Computational Natural Language Learning (CoNLL 2002). Taipei: Association for Computational Linguistics (ACL), 167-170 (http://acl.ldc.upenn.edu/W/W02/W02-2004.pdf ).

Ferrández, Óscar et al. (2007): “Tackling HAREM’s portuguese named entity recognition task with spanish resources”, em Diana Santos / Nuno Cardoso (eds.), Reconhecimento de entidades mencionadas em português: Documentação e actas do HAREM, a primeira avaliação conjunta na área. Linguateca (http://www.linguateca.pt/aval_conjunta/LivroHAREM/Cap11-SantosCardoso2007-Ferrandezetal.pdf ).

Ferreira, Eduardo / João Balsa / António Branco (2007): “Combining rule-based and statistical methods for named entity recognition in Portuguese”, em V Workshop em Tecnologia da Informação e da Linguagem Humana (TIL 2007). Anais do XXVII Congresso da Sociedade Brasileira de Computação. Salvador: Sociedade Brasileira de Computação (SBC), 1615-1624 (http://www.di.fc.ul.pt/%7Eahb/FerreiraBalsaBranco2007.pdf ).

Finkel, Jenny Rose / Trond Grenager / Christopher Manning (2005): “Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling”, em Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005). Ann Arbor: Association for Computational Linguistics (ACL), 363- 370 (http://acl.ldc.upenn.edu/P/P05/P05-1045.pdf ).

Gamallo, Pablo / Marcos Garcia (2011): “A ResourceBased Method for Named Entity Extraction and Classification”, em Luís Antunes / H. Sofia Pinto (eds.), Proceedings of the XV Portuguese Conference on Artificial Intelligence (EPIA 2011). Progress in Artificial Intelligence. Lecture Notes in Computer Science (LNAI), vol. 7026. Berlin / Heidelberg: Springer-Verlag, 610-623.

Garcia, Marcos / Pablo Gamallo (2010): “Análise morfossintáctica para Português Europeu e Galego. Problemas, Soluções e Avaliação”, Linguamática. Revista para o Processamento Automático das Línguas Ibéricas 2(2), 59-67 (http://linguamatica.com/index.php/linguamatica/article/download/56/87 ).

Leach, Geoffrey / Andrew Wilson (1996): “Recommendations for the Morphosyntactic Annotation of Corpora”. Relatório Técnico. Expert Advisory Group on Language Engineering Standards (EAGLES) (http://tagmatica.fr/doc/EaglesAnnotate.pdf ).

Malvar, Paulo et al. (2010): “Vencendo a escassez de recursos computacionais. Carvalho: Tradutor Automático Estatístico Inglês-Galego a partir do corpus paralelo Europarl Inglês-Português”, Linguamática. Revista para o Processamento Automático das Línguas Ibéricas 2(2), 31-38 (http://linguamatica.com/index.php/linguamatica/article/download/57/81 ).

Mika, Peter et al. (2008): “Learning to tag and tagging to learn: A case study on Wikipedia”, IEEE Inteligent Systems 23(5), 26-33 (http://research.yahoo.com/files/wikipedia-ieee.pdf ).

Mikheev, Andrei / Claire Grover / Marc Moens (1998): “Description of the LTG system used for MUC7”, em Proceedings of the 7th Message Understanding Conference. Morgan Kaufman (http://www.nlpir.nist.gov/related_projects/muc/proceedings/muc_7_proceedings/ltg_muc7.pdf).

Mota, Cristina / Diana Santos (eds.) (2008): Desafios na avaliação conjunta do reconhecimento de entidades mencionadas: O Segundo HAREM. Linguateca (http://www.linguateca.pt/LivroSegundoHAREM/).

Nothman, Joel / James R. Curran / Tara Murphy (2008): “Transforming Wikipedia into Named Entity Training Data”, em Nicola Stokes / David Powers (eds.), Proceedings of the Australasian Language Technology Workshop, vol. 6. Hobart: Australasian Language Technology Association, 124-132 (http://aclweb.org/anthology/U/U08/U08-1016.pdf

Padró, Lluís et al. (2010): “FreeLing 2.1: Five Years of Open-Source Language Processing Tools”, em Nicoletta Calzolari et al. (eds.), Proceedings of 7th Language Resources and Evaluation Conference (LREC 2010). Valletta: European Language Resources Association (ELRA) (http://www.lsi.upc.edu/~nlp/papers/padro10b.pdf).

Santos, Diana / Nuno Cardoso (eds.) (2007): Reconhecimento de entidades mencionadas em português: Documentação e actas do HAREM, a primeira avaliação conjunta na área. Linguateca (http://www.linguateca.pt/LivroHAREM/ ).

Tjong Kim Sang, Erik F. / Fien de Meulder (2003). “Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition”, em WalterDaelemans / Miles Osborne (eds.), Proceedings of the 7th Conference on Natural Language Learning (CoNLL 2003). Edmonton: Association for Computational Linguistics (ACL), 142-147 (http://acl.ldc.upenn.edu/W/W03/W03-0419.pdf).