Métodos para medir la riqueza léxica de los textos. Revisión y propuesta

Joan Torruella Casañas; Ramon Capsada Blanch

doi:10.15304/verba.44.3155

PDF (Español)

Published: 27-09-2017

Joan Torruella Casañas⁺⁻
Ramon Capsada Blanch⁺⁻

Joan Torruella Casañas

ICREA - Universitat Autònoma de Barcelona

Spain

Ramon Capsada Blanch

Institut de Sabadell ICREA - Universitat Autònoma de Barcelona

Spain

Vol 44 (2017), Articles, pages 347-408

DOI: https://doi.org/10.15304/verba.44.3155

Submitted: 18-02-2016 Accepted: 18-10-2016 Published: 27-09-2017

Copyright How to Cite

Abstract

This paper aims to provide a comprehensive review of the different methods used to measure the lexical richness of texts and make a proposal for their application to text corpora. Firstly, it presents an overview of the main existing metrics to quantify lexical richness, explaining how these are defined and evaluating their strengths and weaknesses by conducting experimental activities. Secondly, it proposes a methodology for measuring lexical richness that can be used across a complete text corpus so that we can both draw comparisons between texts and create a patterned rating of the degree of lexical richness for each of the texts within the whole corpus.

Keywords:

Lexical richness, Corpus linguistics, Quantitative linguistics, Lexical statistics, Word frequency distributions, Stylometry

Cited by

References

Baayen, R. H. & Tweedie, F. J. (1998): "How variables may a constant be? Measures in lexical richness in perspective", Computers and the Humanities 32, pp. 323-352. https://doi.org/10.1023/A:1001749303137

Baayen, R. H. (2001): Word frequency distributions. Dordrecht: Kluver Academic Publishers. https://doi.org/10.1007/978-94-010-0844-0

Baayen, R. H. (2008): Analyzing Linguistic Data. A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511801686

Bowker, L & Pearson, J. (2002): Working with specialized language. A practical guide to using corpora. London / New York: Routledge. https://doi.org/10.4324/9780203469255

Carroll, J. B. (1964): Language and Thought. New Jersey: Prentice-Hall, Englewood Cliffs.

CICA = Torruella, J & Pérez Saldanya, M. & Martines, J. (dirs.) (2013): Corpus Informatitzat del Català Antic. http://www.cica.cat [última consulta: 30/01/2016].

Covington, M. A. & McFall, J. D. (2010): "Cutting the Gordian Knot: The Moving-Average Type–Token Ratio (MATTR)", Journal of Quantitative Linguistics 17/2, pp. 94-100. https://doi.org/10.1080/09296171003643098

Dugast, D. (1978): "Sur quoi se fonde la notion d’etendue theoretique du vocabulaire?", Le Francais Moderne 46, pp. 25-32.

Ferrando, A. (ed.) (2007): Curial e Güelfa. Edició a cura de –. Tolouse: Anacharsis.

Guiraud, P. (1960): Problèmes et Méthodes de la Statistique Linguistique. Dordrecht: D. Reidel.

Harris Wright, H. & Silverman, S. W. & Newhoff, M. (2003): "Measures of lexical diversity in aphasia", Aphasiology 17, pp. 443-452. https://doi.org/10.1080/02687030344000166

Hauf, A. (ed.) (2005): Joanot Martorell. Tirant lo Blanch. València: Tirant lo Blanch.

Herdan, G. (1955): "A new derivation and interpretation of Yule's 'Characteristic' K", Zeitschrift für angewandte Mathematik und Physik 6/4, pp. 332-339. https://doi.org/10.1007/BF01587632

Herdan, G. (1960): Quantitative Linguistics. London: Butterworth.

Honoré, A. (1979): "Some Measures of Richness of Vocabulary", ALLC Bulletin 7/2, pp. 172-177.

Jarvis, S. (2002): "Short texts, best-fitting curves and new measures of lexical diversity", Language Testing 19, pp. 57-84. https://doi.org/10.1191/0265532202lt220oa

Johnson, W. (1944): "Studies in language behavior: I. A program of research", Psychological Monographs 56, pp. 1-15. https://doi.org/10.1037/h0093508

Maas, H. D. (1972): "Zusammenhang zwischen Wortschatzumfang und L'ange eines Textes", Zeitschrift für Literaturwissenschaft und Linguistik 8, pp. 73-79.

McKee, G. & Malvern, D. & Richards, B. (2000): "Measuring vocabulary diversity using dedicated software", Literary and Linguistic Computing 15/3, pp. 323-337. https://doi.org/10.1093/llc/15.3.323

Malvern, D. D. (1989): Thetype-token characteristic - an empirical investigation of a mathematical model for thetype-token ratio. Reading: University of Reading, Faculty of Education and Community Studies. Unpublished working paper.

Malvern, D. D. & Richards, B. J. (1997): "A new measure of lexical diversity", in A. Ryan & A. Wray (eds.): Evolving Models of Language. Clevedon: Multilingual Matters, pp. 58-71.

Malvern, D. et alii (2004): Lexical Diversity and Language Development. Quantification and Assessment. New York: Palgrave Macmillan.

McCarthy, P. M. (2005): "An Assessment of the Range and Usefulness of Lexical Diversity Measures and the Potential of the Measure of Textual, Lexical Diversity (MTLD)", Dissertation Abstracts International 66/12, UMI No. 3199485.

McCarthy, P. M. & Jarvis, S. (2007): "Vocd: A theoretical and empirical evaluation", Language Testing 24/4, pp. 459-488. https://doi.org/10.1177/0265532207080767

McCarthy, P. M. & Jarvis, S. (2010): "MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment", Behavior Research Methods 42, pp. 381-392. https://doi.org/10.3758/BRM.42.2.381

Miller, J. F. (1991): "Quantifying productive Language disorders", in J. F. Miller (ed.): Research on child language disorders: a decade of progress. Austin: Pro-Ed, pp. 211-220.

Muller, C. (1968): Initiation à la statistique linguistique. Paris: Librairie Larousse.

Orlov, J. & Chitashvili, R. (1983): "Generalized Z-distribution generating the well-known rank-distributions", Bulletin of the Academy of Sciences 110, pp. 269-272.

Riera i Sans, J. (ed.) (1976): Llibre de Job. Versió del segle XVI. Edició a cura de –. Barcelona: Insititut d'Estudis Catalans.

Rojo, G. (2002): "Sobre la Lingüística basada en el análisis de corpus". Ponencia plenaria en las Jornadas sobre corpus lingüísticos (organizadas por Uzei, San Sebastián, octubre de 2002).

Rojo, G. (2008): "Lingüística de corpus y lingüística del español". Ponencia plenaria en el XV Congreso de la Asociación de Lingüística y Filología de América Latina (Montevideo, 18-21 de agosto de 2008). Disponible en: http://gramatica.usc.es/~grojo/Publicaciones/Lgca_corpus_lgca_espanol.pdf.

Sichel, H. S. (1975): "On a distribution law for word frequencies", Journal of the American Statististical Association 70/351, pp. 542-547. https://doi.org/10.1080/01621459.1975.10482469
https://doi.org/10.2307/2285930

Sichel, H. S. (1986): "Word frequency distributions and type-token characteristics", Mathematical Scientist 11, pp. 45-72.

Somers, H. H. (1966): "Statistical methods in literary analysis", in J. Leeds. (ed.): The computer and literary style. Kent: Kent State University, pp. 128-140.

Templin, M. C. (1957): Certain languages kills in children: Their development and interrelation ships. Westport: Greenwood.

Tres, J. (ed.) (1995): Francesc Comte. Il·lustracions dels comtats de Rosselló, Cerdanya y Conflent. Barcelona: Curial.

Van Gijsel, S. & Speelman, D. & Geeraerts, D. (2005): "A Variationist, Corpus Linguistic Analysis of Lexical Richness", in Proceedings from the Corpus Linguistics Conference Series, vol. 1/1, pp. 1-16. http://www.birmingham.ac.uk/research/activity/corpus/ publications/conference-archives/2005-conf-e-journal.aspx.

Yule, G. U. (1944): The Statistical Study of Literary Vocabulary. London/Cambridge: Cambridge University Press. Zipf, G. K. (1949): Human Behavior and the Principle of Least Effort. Cambridge/ Massachusetts: Addison-Wesley.

Article Sidebar

Main Article Content

Abstract

Keywords:

Cited by

Article Details

References