Métodos para medir la riqueza léxica de los textos. Revisión y propuesta

Joan Torruella Casañas; Ramon Capsada Blanch

doi:10.15304/verba.44.3155

PDF

Publicado: 27-09-2017

Joan Torruella Casañas⁺⁻
Ramon Capsada Blanch⁺⁻

Joan Torruella Casañas

ICREA - Universidad Autónoma de Barcelona

España

Ramon Capsada Blanch

IES Sabadell

España

Vol. 44 (2017), Artículos, Páginas 347-408

DOI: https://doi.org/10.15304/verba.44.3155

Recibido: 18-02-2016 Aceptado: 18-10-2016 Publicado: 27-09-2017

Derechos de autoría Cómo citar

Resumen

Con el presente trabajo se pretende hacer una revisión extensa de los diferentes métodos existentes para medir la riqueza léxica de textos y hacer una propuesta para su aplicación a un corpus textual. En primer lugar, se da una visión de los principales índices existentes para cuantificar la riqueza léxica, explicando cómo están definidos y evaluando sus fortalezas y sus debilidades a partir de actividades de carácter experimental. En segundo lugar, se plantea una propuesta de metodología de medición de la riqueza léxica para poderse utilizar en todo un corpus textual, de manera que se puedan establecer comparaciones entre textos y constituir una clasificación pautada del grado de riqueza léxica de cada uno de ellos dentro del conjunto del corpus.

Palabras clave:

riqueza léxica, lingüística de corpus, lingüística cuantitativa, estadística léxica, distribuciones de frecuencia de palabras, estilometría

Citado por

Referencias

Baayen, R. H. & Tweedie, F. J. (1998): "How variables may a constant be? Measures in lexical richness in perspective", Computers and the Humanities 32, pp. 323-352. https://doi.org/10.1023/A:1001749303137

Baayen, R. H. (2001): Word frequency distributions. Dordrecht: Kluver Academic Publishers. https://doi.org/10.1007/978-94-010-0844-0

Baayen, R. H. (2008): Analyzing Linguistic Data. A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511801686

Bowker, L & Pearson, J. (2002): Working with specialized language. A practical guide to using corpora. London / New York: Routledge. https://doi.org/10.4324/9780203469255

Carroll, J. B. (1964): Language and Thought. New Jersey: Prentice-Hall, Englewood Cliffs.

CICA = Torruella, J & Pérez Saldanya, M. & Martines, J. (dirs.) (2013): Corpus Informatitzat del Català Antic. http://www.cica.cat [última consulta: 30/01/2016].

Covington, M. A. & McFall, J. D. (2010): "Cutting the Gordian Knot: The Moving-Average Type–Token Ratio (MATTR)", Journal of Quantitative Linguistics 17/2, pp. 94-100. https://doi.org/10.1080/09296171003643098

Dugast, D. (1978): "Sur quoi se fonde la notion d’etendue theoretique du vocabulaire?", Le Francais Moderne 46, pp. 25-32.

Ferrando, A. (ed.) (2007): Curial e Güelfa. Edició a cura de –. Tolouse: Anacharsis.

Guiraud, P. (1960): Problèmes et Méthodes de la Statistique Linguistique. Dordrecht: D. Reidel.

Harris Wright, H. & Silverman, S. W. & Newhoff, M. (2003): "Measures of lexical diversity in aphasia", Aphasiology 17, pp. 443-452. https://doi.org/10.1080/02687030344000166

Hauf, A. (ed.) (2005): Joanot Martorell. Tirant lo Blanch. València: Tirant lo Blanch.

Herdan, G. (1955): "A new derivation and interpretation of Yule's 'Characteristic' K", Zeitschrift für angewandte Mathematik und Physik 6/4, pp. 332-339. https://doi.org/10.1007/BF01587632

Herdan, G. (1960): Quantitative Linguistics. London: Butterworth.

Honoré, A. (1979): "Some Measures of Richness of Vocabulary", ALLC Bulletin 7/2, pp. 172-177.

Jarvis, S. (2002): "Short texts, best-fitting curves and new measures of lexical diversity", Language Testing 19, pp. 57-84. https://doi.org/10.1191/0265532202lt220oa

Johnson, W. (1944): "Studies in language behavior: I. A program of research", Psychological Monographs 56, pp. 1-15. https://doi.org/10.1037/h0093508

Maas, H. D. (1972): "Zusammenhang zwischen Wortschatzumfang und L'ange eines Textes", Zeitschrift für Literaturwissenschaft und Linguistik 8, pp. 73-79.

McKee, G. & Malvern, D. & Richards, B. (2000): "Measuring vocabulary diversity using dedicated software", Literary and Linguistic Computing 15/3, pp. 323-337. https://doi.org/10.1093/llc/15.3.323

Malvern, D. D. (1989): Thetype-token characteristic - an empirical investigation of a mathematical model for thetype-token ratio. Reading: University of Reading, Faculty of Education and Community Studies. Unpublished working paper.

Malvern, D. D. & Richards, B. J. (1997): "A new measure of lexical diversity", in A. Ryan & A. Wray (eds.): Evolving Models of Language. Clevedon: Multilingual Matters, pp. 58-71.

Malvern, D. et alii (2004): Lexical Diversity and Language Development. Quantification and Assessment. New York: Palgrave Macmillan.

McCarthy, P. M. (2005): "An Assessment of the Range and Usefulness of Lexical Diversity Measures and the Potential of the Measure of Textual, Lexical Diversity (MTLD)", Dissertation Abstracts International 66/12, UMI No. 3199485.

McCarthy, P. M. & Jarvis, S. (2007): "Vocd: A theoretical and empirical evaluation", Language Testing 24/4, pp. 459-488. https://doi.org/10.1177/0265532207080767

McCarthy, P. M. & Jarvis, S. (2010): "MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment", Behavior Research Methods 42, pp. 381-392. https://doi.org/10.3758/BRM.42.2.381

Miller, J. F. (1991): "Quantifying productive Language disorders", in J. F. Miller (ed.): Research on child language disorders: a decade of progress. Austin: Pro-Ed, pp. 211-220.

Muller, C. (1968): Initiation à la statistique linguistique. Paris: Librairie Larousse.

Orlov, J. & Chitashvili, R. (1983): "Generalized Z-distribution generating the well-known rank-distributions", Bulletin of the Academy of Sciences 110, pp. 269-272.

Riera i Sans, J. (ed.) (1976): Llibre de Job. Versió del segle XVI. Edició a cura de –. Barcelona: Insititut d'Estudis Catalans.

Rojo, G. (2002): "Sobre la Lingüística basada en el análisis de corpus". Ponencia plenaria en las Jornadas sobre corpus lingüísticos (organizadas por Uzei, San Sebastián, octubre de 2002).

Rojo, G. (2008): "Lingüística de corpus y lingüística del español". Ponencia plenaria en el XV Congreso de la Asociación de Lingüística y Filología de América Latina (Montevideo, 18-21 de agosto de 2008). Disponible en: http://gramatica.usc.es/~grojo/Publicaciones/Lgca_corpus_lgca_espanol.pdf.

Sichel, H. S. (1975): "On a distribution law for word frequencies", Journal of the American Statististical Association 70/351, pp. 542-547. https://doi.org/10.1080/01621459.1975.10482469
https://doi.org/10.2307/2285930

Sichel, H. S. (1986): "Word frequency distributions and type-token characteristics", Mathematical Scientist 11, pp. 45-72.

Somers, H. H. (1966): "Statistical methods in literary analysis", in J. Leeds. (ed.): The computer and literary style. Kent: Kent State University, pp. 128-140.

Templin, M. C. (1957): Certain languages kills in children: Their development and interrelation ships. Westport: Greenwood.

Tres, J. (ed.) (1995): Francesc Comte. Il·lustracions dels comtats de Rosselló, Cerdanya y Conflent. Barcelona: Curial.

Van Gijsel, S. & Speelman, D. & Geeraerts, D. (2005): "A Variationist, Corpus Linguistic Analysis of Lexical Richness", in Proceedings from the Corpus Linguistics Conference Series, vol. 1/1, pp. 1-16. http://www.birmingham.ac.uk/research/activity/corpus/ publications/conference-archives/2005-conf-e-journal.aspx.

Yule, G. U. (1944): The Statistical Study of Literary Vocabulary. London/Cambridge: Cambridge University Press. Zipf, G. K. (1949): Human Behavior and the Principle of Least Effort. Cambridge/ Massachusetts: Addison-Wesley.

Barra lateral del artículo

Contenido principal del artículo

Resumen

Palabras clave:

Citado por

Detalles del artículo

Referencias