Automated Linguistic Profiling of Disputed Texts in the Uzbek Language: A Model Based on Hybrid Vectorization and Support Vector Machine (SVM)
DOI:
https://doi.org/10.5281/zenodo.21153968
Abstract
The article proposes a model for the automated linguistic profiling of disputed texts in the Uzbek language. The aim is to develop a methodology that, for forensic-linguistic examination, determines in a quantitative, reproducible and interpretable manner the probabilistic socio-demographic characteristics of an author, namely gender, age and region, as well as the legal classification of a text into insult, defamation or neutral. The methodology integrates Biber’s register analysis, Lakoff’s language-and-gender theory and Nini’s theory of linguistic individuality, adapting them to the agglutinative nature of Uzbek. Features are vectorized using a hybrid TF-IDF and FastText method, while a separate Support Vector Machine classifier is applied to each profiling task. The results are demonstrated through worked examples of TF-IDF weighting, character n-gram extraction and a confusion matrix. The proposed model operates interpretably and accurately under conditions of mixed Latin-Cyrillic writing and morphological richness. Thus, the study offers a codeable, interpretable and ethically constrained model for Uzbek forensic linguistics.
Keywords:
linguistic profiling disputed text forensic linguistics support vector machine hybrid vectorization character n-grams Uzbek language idiolectReferences
Argamon, S., Koppel, M., Pennebaker, J. W., & Schler, J. (2009). Automatically profiling the author of an anonymous text. Communications of the ACM, 52(2), 119–123.
Biber, D. (1988). Variation across speech and writing. Cambridge University Press.
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.
Burrows, J. (2002). ‘Delta’: A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17(3), 267–287.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.
Elov, B., & Alayev, R. (2023). O‘zbek tili korpusi va uning imkoniyatlari. O‘zbekiston informatika va energetika muammolari jurnali, (2).
Elov, B., Hamroyeva, Sh., Alayev, R., Xusainova, Z., & Yodgorov, U. (2023). O‘zbek tili korpusi matnlarini qayta ishlash usullari. Raqamli transformatsiya va sun’iy intellekt, 1(3), 117–130.
Grant, T. (2007). Quantifying evidence in forensic authorship analysis. International Journal of Speech, Language and the Law, 14(1), 1–25.
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Proceedings of ECML-98 (LNCS 1398, pp. 137–142). Springer.
Koppel, M., Argamon, S., & Shimoni, A. R. (2002). Automatically categorizing written texts by author gender. Literary and Linguistic Computing, 17(4), 401–412.
Lakoff, R. (1975). Language and woman’s place. Harper & Row.
Mosteller, F., & Wallace, D. L. (1964). Inference and disputed authorship: The Federalist. Addison-Wesley.
Nini, A. (2023). A theory of linguistic individuality for authorship analysis. Cambridge University Press.
O‘zbekiston Respublikasi Jinoyat kodeksi. (1994/2020). 139–140-moddalar. Toshkent. https://lex.uz/docs/-111453
Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. McGraw-Hill.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.
Spärck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1), 11–21.
Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60(3), 538–556.
Published
Downloads
How to Cite
Issue
Section
License
Copyright (c) 2026 Mekhroj Raupov

This work is licensed under a Creative Commons Attribution 4.0 International License.
