PRE-TRAINED MODELLAR YORDAMIDA QISQA MATNLARNI TASNIFLASH ALGORITMLARINI TADQIQ QILISH: QIYOSIY TAHLIL VA HYBRID YONDASHUV
PDF

Keywords

qisqa matn tasnifi, pre-trained modellar, BERT, RoBERTa, XLM-RoBERTa, mBERT, DistilBERT, hybrid arxitektura, transfer learning, tabiiy tilni qayta ishlash, o'zbek tili, kam resursli tillar.

Abstract

Mazkur maqolada qisqa matnlarni tasniflashda zamonaviy oldindan o'qitilgan (pre-trained) til modellaridan foydalanish samaradorligi tadqiq qilingan. Qisqa matnlar — ijtimoiy tarmoq xabarlari, mahsulot sharhlari, foydalanuvchi so'rovlari va SMS bildirishnomalar — leksik kambag'allik, kontekst yetishmasligi va xususiyatlar siyrakligi (feature sparsity) muammolari bilan ajralib turadi. Tadqiqotda BERT, RoBERTa, DistilBERT, mBERT (multilingual BERT) va XLM-RoBERTa modellari beshta jamoatchilik datasetlarida (AG News, TREC, SST-2, Twitter Sentiment140 va o'zbek tilidagi mahalliy korpus) qiyosiy baholangan. Asosiy metodologik hissa sifatida BERT/XLM-R asoslangan kontekstual embeddinglarni Convolutional Neural Network (CNN) lokal xususiyatlari va Bidirectional GRU (BiGRU) ketma-ket bog'liqliklari bilan birlashtiruvchi diqqat (attention) mexanizmiga ega yangi hybrid arxitektura — STC-Hybrid (Short Text Classification Hybrid) modeli taklif etilgan. Eksperimental natijalar shuni ko'rsatdiki, taklif etilgan hybrid model standart fine-tuning yondashuviga nisbatan o'rtacha F1-ko'rsatkichni 2,4–4,1 foiz punktga oshirgan, ayniqsa kam resursli o'zbek tilida (F1 = 0,887) sezilarli yutuqqa erishilgan. Olingan natijalar hybrid arxitekturaning leksik siyraklik muammosini hal etishda samaradorligini tasdiqlaydi va kam resursli tillar uchun tasniflash tizimlarini ishlab chiqishda amaliy ahamiyatga ega.

PDF

References

Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding // Proceedings of NAACL-HLT. — 2019. — pp. 4171–4186. DOI: 10.18653/v1/N19-1423.

2. Liu Y., Ott M., Goyal N., Du J., Joshi M., Chen D., Levy O., Lewis M., Zettlemoyer L., Stoyanov V. RoBERTa: A Robustly Optimized BERT Pretraining Approach // arXiv preprint arXiv:1907.11692. — 2019.

3. Sanh V., Debut L., Chaumond J., Wolf T. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter // arXiv preprint arXiv:1910.01108. — 2019.

4. Conneau A., Khandelwal K., Goyal N., Chaudhary V., Wenzek G., Guzmán F., Grave E., Ott M., Zettlemoyer L., Stoyanov V. Unsupervised Cross-lingual Representation Learning at Scale // Proceedings of ACL. — 2020. — pp. 8440–8451.

5. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser Ł., Polosukhin I. Attention Is All You Need // Advances in Neural Information Processing Systems (NeurIPS). — 2017. — Vol. 30.

6. Sun C., Qiu X., Xu Y., Huang X. How to Fine-Tune BERT for Text Classification? // China National Conference on Chinese Computational Linguistics. — Springer. — 2019. — pp. 194–206.

7. Yao R., Hou L., Ye Y., Wu O., Zhang J., Wu J. A BERT-Based Hybrid Short Text Classification Model Incorporating CNN and Attention-Based BiGRU // Journal of Organizational and End User Computing. — 2024. — Vol. 36, № 1. — pp. 1–22.

8. Li Y. Short Text Classification Improved by Feature Space Extension // arXiv preprint arXiv:1904.01313. — 2019.

9. Wang Y., Wang C., Zhan J., Ma W., Jiang Y. RB-GAT: A Text Classification Model Based on RoBERTa-BiGRU with Graph Attention Network // Sensors. — 2024. — Vol. 24, № 11. — Article 3365. DOI: 10.3390/s24113365.

10. Minaee S., Kalchbrenner N., Cambria E., Nikzad N., Chenaghlu M., Gao J. Deep Learning Based Text Classification: A Comprehensive Review // ACM Computing Surveys. — 2021. — Vol. 54, № 3. — pp. 1–40.

11. Kowsari K., Jafari Meimandi K., Heidarysafa M., Mendu S., Barnes L., Brown D. Text Classification Algorithms: A Survey // Information. — 2019. — Vol. 10, № 4. — Article 150.

12. Howard J., Ruder S. Universal Language Model Fine-tuning for Text Classification // Proceedings of ACL. — 2018. — pp. 328–339.

13. Lan Z., Chen M., Goodman S., Gimpel K., Sharma P., Soricut R. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations // Proceedings of ICLR. — 2020.

14. Clark K., Luong M.-T., Le Q. V., Manning C. D. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators // Proceedings of ICLR. — 2020.

15. Kim Y. Convolutional Neural Networks for Sentence Classification // Proceedings of EMNLP. — 2014. — pp. 1746–1751.

16. Cho K., van Merriënboer B., Gulcehre C., Bahdanau D., Bougares F., Schwenk H., Bengio Y. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation // Proceedings of EMNLP. — 2014. — pp. 1724–1734.

17. Joulin A., Grave E., Bojanowski P., Mikolov T. Bag of Tricks for Efficient Text Classification // Proceedings of EACL. — 2017. — pp. 427–431.

18. Brown T., Mann B., Ryder N., Subbiah M. et al. Language Models are Few-Shot Learners // Advances in Neural Information Processing Systems. — 2020. — Vol. 33. — pp. 1877–1901.

19. Yang Z., Dai Z., Yang Y., Carbonell J., Salakhutdinov R., Le Q. V. XLNet: Generalized Autoregressive Pretraining for Language Understanding // Advances in Neural Information Processing Systems. — 2019. — Vol. 32.

20. Mamatov N. S., Kobilov S. S., Niyozmatova N. A. O'zbek tilini kompyuter lingvistikasi muammolari // Tabiiy fanlar va matematika xabarnomasi. — 2022. — № 3. — pp. 45–53.