Recently, various areas of artificial language processing have been actively developing, such as search engines, machine translation technologies, speech technologies, etc. using machine learning technology and non-neural networks. For the implementation and development of these areas, first of all, the task of electronic linguistic resources such as corpora, dictionaries, a set of rules, etc. is acute. These resources should be of a very large volume of good quality. In this article, the problem of shortage of buildings for low-resource languages, which include the Turkic-speaking group, is considered. This is a problem for low-resource languages, such as Kazakh, because there are very few available corpora. This article presents an approach to the creation of synthetic corpora by the method of determining and replacing a candidate word from the list of synonymous dictionary of the Kazakh language. Test experiments were conducted. As a result, the specified case was enlarged 3.37 times.
A A TASK OF SYNTHETIC CORPORA GENERATION FOR THE LOW-RESOURCE LANGUAGE
Published December 2022
128
117
Abstract
Language
English
How to Cite
[1]
Rakhimova, D., Adali, E., Shormakova, A., Turarbek, A. and Suleimenov, Y. 2022. A A TASK OF SYNTHETIC CORPORA GENERATION FOR THE LOW-RESOURCE LANGUAGE. Bulletin of Abai KazNPU. Series of Physical and mathematical sciences. 80, 4 (Dec. 2022), 169–179. DOI:https://doi.org/10.51889/2938.2022.14.84.020.