Skip to main content Skip to main navigation menu Skip to site footer

Уважаемые пользователи! На нашем хостинге ведутся технические работы, на сайте могут быть ошибки. Приносим свои извинения за временные неудобства.

Bulletin of the Abai KazNPU, the series of "Physical and Mathematical Sciences"

A A TASK OF SYNTHETIC CORPORA GENERATION FOR THE LOW-RESOURCE LANGUAGE

Published December 2022
al-Farabi Kazakh National University, Almaty
Istanbul Technical University, Istanbul
al-Farabi Kazakh National University, Almaty
al-Farabi Kazakh National University, Almaty
Institute of information and computational technologies, Almaty
Abstract

Recently, various areas of artificial language processing have been actively developing, such as search engines, machine translation technologies, speech technologies, etc. using machine learning technology and non-neural networks. For the implementation and development of these areas, first of all, the task of electronic linguistic resources such as corpora, dictionaries, a set of rules, etc. is acute. These resources should be of a very large volume of good quality. In this article, the problem of shortage of buildings for low-resource languages, which include the Turkic-speaking group, is considered. This is a problem for low-resource languages, such as Kazakh, because there are very few available corpora. This article presents an approach to the creation of synthetic corpora by the method of determining and replacing a candidate word from the list of synonymous dictionary of the Kazakh language. Test experiments were conducted. As a result, the specified case was enlarged 3.37 times.

pdf
Language

Eng

How to Cite

[1]
Rakhimova, D., Adali, E., Shormakova, A., Turarbek, A. and Suleimenov, Y. 2022. A A TASK OF SYNTHETIC CORPORA GENERATION FOR THE LOW-RESOURCE LANGUAGE. Bulletin of the Abai KazNPU, the series of "Physical and Mathematical Sciences". 80, 4 (Dec. 2022), 169–179. DOI:https://doi.org/10.51889/2938.2022.14.84.020.