Skip to main content Skip to main navigation menu Skip to site footer
Bulletin of Abai KazNPU. Series of Physical and Mathematical sciences

PREPARATION OF DATA WITH THE HELP OF OCR FOR LLM IN KAZAKH LANGUAGE

Published March 2026

0

0

N. Toiganbayeva+
Al-Farabi Kazakh National University, Almaty, Kazakhstan
https://orcid.org/0000-0003-2661-8661
G. Abdimanap+
KazMunayGas Engineering LLP, Astana, Kazakhstan;
https://orcid.org/0000-0003-1676-4075
А. Musa+
Al-Farabi Kazakh National University, Almaty, Kazakhstan
https://orcid.org/0009-0001-9972-7677
N. Abdurakhmonova+
National University of Uzbekistan named after Mirzo Ulugbek, Toshkent, Uzbekistan
https://orcid.org/0000-0001-9195-5723
Al-Farabi Kazakh National University, Almaty, Kazakhstan
KazMunayGas Engineering LLP, Astana, Kazakhstan;
Al-Farabi Kazakh National University, Almaty, Kazakhstan
National University of Uzbekistan named after Mirzo Ulugbek, Toshkent, Uzbekistan
Abstract

In recent years, artificial intelligence and large language models (LLMs) have undergone rapid development. The effectiveness of these models largely depends on the quality of the training data. However, the scarcity of structured text resources in the Kazakh language poses a significant challenge for LLM development. This paper explores the digitization of Kazakh-language texts using OCR technology and the creation of a high-quality dataset in JSON format. The main objective of the study is to automatically process Kazakh texts and prepare structured data suitable for training LLMs. For this purpose, scanned documents were collected, processed using Tesseract OCR, and converted into a structured JSON format. As a result, 37,062 documents were processed and used to train the LLaMA 3.2 3B model in the Kazakh language. The model demonstrated an understanding of national linguistic style and was capable of generating poetic texts. The train/loss graph indicated stable training performance.

pdf (Қазақ)
Language

Қазақ

How to Cite

[1]
Toiganbayeva Н., Abdimanap Ғ. , Musa А. and Abdurakhmonova Н. 2026. PREPARATION OF DATA WITH THE HELP OF OCR FOR LLM IN KAZAKH LANGUAGE. Bulletin of Abai KazNPU. Series of Physical and Mathematical sciences. 93, 1 (Mar. 2026), 251–260. DOI:https://doi.org/10.51889/2959-5894.2026.93.1.022.