In recent years, artificial intelligence and large language models (LLMs) have undergone rapid development. The effectiveness of these models largely depends on the quality of the training data. However, the scarcity of structured text resources in the Kazakh language poses a significant challenge for LLM development. This paper explores the digitization of Kazakh-language texts using OCR technology and the creation of a high-quality dataset in JSON format. The main objective of the study is to automatically process Kazakh texts and prepare structured data suitable for training LLMs. For this purpose, scanned documents were collected, processed using Tesseract OCR, and converted into a structured JSON format. As a result, 37,062 documents were processed and used to train the LLaMA 3.2 3B model in the Kazakh language. The model demonstrated an understanding of national linguistic style and was capable of generating poetic texts. The train/loss graph indicated stable training performance.
PREPARATION OF DATA WITH THE HELP OF OCR FOR LLM IN KAZAKH LANGUAGE
Published March 2026
0
0
Abstract
Language
Қазақ
How to Cite
[1]
Toiganbayeva Н., Abdimanap Ғ. , Musa А. and Abdurakhmonova Н. 2026. PREPARATION OF DATA WITH THE HELP OF OCR FOR LLM IN KAZAKH LANGUAGE. Bulletin of Abai KazNPU. Series of Physical and Mathematical sciences. 93, 1 (Mar. 2026), 251–260. DOI:https://doi.org/10.51889/2959-5894.2026.93.1.022.
https://orcid.org/0000-0003-2661-8661