Skip to main content Skip to main navigation menu Skip to site footer

Уважаемые пользователи! На нашем хостинге ведутся технические работы, на сайте могут быть ошибки. Приносим свои изменения за временные неудобства.

Bulletin of the Abai KazNPU, the series of "Physical and Mathematical Sciences"

IDENTIFYING AND ANALYZING FEATURES FOR THE CLASSIFICATION OF NEWS

Published 03-2023
al-Farabi Kazakh National University, Almaty
The Institute of Information and Computational Technologies, Almaty
Abstract

The number of documents, including online news, that requires a deeper understanding and analysis grows every year. Machine Learning algorithms help us to classify texts accurately. However, finding suitable structures and techniques for text, including feature extraction, is difficult for researchers. This paper addresses the task of identi-fying and analyzing features to distinguish different genres of texts. We studied the main characteristics of each genre of news text like news, articles, interviews, and blogs to obtain more informative features. We have built our data set by collecting texts from open-access official information portals. Analysis of our data set and features that look at structural complexity, detail, and imaginative details in a text are helpful to distinguish our dataset. In par-ticular, we use complexity (lexical diversity, lexical density, punctuation, average sentence length, number of personal pronouns, readability index), detail features (number of proper nouns in the text, numbers, month-related words), imaginative features (PoS tags, words-quantifiers, plural nouns) features. Our results suggest that our features provide effective representation to distinguish news texts from articles, blogs/opinions, and interviews with high accuracy.

pdf
Language

Eng

How to Cite

[1]
Ualiyeva, I. and Mussabayev , R. 2023. IDENTIFYING AND ANALYZING FEATURES FOR THE CLASSIFICATION OF NEWS. Bulletin of the Abai KazNPU, the series of "Physical and Mathematical Sciences". 81, 1 (Mar. 2023), 178–185. DOI:https://doi.org/10.51889/2959-5894.2023.81.1.020.