The number of documents, including online news, that requires a deeper understanding and analysis grows every year. Machine Learning algorithms help us to classify texts accurately. However, finding suitable structures and techniques for text, including feature extraction, is difficult for researchers. This paper addresses the task of identi-fying and analyzing features to distinguish different genres of texts. We studied the main characteristics of each genre of news text like news, articles, interviews, and blogs to obtain more informative features. We have built our data set by collecting texts from open-access official information portals. Analysis of our data set and features that look at structural complexity, detail, and imaginative details in a text are helpful to distinguish our dataset. In par-ticular, we use complexity (lexical diversity, lexical density, punctuation, average sentence length, number of personal pronouns, readability index), detail features (number of proper nouns in the text, numbers, month-related words), imaginative features (PoS tags, words-quantifiers, plural nouns) features. Our results suggest that our features provide effective representation to distinguish news texts from articles, blogs/opinions, and interviews with high accuracy.
IDENTIFYING AND ANALYZING FEATURES FOR THE CLASSIFICATION OF NEWS
Published March 2023
139
63
Abstract
Language
English
How to Cite
[1]
Ualiyeva, I. and Mussabayev , R. 2023. IDENTIFYING AND ANALYZING FEATURES FOR THE CLASSIFICATION OF NEWS. Bulletin of Abai KazNPU. Series of Physical and mathematical sciences. 81, 1 (Mar. 2023), 178–185. DOI:https://doi.org/10.51889/2959-5894.2023.81.1.020.