Text-Analytics-with-Multi-Class-and-Imbalanced-Learning

This project is part of Advanced Topics in Machine Learning subject. Further detailed description of the project can be known in the documentation of the Project.

Problem: Genre Identification on (a sub-set of) Gutenberg Corpus

Consider this set of books belonging to the 19^th Century English Fiction ¹.

The data set is created from Project Gutenberg². The data set consists of about 1000 books and roughly 10 genres. The task here consists of detection (i.e. multi-class classification) of genre³ of a book. Each data-point in this classification task is a fiction book with a label (genre). Please note the following three main challenges tackled:

Extraction of features that are relevant to fiction books, which may include ideas like sentiment, setting⁴ and so on, using appropriate libraries.
Outline of all the models used and why and how model selection was performed.
Explaination of how the evaluation of the model is being done and how the data set is to be partitioned while taking into account potential challenges like class imbalances and similar.

This site is open source. Improve this page.