An End to End Pipeline for Text Modelling
The purpose of this repository is to implement all the necessary steps required for building a Text Classifier in Python. All these steps have been explained in the Jupyter Notebook(fit_nlp.ipynb).
The core idea of building a Text Classifier involves:
-
Data Analysis This step involved analyzing the entire data using techniques, cleaning it and transforming it to bring out valuable insights.
-
Feature Engineering Feature Engineering involves creating a representation of input data using which the model can be trained better. In this repository, I have used Bag of Words Approach to create feature vectors for the input data. Bag of Words is a representation used in Natural Language Processing in which we create a multiset of words for each training example.
The entire pipeline has been comprehensively explained in the IPython Notebook.
- Model Selection, Training and Evaluation In this step, firstly a model is selected based on the properties of data and then it is trained on the training sample. In order to prevent underfitting and oerfitting, model is evaluation using evaluation metrics.