Write a script under src/papers/features, that preprocess the NIPS dataset #6
Labels
good first issue
Good for newcomers
hacktoberfest 🍁
https://hacktoberfest.digitalocean.com/
help wanted
Extra attention is needed
To improve the efficiency of our models, we often preprocess the document before starting the process itself.
This includes methods such as segmenting and tokenizing the text - breaking the document into sentences and words; removing 'stop-words', which are frequent words in the language that don't contribute to the meaning of the text; stemming and/or lemmatizing words, and more.
For more info about the process, we encourage you to visit chapters 3 & 5 in http://www.nltk.org/book/
besides nltk, a possilble package may be spacy: https://spacy.io/usage/spacy-101#section-lightning-tour
To sum up - the script, which should be divided to testable functions, should be able to receive a dataset with documents column as an input, and return the dataset with an additional column preprocessed_docs which contains the preprocessing result.
The text was updated successfully, but these errors were encountered: