Write a script under src/papers/features, that preprocess the NIPS dataset #6

liadmagen · 2018-10-13T19:22:03Z

To improve the efficiency of our models, we often preprocess the document before starting the process itself.

This includes methods such as segmenting and tokenizing the text - breaking the document into sentences and words; removing 'stop-words', which are frequent words in the language that don't contribute to the meaning of the text; stemming and/or lemmatizing words, and more.

For more info about the process, we encourage you to visit chapters 3 & 5 in http://www.nltk.org/book/

besides nltk, a possilble package may be spacy: https://spacy.io/usage/spacy-101#section-lightning-tour

To sum up - the script, which should be divided to testable functions, should be able to receive a dataset with documents column as an input, and return the dataset with an additional column preprocessed_docs which contains the preprocessing result.

liadmagen added good first issue Good for newcomers hacktoberfest 🍁 https://hacktoberfest.digitalocean.com/ help wanted Extra attention is needed labels Oct 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write a script under src/papers/features, that preprocess the NIPS dataset #6

Write a script under src/papers/features, that preprocess the NIPS dataset #6

liadmagen commented Oct 13, 2018 •

edited

Loading

Write a script under src/papers/features, that preprocess the NIPS dataset #6

Write a script under src/papers/features, that preprocess the NIPS dataset #6

Comments

liadmagen commented Oct 13, 2018 • edited Loading

liadmagen commented Oct 13, 2018 •

edited

Loading