Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write a script under src/papers/features, that preprocess the NIPS dataset #6

Open
liadmagen opened this issue Oct 13, 2018 · 0 comments
Labels
good first issue Good for newcomers hacktoberfest 🍁 https://hacktoberfest.digitalocean.com/ help wanted Extra attention is needed

Comments

@liadmagen
Copy link
Member

liadmagen commented Oct 13, 2018

To improve the efficiency of our models, we often preprocess the document before starting the process itself.

This includes methods such as segmenting and tokenizing the text - breaking the document into sentences and words; removing 'stop-words', which are frequent words in the language that don't contribute to the meaning of the text; stemming and/or lemmatizing words, and more.

For more info about the process, we encourage you to visit chapters 3 & 5 in http://www.nltk.org/book/

besides nltk, a possilble package may be spacy: https://spacy.io/usage/spacy-101#section-lightning-tour

To sum up - the script, which should be divided to testable functions, should be able to receive a dataset with documents column as an input, and return the dataset with an additional column preprocessed_docs which contains the preprocessing result.

@liadmagen liadmagen added good first issue Good for newcomers hacktoberfest 🍁 https://hacktoberfest.digitalocean.com/ help wanted Extra attention is needed labels Oct 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers hacktoberfest 🍁 https://hacktoberfest.digitalocean.com/ help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant