useful nlp functions / pipelines / transformers for text mining in a package format
TODO: add when ready
This package is compatible with Linux/OSX systems
See requirements.txt
Most prerequisite packages will be installed automaticaly.
However the spellcheck submodule uses the enchant system library.If this library is not there then pyenchant will not work.
On an Ubuntu / Debian system that means that you should run on your bash shell: sudo apt-get install enchant
On a RedHat / CentOS / Cloudera CDH system that means you should run on your bash shell: sudo yum install enchant
On OSX you need to run brew install enchant
After that the fastest way to make sure you have everything would be go to main directory and run
pip install -r requirements.txt
python -m nltk.downloader vader_lexicon stopwords wordnet brown_tei gutenberg punkt popular
after downloading these packages and their assorted material (in the case of NLTK) everything should run smoothly. If not please open an issue here on this repo.
TODO: add when ready.. This is still in dev testing / alpha stage. Plan is after code review to
upload to PYPI. (So at some point its going to be pip install nlpbumblebee
or something similar... )
For the time being git clone the repo and enter:
python setup.py install
((you will need pytest intalled. if you dont have it the just : pip install pytest pytest-cov
)
In order to run all the automated tests for this system after you have cloned it into your system just do:
cd tests
pytest -v ## run all tests
cd ..
pytest -v --cov=nlpfunctions tests/ ## run tests and calculate testing coverage
Continuous Integration is a software development practice where members of a team integrate their work on a main repo frequently. Usually each person integrates their work at least daily leading to multiple integrations per day. Each integration is verified by an automated build (that includes running an automated test harness) to detect integration errors as quickly as possible. Many teams find that this approach leads to significantly reduced integration problems and allows a team to develop cohesive software more rapidly.
We are using Travis CI for this process.
Curently there is automaticaly created documentation supported by sphinx-docs. This documentation is also availiable Additionaly there is a \examples folder where simple tasks using functions from this package are described.
- NLTK - Natural Language ToolKit
- scikit-learn - machine learning framework
- pytest - unit testing framework
- black - code formatter
- sphinx-doc - documentation framework
- Travis CI - Continuous Integration framework
Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.
- Theodore Manassis - mamonu
- Alessia Tosi - exfalsoquodlibet
See also the list of contributors who participated in this project.
This project is licensed under the MIT License - see the LICENSE.md file for details
In opensource everyone is standing on the shoulders of giants...
or possibly a really tall stack of ordinary-height people
The authors would like to thank in no particular order:
-
the ONS Big Data team (check their repos here )
-
the NLTK maintainers
-
the scikit-learn maintainers
-
Benjamin Bengfort, Tony Ojeda, Rebecca Bilbro . The authors of one of the most useful NLP books out there:
Applied Text Analysis with Python
-
Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (January 2003). Lafferty, John, ed. "Latent Dirichlet Allocation". Journal of Machine Learning Research. 3 (4–5): pp. 993–1022.
-
Blei, David (April 2012). "Probabilistic Topic Models". Communications of the ACM. 55 (4): 77–84.
-
Lee, Daniel D., and H. Sebastian Seung. "Learning the parts of objects by non-negative matrix factorization." Nature 401.6755 (1999): 788.
-
Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: analyzing text with the natural language toolkit. " O'Reilly Media, Inc.".
-
Mihalcea, R., & Tarau, P. (2004). Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing.
-
Li, W. (1992). Random texts exhibit Zipf's-law-like word frequency distribution. IEEE Transactions on information theory, 38(6), 1842-1845.
-
Knuth, D. E., Morris, Jr, J. H., & Pratt, V. R. (1977), 'Fast pattern matching in strings'. SIAM journal on computing, 6(2), 323-350.
-
Gilbert, C. H. E. (2014). Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Eighth International Conference on Weblogs and Social Media (ICWSM-14). Available at (08/10/18) http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf