If do not want to download .ipynb
files and face some issues with graph displaying, please refer to the html folder.
Given set of financial reports, issued by companies, which are publicly traded. Apply machine learning techniques to find topics of paragraphs of documents.
- Use programming language
python3.4
; - Use libraries
spacy
,sklearn
; - Apply methods
TF/IDF
andLDA
to analyze given texts.
- html contains html-version of .ipynb files;
- img contains the charts from the result of two methods
TF/IDF
andLDA
; - presentation has the slides of our presentation;
- src has all the configurations we needed;
- file LDA apply the method
LDA
; - file TFIDF apply the method
TF/IDF
.
- Remove all the unnecessary symbols;
- Remove all the stop words;
- Remove all the numbers;
- Classify all the words with their lemma.
- Stands for term frequency-inverse document frequency
- Our goals:
- Find the most important words for certain text;
- Learn the trend for this words during several years.
- Apply LSI technique at TF/IDF matrix to implement an information query program.
- Stands for Latent Dirichlet Allocation
- Our goals:
- Find several topics from certain text;
- Find paragraphs in the text, which are mostly related to the topic;
- Learn the trend for topics during several years.
- library spacy: https://spacy.io/
- library sklearn: http://scikit-learn.org/stable/
- Wikipedia for TF-IDF: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
- Wikipedia for LSI: https://en.wikipedia.org/wiki/Latent_semantic_analysis
- Blei, D.M., Ng, A.Y. and Jordan, M.I., 2003. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), pp.993-1022.