-
Notifications
You must be signed in to change notification settings - Fork 363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stopwords on KeyBERT #203
Comments
Could you share your full code and error message? Without that, it's not possible for me to understand what exactly is happening here. |
file_path = os.path.join(app.root_path, 'static', 'stopwords.txt') suppose stopwords.txt have words like i am getting this warning UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['climate', 'integrated', 'services'] not in stop_words. Also, the phrase climate integrated service is still showing |
here is another reference for the code with KeyLLM import openai Create your LLMopenai.api_key = "sk-...." Load it in KeyLLMkw_model = KeyBERT(llm=llm) Extract keywordskeywords = kw_model.extract_keywords(documents,top_n=20,stop_words=stopwords) As you can see stopwords are clearly not working and its the same for top_n |
It is difficult to read your code without any formatting so I am not sure what you are exactly running, nor did you share your full error log. However, if you are using KeyLLM, then it merely uses the keywords generated by KeyBERT as the candidate keywords. Then, the LLM can decide freely which keywords to create. If you want to change things like |
sorry but let me make it more clear. In this below code i am only using KeyBERT. ### **Code**
from keybert import KeyBERT
from PyPDF2 import PdfReader
file_path = my_file_path
**#pdf extraction**
text=""
pdf_reader = PdfReader(file_path)
for page in range(len(pdf_reader.pages)):
text += pdf_reader.pages[page].extract_text()
Final_stopwords = ['clarity climate services', 'climate change','infrastructure']
**#keybert function**
kw_model = KeyBERT()
keyphrases = kw_model.extract_keywords(text, keyphrase_ngram_range=(1, 3), top_n=7,stop_words=Final_stopwords)
print(keyphrases)
So, as you can see i have the terms 'clarity climate services', 'climate change' in stop words but it clearly displays in the results, i am facing this problem only with key phrases and not with keywords. |
I updated your comment to have better formatting, please take a look for future requests. With respect to the stop words, it might be an issue/feature of the CountVectorizer. If you take a look at the official documentation here concerning the |
Am getting this weird error
UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['climate', 'integrated', 'services'] not in stop_words.
Originally posted by @ChettakattuA in #200 (comment)
The text was updated successfully, but these errors were encountered: