Stopwords on KeyBERT #203

ChettakattuA · 2024-01-25T23:55:12Z

          Also i have very serious issue with keyphrases on KeyBERT. for example if i add "climate integrated services" to stopword list then since word have 3 syllables its considered as a phrase and is ignored in from my stopwords list

Am getting this weird error
UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['climate', 'integrated', 'services'] not in stop_words.

Originally posted by @ChettakattuA in #200 (comment)

The text was updated successfully, but these errors were encountered:

MaartenGr · 2024-01-26T15:06:41Z

Could you share your full code and error message? Without that, it's not possible for me to understand what exactly is happening here.

ChettakattuA · 2024-01-30T09:05:15Z

file_path = os.path.join(app.root_path, 'static', 'stopwords.txt')
if os.path.exists(file_path) and os.path.getsize(file_path) > 0:
with open(file_path, 'r') as file:
for line in file:
Final_stopwords.append(line.strip()) # Remove newline characters
Final_stopwords = list(set(Final_stopwords))
Final_stopwords = sorted(Final_stopwords)
keyphrases = kw_model.extract_keywords(pdfText, keyphrase_ngram_range=(1, 3), top_n=topN,stop_words=Final_stopwords)

suppose stopwords.txt have words like
climate
climate integrated services
infrastructure

i am getting this warning UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['climate', 'integrated', 'services'] not in stop_words.

Also, the phrase climate integrated service is still showing

ChettakattuA · 2024-01-30T09:55:43Z

here is another reference for the code with KeyLLM

import openai
from keybert.llm import OpenAI
from keybert import KeyLLM, KeyBERT

Create your LLM

openai.api_key = "sk-...."
llm = OpenAI()

Load it in KeyLLM

kw_model = KeyBERT(llm=llm)
stopwords = ["deforestation","Deforestation"]

Extract keywords

keywords = kw_model.extract_keywords(documents,top_n=20,stop_words=stopwords)
keywords
[ ['Deforestation',
'Document',
'Environmental destruction',
'Logging',
'Clearing',
'Forests']]

As you can see stopwords are clearly not working and its the same for top_n

MaartenGr · 2024-01-30T10:38:04Z

It is difficult to read your code without any formatting so I am not sure what you are exactly running, nor did you share your full error log. However, if you are using KeyLLM, then it merely uses the keywords generated by KeyBERT as the candidate keywords. Then, the LLM can decide freely which keywords to create. If you want to change things like top_n and stopwords inside KeyLLM, you will have to do that within the prompt.

ChettakattuA · 2024-01-30T10:50:52Z

sorry but let me make it more clear. In this below code i am only using KeyBERT.

### **Code**

from keybert import KeyBERT
from PyPDF2 import PdfReader

file_path = my_file_path

**#pdf extraction**
text=""
pdf_reader = PdfReader(file_path)
for page in range(len(pdf_reader.pages)):
    text += pdf_reader.pages[page].extract_text()

Final_stopwords = ['clarity climate services', 'climate change','infrastructure']

**#keybert function**
kw_model = KeyBERT()
keyphrases = kw_model.extract_keywords(text, keyphrase_ngram_range=(1, 3), top_n=7,stop_words=Final_stopwords)
print(keyphrases)

**Result**
C:\Users\ChettakattuA\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\feature_extraction\text.py:409: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['change', 'clarity', 'climate', 'services'] not in stop_words.

integrated climate services : 0.6006
clarity climate services : 0.6003
existing climate intelligence : 0.5546
climate intelligence can : 0.5317
climate intelligence : 0.5298
climate adaptation service : 0.5271
climate services information : 0.5254

So, as you can see i have the terms 'clarity climate services', 'climate change' in stop words but it clearly displays in the results, i am facing this problem only with key phrases and not with keywords.

MaartenGr · 2024-01-30T11:58:39Z

I updated your comment to have better formatting, please take a look for future requests. With respect to the stop words, it might be an issue/feature of the CountVectorizer. If you take a look at the official documentation here concerning the stop_words parameter, it might just be that it does not support n-grams. I am not sure though, so you would have to check.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stopwords on KeyBERT #203

Stopwords on KeyBERT #203

ChettakattuA commented Jan 25, 2024

MaartenGr commented Jan 26, 2024

ChettakattuA commented Jan 30, 2024

ChettakattuA commented Jan 30, 2024

MaartenGr commented Jan 30, 2024

ChettakattuA commented Jan 30, 2024 •

edited by MaartenGr

Loading

MaartenGr commented Jan 30, 2024

Stopwords on KeyBERT #203

Stopwords on KeyBERT #203

Comments

ChettakattuA commented Jan 25, 2024

MaartenGr commented Jan 26, 2024

ChettakattuA commented Jan 30, 2024

ChettakattuA commented Jan 30, 2024

Create your LLM

Load it in KeyLLM

Extract keywords

MaartenGr commented Jan 30, 2024

ChettakattuA commented Jan 30, 2024 • edited by MaartenGr Loading

MaartenGr commented Jan 30, 2024

ChettakattuA commented Jan 30, 2024 •

edited by MaartenGr

Loading