-
Notifications
You must be signed in to change notification settings - Fork 774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Saving a trained model using pytorch and safetensor and then redownloading causes topics to be off #2198
Comments
You are using an older version of BERTopic and I remember that there were some fixes since then. Could you try it with the latest version instead? 0.16.4. |
Got it trying that now! |
Just tried increasing the version of Bertopic to 0.16.4 and still the same issue. Initial Training: Inference/Transform without saving: Inference/Training after saving and redownloading using safetensor: All topics (except outliers) are coming out one more than the original run or the original model without saving it |
@MaartenGr I also just tried saving with pytorch as well and got the same issue |
Hmnmmm, this is quite unexpected. I'm a bit baffled here considering these probabilities are extremely high. My guess would be that there is something going wrong with reducing outliers before updating and then saving the model. What would happen if you didn't reduce outliers? |
@MaartenGr Removing reduce outliers fixes the issue and now I am getting the same results between the initial training and the inference run after downloading. Is there a way to keep reduce outliers or is this a bug that would need to be fixed first? |
@SkylarOconnell I'm not actually sure why this is happening. It could be that by reducing outliers so much, it distorts the newly created topic embeddings ( |
@MaartenGr Could you provide an example for this? I'm not really sure how to do that. |
@SkylarOconnell Sure! # Track topic embeddings before reducing outliers
topic_embeddings = topic_model.topic_embeddings_
# Reduce outliers and update topics
new_topics = self.model.reduce_outliers(
self.docs,
self.topics,
probabilities=self.probabilities,
strategy='probabilities'
)
self.model.update_topics(self.docs, topics=new_topics)
# Reassign old topic embeddings
topic_model.topic_embeddings_ = topic_embeddings When doing this, make sure whether the old topic embeddings are correctly assigned as I'm not sure whether this creates a shallow or deep copy. |
@MaartenGr Sorry for the delayed response. When I add in the code above (changing topic_model to self.model since we are using class variables), it goes back to the original issue. Could it be an issue/bug between reduce_outliers and pytorch/safetensor? Reduced outliers works and the transform works until I save with those and redownload.
|
I'm not sure if I understand correctly. Just to make sure:
|
I will double check the top bullet and let you know. If the topic_embeddings are the same as the old embeddings, I will run a quick count to see how many are off. I'll respond here once I am able to do so. |
Have you searched existing issues? 🔎
Desribe the bug
After training, I tried saving the model using both pytorch and safetensor. When I re-download the model, load the files into Bertopic using Bertopic.load(), and run inference using transform(), all the topics are coming out differently than the original fit results. Below are some examples the first topic and prob is from the original training/fit of the model and the second is from running transform():
Topic: 2 Probability: 0.9999999985560923 vs. Topic: 3 Probability: 0.9999477863311768
Topic: 1 Probability: 0.9993163446248252 vs. Topic: 2 Probability: 0.04614641437377926
Topic: 2 Probability: 1.0 vs. Topic: 3 Probability: 0.9591490626335144
One thing to note is that running transform over and over comes out with the same results that are different than the original training output. Also, when I run transform on the original model without saving it anywhere else, I get the same results as the original run. I was wondering if I am missing something with saving the model correctly. Below is the code I use to train, save, and run transform on the model. We also run reduce_outliers() before saving the model.
Reproduction
BERTopic Version
0.16.0
The text was updated successfully, but these errors were encountered: