-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicated messages in mam_message table #4316
Comments
I have not been able to find a reproducing scenario, but I was able to reproduce a few times by abruptly killing connections from the client side and creating new ones immediately after. This does not trigger the problem consistently though. During these attempts I gathered as much traces and logs as I could to try to get clues about what's going on. Here is what I found that might be relevant:
Here is an example:
The added stanza id is not the same for all duplicated messages. For instance, the message of my above example got duplicated 98 times, but there are 34 different stanza ids in the duplicate messages, each being present from 2 to 9 times. From what I can see the same stanza id is NOT used for the duplicates of two different original messages.
All of the above make me say: it looks like mongooseim keeps somewhere in memory or some transient storage a queue of stanzas even after they were sent and acknowledged (as mentioned in my original post, this seems to occur only when stream management is enabled), and these stanzas get routed as carbon copies to new clients of the same user. If they don't get acked by the new client, they get stored to the mam table again. This is of course only a rough idea of what I understand, in the hope that it may help someone with better knowledge of MongooseIM internals to understand the problem. Let me know if there is more information that I can provide or experiments that I can do (bear in mind that I still don't know how to reproduce consistently). In my latest example, I have server logs (debug level), CSV dump of the |
Hi, thank you for bringing this issue to our attention. We are currently investigating it and will keep you updated on our progress. |
Thanks for looking at it. Something happened to me yesterday that might help the investigation: I wanted to switch the storage format of the mam table to XML for a test instance, so I truncated the tables I wanted to share this because it gives more evidence towards some of my guesses I made above : happens when a client abruptly disappears, and the messages are copied from a transient source (they could not be copied from DB since the table was empty). Hope this helps. |
Hi @rthouvenin , I'm looking into this issue now. I have some comments, and a few questions which may help me find the reason, as I'm trying to reproduce the scenario.
|
Thank you @gustawlippa for looking into this problem. I spent some more time trying to understand the reproducing scenario, below are my answers:
To reproduce and look at the ACK exchanges, here is what I did:
The web client I am using has the following behavior: when a user connects, it opens a session with the server and then:
The fact that the conversation has several messages and the fact that it has unread messages means that the client initialization takes longer, and I thought it would give me enough time to kill the client before the initialization is complete. I did notice that I was not able to reproduce when the conversation is read, but interestingly when I did manage to reproduce, the client initialization actually completed... So I'm not sure what to think... Anyway, I am attaching the output of Some things worth mentioning that I found:
Let me know if there is any more info I can provide! |
Thank you for such a detailed response @rthouvenin, it really helps. Your effort is greatly appreciated! I have to ask just to clarify, because I see that you use Stream Management, Chat Markers and Inbox. When you say that you mark messages as received, I assume you mean in the Chat Markers sense, that is sending a stanza with a In the testing that you've performed, does the client respond to the Stream Management request I've found an issue which can occur when reconnecting, and before a client managed to |
That's right, more exactly: sending a stanza with a Just in case you are not familiar with HAR files, this is notably the format used by browsers to export network logs. You can open it by going to the network tab of Chrome (I believe it works with other browsers) and find the button "Import HAR file". You can also open it with a text editor of your choice, this is just a plain JSON file, and look at the Thank you for the possible workaround, I will try it later this week to confirm that this is indeed the same issue. |
Hello again, something interesting happened to me just now. After reproducing the problem this morning as described above, some messages got duplicated a number of times, and the total number of messages in Unfortunately I had not foreseen this and did not capture any trace or log, but I thought this might give some more clues about the issue. |
Hi @rthouvenin, thanks for the additional info, and tips for the HAR files! |
@gustawlippa Thank you for working on this and pushing a fix. With the Could it be that another image was pushed with tag |
Hello @rthouvenin, You have used the correct Docker image. My fix didn't work, but we have the task to revisit this bug our internal backlog. I believe the bug wasn't fixed, because of how websockets connections are handled in MIM. I focused on the XMPP connections logic, and I thought this would be enough. I hope we will find time to fix it until the end of the year. |
Thank you for the update |
Hello @rthouvenin, |
Hi @gustawlippa, sure thing: mongooseim.toml.txt (I renamed to .txt because Github would refuse it otherwise). This is the file I use for my local tests (in a Docker container, using the official image), the only changes I've made before uploading here are the JWT secret and the RDBMS connection settings. For transport, I use only websockets initiated over HTTP (port 5280). I hope this helps |
Hi, for now I'm not able to reproduce the issue by trying to emulate your client's behaviour, using the same set of Mongoose modules enabled. If you are still able to reproduce the issue, could you please provide debug logs collected while it happens? This would be of much help to us. To enable debug logs, please set the As an additional debugging step, which could be a bit more involved, but also very useful, we could use the tracing application included in MongooseIM. If you were willing, the trace can be captured by entering the Erlang shell by running As for fixes to the underlying issue, I can't yet see a reason why an unacked stanza would be resent by stream management, but maybe with more info we'll figure it out. So if you'd be able to provide either the logs, the trace, or both to us, we would be grateful. |
Hi, I meant to come back to you earlier but I've been quite busy with other tasks, sorry for that. What I did is I tried to play the scenario that most often allowed me to reproduce, and on the side I was looking at the count of rows in mam_message, refreshed every second. When I saw the count increased abnormally, I stopped and dumped the tracing. Here is the trace dump that I uploaded on my personal server (N.B.: it is 2.5GB, I hope this is usable!): https://next.thouvenin.cloud/s/dKNMYYaR9kEbQRS Here are some approximate timestamps when I interruped abrutly the client connection and I saw the count of mam_messages increase (I am not sure about the first two):
Let me know if I can provide more info! |
MongooseIM version: 6.2.1
Installed from: Docker image
Erlang/OTP version: provided by the docker image
In case this is relevant: I am using JWT authentication
I am essentially facing the problem described in #4128.
On some occasions, all the messages of a conversation received by a user get duplicated in database in the
mam_message
table. This does not happen consistently (was not able to describe a reproducing scenario yet), and happens only when stream management is enabled and I think this happens when one of the users requests the conversation history. This makes me think there is a bug when there is a problem in delivering the history response that triggers a duplication of the history in DB.When this happens for a user, it is likely to happen again soon, often resulting in dozens of copies of all received messages in the conversation. All the messages (same origin_id) are identical except for the timestamp, which seems to be the time when the duplication occured.
So for example if a user received a single message in a conversation, and the bug occurs 3 times for that conversation, you would see something like the below in the
mam_message
table:The text was updated successfully, but these errors were encountered: