-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Loading collection stuck at 0% after server's sudden electricity lost #30501
Comments
@ThinkThinkSyn please export the log for investigating, export log script |
milvus.zip |
after the milvus server restart, it will try to reload the collections. and i want to know why the loading progress stucks, so i need the logs since the milvus service restart |
I have tried reloaded all collection for serveral times, even tried to wait for loading over 10 hrs but still stuck in 0%. For logs, I am not sure how to output the latest log. For the above log it is just outputed by simply calling |
Update: I have output the latest log by calling |
@ThinkThinkSyn i saw milvus fails to find the index for the collection law_original_content, and i did not find any logs about load task. could you please a) try to load the collection b)and then export the logs again. |
@ThinkThinkSyn also Could you please attach the etcd backup for investigation? Check this: https://github.com/milvus-io/birdwatcher for details about how to backup etcd with birdwatcher |
Yes, that collection with the latest log of this stuck: |
update: the collection But after I release it, it stuck with 0% in loading again. |
|
For 2, we have just restart the container and now it loads successfully, but new error came out with For 1, I am sorry for not familiar with the tools you mention. I will check it later. |
@ThinkThinkSyn there are only 2 mins logs in the new attached files, and the only msg i can see about 447474321364366830 is a warn as below. I agree that you shall check the MQ service first, and if possible attach the MQ logs, and more logs would be great help. |
Sorry for late reply. May I have some document refs on how to check the message queue u have mentioned? Seems I can't find some useful information for it. |
base on the logs above, seems that you deploy a milvus standalone instance, which uses internal rocksmq, for now, it's status maintained by milvus itself |
@weiliu1031 any ideas of how to get more helpful info? |
+1. My python workers exceeded RAM on the machine and Milvus-stanalone has been crashed. After restarting loading collection stuck at 0% for a long time and no progress appears. Logs (I've dropped "skip msg" rows) - |
this logs helps a lot! we found that after the milvus instance restart, reloading collection stuck at consuming data from rocksmq when subscribing channel. guess that after a sudden reboot of the physical machine, RocksDB may have left a file lock, blocking subsequent operations to read data from RocksDB. i will do some tests to figure it out |
/assign @weiliu1031 |
Thank you very much! I'll be waiting for any progress on this issue. |
Hello! Is there any progress on this? |
|
@smellthemoon think of how we can remove the lock automaticlly when service start? It shouldn't be very hard to do |
|
the easy fix is to clean the RDB directory next time you hit this. |
What about the previous data |
May I ask if the problem of the collection load being stuck has been resolved |
I'm not sure if you hitting into the same issue since there is no logs offered and "Loading stuck at 0%" might be triggered by different reasons. But if you hit the same issue, remove the lock file of rocksdb could ideally solve the problem. |
Okay, thank you. If I face the same issue again I'll try this method. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Where do I remove the lock file when running standalone version in k8s? |
@xiaofan-luan Where do I remove the lock file when running standalone version ? |
the rocksmq dir. usually it's under /var/lib. check your milvus.yaml file config |
I also have this issue, and it happened on system restart. How can avoid this happening in the future, and how to resolve this? |
/assign @yanliang567 |
this is important and we saw this happened multiple times |
Is there any news on this? I had to restart my server due to a power outage, Milvus from docker is having trouble starting (status unhealthy), I do not want to reindex the whole collection again |
we can not reproduce this issue in house. @m-ajer I suggest you manually stop the milvus service before the power outage. |
Could you offer a log and your milvus version if your service can not start? |
we'd like to further track if there are more clues. |
I retried it again with the same result - milvus-standalone remains unhealthy despite increasing the wait time for other instances to load in the docker configuration. Here are the logs and the docker-compose: files.zip. Another factor that could be the case or error is that we are not using an SSD, but the collection indexed normally before, so we didn't think it would be an issue. The current volume size is around 110 GB. |
@m-ajer the logs you attached is all about INFO logs, which does not have enough info. Could you please set the milvus log level to DEBUG, and recollect the log? |
change this queryCoord.channelTaskTimeout config to 600000 and see if it works. We recommend you to upgrade 2.4-latest and this pr might solve the problem #36617 |
Is there an existing issue for this?
Environment
Current Behavior
After server restart from a sudden electricity lost, Milvus cannot load collection into memory. I have tried to load it for over 12 hours but it just keep stucking.
I am doing the load action in attu, it is observed to have 2 possible apperances(randomly):
Error: 4 DEADLINE_EXCEEDED
and the loading process stop automatically.I have search for similar issue and I found #29484 facing to the same problem. What makes different is seems his stucking is resulted by the
milvus-backup
, while I have not installed that ever.Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
I have tried to figure out problems in log, but I could only find trash log jumping out like:
[2024/02/04 08:24:38.336 +00:00] [INFO] [msgstream/mq_msgstream.go:918] ["skip msg"] [msg="base:<msg_type:TimeTick timestamp:447290085443633154 sourceID:6 > "]
I have tried
grep "[ERROR]" <my docker container log file>
but the latest error is happened in Jan 26 (while the electricity lost is happened on 2/2), which means that there has no helpful error msg.Anything else?
No response
The text was updated successfully, but these errors were encountered: