Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Loading collection stuck at 0% after server's sudden electricity lost #30501

Closed
1 task done
ThinkThinkSyn opened this issue Feb 4, 2024 · 43 comments
Closed
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@ThinkThinkSyn
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.3.4
- Deployment mode(standalone or cluster): standalone(GPU)
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2): Attu 2.3.6
- OS(Ubuntu or CentOS): Ubuntu 22.04.3 LTS
- CPU/Memory: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz 64GB
- GPU: RTX 1060 6G

Current Behavior

After server restart from a sudden electricity lost, Milvus cannot load collection into memory. I have tried to load it for over 12 hours but it just keep stucking.

I am doing the load action in attu, it is observed to have 2 possible apperances(randomly):

  1. stuck at 0% forever
  2. error with msg Error: 4 DEADLINE_EXCEEDED and the loading process stop automatically.

I have search for similar issue and I found #29484 facing to the same problem. What makes different is seems his stucking is resulted by the milvus-backup, while I have not installed that ever.

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

I have tried to figure out problems in log, but I could only find trash log jumping out like: [2024/02/04 08:24:38.336 +00:00] [INFO] [msgstream/mq_msgstream.go:918] ["skip msg"] [msg="base:<msg_type:TimeTick timestamp:447290085443633154 sourceID:6 > "]

I have tried grep "[ERROR]" <my docker container log file> but the latest error is happened in Jan 26 (while the electricity lost is happened on 2/2), which means that there has no helpful error msg.

Anything else?

No response

@ThinkThinkSyn ThinkThinkSyn added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 4, 2024
@weiliu1031
Copy link
Contributor

@ThinkThinkSyn please export the log for investigating, export log script

@ThinkThinkSyn
Copy link
Author

@ThinkThinkSyn please export the log for investigating, export log script

milvus.zip
Sorry for late reply. This is the docker log.
Please be noticed that since the log is to large, this is just a part of log from Feb1 till it die

@weiliu1031
Copy link
Contributor

@ThinkThinkSyn please export the log for investigating, export log script

milvus.zip Sorry for late reply. This is the docker log. Please be noticed that since the log is to large, this is just a part of log from Feb1 till it die

after the milvus server restart, it will try to reload the collections. and i want to know why the loading progress stucks, so i need the logs since the milvus service restart

@ThinkThinkSyn
Copy link
Author

@ThinkThinkSyn please export the log for investigating, export log script

milvus.zip Sorry for late reply. This is the docker log. Please be noticed that since the log is to large, this is just a part of log from Feb1 till it die

after the milvus server restart, it will try to reload the collections. and i want to know why the loading progress stucks, so i need the logs since the milvus service restart

I have tried reloaded all collection for serveral times, even tried to wait for loading over 10 hrs but still stuck in 0%.

For logs, I am not sure how to output the latest log. For the above log it is just outputed by simply calling docker-compose logs > milvus.log. As mention at the begining, I have looked into the latest rolling log and it is filled by a plantey of useless log "skip msg" like this:
image

@ThinkThinkSyn
Copy link
Author

Update: I have output the latest log by calling --tail=....
Since it is too big(almost fill with useless "skip msg" log), I have grep out those "ERROR"/"WARN" log. Please have a check.
error_logs.txt

@yanliang567
Copy link
Contributor

@ThinkThinkSyn i saw milvus fails to find the index for the collection law_original_content, and i did not find any logs about load task. could you please a) try to load the collection b)and then export the logs again.
[33mmilvus-standalone |�[0m [2024/02/05 08:44:02.278 +00:00] [WARN] [proxy/task_scheduler.go:458] ["Failed to execute task: "] [traceID=45c066b7f89ff71ae95accff4aff2ddc] [error="index not found[collection=law_original_content]"]
/assign @ThinkThinkSyn

@yanliang567
Copy link
Contributor

@ThinkThinkSyn also Could you please attach the etcd backup for investigation? Check this: https://github.com/milvus-io/birdwatcher for details about how to backup etcd with birdwatcher

@ThinkThinkSyn
Copy link
Author

@ThinkThinkSyn i saw milvus fails to find the index for the collection law_original_content, and i did not find any logs about load task. could you please a) try to load the collection b)and then export the logs again. �[33mmilvus-standalone |�[0m [2024/02/05 08:44:02.278 +00:00] [WARN] [proxy/task_scheduler.go:458] ["Failed to execute task: "] [traceID=45c066b7f89ff71ae95accff4aff2ddc] [error="index not found[collection=law_original_content]"] /assign @ThinkThinkSyn

Yes, that collection law_original_content's index was deleted yesterday cuz I wanna test whether Milvus will be runnable again after the recreation of index. Other collections' index are not deleted, but it is still unavailable to load. The below shows an example:

image

with the latest log of this stuck:
latest-50000-erros.log

@ThinkThinkSyn
Copy link
Author

update: the collection law_full_content_summarized_embedding above have just successful loaded for once with unknown reason.

But after I release it, it stuck with 0% in loading again.

@ThinkThinkSyn
Copy link
Author

And btw, are these plantey of msg be normal?

image

They are printing out for every million seconds, which makes the log history to be really big & hard to review.

@xiaofan-luan
Copy link
Collaborator

  1. did you check your message queue state?using kafkacli or pulsarctl to check if all the kafka is working.
  2. Did you tried to restart the cluster

@ThinkThinkSyn
Copy link
Author

  1. did you check your message queue state?using kafkacli or pulsarctl to check if all the kafka is working.您檢查訊息佇列狀態了嗎?使用 kafkacli 或 pulsarctl 檢查所有 kafka 是否正常運作。
  2. Did you tried to restart the cluster您是否嘗試過重啟集群

For 2, we have just restart the container and now it loads successfully, but new error came out with failed to search/query delegator 11 for channel by-dev-rootcoord-dml_4_447474321364366830v0: fail to Search, QueryNode ID=11, reason=Timestamp lag too large when viewing the data.

For 1, I am sorry for not familiar with the tools you mention. I will check it later.

@yanliang567
Copy link
Contributor

@ThinkThinkSyn there are only 2 mins logs in the new attached files, and the only msg i can see about 447474321364366830 is a warn as below. I agree that you shall check the MQ service first, and if possible attach the MQ logs, and more logs would be great help.
�[33mmilvus-standalone |�[0m [2024/02/05 13:02:37.678 +00:00] [WARN] [task/executor.go:397] ["failed to subscribe channel"] [taskID=1706979178640] [collectionID=447059354952351200] [replicaID=447474321491492875] [channel=by-dev-rootcoord-dml_11_447059354952351200v0] [node=9] [source=channel_checker] [error="the other same channel is unsubscribing: channel reduplicates[channel=by-dev-rootcoord-dml_11_447059354952351200v0]"]

@yanliang567 yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 6, 2024
@ThinkThinkSyn
Copy link
Author

@ThinkThinkSyn there are only 2 mins logs in the new attached files, and the only msg i can see about 447474321364366830 is a warn as below. I agree that you shall check the MQ service first, and if possible attach the MQ logs, and more logs would be great help. 新附加文件中只有 2 分鐘的日誌,我能看到的關於 447474321364366830 的唯一訊息是如下警告。我同意你應該先檢查MQ服務,如果可能的話附上MQ日誌,更多的日誌會很有幫助。 �[33mmilvus-standalone |�[0m [2024/02/05 13:02:37.678 +00:00] [WARN] [task/executor.go:397] ["failed to subscribe channel"] [taskID=1706979178640] [collectionID=447059354952351200] [replicaID=447474321491492875] [channel=by-dev-rootcoord-dml_11_447059354952351200v0] [node=9] [source=channel_checker] [error="the other same channel is unsubscribing: channel reduplicates[channel=by-dev-rootcoord-dml_11_447059354952351200v0]"]

Sorry for late reply. May I have some document refs on how to check the message queue u have mentioned? Seems I can't find some useful information for it.

@weiliu1031
Copy link
Contributor

@ThinkThinkSyn there are only 2 mins logs in the new attached files, and the only msg i can see about 447474321364366830 is a warn as below. I agree that you shall check the MQ service first, and if possible attach the MQ logs, and more logs would be great help. 新附加文件中只有 2 分鐘的日誌,我能看到的關於 447474321364366830 的唯一訊息是如下警告。我同意你應該先檢查MQ服務,如果可能的話附上MQ日誌,更多的日誌會很有幫助。 �[33mmilvus-standalone |�[0m [2024/02/05 13:02:37.678 +00:00] [WARN] [task/executor.go:397] ["failed to subscribe channel"] [taskID=1706979178640] [collectionID=447059354952351200] [replicaID=447474321491492875] [channel=by-dev-rootcoord-dml_11_447059354952351200v0] [node=9] [source=channel_checker] [error="the other same channel is unsubscribing: channel reduplicates[channel=by-dev-rootcoord-dml_11_447059354952351200v0]"]

Sorry for late reply. May I have some document refs on how to check the message queue u have mentioned? Seems I can't find some useful information for it.

base on the logs above, seems that you deploy a milvus standalone instance, which uses internal rocksmq, for now, it's status maintained by milvus itself

@yanliang567
Copy link
Contributor

@weiliu1031 any ideas of how to get more helpful info?

@Borntowarn
Copy link

+1. My python workers exceeded RAM on the machine and Milvus-stanalone has been crashed. After restarting loading collection stuck at 0% for a long time and no progress appears. Logs (I've dropped "skip msg" rows) -
milvus.log

@weiliu1031
Copy link
Contributor

+1. My python workers exceeded RAM on the machine and Milvus-stanalone has been crashed. After restarting loading collection stuck at 0% for a long time and no progress appears. Logs (I've dropped "skip msg" rows) - milvus.log

this logs helps a lot! we found that after the milvus instance restart, reloading collection stuck at consuming data from rocksmq when subscribing channel.

guess that after a sudden reboot of the physical machine, RocksDB may have left a file lock, blocking subsequent operations to read data from RocksDB. i will do some tests to figure it out

@yanliang567
Copy link
Contributor

/assign @weiliu1031
/unassign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed triage/needs-information Indicates an issue needs more information in order to work on it. labels Feb 28, 2024
@Borntowarn
Copy link

+1. My python workers exceeded RAM on the machine and Milvus-stanalone has been crashed. After restarting loading collection stuck at 0% for a long time and no progress appears. Logs (I've dropped "skip msg" rows) - milvus.log

this logs helps a lot! we found that after the milvus instance restart, reloading collection stuck at consuming data from rocksmq when subscribing channel.

guess that after a sudden reboot of the physical machine, RocksDB may have left a file lock, blocking subsequent operations to read data from RocksDB. i will do some tests to figure it out

Thank you very much! I'll be waiting for any progress on this issue.

@Borntowarn
Copy link

Hello! Is there any progress on this?

@xiaofan-luan
Copy link
Collaborator

Hello! Is there any progress on this?

  1. did you have enough memory for now?
  2. we need detailed log to judge if this is same issue. If it's the same issue, remove the file lock on rdb directory could solve the issue

@xiaofan-luan
Copy link
Collaborator

@smellthemoon think of how we can remove the lock automaticlly when service start? It shouldn't be very hard to do

@Borntowarn
Copy link

Hello! Is there any progress on this?

  1. did you have enough memory for now?
  2. we need detailed log to judge if this is same issue. If it's the same issue, remove the file lock on rdb directory could solve the issue
  1. Yes, I did. But I wouldn't want to get this problem again
  2. I can't provide a new journal right now because this problem rarely occurs, I have only previous one

@xiaofan-luan
Copy link
Collaborator

the easy fix is to clean the RDB directory next time you hit this.
We will fix that soon

@AI-Candy-Yang
Copy link

the easy fix is to clean the RDB directory next time you hit this. We will fix that soon

What about the previous data

@AI-Candy-Yang
Copy link

Hello! Is there any progress on this?

  1. did you have enough memory for now?
  2. we need detailed log to judge if this is same issue. If it's the same issue, remove the file lock on rdb directory could solve the issue
1. Yes, I did. But I wouldn't want to get this problem again

2. I can't provide a new journal right now because this problem rarely occurs, I have only previous one

May I ask if the problem of the collection load being stuck has been resolved

@xiaofan-luan
Copy link
Collaborator

Hello! Is there any progress on this?

  1. did you have enough memory for now?
  2. we need detailed log to judge if this is same issue. If it's the same issue, remove the file lock on rdb directory could solve the issue
1. Yes, I did. But I wouldn't want to get this problem again

2. I can't provide a new journal right now because this problem rarely occurs, I have only previous one

May I ask if the problem of the collection load being stuck has been resolved

I'm not sure if you hitting into the same issue since there is no logs offered and "Loading stuck at 0%" might be triggered by different reasons.

But if you hit the same issue, remove the lock file of rocksdb could ideally solve the problem.

@Borntowarn
Copy link

But if you hit the same issue, remove the lock file of rocksdb could ideally solve the problem.

Okay, thank you. If I face the same issue again I'll try this method.

Copy link

stale bot commented May 3, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label May 3, 2024
@stale stale bot closed this as completed May 12, 2024
@junjie-landing
Copy link

But if you hit the same issue, remove the lock file of rocksdb could ideally solve the problem.

Where do I remove the lock file when running standalone version in k8s?

@zrg-team
Copy link

@xiaofan-luan Where do I remove the lock file when running standalone version ?

@xiaofan-luan
Copy link
Collaborator

the rocksmq dir. usually it's under /var/lib. check your milvus.yaml file config

@pavluka6
Copy link

pavluka6 commented Sep 2, 2024

I also have this issue, and it happened on system restart. How can avoid this happening in the future, and how to resolve this?

@xiaofan-luan
Copy link
Collaborator

/assign @yanliang567
please help on reproducing this issue

@xiaofan-luan
Copy link
Collaborator

this is important and we saw this happened multiple times

@m-ajer
Copy link

m-ajer commented Dec 19, 2024

Is there any news on this? I had to restart my server due to a power outage, Milvus from docker is having trouble starting (status unhealthy), I do not want to reindex the whole collection again

@yanliang567
Copy link
Contributor

we can not reproduce this issue in house. @m-ajer I suggest you manually stop the milvus service before the power outage.

@xiaofan-luan
Copy link
Collaborator

Is there any news on this? I had to restart my server due to a power outage, Milvus from docker is having trouble starting (status unhealthy), I do not want to reindex the whole collection again

Could you offer a log and your milvus version if your service can not start?
Upgrade to milvus latest might be able to help

@xiaofan-luan
Copy link
Collaborator

we'd like to further track if there are more clues.

@m-ajer
Copy link

m-ajer commented Jan 3, 2025

I retried it again with the same result - milvus-standalone remains unhealthy despite increasing the wait time for other instances to load in the docker configuration. Here are the logs and the docker-compose: files.zip. Another factor that could be the case or error is that we are not using an SSD, but the collection indexed normally before, so we didn't think it would be an issue. The current volume size is around 110 GB.

@yanliang567
Copy link
Contributor

@m-ajer the logs you attached is all about INFO logs, which does not have enough info. Could you please set the milvus log level to DEBUG, and recollect the log?

@xiaofan-luan
Copy link
Collaborator

change this queryCoord.channelTaskTimeout config to 600000 and see if it works.

We recommend you to upgrade 2.4-latest and this pr might solve the problem #36617

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

10 participants