[Bug]: Loading collection stuck at 0% after server's sudden electricity lost #30501

ThinkThinkSyn · 2024-02-04T08:27:19Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: 2.3.4
- Deployment mode(standalone or cluster): standalone(GPU)
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2): Attu 2.3.6
- OS(Ubuntu or CentOS): Ubuntu 22.04.3 LTS
- CPU/Memory: Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz 64GB
- GPU: RTX 1060 6G

Current Behavior

After server restart from a sudden electricity lost, Milvus cannot load collection into memory. I have tried to load it for over 12 hours but it just keep stucking.

I am doing the load action in attu, it is observed to have 2 possible apperances(randomly):

stuck at 0% forever
error with msg Error: 4 DEADLINE_EXCEEDED and the loading process stop automatically.

I have search for similar issue and I found #29484 facing to the same problem. What makes different is seems his stucking is resulted by the milvus-backup, while I have not installed that ever.

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

I have tried to figure out problems in log, but I could only find trash log jumping out like: [2024/02/04 08:24:38.336 +00:00] [INFO] [msgstream/mq_msgstream.go:918] ["skip msg"] [msg="base:<msg_type:TimeTick timestamp:447290085443633154 sourceID:6 > "]

I have tried grep "[ERROR]" <my docker container log file> but the latest error is happened in Jan 26 (while the electricity lost is happened on 2/2), which means that there has no helpful error msg.

Anything else?

No response

The text was updated successfully, but these errors were encountered:

weiliu1031 · 2024-02-04T08:49:42Z

@ThinkThinkSyn please export the log for investigating, export log script

ThinkThinkSyn · 2024-02-04T13:09:03Z

@ThinkThinkSyn please export the log for investigating, export log script

milvus.zip
Sorry for late reply. This is the docker log.
Please be noticed that since the log is to large, this is just a part of log from Feb1 till it die

weiliu1031 · 2024-02-05T02:22:42Z

@ThinkThinkSyn please export the log for investigating, export log script

milvus.zip Sorry for late reply. This is the docker log. Please be noticed that since the log is to large, this is just a part of log from Feb1 till it die

after the milvus server restart, it will try to reload the collections. and i want to know why the loading progress stucks, so i need the logs since the milvus service restart

ThinkThinkSyn · 2024-02-05T08:49:43Z

@ThinkThinkSyn please export the log for investigating, export log script

milvus.zip Sorry for late reply. This is the docker log. Please be noticed that since the log is to large, this is just a part of log from Feb1 till it die

after the milvus server restart, it will try to reload the collections. and i want to know why the loading progress stucks, so i need the logs since the milvus service restart

I have tried reloaded all collection for serveral times, even tried to wait for loading over 10 hrs but still stuck in 0%.

For logs, I am not sure how to output the latest log. For the above log it is just outputed by simply calling docker-compose logs > milvus.log. As mention at the begining, I have looked into the latest rolling log and it is filled by a plantey of useless log "skip msg" like this:

ThinkThinkSyn · 2024-02-05T09:08:17Z

Update: I have output the latest log by calling --tail=....
Since it is too big(almost fill with useless "skip msg" log), I have grep out those "ERROR"/"WARN" log. Please have a check.
error_logs.txt

yanliang567 · 2024-02-05T11:10:01Z

@ThinkThinkSyn i saw milvus fails to find the index for the collection law_original_content, and i did not find any logs about load task. could you please a) try to load the collection b)and then export the logs again.
�[33mmilvus-standalone |�[0m [2024/02/05 08:44:02.278 +00:00] [WARN] [proxy/task_scheduler.go:458] ["Failed to execute task: "] [traceID=45c066b7f89ff71ae95accff4aff2ddc] [error="index not found[collection=law_original_content]"]
/assign @ThinkThinkSyn

yanliang567 · 2024-02-05T11:10:30Z

@ThinkThinkSyn also Could you please attach the etcd backup for investigation？ Check this: https://github.com/milvus-io/birdwatcher for details about how to backup etcd with birdwatcher

ThinkThinkSyn · 2024-02-05T13:06:21Z

@ThinkThinkSyn i saw milvus fails to find the index for the collection law_original_content, and i did not find any logs about load task. could you please a) try to load the collection b)and then export the logs again. �[33mmilvus-standalone |�[0m [2024/02/05 08:44:02.278 +00:00] [WARN] [proxy/task_scheduler.go:458] ["Failed to execute task: "] [traceID=45c066b7f89ff71ae95accff4aff2ddc] [error="index not found[collection=law_original_content]"] /assign @ThinkThinkSyn

Yes, that collection law_original_content's index was deleted yesterday cuz I wanna test whether Milvus will be runnable again after the recreation of index. Other collections' index are not deleted, but it is still unavailable to load. The below shows an example:

with the latest log of this stuck:
latest-50000-erros.log

ThinkThinkSyn · 2024-02-05T13:08:22Z

update: the collection law_full_content_summarized_embedding above have just successful loaded for once with unknown reason.

But after I release it, it stuck with 0% in loading again.

ThinkThinkSyn · 2024-02-05T13:12:14Z

And btw, are these plantey of msg be normal?

They are printing out for every million seconds, which makes the log history to be really big & hard to review.

xiaofan-luan · 2024-02-05T14:50:30Z

did you check your message queue state?using kafkacli or pulsarctl to check if all the kafka is working.
Did you tried to restart the cluster

ThinkThinkSyn · 2024-02-05T15:56:20Z

did you check your message queue state?using kafkacli or pulsarctl to check if all the kafka is working.您檢查訊息佇列狀態了嗎？使用 kafkacli 或 pulsarctl 檢查所有 kafka 是否正常運作。

Did you tried to restart the cluster您是否嘗試過重啟集群

For 2, we have just restart the container and now it loads successfully, but new error came out with failed to search/query delegator 11 for channel by-dev-rootcoord-dml_4_447474321364366830v0: fail to Search, QueryNode ID=11, reason=Timestamp lag too large when viewing the data.

For 1, I am sorry for not familiar with the tools you mention. I will check it later.

yanliang567 · 2024-02-06T01:22:50Z

@ThinkThinkSyn there are only 2 mins logs in the new attached files, and the only msg i can see about 447474321364366830 is a warn as below. I agree that you shall check the MQ service first, and if possible attach the MQ logs, and more logs would be great help.
�[33mmilvus-standalone |�[0m [2024/02/05 13:02:37.678 +00:00] [WARN] [task/executor.go:397] ["failed to subscribe channel"] [taskID=1706979178640] [collectionID=447059354952351200] [replicaID=447474321491492875] [channel=by-dev-rootcoord-dml_11_447059354952351200v0] [node=9] [source=channel_checker] [error="the other same channel is unsubscribing: channel reduplicates[channel=by-dev-rootcoord-dml_11_447059354952351200v0]"]

ThinkThinkSyn · 2024-02-06T14:42:34Z

@ThinkThinkSyn there are only 2 mins logs in the new attached files, and the only msg i can see about 447474321364366830 is a warn as below. I agree that you shall check the MQ service first, and if possible attach the MQ logs, and more logs would be great help. 新附加文件中只有 2 分鐘的日誌，我能看到的關於 447474321364366830 的唯一訊息是如下警告。我同意你應該先檢查MQ服務，如果可能的話附上MQ日誌，更多的日誌會很有幫助。 �[33mmilvus-standalone |�[0m [2024/02/05 13:02:37.678 +00:00] [WARN] [task/executor.go:397] ["failed to subscribe channel"] [taskID=1706979178640] [collectionID=447059354952351200] [replicaID=447474321491492875] [channel=by-dev-rootcoord-dml_11_447059354952351200v0] [node=9] [source=channel_checker] [error="the other same channel is unsubscribing: channel reduplicates[channel=by-dev-rootcoord-dml_11_447059354952351200v0]"]

Sorry for late reply. May I have some document refs on how to check the message queue u have mentioned? Seems I can't find some useful information for it.

weiliu1031 · 2024-02-07T02:24:15Z

@ThinkThinkSyn there are only 2 mins logs in the new attached files, and the only msg i can see about 447474321364366830 is a warn as below. I agree that you shall check the MQ service first, and if possible attach the MQ logs, and more logs would be great help. 新附加文件中只有 2 分鐘的日誌，我能看到的關於 447474321364366830 的唯一訊息是如下警告。我同意你應該先檢查MQ服務，如果可能的話附上MQ日誌，更多的日誌會很有幫助。 �[33mmilvus-standalone |�[0m [2024/02/05 13:02:37.678 +00:00] [WARN] [task/executor.go:397] ["failed to subscribe channel"] [taskID=1706979178640] [collectionID=447059354952351200] [replicaID=447474321491492875] [channel=by-dev-rootcoord-dml_11_447059354952351200v0] [node=9] [source=channel_checker] [error="the other same channel is unsubscribing: channel reduplicates[channel=by-dev-rootcoord-dml_11_447059354952351200v0]"]

Sorry for late reply. May I have some document refs on how to check the message queue u have mentioned? Seems I can't find some useful information for it.

base on the logs above, seems that you deploy a milvus standalone instance, which uses internal rocksmq, for now, it's status maintained by milvus itself

yanliang567 · 2024-02-18T03:41:58Z

@weiliu1031 any ideas of how to get more helpful info?

Borntowarn · 2024-02-21T15:22:50Z

+1. My python workers exceeded RAM on the machine and Milvus-stanalone has been crashed. After restarting loading collection stuck at 0% for a long time and no progress appears. Logs (I've dropped "skip msg" rows) -
milvus.log

weiliu1031 · 2024-02-22T02:09:29Z

+1. My python workers exceeded RAM on the machine and Milvus-stanalone has been crashed. After restarting loading collection stuck at 0% for a long time and no progress appears. Logs (I've dropped "skip msg" rows) - milvus.log

this logs helps a lot! we found that after the milvus instance restart, reloading collection stuck at consuming data from rocksmq when subscribing channel.

guess that after a sudden reboot of the physical machine, RocksDB may have left a file lock, blocking subsequent operations to read data from RocksDB. i will do some tests to figure it out

yanliang567 · 2024-02-28T01:21:01Z

/assign @weiliu1031
/unassign

Borntowarn · 2024-02-28T04:44:39Z

+1. My python workers exceeded RAM on the machine and Milvus-stanalone has been crashed. After restarting loading collection stuck at 0% for a long time and no progress appears. Logs (I've dropped "skip msg" rows) - milvus.log

this logs helps a lot! we found that after the milvus instance restart, reloading collection stuck at consuming data from rocksmq when subscribing channel.

guess that after a sudden reboot of the physical machine, RocksDB may have left a file lock, blocking subsequent operations to read data from RocksDB. i will do some tests to figure it out

Thank you very much! I'll be waiting for any progress on this issue.

Borntowarn · 2024-03-26T12:07:12Z

Hello! Is there any progress on this?

xiaofan-luan · 2024-03-26T18:29:55Z

Hello! Is there any progress on this?

did you have enough memory for now?
we need detailed log to judge if this is same issue. If it's the same issue, remove the file lock on rdb directory could solve the issue

xiaofan-luan · 2024-03-26T18:30:30Z

@smellthemoon think of how we can remove the lock automaticlly when service start? It shouldn't be very hard to do

Borntowarn · 2024-03-27T07:08:32Z

Hello! Is there any progress on this?

did you have enough memory for now?

we need detailed log to judge if this is same issue. If it's the same issue, remove the file lock on rdb directory could solve the issue

Yes, I did. But I wouldn't want to get this problem again
I can't provide a new journal right now because this problem rarely occurs, I have only previous one

xiaofan-luan · 2024-03-27T17:26:31Z

the easy fix is to clean the RDB directory next time you hit this.
We will fix that soon

AI-Candy-Yang · 2024-03-29T08:13:46Z

the easy fix is to clean the RDB directory next time you hit this. We will fix that soon

What about the previous data

AI-Candy-Yang · 2024-03-29T08:16:43Z

Hello! Is there any progress on this?

did you have enough memory for now?

we need detailed log to judge if this is same issue. If it's the same issue, remove the file lock on rdb directory could solve the issue
1. Yes, I did. But I wouldn't want to get this problem again

2. I can't provide a new journal right now because this problem rarely occurs, I have only previous one

May I ask if the problem of the collection load being stuck has been resolved

xiaofan-luan · 2024-03-29T17:22:54Z

Hello! Is there any progress on this?

did you have enough memory for now?

we need detailed log to judge if this is same issue. If it's the same issue, remove the file lock on rdb directory could solve the issue
1. Yes, I did. But I wouldn't want to get this problem again

2. I can't provide a new journal right now because this problem rarely occurs, I have only previous one
May I ask if the problem of the collection load being stuck has been resolved

I'm not sure if you hitting into the same issue since there is no logs offered and "Loading stuck at 0%" might be triggered by different reasons.

But if you hit the same issue, remove the lock file of rocksdb could ideally solve the problem.

Borntowarn · 2024-04-02T15:41:33Z

But if you hit the same issue, remove the lock file of rocksdb could ideally solve the problem.

Okay, thank you. If I face the same issue again I'll try this method.

stale · 2024-05-03T11:53:25Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

junjie-landing · 2024-07-24T06:42:45Z

But if you hit the same issue, remove the lock file of rocksdb could ideally solve the problem.

Where do I remove the lock file when running standalone version in k8s?

zrg-team · 2024-08-20T16:19:08Z

@xiaofan-luan Where do I remove the lock file when running standalone version ?

xiaofan-luan · 2024-08-21T19:17:07Z

the rocksmq dir. usually it's under /var/lib. check your milvus.yaml file config

pavluka6 · 2024-09-02T21:00:15Z

I also have this issue, and it happened on system restart. How can avoid this happening in the future, and how to resolve this?

xiaofan-luan · 2024-09-03T23:20:16Z

/assign @yanliang567
please help on reproducing this issue

xiaofan-luan · 2024-09-03T23:21:15Z

this is important and we saw this happened multiple times

m-ajer · 2024-12-19T12:16:19Z

Is there any news on this? I had to restart my server due to a power outage, Milvus from docker is having trouble starting (status unhealthy), I do not want to reindex the whole collection again

yanliang567 · 2024-12-20T03:41:00Z

we can not reproduce this issue in house. @m-ajer I suggest you manually stop the milvus service before the power outage.

xiaofan-luan · 2024-12-26T01:34:43Z

Is there any news on this? I had to restart my server due to a power outage, Milvus from docker is having trouble starting (status unhealthy), I do not want to reindex the whole collection again

Could you offer a log and your milvus version if your service can not start?
Upgrade to milvus latest might be able to help

xiaofan-luan · 2024-12-26T01:35:22Z

we'd like to further track if there are more clues.

m-ajer · 2025-01-03T14:25:02Z

I retried it again with the same result - milvus-standalone remains unhealthy despite increasing the wait time for other instances to load in the docker configuration. Here are the logs and the docker-compose: files.zip. Another factor that could be the case or error is that we are not using an SSD, but the collection indexed normally before, so we didn't think it would be an issue. The current volume size is around 110 GB.

yanliang567 · 2025-01-04T01:38:39Z

@m-ajer the logs you attached is all about INFO logs, which does not have enough info. Could you please set the milvus log level to DEBUG, and recollect the log?

xiaofan-luan · 2025-01-04T03:21:25Z

change this queryCoord.channelTaskTimeout config to 600000 and see if it works.

We recommend you to upgrade 2.4-latest and this pr might solve the problem #36617

ThinkThinkSyn added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 4, 2024

ThinkThinkSyn assigned yanliang567 Feb 4, 2024

sre-ci-robot assigned ThinkThinkSyn Feb 5, 2024

yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 6, 2024

sre-ci-robot assigned weiliu1031 and unassigned yanliang567 Feb 28, 2024

yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed triage/needs-information Indicates an issue needs more information in order to work on it. labels Feb 28, 2024

smellthemoon mentioned this issue Mar 28, 2024

fix: remove previous lock file when processes try to open the same db #31709

Closed

stale bot added the stale indicates no udpates for 30 days label May 3, 2024

stale bot closed this as completed May 12, 2024

sre-ci-robot assigned yanliang567 Sep 3, 2024

[Bug]: Loading collection stuck at 0% after server's sudden electricity lost #30501

[Bug]: Loading collection stuck at 0% after server's sudden electricity lost #30501

Comments

ThinkThinkSyn commented Feb 4, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

weiliu1031 commented Feb 4, 2024

ThinkThinkSyn commented Feb 4, 2024

weiliu1031 commented Feb 5, 2024

ThinkThinkSyn commented Feb 5, 2024

ThinkThinkSyn commented Feb 5, 2024

yanliang567 commented Feb 5, 2024

yanliang567 commented Feb 5, 2024

ThinkThinkSyn commented Feb 5, 2024

ThinkThinkSyn commented Feb 5, 2024

ThinkThinkSyn commented Feb 5, 2024

xiaofan-luan commented Feb 5, 2024

ThinkThinkSyn commented Feb 5, 2024

yanliang567 commented Feb 6, 2024

ThinkThinkSyn commented Feb 6, 2024

weiliu1031 commented Feb 7, 2024

yanliang567 commented Feb 18, 2024

Borntowarn commented Feb 21, 2024

weiliu1031 commented Feb 22, 2024

yanliang567 commented Feb 28, 2024

Borntowarn commented Feb 28, 2024

Borntowarn commented Mar 26, 2024

xiaofan-luan commented Mar 26, 2024

xiaofan-luan commented Mar 26, 2024

Borntowarn commented Mar 27, 2024

xiaofan-luan commented Mar 27, 2024

AI-Candy-Yang commented Mar 29, 2024

AI-Candy-Yang commented Mar 29, 2024

xiaofan-luan commented Mar 29, 2024

Borntowarn commented Apr 2, 2024

stale bot commented May 3, 2024

junjie-landing commented Jul 24, 2024

zrg-team commented Aug 20, 2024

xiaofan-luan commented Aug 21, 2024

pavluka6 commented Sep 2, 2024

xiaofan-luan commented Sep 3, 2024

xiaofan-luan commented Sep 3, 2024

m-ajer commented Dec 19, 2024

yanliang567 commented Dec 20, 2024

xiaofan-luan commented Dec 26, 2024

xiaofan-luan commented Dec 26, 2024

m-ajer commented Jan 3, 2025

yanliang567 commented Jan 4, 2025

xiaofan-luan commented Jan 4, 2025