Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: mixcoord error #38597

Open
1 task done
553589912 opened this issue Dec 19, 2024 · 4 comments
Open
1 task done

[Bug]: mixcoord error #38597

553589912 opened this issue Dec 19, 2024 · 4 comments
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@553589912
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:2.4.11
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2): 
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

使用的过程中报错
stack trace: /workspace/source/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace /workspace/source/internal/util/grpcclient/client.go:555 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call /workspace/source/internal/util/grpcclient/client.go:569 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall /workspace/source/internal/distributed/querynode/client/client.go:90 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...] /workspace/source/internal/distributed/querynode/client/client.go:215 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Query /workspace/source/internal/proxy/task_query.go:553 github.com/milvus-io/milvus/internal/proxy.(*queryTask).queryShard /workspace/source/internal/proxy/lb_policy.go:180 github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry.func1 /workspace/source/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do /workspace/source/internal/proxy/lb_policy.go:154 github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry /workspace/source/internal/proxy/lb_policy.go:218 github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).Execute.func2: node not found

milvus-error

Expected Behavior

服务在索引的时候报错

Steps To Reproduce

1 在这个环境中, 有2个mixcoord ,又一个是不报错的
ak-b2c-cloudbase-01-milvus-attu-69dc86cb46-m6rkn        1/1     Running   0          3d10h
ak-b2c-cloudbase-01-milvus-datanode-55b766cdb9-zmj66    1/1     Running   0          3d10h
ak-b2c-cloudbase-01-milvus-indexnode-7757c6696b-2qm4g   1/1     Running   0          3d10h
ak-b2c-cloudbase-01-milvus-indexnode-7757c6696b-4xnv6   1/1     Running   0          25h
ak-b2c-cloudbase-01-milvus-indexnode-7757c6696b-9fpsd   1/1     Running   0          30d
ak-b2c-cloudbase-01-milvus-indexnode-7757c6696b-9jchv   1/1     Running   0          21h
ak-b2c-cloudbase-01-milvus-mixcoord-65598fc465-jvzrg    1/1     Running   0          10d
ak-b2c-cloudbase-01-milvus-mixcoord-65598fc465-z55gr    1/1     Running   1          3d10h
ak-b2c-cloudbase-01-milvus-proxy-8649b6d4fc-8g829       1/1     Running   0          25h
ak-b2c-cloudbase-01-milvus-proxy-8649b6d4fc-bh5cx       1/1     Running   0          3d10h
ak-b2c-cloudbase-01-milvus-proxy-8649b6d4fc-r4k4v       1/1     Running   0          10d
ak-b2c-cloudbase-01-milvus-querynode-66f8ff75b8-dg9mr   1/1     Running   0          5h
ak-b2c-cloudbase-01-milvus-querynode-66f8ff75b8-h9rf4   1/1     Running   0          5h
ak-b2c-cloudbase-01-milvus-querynode-66f8ff75b8-jd9dq   1/1     Running   0          5h
ak-b2c-cloudbase-01-milvus-querynode-66f8ff75b8-wzbjl   1/1     Running   0          4h57m
ak-b2c-cloudbase-01-pulsar-bookie-0                     1/1     Running   0          3d10h
ak-b2c-cloudbase-01-pulsar-bookie-1                     1/1     Running   0          3d9h
ak-b2c-cloudbase-01-pulsar-bookie-2                     1/1     Running   0          24h
ak-b2c-cloudbase-01-pulsar-broker-0                     1/1     Running   0          3d10h
ak-b2c-cloudbase-01-pulsar-broker-1                     1/1     Running   0          25h
ak-b2c-cloudbase-01-pulsar-proxy-0                      1/1     Running   0          3d10h
ak-b2c-cloudbase-01-pulsar-proxy-1                      1/1     Running   0          24h
ak-b2c-cloudbase-01-pulsar-recovery-0                   1/1     Running   0          42m
ak-b2c-cloudbase-01-pulsar-zookeeper-0                  1/1     Running   0          3d10h
ak-b2c-cloudbase-01-pulsar-zookeeper-1                  1/1     Running   0          30d
ak-b2c-cloudbase-01-pulsar-zookeeper-2                  1/1     Running   0          68m

Milvus Log

mixcoord.log.tar.gz

Anything else?

No response

@553589912 553589912 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 19, 2024
@553589912
Copy link
Author

刚刚我这边尝试把出问题mixcoord pod delete 掉后 ,pod自动重建后暂时恢复了, 就是这个原因有点奇怪

@yanliang567
Copy link
Contributor

I think the etcd is not running in high performance, which get the milvus pods lost connection. Please make sure the ectd is running again SSD volumes and with enough cpu resource

[2024/12/15 11:57:04.487 +00:00] [ERROR] [proxy/proxy.go:177] ["Proxy disconnected from etcd, process will exit"] ["Server Id"=39] [stack="github.com/milvus-io/milvus/internal/proxy.(*Proxy).Register.func1\n\t/go/src/github.com/milvus-io/milvus/internal/proxy/proxy.go:177"]
[2024/12/15 11:35:34.616 +00:00] [INFO] [sessionutil/session_util.go:538] ["keepAlive channel close caused by etcd, try to KeepAliveOnce"] [serverName=indexcoord]
[2024/12/15 11:35:34.616 +00:00] [WARN] [sessionutil/session_util.go:530] ["session keepalive channel closed"]

/assign @553589912
/unassign

@yanliang567 yanliang567 added help wanted Extra attention is needed and removed kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 20, 2024
@553589912
Copy link
Author

Etcd runs on SSD volume, and the CPU is sufficient. However, the three etcd j nodes are located in three availability zones, and the network latency is around 1ms. Even if etcd has performance jitter, milvus should not remain unavailable for a long time. The current situation is that it cannot automatically discover the topology without the mixcoord pod being rebuilt.

@yanliang567
Copy link
Contributor

/assign @congqixia
please help to take a look

/unassign

@yanliang567 yanliang567 added kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed help wanted Extra attention is needed labels Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants