Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAS-134183 / 25.10 / nvme-of: wait for timeout to pass if shelf was empty #15707

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ixhamza
Copy link
Contributor

@ixhamza ixhamza commented Feb 14, 2025

Removing power from the ES24N shelf does not trigger events on the discovery controller except for an Interface Link Down. Shelf drives keep retrying until a 10-minute timeout. After power restoration, it takes about 10 seconds to connect, yet the remote discovery log shows no entries. A discovery change event then removes all drives since the ES24N shelf reports none. One minute after Link Up, a Link Down occurs, followed by another Link Up a few seconds later that triggers a change event during which some entries, notably CM7 drives, are missed.
To address this, wait for the timeout to complete if the shelf was previously empty during a discovery change event on power restore. This avoids acting on incomplete discovery logs and ensures proper drive reconnection. In addition, when connecting the first disk to the enclosure, we add additional waiting time. The extra delay is minimal since disks already take a few seconds to appear in the discovery log after a change event.

Jira Ticket: https://ixsystems.atlassian.net/browse/NAS-134183
Validated by Jeff Ervin on f100-152

Removing power from the ES24N shelf does not trigger events on the
discovery controller except for an Interface Link Down. Shelf drives
keep retrying until a 10-minute timeout. After power restoration, it
takes about 10 seconds to connect, yet the remote discovery log shows
no entries. A discovery change event then removes all drives since the
ES24N shelf reports none. One minute after Link Up, a Link Down occurs,
followed by another Link Up a few seconds later that triggers a change
event during which some entries, notably CM7 drives, are missed.
To address this, wait for the timeout to complete if the shelf was
previously empty during a discovery change event on power restore. This
avoids acting on incomplete discovery logs and ensures proper drive
reconnection. In addition, when connecting the first disk to the
enclosure, we add additional waiting time. The extra delay is minimal
since disks already take a few seconds to appear in the discovery log
after a change event.
@ixhamza ixhamza requested review from amotin and yocalebo February 14, 2025 17:31
@bugclerk bugclerk changed the title nvme-of: wait for timeout to pass if shelf was empty NAS-134183 / 25.10 / nvme-of: wait for timeout to pass if shelf was empty Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants