Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sled-agent] Use zone-network-setup service after underlay is up #7260

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

karencfv
Copy link
Contributor

@karencfv karencfv commented Dec 17, 2024

This commit removes the need to zlogin into the switch zone to set the underlay IP address and new static address after the underlay is up. Instead it adds the new values to the SMF properties and refreshes the zone-network-setup service. This service is safe to rerun as all it's commands ensure things are there, but does not throw an error if they are.

Closes: #6157

@karencfv
Copy link
Contributor Author

Tried this on London but it didn't work. What's weird is that the sled agent never reached the part where it detects a different address https://github.com/oxidecomputer/omicron/blob/main/sled-agent/src/services.rs#L4261-L4272 . This means it technically never even touched the new code? Unsure about what's going on here. Will look later

@karencfv
Copy link
Contributor Author

karencfv commented Dec 19, 2024

I've tested on a4x2 and everything came up successfully.

root@g3:~# cat $(svcs -L sled-agent) | grep refreshed | looker
04:36:11.926Z INFO SledAgent (ServiceManager): refreshed zone-network-setup service with new configuration
    file = sled-agent/src/services.rs:4328
04:36:12.102Z INFO SledAgent (ServiceManager): refreshed dendrite service with new configuration
    file = sled-agent/src/services.rs:4452
04:36:12.152Z INFO SledAgent (ServiceManager): refreshed lldpd service with new configuration
    file = sled-agent/src/services.rs:4492
04:36:12.208Z INFO SledAgent (ServiceManager): refreshed MGS service with new configuration
    file = sled-agent/src/services.rs:4380
04:36:12.331Z INFO SledAgent (ServiceManager): refreshed wicketd service with new configuration
    file = sled-agent/src/services.rs:4470
04:36:12.577Z INFO SledAgent (ServiceManager): refreshed mgd service with new configuration
    file = sled-agent/src/services.rs:4544
04:36:12.768Z INFO SledAgent (ServiceManager): refreshed mg-ddm service with new configuration
    file = sled-agent/src/services.rs:4591

root@g3:~# zlogin oxz_switch

root@oxz_switch:~# svccfg -s svc:/oxide/zone-network-setup listprop config
config              application
config/datalink     astring  oxControlService34
config/gateway      astring  fd00:1122:3344:104::1
config/static_addr  astring  fd00:1122:3344:104::2

root@oxz_switch:~# cat /var/svc/log/oxide-zone-network-setup:default.log
[ Dec 19 04:36:00 Enabled. ]
[ Dec 19 04:36:00 Rereading configuration. ]
[ Dec 19 04:36:00 Rereading configuration. ]
[ Dec 19 04:36:11 Executing start method ("/opt/oxide/zone-setup-cli/bin/zone-setup common-networking -d oxControlService34 -s ::1 -g "). ]
note: configured to log to "/dev/stderr"
{"msg":"Ensuring IP interface exists on datalink","v":0,"name":"zone-setup","level":30,"time":"2024-12-19T04:36:11.4686168Z","hostname":"oxz_switch","pid":3090,"datalink":"oxControlService34"}
{"msg":"Setting MTU to 9000 for IPv6 and IPv4","v":0,"name":"zone-setup","level":30,"time":"2024-12-19T04:36:11.534105902Z","hostname":"oxz_switch","pid":3090,"datalink":"oxControlService34"}
{"msg":"Setting TCP recv_buf size to 1 MB","v":0,"name":"zone-setup","level":30,"time":"2024-12-19T04:36:11.542986021Z","hostname":"oxz_switch","pid":3090}
{"msg":"Setting TCP congestion control algorithm to cubic","v":0,"name":"zone-setup","level":30,"time":"2024-12-19T04:36:11.546753526Z","hostname":"oxz_switch","pid":3090}
{"msg":"Static address is localhost; will not ensure it's set on the IP interface","v":0,"name":"zone-setup","level":30,"time":"2024-12-19T04:36:11.55043676Z","hostname":"oxz_switch","pid":3090}
{"msg":"Underlay is not available yet; will not ensure default route","v":0,"name":"zone-setup","level":30,"time":"2024-12-19T04:36:11.550488207Z","hostname":"oxz_switch","pid":3090}
{"msg":"Populating hosts file for zone","v":0,"name":"zone-setup","level":30,"time":"2024-12-19T04:36:11.550496408Z","hostname":"oxz_switch","pid":3090,"zonename":"\"oxz_switch\""}
[ Dec 19 04:36:11 Method "start" exited with status 0. ]
[ Dec 19 04:36:11 Rereading configuration. ]
[ Dec 19 04:36:11 Executing refresh method ("/opt/oxide/zone-setup-cli/bin/zone-setup common-networking -d oxControlService34 -s fd00:1122:3344:104::2 -g fd00:1122:3344:104::1"). ]
note: configured to log to "/dev/stderr"
{"msg":"Ensuring IP interface exists on datalink","v":0,"name":"zone-setup","level":30,"time":"2024-12-19T04:36:11.990693716Z","hostname":"oxz_switch","pid":3121,"datalink":"oxControlService34"}
{"msg":"Setting MTU to 9000 for IPv6 and IPv4","v":0,"name":"zone-setup","level":30,"time":"2024-12-19T04:36:11.995634708Z","hostname":"oxz_switch","pid":3121,"datalink":"oxControlService34"}
{"msg":"Setting TCP recv_buf size to 1 MB","v":0,"name":"zone-setup","level":30,"time":"2024-12-19T04:36:12.00612112Z","hostname":"oxz_switch","pid":3121}
{"msg":"Setting TCP congestion control algorithm to cubic","v":0,"name":"zone-setup","level":30,"time":"2024-12-19T04:36:12.010800129Z","hostname":"oxz_switch","pid":3121}
{"msg":"Ensuring static and auto-configured addresses are set on the IP interface","v":0,"name":"zone-setup","level":30,"time":"2024-12-19T04:36:12.016213931Z","hostname":"oxz_switch","pid":3121,"static address":"fd00:1122:3344:104::2","data link":"\"oxControlService34\""}
{"msg":"Ensuring there is a default route","v":0,"name":"zone-setup","level":30,"time":"2024-12-19T04:36:12.500717653Z","hostname":"oxz_switch","pid":3121,"gateway":"Ipv6(fd00:1122:3344:104::1)"}
{"msg":"Populating hosts file for zone","v":0,"name":"zone-setup","level":30,"time":"2024-12-19T04:36:12.540005264Z","hostname":"oxz_switch","pid":3121,"zonename":"\"oxz_switch\""}
[ Dec 19 04:36:12 Method "refresh" exited with status 0. ]

root@oxz_switch:~# cat /etc/inet/hosts 

::1 localhost loghost
127.0.0.1 localhost loghost
fd00:1122:3344:104::2 oxz_switch.local oxz_switch

root@oxz_switch:~# netstat -rcnv

IRE Table: IPv4
  Destination             Mask           Gateway          Device  MTU  Ref Flg  Out   In/Fwd 
-------------------- --------------- -------------------- ------ ----- --- --- ------ ------ 
127.0.0.1            255.255.255.255 127.0.0.1            lo0     8232   2 UH     146    146 
198.51.101.4/30      255.255.255.252 198.51.101.6         tfportqsfp0_0  1500   3 U     4952      0 
198.51.101.12/30     255.255.255.252 198.51.101.14        tfportqsfp1_0  1500   3 U     4953      0 

IRE Table: IPv6
  Destination/Mask            Gateway                    If    MTU  Ref Flags  Out   In/Fwd 
--------------------------- --------------------------- ----- ----- --- ----- ------ ------ 
::1                         ::1                         lo0    8252  36 UH    307206 307272 
fdb0:a840:2500:7::/64       fdb0:a840:2500:7::2         oxBootstrap34  1500   3 U        981      0 
fd00:1122:3344:104::/64     fd00:1122:3344:104::2       oxControlService34  9000   4 U       3791      0 
fdb0::/16                   fe80::8:20ff:feb6:f3cb      oxBootstrap34  1500   8 UG      2838      0 
fe80::/10                   fe80::aa40:25ff:fe10:2680   tfportrear3_0  1500   3 U       3297      0 
fe80::/10                   fe80::aa40:25ff:fe10:267f   tfportrear2_0  1500   3 U       3296      0 
fe80::/10                   fe80::aa40:25ff:fe10:267e   tfportrear1_0  1500   3 U       3212      0 
fe80::/10                   fe80::aa40:25ff:fe10:267d   tfportrear0_0  1500   3 U       3283      0 
fe80::/10                   fe80::aa40:25ff:fe10:267c   tfportint0_0  1500   2 U          0      0 
fe80::/10                   fe80::8:20ff:fe77:5014      oxControlService34  9000   2 U          0      0 
fe80::/10                   fe80::8:20ff:fe99:833d      oxBootstrap34  1500   3 U          2      0 
default                     fd00:1122:3344:104::1                 0   6 UG     65795      0 

root@oxz_switch:~# ipadm
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
tfportqsfp0_0/uplink1 static ok         198.51.101.6/30
tfportqsfp1_0/uplink1 static ok         198.51.101.14/30
lo0/v6            static   ok           ::1/128
oxBootstrap34/ll  addrconf ok           fe80::8:20ff:fe99:833d%oxBootstrap34/10
oxBootstrap34/bootstrap6 static ok      fdb0:a840:2500:7::2/64
oxControlService34/ll addrconf ok       fe80::8:20ff:fe77:5014%oxControlService34/10
oxControlService34/omicron6 static ok   fd00:1122:3344:104::2/64
tfportint0_0/ll   addrconf ok           fe80::aa40:25ff:fe10:267c%tfportint0_0/10
tfportrear0_0/ll  addrconf ok           fe80::aa40:25ff:fe10:267d%tfportrear0_0/10
tfportrear1_0/ll  addrconf ok           fe80::aa40:25ff:fe10:267e%tfportrear1_0/10
tfportrear2_0/ll  addrconf ok           fe80::aa40:25ff:fe10:267f%tfportrear2_0/10
tfportrear3_0/ll  addrconf ok           fe80::aa40:25ff:fe10:2680%tfportrear3_0/10

root@oxz_switch:~# dladm
LINK        CLASS     MTU    STATE    BRIDGE     OVER
vioif0      phys      1500   up       ?          --
oxBootstrap34 vnic    1500   up       --         ?
oxControlService34 vnic 9000 up       --         ?
tfport0     tfport    1500   up       --         vioif0
tfportint0_0 tfport   1500   up       --         vioif0
tfportrear0_0 tfport  1500   up       --         vioif0
tfportrear1_0 tfport  1500   up       --         vioif0
tfportrear2_0 tfport  1500   up       --         vioif0
tfportrear3_0 tfport  1500   up       --         vioif0
tfportqsfp0_0 tfport  1500   up       --         vioif0
tfportqsfp1_0 tfport  1500   up       --         vioif0

I'm pretty confident this will work just fine, but as a precaution I'd like to run it on madrid or london before merging. Will do so once either are available to use

@karencfv karencfv marked this pull request as ready for review December 19, 2024 06:05
@karencfv karencfv requested review from smklein and bnaecker December 19, 2024 06:05
Copy link
Collaborator

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a few small questions, but mostly pretty minimal. Thanks!

Err(e) => {
// If a property already doesn't exist we don't need to
// return an error
if !e.to_string().contains("No such property") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd hit this if one tried to delete a property that doesn't exist, right? As opposed to deleting a value that doesn't exist from a property that does. Is that what we want? I could see an argument for failing if we try to deleting a non-existent property.

Copy link
Contributor Author

@karencfv karencfv Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, yeah I see your point.

In this case, after deleting we run the svccfg addpropvalue command with the value type, which means the property is created if it doesn't exist. So, it seems like it's not a big deal if we don't return an error if the property didn't exist. If the property wasn't created previously for whatever reason, it's OK because we're creating it anyway.

That said, returning the error like you propose, would guarantee that we are not adding a property where it's not meant to be.

I'm a little torn on this one. WDYT?

info!(
nsmfh.addpropvalue_type(
"config/gateway",
&format!("{}", info.underlay_address),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method takes any type that impls ToString, so we don't need to format here. We can pass &info.underlay_address I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! yes. good catch

sled-agent/src/services.rs Outdated Show resolved Hide resolved
@@ -165,6 +165,38 @@ impl<'t> SmfHelper<'t> {
Ok(())
}

pub fn delpropvalue<P, V>(&self, prop: P, val: V) -> Result<(), Error>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this differ from the existing delpropvalue_default_instance method below? There's only one (default) instance of this service in a zone AFAIU, so they seem like they should be equivalent. I could be missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're almost the same, yes. You can set properties on the instance itself and/or on the service.

Example:

coatlicue@centzon:~$ svccfg -s svc:/system/fmd:default listprop
general                            framework
general/comment                    astring  
general/enabled                    boolean  true
restarter                          framework    NONPERSISTENT
restarter/logfile                  astring  /var/svc/log/system-fmd:default.log
restarter/contract                 count    89
restarter/start_pid                count    1516
restarter/start_method_timestamp   time     1734493550.186287000
restarter/start_method_waitstatus  integer  0
restarter/auxiliary_state          astring  dependencies_satisfied
restarter/next_state               astring  none
restarter/state                    astring  online
restarter/state_timestamp          time     1734493550.187733000
coatlicue@centzon:~$ svccfg -s svc:/system/fmd listprop
dumpadm-fmd                                    dependency
dumpadm-fmd/entities                           fmri     svc:/system/dumpadm
dumpadm-fmd/external                           boolean  true
dumpadm-fmd/grouping                           astring  require_all
dumpadm-fmd/restart_on                         astring  none
dumpadm-fmd/type                               astring  service
SUNWfmd                                        dependency
SUNWfmd/entities                               fmri     file://localhost/usr/lib/fm/fmd/fmd
SUNWfmd/grouping                               astring  require_all
SUNWfmd/restart_on                             astring  none
SUNWfmd/type                                   astring  path
<more properties>

The zone-network-service has it's properties set on the service directly so the other command would not delete the correct property values

Comment on lines +34 to +37
<!-- We use the same command as the start method as it is safe to rerun. -->
<exec_method type='method' name='refresh'
exec='/opt/oxide/zone-setup-cli/bin/zone-setup common-networking -d %{config/datalink} -s %{config/static_addr} -g %{config/gateway}'
timeout_seconds='0' />
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the service is currently running when it's refreshed?

Copy link
Contributor Author

@karencfv karencfv Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing really. All of the commands on that service are idempotent (I also manually refreshed this service multiple times to see what happened).

Additionally, it's highly unlikely this would happen. Several other services on the switch zone depend on this service, which means we wouldn't even get to the point where the service is refreshed, if the zone-network-setup service hadn't exited with 0 (as far as I understand how the switch zone startup works, I could be wrong though).

Copy link
Contributor Author

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to review! I've addressed your comments and still on the fence about one. Let me know what you think :)

info!(
nsmfh.addpropvalue_type(
"config/gateway",
&format!("{}", info.underlay_address),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! yes. good catch

@@ -165,6 +165,38 @@ impl<'t> SmfHelper<'t> {
Ok(())
}

pub fn delpropvalue<P, V>(&self, prop: P, val: V) -> Result<(), Error>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're almost the same, yes. You can set properties on the instance itself and/or on the service.

Example:

coatlicue@centzon:~$ svccfg -s svc:/system/fmd:default listprop
general                            framework
general/comment                    astring  
general/enabled                    boolean  true
restarter                          framework    NONPERSISTENT
restarter/logfile                  astring  /var/svc/log/system-fmd:default.log
restarter/contract                 count    89
restarter/start_pid                count    1516
restarter/start_method_timestamp   time     1734493550.186287000
restarter/start_method_waitstatus  integer  0
restarter/auxiliary_state          astring  dependencies_satisfied
restarter/next_state               astring  none
restarter/state                    astring  online
restarter/state_timestamp          time     1734493550.187733000
coatlicue@centzon:~$ svccfg -s svc:/system/fmd listprop
dumpadm-fmd                                    dependency
dumpadm-fmd/entities                           fmri     svc:/system/dumpadm
dumpadm-fmd/external                           boolean  true
dumpadm-fmd/grouping                           astring  require_all
dumpadm-fmd/restart_on                         astring  none
dumpadm-fmd/type                               astring  service
SUNWfmd                                        dependency
SUNWfmd/entities                               fmri     file://localhost/usr/lib/fm/fmd/fmd
SUNWfmd/grouping                               astring  require_all
SUNWfmd/restart_on                             astring  none
SUNWfmd/type                                   astring  path
<more properties>

The zone-network-service has it's properties set on the service directly so the other command would not delete the correct property values

Err(e) => {
// If a property already doesn't exist we don't need to
// return an error
if !e.to_string().contains("No such property") {
Copy link
Contributor Author

@karencfv karencfv Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, yeah I see your point.

In this case, after deleting we run the svccfg addpropvalue command with the value type, which means the property is created if it doesn't exist. So, it seems like it's not a big deal if we don't return an error if the property didn't exist. If the property wasn't created previously for whatever reason, it's OK because we're creating it anyway.

That said, returning the error like you propose, would guarantee that we are not adding a property where it's not meant to be.

I'm a little torn on this one. WDYT?

Comment on lines +34 to +37
<!-- We use the same command as the start method as it is safe to rerun. -->
<exec_method type='method' name='refresh'
exec='/opt/oxide/zone-setup-cli/bin/zone-setup common-networking -d %{config/datalink} -s %{config/static_addr} -g %{config/gateway}'
timeout_seconds='0' />
Copy link
Contributor Author

@karencfv karencfv Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing really. All of the commands on that service are idempotent (I also manually refreshed this service multiple times to see what happened).

Additionally, it's highly unlikely this would happen. Several other services on the switch zone depend on this service, which means we wouldn't even get to the point where the service is refreshed, if the zone-network-setup service hadn't exited with 0 (as far as I understand how the switch zone startup works, I could be wrong though).

@karencfv karencfv requested a review from bnaecker December 19, 2024 22:11
@karencfv
Copy link
Contributor Author

Thanks for the review! I'll merge once I've ran this on London or Madrid

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

zone-setup service could be more helpful in the switch zone
2 participants