-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate from ntpd to Chrony #1852
base: master
Are you sure you want to change the base?
Changes from all commits
8a53dc7
63a6b66
50bb55e
aab359a
67577cf
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,271 @@ | ||
# SONiC Migration to Chrony | ||
|
||
## Table of Contents | ||
|
||
### Revision | ||
|
||
| Rev | Date | Author | Description | | ||
|:---:|:----------:|:-----------------:|:----------------| | ||
| 1.0 | 10/07/2024 | Saikrishna Arcot | Initial version | | ||
|
||
### Scope | ||
|
||
This high-level design document is to document the move from ntpd to Chrony as | ||
the NTP daemon in SONiC, along with the reasons why this was done and the | ||
changes in behavior. | ||
|
||
## Definitions | ||
|
||
* NTP: Network Time Protocol | ||
* RTC: Real Time Clock | ||
* System time: the time that is used by userspace applications when they want to get the current time | ||
|
||
## Overview | ||
|
||
Today, in SONiC, the NTP daemon that is used is ntpd, the reference | ||
implementation from the NTP Project. (Starting in the 202405 branch, NTPsec is | ||
used instead, which is a fork of ntpd that has been security-hardened. The rest | ||
of this document will still refer to it as ntpd, but all of the issues listed | ||
below still apply.) This daemon is responsible for keeping the time on SONiC | ||
devices synchronized to the actual time. In general, this daemon is doing a | ||
good job of keeping the time correct. However, there are some critical | ||
shortcomings with regards to how this daemon works: | ||
|
||
1. SONiC intentionally disables long jumps in the ntpd configuration (ntpd | ||
calls these steps), because not all applications may be able to handle large | ||
changes in the system time, and instead expects the time to be slewed (i.e. | ||
gradually adjusted). Specifically, if applications are using the current | ||
system time to determine when to do something, and there's a large time jump | ||
either backwards or forwards, then that application may no longer behave | ||
correctly. On a network device, this, in the worst case, could mean | ||
dataplane impact. However, ntpd doesn't support _fully_ disabling long | ||
jumps. This is seen in the fact that the | ||
`ntp/test_ntp.py::test_ntp_long_jump_disabled` test case passes, where ntpd | ||
is able to synchronize the system time when it's one hour off within 12 | ||
minutes. This is because ntpd is doing a long jump to correct the time, even | ||
though it was configured to be slewed. | ||
2. When slewing the time, ntpd disables the kernel time discipline. One of the | ||
effects of this is that the kernel will never know that the system time has | ||
been synchronized to the actual time, and thus will not update the | ||
hardware clock/RTC on the board with the correct time. When the kernel knows | ||
that the system time is synchronized, every 11 minutes, it will write the | ||
current system time to the hardware clock/RTC. Not syncing the system time | ||
to the hardware clock/RTC means that if the system were to be rebooted, then | ||
it would come back up with whatever time was recorded in the hardware clock, | ||
which might not be the actual time. It is possible to manually sync the | ||
system time to the hardware clock, which is what is done in SONiC when the | ||
device is being rebooted (either cold, soft, fast, or warm). However, in the | ||
case of an unexpected device reload (power loss, kernel panic, etc.), this | ||
sync will not happen. | ||
3. For ntpd to send and receive NTP packets from the upstream servers, it must | ||
be listening to port 123 of an interface or an IP address. This may be | ||
needed for symmetric associations, but for typical client-server | ||
associations, generally speaking, clients shouldn't need to be listening on | ||
port 123. This is because the packets coming back from the NTP server will | ||
be sent to the UDP port that the client sent out the packet on. The | ||
constraint that ntpd needs to be listening on each interface/address also | ||
complicates new addresses being added to an interface, or an interface being | ||
removed and/or added (due to runtime configuration changes). In these cases, | ||
ntpd needs to listen for new IP addresses being added (which it does), or | ||
the configuration needs to be updated. | ||
4. There have been a few cases of ntpd no longer sending out NTP packets. It's | ||
unclear why this happens (upstream servers have been unreachable for too | ||
long, interfaces have been removed and re-added, or something else), but | ||
this causes issues with the system time drifting. | ||
|
||
Ntpd was the only major daemon available for Linux until fairly recently. Now, | ||
in the last few years, there are two other implementations: | ||
|
||
* chrony: This is another implementation designed for systems which might not | ||
always be running or connected to the Internet (especially virtual machines). | ||
It's able to synchronize the time faster than ntpd. | ||
* systemd-timesyncd: This is a SNTP client-only implementation built into | ||
systemd, and is enabled by default in Ubuntu and in Debian (starting with | ||
Bookworm). | ||
|
||
Systemd-timesyncd has limited configuration options, and while it might be | ||
sufficient as a simple NTP client, it only implements SNTP (which is now | ||
generally discouraged since it provides reduced accuracy), and will step the | ||
clock for large changes. Therefore, chrony is the better option here. | ||
|
||
### Requirements | ||
|
||
For SONiC, the NTP daemon needs to support the following: | ||
|
||
* Connect to one or more NTP servers, via interfaces that may be added or | ||
removed (such as front-panel ports or port channels) | ||
* If the NTP server(s) are not reachable, then it should keep retrying | ||
* Keep the system time close to the actual time | ||
* Only slew the clock, and never step the clock except upon request | ||
* Keep the hardware clock in sync with the system clock (or allow the kernel to | ||
synchronize the hardware clock) | ||
* Optionally act as an NTP server, for other devices that want to use this | ||
device as its upstream server | ||
* Optionally have NTP servers configured via DHCP | ||
|
||
## Overview of chrony | ||
|
||
Chrony is a NTP daemon first released in 2014 that is smaller than ntpd and | ||
claims to synchronize the time faster than ntpd. It supports most features of | ||
ntpd and can probably be used as a replacement to ntpd in most environments. | ||
|
||
### Advantages of replacing ntpd with chrony | ||
|
||
For SONiC's purposes, there are specific advantages that chrony has over ntpd: | ||
|
||
* It will only slew the system clock, and not step the system clock unless | ||
explicitly requested in the config file or `chronyc` (the client application | ||
to control `chronyd`). | ||
* Unless specified via a config option, chrony will use the system's routing | ||
rules to determine what interface to send NTP packets to for each source, and | ||
will listen for a response on the socket that it opens. In other words, the | ||
list of interfaces to listen on doesn't need to be specified, and a permanent | ||
socket doesn't need to be kept open (unlike ntpd). If it is desired that NTP | ||
packets are sent via a specific interface, then the config option | ||
`bindacqdevice` can be used to specify this interface. Similarly, | ||
`bindacqaddress` can be used to specify an IPv4 or IPv6 address. | ||
* If `rtcsync` is enabled in the configuration, then the kernel will get a | ||
notification that the time is synchronized, which will allow it to sync the | ||
hardware clock/RTC. Otherwise, chrony can manage the hardware clock/RTC. | ||
* There's a separate communication method (Unix socket and UDP port 323) for | ||
talking to and configuring `chronyd` itself. `ntpd` uses the same port for | ||
daemon configuration/information as NTP packets. This can help with | ||
security/firewalls. | ||
|
||
### Disadvantages of replacing ntpd with chrony | ||
|
||
There are also a couple minor disadvantages as well: | ||
|
||
* When chrony is acting as an NTP server (not just as a client), chrony can | ||
listen on only one interface or on one IPv4 and one IPv6 address. This means | ||
that unlike ntpd, where there may be multiple sockets (one per interface or | ||
per IP address) listening for NTP packets from client, chrony will have only | ||
two sockets (one for IPv4, one for IPv6) open. That being said, chrony can be | ||
told which IP addresses/subnets to allow/deny packets from. This means that | ||
chrony can be told to listen on all addresses (i.e. not be bound to a single | ||
interface, and listen on 0.0.0.0 and ::), and specify which subnets are | ||
allowed to talk to chrony through the use of `allow` and `deny` config | ||
options (i.e. `allow 10.2.0.0/16` and `deny 10.2.3.0/24`). Alternatively, a | ||
firewall (such as iptables) can be used to allow/block UDP port 123 packets | ||
from selected interfaces/IP subnets. | ||
* Tools that work with `ntpd` such as `ntpq` and `ntpstat` will not work with | ||
chrony, as they use different protocols for communication. Fortunately, | ||
`chronyc` can serve as a replacement to all necessary functions of `ntpq` and | ||
`ntpstat`, but with possibly different output formats. | ||
|
||
### Conclusion | ||
|
||
Given the issues that the usage of ntpd have revealed, chrony's differences in | ||
behavior (always slewing the time, optionally updating the hardware clock/RTC, | ||
and reduced scope of permanently open sockets) are a major improvement over | ||
ntpd. For the disadvantages listed here, there are at least workarounds that | ||
can be used. These workarounds are listed above. | ||
|
||
For this reason, it makes sense to migrate to chrony. | ||
|
||
## Migrating from ntpd to chrony in SONiC | ||
|
||
### Configuration | ||
|
||
In terms of SONiC configuration changes, there are no configuration changes | ||
required for migrating from ntpd to chrony. All of the configuration | ||
information passed in can be translated to chrony's syntax. | ||
|
||
For chrony's configuration file, there are differences in the configuration | ||
options that are available. The major ones of note are: | ||
* Chrony doesn't require listening on an interface to be able to send and | ||
receive NTP packets on it. | ||
* When acting as a NTP server, chrony can only listen on one interface (or one | ||
IPv4 and one IPv6 address), whereas ntpd can open any number of sockets | ||
listening on port 123. | ||
* Chrony doesn't have a default panic threshold, whereas ntpd does by default. | ||
A panic threshold means that if the time received via NTP is too different | ||
than the system time (i.e. greater than the panic threshold), then the NTP | ||
daemon will exit immediately instead of doing anything, and expect the system | ||
administrator to first correct the system time to the actual time. For | ||
SONiC's purposes, we do not want the panic threshold to be set. Chrony | ||
doesn't set one, whereas ntpd does set a threshold of 1000 seconds by default | ||
(which can be overridden). | ||
* ntpd's configuration specified what each subnet was allowed to do, whereas | ||
chrony doesn't quite have that. This is partly because the configuration | ||
control for chrony is on a different port entirely (UDP port 323 instead of | ||
UDP port 123), this making it easier to be firewalled off and/or configured | ||
separtely. In addition, chrony will default to using a client-server | ||
relationship instead of symmetric relationship (where both sides will sync | ||
time with each other), unless the `peer` keyword is used instead of `server`. | ||
* Chrony also allows storing the NTP servers in a separate file, making it | ||
possible to reload the daemon and have it reread the servers instead of | ||
restarting the whole daemon. At this time, this is not used in SONiC. | ||
* Chrony configuration file can specify `rtcsync` to tell the kernel that the | ||
system time is now synchronized, and the kernel can then synchronize the | ||
hardware clock/RTC with the system time. However, this would mean that if the | ||
system time is significantly different from the actual time, then the | ||
hardware clock/RTC will not get updated until the system time is synchronized | ||
to the actual time, which may take months. As an alternative, chrony can | ||
manage the hardware clock/RTC. With this, it will immediately update the | ||
hardware clock/RTC with the actual time, while the system time is gradually | ||
slewed. This will be the configuration chosen for SONiC. | ||
|
||
### Monitoring | ||
|
||
For the purpose of making time synchronization issues more visible, a Monit | ||
check will be added to verify that the time is currently synchronized to one or | ||
more NTP servers. If Monit sees that if the time is not synchronized for 3 | ||
minutes, then a message will be printed every 5 minutes saying that the time is | ||
not synchronized. | ||
|
||
Sample messsage: | ||
|
||
``` | ||
2024 Nov 7 01:36:00.154986 vlab-01 ERR monit[735]: 'ntp' status failed (1) -- NTP is not synchronized with servers | ||
``` | ||
|
||
## SAI API | ||
|
||
There are no changes needed in the SAI API or in the implementation by vendors. | ||
|
||
## Configuration and management | ||
|
||
### Config DB | ||
|
||
There are no changes to the config DB schema. | ||
|
||
### CLI | ||
|
||
The output of the `show ntp` CLI will change as the output format of `chronyc` | ||
is different. There will be no other changes specifically related to this. | ||
|
||
However, `config ntp` will have additional options added. Specifically, it will | ||
accept `--iburst`, `--version`, and `--association-type` arguments when adding | ||
a NTP server, to enable iburst, specify the NTP association version, or specify | ||
the association type, respectively. This is to address the gap that while these | ||
options could be configured via `config_db.json`, there is no CLI option to | ||
configure this. | ||
|
||
Examples: | ||
|
||
``` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you make sure Chrony works with mgmt VRF as well? Please add examples with mgmt VRF. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Chrony running in mgmt vrf:
With the following config blocks:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks! |
||
sudo config ntp add --iburst 10.250.0.1 | ||
``` | ||
|
||
## Restrictions/Limitations | ||
|
||
There are expected to be no new restrictions or limitations with this change. | ||
|
||
## Testing Requirements/Design | ||
|
||
The existing NTP test cases will be updated to support chrony. In addition, the | ||
long jump disabled test case will be expected to fail for chrony; that is, the | ||
time should *not* be synchronized after 12 minutes. | ||
|
||
# Pull requests | ||
|
||
* [sonic-net/sonic-utilities: Switch to using chrony instead of ntpd](https://github.com/sonic-net/sonic-utilities/pull/3574) | ||
* [sonic-net/sonic-host-services: Update hostcfgd to start chrony instead of ntp-config or ntpd](https://github.com/sonic-net/sonic-host-services/pull/170) | ||
* [sonic-net/sonic-buildimage: Switch from ntpd to chrony](https://github.com/sonic-net/sonic-buildimage/pull/20497) | ||
* [sonic-net/sonic-mgmt: Add support for testing chrony](https://github.com/sonic-net/sonic-mgmt/pull/15008) | ||
|
||
# References | ||
|
||
* [chrony FAQ](https://chrony-project.org/faq.html) | ||
* [chrony.conf man page](https://chrony-project.org/doc/4.6.1/chrony.conf.html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good if the minimum number of active servers could be defined so that the synchronization is considered correct. For high requirements, 3 servers should be the minimum requirement.
It would also be good if the details of the time synchronization station (e.g. offset, jitter, delay, count) could be monitored from the network. As far as I know there is currently no Prometheus exporter included in SONiC, SNMP would probably be the only option here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the first part, while specifying 3 servers is required for a high level of time accuracy, I'd rather not make it a minimum requirement within SONiC, only because if it is used in an environment that either doesn't need a high level of time accuracy or doesn't have 3 servers available, then that requirement won't be met.
For the second part, I agree on exposing the metrics somehow. I think the current standard we have is to publish the data into STATE_DB, which should make it easier to get exported elsewhere. However, this would involve having (at minimum) another daemon polling chrony frequently, and publishing the data into STATE_DB, and the scope of this effort is already large enough. Can you open an enhancement request for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the minimum servers required: I fully agree that there are environments where having accurate time is not that important. My recommendation was to make the minimum numbers of servers "defineable", then its up to the operator what is needed.
Regarding metrics topic:
Yes of course, see #1857