Skip to content

Commit

Permalink
Add support for raid10
Browse files Browse the repository at this point in the history
This removes the wait block for raid resync for two reasons:
1) raid0 does not have redundancy and therefore no initial resync[1]
2) with raid10 the resync time for 4x 1.9TB disks takes from tens of minutes
   to multiple hours, depending on sysctl params `dev.raid.speed_limit_min` and
   `dev.raid.speed_limit_max` and the speed of the disks. Initial resync for
   raid10 is not strictly needed[1]

Filesystem creation: by default `mkfs.xfs` attempts to TRIM the drive. This is
also something that can take tens of minutes or hours, depening on the size of
drives. TRIM can be skipped, as instances are delivered with disks fully
trimmed[2].

[1] https://raid.wiki.kernel.org/index.php/Initial_Array_Creation
[2] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-instance-store.html#InstanceStoreTrimSupport
  • Loading branch information
lassizci committed Oct 15, 2024
1 parent 813af95 commit ec5369d
Show file tree
Hide file tree
Showing 7 changed files with 41 additions and 16 deletions.
6 changes: 6 additions & 0 deletions doc/usage/al2.md
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,12 @@ A RAID-0 array is setup that includes all ephemeral NVMe instance storage disks.

Another way of utilizing the ephemeral disks is to format and mount the individual disks. Mounting individual disks allows the [local-static-provisioner](https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner) DaemonSet to create Persistent Volume Claims that pods can utilize.

### Experimental: RAID-10 Kubelet and Containerd (raid10)

Similar to RAID-0 array, it is possible to utilize RAID-10 array for instance types with four or more ephemeral NVMe instance storage disks. RAID-10 tolerates failure of maximum of 2 disks. However, individual ephemeral disks can not be replaced, so the purpose of redundancy is to make graceful decommisioning of a node possible.

RAID-10 can be enabled by passing `--local-disks raid10` flag to the bootstrap script.

---

## Version-locked packages
Expand Down
5 changes: 4 additions & 1 deletion nodeadm/api/v1alpha1/nodeconfig_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -94,13 +94,16 @@ type LocalStorageOptions struct {
}

// LocalStorageStrategy specifies how to handle an instance's local storage devices.
// +kubebuilder:validation:Enum={RAID0, Mount}
// +kubebuilder:validation:Enum={RAID0, RAID10, Mount}
type LocalStorageStrategy string

const (
// LocalStorageRAID0 will create a single raid0 volume from any local disks
LocalStorageRAID0 LocalStorageStrategy = "RAID0"

Check failure on line 102 in nodeadm/api/v1alpha1/nodeconfig_types.go

View workflow job for this annotation

GitHub Actions / govulncheck

other declaration of LocalStorageRAID0

// LocalStorageRAID10 will create a single raid10 volume from any local disks. Minimum of 4.
LocalStorageRAID0 LocalStorageStrategy = "RAID10"

Check failure on line 105 in nodeadm/api/v1alpha1/nodeconfig_types.go

View workflow job for this annotation

GitHub Actions / govulncheck

LocalStorageRAID0 redeclared in this block

// LocalStorageMount will mount each local disk individually
LocalStorageMount LocalStorageStrategy = "Mount"
)
Expand Down
1 change: 1 addition & 0 deletions nodeadm/crds/node.eks.aws_nodeconfigs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,7 @@ spec:
an instance's local storage devices.
enum:
- RAID0
- RAID10
- Mount
type: string
type: object
Expand Down
2 changes: 1 addition & 1 deletion nodeadm/doc/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ _Appears in:_
- [LocalStorageOptions](#localstorageoptions)

.Validation:
- Enum: [RAID0 Mount]
- Enum: [RAID0 RAID10 Mount]

#### NodeConfig

Expand Down
1 change: 1 addition & 0 deletions nodeadm/internal/api/types.go
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,7 @@ type LocalStorageStrategy string

const (
LocalStorageRAID0 LocalStorageStrategy = "RAID0"
LocalStorageRAID10 LocalStorageStrategy = "RAID10"
LocalStorageMount LocalStorageStrategy = "Mount"
)

Expand Down
2 changes: 1 addition & 1 deletion templates/al2/runtime/bootstrap.sh
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ function print_help {
echo "--enable-local-outpost Enable support for worker nodes to communicate with the local control plane when running on a disconnected Outpost. (true or false)"
echo "--ip-family Specify ip family of the cluster"
echo "--kubelet-extra-args Extra arguments to add to the kubelet. Useful for adding labels or taints."
echo "--local-disks Setup instance storage NVMe disks in raid0 or mount the individual disks for use by pods [mount | raid0]"
echo "--local-disks Setup instance storage NVMe disks in raid0 or mount the individual disks for use by pods <mount | raid0 | raid10>"
echo "--mount-bpf-fs Mount a bpffs at /sys/fs/bpf (default: true)"
echo "--pause-container-account The AWS account (number) to pull the pause container from"
echo "--pause-container-version The tag of the pause container"
Expand Down
40 changes: 27 additions & 13 deletions templates/shared/runtime/bin/setup-local-disks
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ err_report() {
trap 'err_report $LINENO' ERR

print_help() {
echo "usage: $0 <raid0 | mount | none>"
echo "usage: $0 <raid0 | raid10 | mount | none>"
echo "Sets up Amazon EC2 Instance Store NVMe disks"
echo ""
echo "-d, --dir directory to mount the filesystem(s) (default: /mnt/k8s-disks/)"
Expand All @@ -26,11 +26,18 @@ print_help() {
echo "-h, --help print this help"
}

# Sets up a RAID-0 of NVMe instance storage disks, moves
# the contents of /var/lib/kubelet and /var/lib/containerd
# Sets up a RAID-0 or RAID-10 of NVMe instance storage disks,
# moves the contents of /var/lib/kubelet and /var/lib/containerd
# to the new mounted RAID, and bind mounts the kubelet and
# containerd state directories.
maybe_raid0() {
#
# Do not wait for initial resync: raid0 has no redundancy so there
# is no initial resync. Raid10 does not strictly needed a resync,
# while the time taken for 4 1.9TB disk raid10 would be in range of
# 20 minutes to 20 days, depending on dev.raid.speed_limit_min and
# dev.raid.speed_limit_max sysctl parameters.
maybe_raid() {
local raid_level="$1"
local md_name="kubernetes"
local md_device="/dev/md/${md_name}"
local md_config="/.aws/mdadm.conf"
Expand All @@ -40,14 +47,10 @@ maybe_raid0() {
if [[ ! -s "${md_config}" ]]; then
mdadm --create --force --verbose \
"${md_device}" \
--level=0 \
--level="${raid_level}" \
--name="${md_name}" \
--raid-devices="${#EPHEMERAL_DISKS[@]}" \
"${EPHEMERAL_DISKS[@]}"
while [ -n "$(mdadm --detail "${md_device}" | grep -ioE 'State :.*resyncing')" ]; do
echo "Raid is resyncing..."
sleep 1
done
mdadm --detail --scan > "${md_config}"
fi

Expand All @@ -63,7 +66,8 @@ maybe_raid0() {
## for the log stripe unit, but the max log stripe unit is 256k.
## So instead, we use 32k (8 blocks) to avoid a warning of breaching the max.
## mkfs.xfs defaults to 32k after logging the warning since the default log buffer size is 32k.
mkfs.xfs -l su=8b "${md_device}"
## Instances are delivered with disks fully trimmed, so TRIM is skipped at creation time.
mkfs.xfs -K -l su=8b "${md_device}"
fi

## Create the mount directory
Expand Down Expand Up @@ -231,8 +235,8 @@ set -- "${POSITIONAL[@]}" # restore positional parameters
DISK_SETUP="$1"
set -u

if [[ "${DISK_SETUP}" != "raid0" && "${DISK_SETUP}" != "mount" && "${DISK_SETUP}" != "none" ]]; then
echo "Valid disk setup options are: raid0, mount, or none"
if [[ "${DISK_SETUP}" != "raid0" && "${DISK_SETUP}" != "raid10" && "${DISK_SETUP}" != "mount" && "${DISK_SETUP}" != "none" ]]; then
echo "Valid disk setup options are: raid0, raid10, mount or none"
exit 1
fi

Expand All @@ -256,11 +260,21 @@ fi
## Get devices of NVMe instance storage ephemeral disks
EPHEMERAL_DISKS=($(realpath "${disks[@]}" | sort -u))

## Also bail early if there are not enough disks for raid10
if [[ "${DISK_SETUP}" == "raid10" && "${#EPHEMERAL_DISKS[@]}" -lt 4 ]]; then
echo "raid10 requires at least 4 disks, but only ${#EPHEMERAL_DISKS[@]} found, skipping disk setup"
exit 0
fi

case "${DISK_SETUP}" in
"raid0")
maybe_raid0
maybe_raid 0
echo "Successfully setup RAID-0 consisting of ${EPHEMERAL_DISKS[@]}"
;;
"raid10")
maybe_raid 10
echo "Successfully setup RAID-10 consisting of ${EPHEMERAL_DISKS[@]}"
;;
"mount")
maybe_mount
echo "Successfully setup disk mounts consisting of ${EPHEMERAL_DISKS[@]}"
Expand Down

0 comments on commit ec5369d

Please sign in to comment.