Bootc and Rhel Image Mode
Table of Contents
TL;DR: RedHat announced this at summit, I thought its finally ready for prod, started using it everywhere and realized how underdeveloped the ecosystem is, and decided to dump my experiences here.
IMHO some if not most of the stuff here, such as the transient etc woes should be handled by the base images themselves, rather than leaving them out for the user with minimal warning.
These were originally written as notes/brain dump. The context is that I am trying to make a Slurm/Jupyter notebook cluster for my student org.
Use transient everything, especially transient /etc
- Files created manually stay in /etc forever, can't edir or delete via container upgrade.
- Don't want to lose control over system, container should be source of truth.
- Transient etc (and everything else) is the only way to ensure changes to the
container ACTUALLY propogate, are installed and functional.
- Harder to reason about change propogation in non-transient systems.
- No point using this technology without transient/composefs root ie
/
Transient etc gotchas
- Note that SELinux will have a stroke because of all this /etc and /var business. I turn it off completely.
- Note also that while etc is transient at runtime, it is mutable and persistent and container build time, ie in the Containerfile
- SSH Host keys will re-gen on reboot!
- Use HostKey directive in config, store keys in /var
- I made a service override for
sshd-keygen@.service
to call my script, which is a copy of the original but makes the keys in /var- Note that atleast on centos-stream9-bootc a systemd target
sshd-keygen.target
exists, whichwants
the unitssshd-keygen@ed25519
etc. An alternative is to kill this altogether target and make keys manually with a shell script.
- Note that atleast on centos-stream9-bootc a systemd target
- Machine ID is lost
- Apparently this is important for systemd
- Use tmpfiles.d to store initial machine id in var, make symlink.
- Repeat for /etc/hostname. I'd recommend injecting the hostname to /var from a installer shell script as the kickstart/anaconda settings will be lost (more on this later)
- Use
systemd-firstboot
to setup locale and keymap and etc.- Note that this can't be used for hostname as that is different per-machine.
- Install dnf package
langpacks-en
to get localeen_US.UTF-8
- Use /var for machine specific files
- Systemd Tricks:
- Use
/etc/systemd/service/foo.service.d/something.conf
to override existing services. While overriding, need to set list-able value to blank to overwrite. Eg.
ExecStart= ExecStart=new-script.sh
- Run systemctl enable in Containerfile to create symlinks for activating on
boot
- Alternative is to manually make symlinks under
multi-user.target.wants
dir etc.
- Alternative is to manually make symlinks under
- Manually make .mount files as kickstart is kinda ass.
- Mount file names MUST be the same as the
Where=
field! Usesystemd-escape -p /some/path
to name units!- Eg. mount for
Where=/boot/efi
must be namedboot-efi.mount
- Failing to do this will cause systemd to "refuse" the mount.
- Eg. mount for
- Use
ConditionFileExists
and co in Unit section to make one-time files like ssh host keys
- Use
Kickstart gotchas
- Transient etc will make stuff from kickstart NOT persist! Eg root password, hostname, fstab
- Use kickstart to make partitions, install and run a shell script to inject secrets and state like hostname to /var, then use tmpfiles.d to make symlinks!
- Do NOT refer to PyKickstart manual, use Rhel9 docs Appendix!
- Kickstart with autopart will happily make you a BIOS MBR system. MUST do manual partitioning!
- Hack: Use file system labels while making partitions in kickstart, then use
What=LABEL=foo
in systemd mount file. - Ostree/bootc commands will NOT work at all if
/boot/loader
is missing.- You can have a efi booted system, but that does not mean /boot and/or /boot/efi is mounted at runtime.
- Do NOT use
bootloader
verb! It either does the wrong thing, or I am holding it wrong… - Manually specify file system type unless cool with default xfs.
OSTree observations
- Whole system is a web of (bind) mounts. Bind mounts in bootc = symlinks in NixOS.
- Systemd service
ostree-finalize-staged.service
runs on shutdown, which "finalizes" the state of /etc and doing the 3-way merge stuff.- Stop this service if you want to forget changes to etc (when in non transient etc)
- This runs the currently undocumented command
ostree admin finalize-staged
- /var is populated once at first boot and never again. Do NOT rely on OSTree to fix your mistakes in /var!
- That said, a pristine /var does exist under /sysroot or /ostree for manual rsync'ing.
- For non transient etc,
ostree admin config-diff
will show how etc has diverged from the container image. - Current OSTree tooling is too poor to reliably rollback commits and/or restore files IMO. In the future ideally it should let you deal with the rootfs like a git repo.
Containerfile and general tips
- What's a Dockerfile?
- Build layer caching and invalidation works top down, ie if something a layer
at the TOP of the file has changed, all layers below are rebuilt and not used
from cache!
- Make layers in Decreasing order of likelihood of change - with stuff that you wont need to change towards the top
- Work in progress - Use multi-stage builds, maybe docker buildkit.
- There are no background daemons running in the container build.
- Make systemd oneshot units to run commands that need a daemon or running system.
- Use drop-in
conf.d
directories for all software that supports them, which is mostly all relevant software. Makes it easier to manage, and in case of non transient etc probably helps with the 3-way going in your favour. - MAKE SURE to match or be mindful of file permissions (mode and ownership)
created via the container. Software may REQUIRE particular permissions!
- Use chown/chmod
--reference=
to copy mode and perms of existing file.
- Use chown/chmod
- Add Kernel cmdline arguments by creating toml files with key "kargs" and value
string array in
/usr/lib/bootc/kargs.d/
A good one is:
# /usr/lib/bootc/kargs.d/10-selinux.conf kargs = ["selinux=0"]
- Remember to
chmod +x
stuff you want to execute! I do it in the Containerfile. - Use
usermod --password
to set the password hash for the root user. DKMS will not work at runtime due to immutability. Use this in Containerfile:
kver=$(cd /usr/lib/modules; echo *); /usr/sbin/dkms autoinstall --verbose --kernelver "$kver"
This will ensure dkms modules are compiled and installed at build time.
- dkms.service is disabled by default!
- Work in progress: Do not use
realmd
as it wants to edit a bunch of files in immutable/transient directories.- Configure sssd manually
- Use
adcli
to do the domain joining, set location for krb5.keytab in /var - Figure a way to run
authselect
in Containerfile, or perhaps on every boot. - Investigate realmd install root cli option.
Nvidia Misery
- Containerfile from fedora bootc examples doesn't work (for me atleast).
- Doesn't install the actual kernel module
- Enable cuda-rhel9 repo, and enable module
nvidia-driver:latest-dkms
- Non-DKMS did not work for me, gave ugly dependency hell errors.
- Install packages "nvidia-driver-cuda" and "kmod-nvidia-latest-dkms"
- These are the proprietary ones, haven't looked into the open ones.
- Build DKMS module using aforementioned trick
- Add kernel cmdline argument to blacklist nouveau
- Probably rebuild initramfs.
- TODO: Investigate ublue akmods repo.