Bootc and Rhel Image Mode

Table of Contents

TL;DR: RedHat announced this at summit, I thought its finally ready for prod, started using it everywhere and realized how underdeveloped the ecosystem is, and decided to dump my experiences here.

IMHO some if not most of the stuff here, such as the transient etc woes should be handled by the base images themselves, rather than leaving them out for the user with minimal warning.

These were originally written as notes/brain dump. The context is that I am trying to make a Slurm/Jupyter notebook cluster for my student org.

Use transient everything, especially transient /etc

  • Files created manually stay in /etc forever, can't edir or delete via container upgrade.
  • Don't want to lose control over system, container should be source of truth.
  • Transient etc (and everything else) is the only way to ensure changes to the container ACTUALLY propogate, are installed and functional.
    • Harder to reason about change propogation in non-transient systems.
  • No point using this technology without transient/composefs root ie /

Transient etc gotchas

  • Note that SELinux will have a stroke because of all this /etc and /var business. I turn it off completely.
  • Note also that while etc is transient at runtime, it is mutable and persistent and container build time, ie in the Containerfile
  • SSH Host keys will re-gen on reboot!
    • Use HostKey directive in config, store keys in /var
    • I made a service override for sshd-keygen@.service to call my script, which is a copy of the original but makes the keys in /var
      • Note that atleast on centos-stream9-bootc a systemd target sshd-keygen.target exists, which wants the units sshd-keygen@ed25519 etc. An alternative is to kill this altogether target and make keys manually with a shell script.
  • Machine ID is lost
    • Apparently this is important for systemd
    • Use tmpfiles.d to store initial machine id in var, make symlink.
  • Repeat for /etc/hostname. I'd recommend injecting the hostname to /var from a installer shell script as the kickstart/anaconda settings will be lost (more on this later)
  • Use systemd-firstboot to setup locale and keymap and etc.
    • Note that this can't be used for hostname as that is different per-machine.
    • Install dnf package langpacks-en to get locale en_US.UTF-8
  • Use /var for machine specific files
  • Systemd Tricks:
    • Use /etc/systemd/service/foo.service.d/something.conf to override existing services.
    • While overriding, need to set list-able value to blank to overwrite. Eg.

      ExecStart=
      ExecStart=new-script.sh
      
    • Run systemctl enable in Containerfile to create symlinks for activating on boot
      • Alternative is to manually make symlinks under multi-user.target.wants dir etc.
    • Manually make .mount files as kickstart is kinda ass.
    • Mount file names MUST be the same as the Where= field! Use systemd-escape -p /some/path to name units!
      • Eg. mount for Where=/boot/efi must be named boot-efi.mount
      • Failing to do this will cause systemd to "refuse" the mount.
    • Use ConditionFileExists and co in Unit section to make one-time files like ssh host keys

Kickstart gotchas

  • Transient etc will make stuff from kickstart NOT persist! Eg root password, hostname, fstab
  • Use kickstart to make partitions, install and run a shell script to inject secrets and state like hostname to /var, then use tmpfiles.d to make symlinks!
  • Do NOT refer to PyKickstart manual, use Rhel9 docs Appendix!
  • Kickstart with autopart will happily make you a BIOS MBR system. MUST do manual partitioning!
  • Hack: Use file system labels while making partitions in kickstart, then use What=LABEL=foo in systemd mount file.
  • Ostree/bootc commands will NOT work at all if /boot/loader is missing.
    • You can have a efi booted system, but that does not mean /boot and/or /boot/efi is mounted at runtime.
  • Do NOT use bootloader verb! It either does the wrong thing, or I am holding it wrong…
  • Manually specify file system type unless cool with default xfs.

OSTree observations

  • Whole system is a web of (bind) mounts. Bind mounts in bootc = symlinks in NixOS.
  • Systemd service ostree-finalize-staged.service runs on shutdown, which "finalizes" the state of /etc and doing the 3-way merge stuff.
    • Stop this service if you want to forget changes to etc (when in non transient etc)
    • This runs the currently undocumented command ostree admin finalize-staged
  • /var is populated once at first boot and never again. Do NOT rely on OSTree to fix your mistakes in /var!
  • That said, a pristine /var does exist under /sysroot or /ostree for manual rsync'ing.
  • For non transient etc, ostree admin config-diff will show how etc has diverged from the container image.
  • Current OSTree tooling is too poor to reliably rollback commits and/or restore files IMO. In the future ideally it should let you deal with the rootfs like a git repo.

Containerfile and general tips

  • What's a Dockerfile?
  • Build layer caching and invalidation works top down, ie if something a layer at the TOP of the file has changed, all layers below are rebuilt and not used from cache!
    • Make layers in Decreasing order of likelihood of change - with stuff that you wont need to change towards the top
    • Work in progress - Use multi-stage builds, maybe docker buildkit.
  • There are no background daemons running in the container build.
    • Make systemd oneshot units to run commands that need a daemon or running system.
  • Use drop-in conf.d directories for all software that supports them, which is mostly all relevant software. Makes it easier to manage, and in case of non transient etc probably helps with the 3-way going in your favour.
  • MAKE SURE to match or be mindful of file permissions (mode and ownership) created via the container. Software may REQUIRE particular permissions!
    • Use chown/chmod --reference= to copy mode and perms of existing file.
  • Add Kernel cmdline arguments by creating toml files with key "kargs" and value string array in /usr/lib/bootc/kargs.d/
    • A good one is:

      # /usr/lib/bootc/kargs.d/10-selinux.conf
      kargs = ["selinux=0"]
      
  • Remember to chmod +x stuff you want to execute! I do it in the Containerfile.
  • Use usermod --password to set the password hash for the root user.
  • DKMS will not work at runtime due to immutability. Use this in Containerfile:

    kver=$(cd /usr/lib/modules; echo *); /usr/sbin/dkms autoinstall --verbose
    --kernelver "$kver"
    

    This will ensure dkms modules are compiled and installed at build time.

    • dkms.service is disabled by default!
  • Work in progress: Do not use realmd as it wants to edit a bunch of files in immutable/transient directories.
    • Configure sssd manually
    • Use adcli to do the domain joining, set location for krb5.keytab in /var
    • Figure a way to run authselect in Containerfile, or perhaps on every boot.
    • Investigate realmd install root cli option.

Nvidia Misery

  • Containerfile from fedora bootc examples doesn't work (for me atleast).
    • Doesn't install the actual kernel module
  • Enable cuda-rhel9 repo, and enable module nvidia-driver:latest-dkms
    • Non-DKMS did not work for me, gave ugly dependency hell errors.
  • Install packages "nvidia-driver-cuda" and "kmod-nvidia-latest-dkms"
    • These are the proprietary ones, haven't looked into the open ones.
  • Build DKMS module using aforementioned trick
  • Add kernel cmdline argument to blacklist nouveau
  • Probably rebuild initramfs.
  • TODO: Investigate ublue akmods repo.