fromthecodefront: systemd pitfalls

This is a loose collection of various issues you might encounter when using/deploying systemd.

Disclaimer

Systemd is very much a moving target with a relatively spotty history of communicating both existing semantics as well as breaking changes. Even criticial security issues have been in the past fixed without notice, not requesting CVEs, even attacking people who suggested to do so.

Thus whatever you read here might already be out of date.

I try to keep this list --sometimes through repeated editing-- at least somewhat neutral. I might not always manage. Please accept my apologies.

Because there are so many of them: Missing or incomplete documentation is indicated by 📚. Note that this does not mean that an issue can be fixed by only updating the documentation.

The Journal

journald completely disregards RFC 5424, section 6.3, including no support for picking up structured log data nor forwarding its own structured data. It does forward existing structured syslog data by virtue of leaving log messages unaltered.
journald makes it impossible for syslog implementations to pick up trusted metadata via the kernel. Since it imposes itself between the syslog daemon and the logging service, all kernel-obtainable metadata is from the journal server. If you need that, you must interface with journald. (workaround module for rsyslog, which has trouble with corrupted journal files).
The journal's query API is essentially a reverse polish datalog query builder with fixed ~~three~~five level nesting and a fixed operator type at each level. The API mixes parsed with non-parsed operations instead of providing a query language or criteria construction engine.
journald stores its file descriptors in PID1 when it is stopped via sd_pid_notify_with_fds(). As this is not implemented as a reliable transmission, journald restarts have a chance of losing all logging streams.
Journald automatically attempts to set nocow if /var/log/journal is on a btrfs filesystem 📚. Nocow also disables btrfs checksumming and thus potential data recovery from multiple block copies. This is not mitigated by the journal's limited checksumming. nocow is re-enabled when a journal file is put offline 📚.
📚 The journal file format description is --to this date-- still incomplete. There is no mention of --for example-- sealing and LZ4 compression.
📚 The journal still seems to strip white space from log messages before forwarding them to syslog. This means, that e.g. multiline log entries with whitespace indentation a continuation marker are mangled.
When /var/log/journal resides on a separate filesystem, journald might create the journal in (one of the) the parent filesystems and then mount /var{,/log{, /journal}} over that location, making the journal inaccessible during runtime. To fix this, you need to make journald wait for the mount point. Waiting for the directory using .path unit might not work, since it is journald that creates the directory with Storage=persistent.
journalctl "-r" does not combine well with "-n" and does the wrong thing.
Journald's timestamps are not necessarily when the event happens, but when the journal daemon processes its queue, so that means if it gets less CPU time for some reason or another, there will be a mismatch between the actual time of the event happening and the time appearing in the journal (thanks to @rt2800pci1).

Documentation and "Closed Design" Issues

📚 The journald query API is documented by completely avoiding any of the well-known jargon (conjunctive query, predicate, variable, literal, atom) for database/datalog queries and instead uses custom idioms.

Careless Maintainership

StartLimitInterval= was silently moved from the [Service] to the [Unit] section. Compatibility is provided for StartLimitInterval=, but not for StartLimitIntervalSec=.📚 for such changes does not appear where it should.

POLA violations

$X and ${X} expand differently in unit files. The former does word splitting and quote stripping, while the latter does neither. This is different from basically everywhere else.
systemctl stop $service && systemctl start $service may not be the same as systemctl restart $service. PID1 only keeps file descriptors as long as it holds a reference to a unit (documented as "unit not fully stopped"), which is no longer the case after "stop", hence all stored file descriptors are lost.

Service Management

It seems systemd does two special things to process running in its DynamicUser facility, first it will try to reuse the same user id if the service remains unchanged by the means of generating a special hash, and second, if you leave IPC objects around by any means (shm, mq), it will pin that UID in memory (see dynamic-user.c and search for ipc, the clean-ipc function returns -1) and never use it. Under a special configuration, this will lead to UID starvation in that range if you restart the service too many times (or use a transient one), if the small UID range is itself not enough to cause this. Also, two services having the StateDirectory= means the one starting later will have the dir chowned to it and screw the former's permissions (thanks to @rt2800pci1).

Design issues

Interaction with journald via PID1 is a deadlock candidate (PID1 blocked on writing to DBUS, dbus blocked on logging to journald, journald blocked on sd_notify), worked around by write-polling the notify socket and issuing a non-blocking write. Again, this might mean you loose all logging streams on restart if the system is busy.
📚 In general, interaction with the systemd bus by various components is a mix of blocking and non-blocking IO, including silent dropping of messages on EAGAIN. Messaging is either unreliable or contains more possible deadlock candidates.
Startup notification via sd_notify() is typically only supported from what systemd considers the main PID of a unit. The presence of the notify socket (and hence notification support) is indicated by an environment variable $NOTIFY_SOCKET. If you spawn notify-capabable subservices from your systemd unit, you need to unset $NOTIFY_SOCKET to prevent systemd warnings, since the subservice will try to notify system of its own startup.

Security Unconscious Design/Implementation

The initial kill for KillMode=process and KillMode=mixed are sent from PID1 as root, ignoring hardening even in previous SysV implementations. Combined with PIDFile=, this allows an ~~unprivileged~~ service to kill an arbitrary process when the service is stopped. As of #7816 there are additional checks for PID files from unprivileged services. The kill is still performed as root.
PID1 has a single-threaded message loop with only limited QoS. For example, a unit can cause a denial of service in PID1 by repeatedly sending file descriptors to PID1. This requires fdstore to be enabled explicitly, though.

fromthecodefront

Montag, 31. Juli 2017

systemd pitfalls