stefan's blag and stuff

Blog – 2016-06-03 – Signals, pthreads and getaddrinfo

This morning I tried to fix a shutdown issue with systemd's systemd-timesyncd. This journey took my in into the glory world of Linux/POSIX's signal handling, POSIX pthreads, the glibc library function getaddrinfo and how all of them interact.

Slow shutdown of systemd-timesyncd

While taking the train to work each morning, I have nothing to do for 15 minutes. If I'm not reading news on my smartphone, I open my laptop and try to get some coding work done. When the train reaches the train station of Mainz, I shutdown the laptop quickly and get off the train. In the shutdown process I noticed that systemd is waiting a couple of seconds for service Network Time Synchronisation to stop.

Jun 03 10:34:33 my-host systemd[1]: Stopping Network Time Synchronization...
Jun 03 10:34:41 my-host systemd[1]: Stopped Network Time Synchronization.

When the laptop is parked in the docking station at home, I never saw the long delay. Hmm, so whats the difference between using the laptop in the train and at home? ... Of course, network connectivity. And indeed, the problem is reproduceable by unplugging the network cable. When you execute

$ systemctl stop systemd-timesyncd

without a internet connection, the shutdown of the daemon takes some time.

Ok. Now, what's the culprit? Before reading the source bottom up, I attached strace (the tcpdump for syscalls) to the running daemon. The output was

$ sudo strace -p $(pidof systemd-timesyncd)
...
send(8, "\7\0\0\0\0\0\0\0\f\0\0\0", 12, MSG_NOSIGNAL) = 12
futex(0xb749fba8, FUTEX_WAIT, 17200, NULL) = ? ERESTARTSYS (To be restarted)
--- SIGCONT (Continued) @ 0 (0) ---
futex(0xb749fba8, FUTEX_WAIT, 17200, NULL) = 0
close(7)                                = 0
close(8)                                = 0
close(9)                                = 0
close(10)                               = 0
signalfd4(5, [TERM], 8, O_NONBLOCK|O_CLOEXEC) = 5
signalfd4(5, [], 8, O_NONBLOCK|O_CLOEXEC) = 5
close(4)                                = 0
close(5)                                = 0
close(6)                                = 0
exit_group(0)                           = ?
Process 17199 detached

The calls to futex take a very long time. What is the futex syscall? Looking it up in the manpages reveals: futex - fast user-space locking. That must be used for synchronisation between multiple threads in the same process context. Adding the -f argument to trace childrens and threads of the process shows that there is another thread running.

Knowing that I had a quick look at the systemd-timesyncd source code in src/timesync/. But there aren't any thread specific functions. So I turned to the whole systemd source tree and grepped for pthread_create.

$ git grep pthread_create
src/bus-proxyd/bus-proxyd.c:                r = pthread_create(&tid, &attr, run_client, c);
src/libsystemd/sd-bus/test-bus-chat.c:        r = pthread_create(&c1, NULL, client1, bus);
src/libsystemd/sd-bus/test-bus-chat.c:        r = pthread_create(&c2, NULL, client2, bus);
src/libsystemd/sd-bus/test-bus-objects.c:        r = pthread_create(&s, NULL, server, &c);
src/libsystemd/sd-bus/test-bus-server.c:        r = pthread_create(&s, NULL, server, &c);
src/libsystemd/sd-resolve/sd-resolve.c:                r = pthread_create(&resolve->workers[resolve->n_valid_workers], NULL, thread_worker, resolve);
src/shared/async.c:        r = pthread_create(&t, &a, func, arg);

Bingo. The library code src/libsystemd/sd-resolve/sd-resolve.c uses threads to lookup domain names asynchronously. It is used in the systemd-timesyncd daemon and queries the DNS system that is definitely a network operation, which is suspected to connectivity and timeout issues.

Putting some debugging messages into the code you can find the following code path in the shutdown sequence of the daemon. On stopping the daemon the function manager_free is called and then:

# file src/timesync/timesyncd-manager.c
void manager_free(Manager *m) {
    [...]
    sd_resolve_unref(m->resolve);

# file src/libsystemd/sd-resolve/sd-resolve.c
_public_ sd_resolve* sd_resolve_unref(sd_resolve *resolve) {
    [...]
    if (resolve->n_ref <= 0)
        resolve_free(resolve);

# file src/libsystemd/sd-resolve/sd-resolve.c
static void resolve_free(sd_resolve *resolve) {
    [...]
    if (resolve->fds[REQUEST_SEND_FD] >= 0) {

            RHeader req = {
                    .type = REQUEST_TERMINATE,
                    .length = sizeof(req)
            };

            /* Send one termination packet for each worker */
            for (i = 0; i < resolve->n_valid_workers; i++)
                    (void) send(resolve->fds[REQUEST_SEND_FD], &req, req.length, MSG_NOSIGNAL);
    }

    /* Now terminate them and wait until they are gone. */
    for (i = 0; i < resolve->n_valid_workers; i++) {
            for (;;) {
                    if (pthread_join(resolve->workers[i], NULL) != EINTR)
                            break;
            }
    }

The code is totally straight forward. On shutdown the library sd-resolve sends the message REQUEST_TERMINATE to all worker threads, so they can stop gracefully. After that the main process uses pthread_join to wait until all threads are finished. This function uses the futex syscall internally for the inter thread communication that was visible in the strace log.

Intermission: I am always totally amazed when I'm looking at systemd code. Every thing is so clean and well implemented. The developers really know what they are doing. Naive developers or in a standalone program would use a single global variable static bool threads_stop_please to signal all workers threads to terminate. But this is not safe when the library is used multiple times in the same process context. You must use a per library instance variable or use messages like the systemd devs.

Intermission 2: Nobody is perfect. The above code snippet contains an issue. The function pthread_join never returns EINTR. It's not documented as a valid return value. The unnecessary check is already removed in systemd v230.

Why do the worker thread take so long to terminate? Throwing in some debugging messages again, shows the culprit, the glibc function getaddrinfo:

# file src/libsystemd/sd-resolve/sd-resolve.c
static int handle_request(int out_fd, const Packet *packet, size_t length) {
    [...]
    switch (req->type) {

    case REQUEST_ADDRINFO: {
           [...]
           ret = getaddrinfo(
                           node, service,
                           ai_req->hints_valid ? &hints : NULL,
                           &result);

           /* send_addrinfo_reply() frees result */
           return send_addrinfo_reply(out_fd, req->id, ret, result, errno, h_errno);
    }
    [...]

The function getaddrinfo is called in the worker threads to resolve a domain name to an IP address. This function parsers /etc/resolv.conf, looks up the domain name in /etc/hosts, maybe uses a cache and eventually queries the configured upstream DNS server. The delayed shutdown issue is caused by long waits in a poll syscall until it times out. It only occurs when there is a routeable connection put no real connectivity. After removing the ethernet cable the connection has NO-CARRIER but is UP.

$ ip addr
2: enp2s1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
 link/ether 00:0a:XX:XX:XX:XX brd ff:ff:ff:ff:ff:ff
 inet 192.168.1.2/24 brd 192.168.1.255 scope global enp2s1
    valid_lft forever preferred_lft forever
 inet6 fe80::20a:e4ff:XXXX:XXXX/64 scope link
    valid_lft forever preferred_lft forever

When I disable the interface, the function getaddrinfo returns immediately.

It would be very cool if I find a solution for this problem and can fix the issue upstream. Maybe earning some kudos. So let's try making the shutdown of the library sd-resolve and systemd-timesyncd fast in any case.

Attempt to interrupt a syscall

I already knew that glibc library functions may return the error code -EINTR when they are interrupted by a signal. So I thought it would be possible to send a signal to the thread before calling pthread_join. The thread will react to the signal and abort the poll syscall immediately. The error code -EINTR would cause all internal library functions to fail and propagate the error up the mainloop of the worker thread. The worker thread then receives the next message that is the terminating message and stops at once, instead of waiting for the timeout. So I added the following code before the pthread_join calls

for (i = 0; i < resolve->n_valid_workers; i++) {
        ret = pthread_kill(resolve->workers[i], SIGTERM);
        if (ret)
                log_error("Thread %d: error = %d", i, ret);
}

What should I say. It does not work. Most of my assumptions above are plain wrong. What I learned along the way:

Signal Handling and Threads

The systemd developers have disabled all signals in the worker threads:

# file src/libsystemd/sd-resolve/sd-resolve.c
static void* thread_worker(void *p) {
    sd_resolve *resolve = p;
    sigset_t fullset;

    /* No signals in this thread please */
    assert_se(sigfillset(&fullset) == 0);
    assert_se(pthread_sigmask(SIG_BLOCK, &fullset, NULL) == 0);
    [...]

Which is a sane thing to do for library code. I learned that signals and signal handlers are not thread specific resources. They a process specific. That means for a given signal, like SIGUSR1, there can be only one registered signal handler in the process context. So you cannot implement thread specific actions on signals. Furthermore when a signal is sent to a process via the syscall kill, it's not clear in which thread the signal is received and the signal handler is invoked. It is either the main thread/process or the next available thread that is not blocking the signal. See Stackoverflow: Signal handling with multiple threads in Linux and Stackoverflow:different signal handler for thread and process?. Is it possible

A thread can only block a set of signals which are never handled in the thread context via pthread_sigmask. So it's possible to create a master thread in a program that handles all signals and forwards the information via some other inter thread communication channels like sockets.

From the outside of the process it is not possible to send a signal directly to a thread via kill. The kernel decides which thread receives the signal based on the signal masks. From the inside of the process context it is possible to send a signal to a specific thread via pthread_kill (It uses the syscall tgkill which is not exposed as a library function directly). But, again, the signal handler function is registered process-wide. See manpage of pthread_kill.

In general it's impossible to use signal handlers and signal masks safely as a library author while using POSIX pthreads. You would have to coordinate the usage with the application developer that is the master of the process context ;-) Without pthreads it may be possible for a blocking library function to safely use signal handlers. It has to save the current signal handlers and mask, set up new ones, do something, wait for the signal or a timeout and then restore the original mask and handlers and return. (As long as the application program doesn't use pthreads, signal mask and signal handler itself concurrently.)

What does this mean for the timeout issue in getaddrinfo? Even if the worker thread would unmask a signal like SIGUSR1 and the main program sends the signal via pthread_kill, the code would have to set up a process-wide signal handler for SIGUSR1. Otherwise the unhandled signal SIGUSR1 would terminate the whole process, because that's the default action. See man 7 signal.

I tried the above idea, but it does not work. The signal does not abort the library function getaddrinfo. The registered signal handler is invoked in the thread context, but it does not lead to a return code of -EINTR. I suspect the internal code of getaddrinfo that calls poll just retries on -EINTR. The timeout problem still exists.

While reading manpages I also found the option SA_RESTART of the signal handler register function sigaction:

SA_RESTART
       Provide  behavior compatible with BSD signal semantics by making certain
       system calls restartable across signals.  This flag is  meaningful  only
       when  establishing  a signal handler.  See signal(7) for a discussion of
       system call restarting.

My assumption that every syscall is aborted by signals is wrong. It depends on the syscall and further options. See man 7 signal section Interruption of system calls and library functions by signal handlers.

But the general idea to send a signal to interrupt (early exit) the poll syscalls works. I wrote a little example C program: wait.c. NOTE: poll is not affected by SA_RESTART. It is never retried.

The glibc code of function getaddrinfo is not nice.

There is a POSIX function pthread_cancel. It is like the signal SIGKILL for processes. It terminates the thread at once. The thread has no chance to cleanup resources.

POSIX defines real-time signals (SIGRTMIN+n to SIGRTMAX) that can be used by an application for his own purposes. They have special properties like queueing and in-order delivery and three of them are already used by the pthread implementation of glibc:

The Linux kernel supports a range of 33 different real-time signals, numbered 32 to
64.  However, the glibc POSIX threads implementation internally uses two (for NPTL)
or three (for LinuxThreads) real-time signals (see pthreads(7)),  and  adjusts  the
value of SIGRTMIN suitably (to 34 or 35).

See man 7 signal section Real-time signals. Above I said that a library cannot use signals reliably... Yes, except your are a glibc developer and can reserve some signal numbers for you. Plus you don't have to worry or handle multiple library instances.

The function getaddrinfo has an async version getaddrinfo_a, too. But I suspect that it uses pthreads internally and have the same timeout behaviour as the resolve library in systemd.

Workarounds

Here are two workarounds for the timeout issue. The first option is to send the signal SIGKILL to the daemon on shutdown at once. This will terminate the process and its threads immediately. Implementation:

$ cat /etc/systemd/system/systemd-timesyncd.service.d/10-hardkill.conf
[Service]
KillSignal=SIGKILL
SuccessExitStatus=SIGKILL

There should be no resource leaks, because the process is terminated anyway.

The second option is only to SIGKILL the worker threads via pthread_cancel. Implementation:

diff --git a/src/libsystemd/sd-resolve/sd-resolve.c b/src/libsystemd/sd-resolve/sd-resolve.c
index 888b372..ef2524a 100644
--- a/src/libsystemd/sd-resolve/sd-resolve.c
+++ b/src/libsystemd/sd-resolve/sd-resolve.c
@@ -579,6 +579,10 @@ static void resolve_free(sd_resolve *resolve) {
                         (void) send(resolve->fds[REQUEST_SEND_FD], &req, req.length, MSG_NOSIGNAL);
         }
 
+        /* Kill threads */
+        for (i = 0; i < resolve->n_valid_workers; i++)
+                pthread_cancel(resolve->workers[i]);
+
         /* Now terminate them and wait until they are gone. */
         for (i = 0; i < resolve->n_valid_workers; i++) {
                 for (;;) {

Here resource leaks are introduced, because the worker threads have no chance to free their memory alloactions. For the shutdown process of a daemon that's not any issue, because the process terminates anyway. But the function resolve_free can also be called in other cases. It's only a ugly workaround.