The SIGTERM Bug That Only Broke on Linux: A Signal Handling Deep Dive
You've built a robust background task system. It works flawlessly on your MacBook. Tests pass. Code review approves. You ship to production—and suddenly, tasks stop completing on your Linux servers. No crashes. No error logs. Just silence.
This exact scenario recently surfaced in a developer's battle with Swift actors and cross-platform signal handling. After six CI runs and countless hours debugging, the culprit turned out to be one of the most fundamental differences between macOS and Linux: how they handle process termination signals.
The Setup: When Actors Meet Background Jobs
The architecture seemed straightforward: a single Swift actor managing all background tasks in the codebase. Actors in Swift provide built-in thread safety and sequential execution guarantees—perfect for coordinating work queues, retries, and job state.
The system also used notification injection before each API call, allowing for request tracking, logging, and observability hooks. On macOS, everything hummed along perfectly. The actor queued tasks, processed them sequentially, and gracefully handled interruptions.
But Linux told a different story.
The Platform Divide: SIGTERM Handling Across Systems
The root cause traces back to how operating systems handle SIGTERM—the signal sent when a process needs to terminate gracefully. This isn't a gentle suggestion; it's a "you have a few seconds to clean up before we SIGKILL you" ultimatum.
On macOS, the Darwin kernel provides more forgiving signal semantics. When SIGTERM arrives, the system gives Swift's runtime enough breathing room to:
- Complete in-flight async operations
- Flush pending actor messages
- Run deferred cleanup handlers
- Properly drain work queues
On Linux, the kernel's signal handling is stricter. When SIGTERM hits a process:
- The main thread receives the signal immediately
- If no custom handler exists, default termination begins within milliseconds
- Actor isolation boundaries may not protect in-progress work
- Async tasks mid-execution can be orphaned without cleanup
The result? Background tasks that appeared to start but never completed their work. No exceptions thrown. No error states recorded. Just incomplete jobs and confused developers.
What CI Revealed (Eventually)
The bug's stealth was its most dangerous feature. Local macOS testing passed consistently. The first five CI runs on Linux passed too—when tasks happened to complete before the test timeout triggered SIGTERM.
On the sixth run, timing shifted just enough for a task to be mid-execution when shutdown began. The actor had dequeued the work, started processing, but hadn't marked it complete. SIGTERM arrived, the process died, and the task state remained "in progress" forever.
The fix required two changes:
- Explicit signal handlers that coordinate with the actor system:
signal(SIGTERM) { _ in
Task {
await backgroundActor.gracefulShutdown()
}
}
- State persistence checkpoints during long-running tasks, not just at completion:
func processJob(_ job: Job) async {
await markInProgress(job.id)
// Work happens here
for chunk in job.chunks {
await process(chunk)
await checkpoint(job.id, progress: chunk.id) // <-- Critical
}
await markComplete(job.id)
}
Lessons for Production Systems
This debugging saga reveals several critical principles:
Cross-platform testing isn't optional. If your code ships to Linux servers, you must test signal handling on Linux. Kubernetes, Docker, systemd—all rely on SIGTERM for graceful shutdowns. What works on macOS is not a Linux guarantee.
Actors don't protect against SIGTERM. Swift actors provide isolation from concurrent access, not from system-level interrupts. Your actor might be processing messages when the kernel says "time's up."
State machines need intermediate checkpoints. Binary states ("pending" vs "complete") fail when crashes happen mid-execution. Consider "in_progress" with progress markers, allowing jobs to resume rather than restart.
Observability saves debugging time. The notification injection pattern mentioned in the original post—inserting hooks before each API call—would have made this bug visible immediately through missing "job completed" metrics.
CI timing matters. Flaky tests that pass 5/6 times aren't random failures—they're race conditions waiting to happen in production. When debugging, intentionally add delays to force timing scenarios.
The Takeaway
The gap between "works on my machine" and "works in production" often hides in the space between operating systems. Signal handling, file locking, network stack behavior, and scheduler differences all create subtle platform-specific bugs.
Swift actors are powerful abstractions for concurrency, but they operate within the constraints of the host OS. When your architecture spans platforms, your testing must too—especially for background tasks where failures are silent and consequences are delayed.
Before you ship your next background job system, ask: What happens when SIGTERM arrives mid-task? If you don't know the answer on every platform you deploy to, you're one CI run away from finding out the hard way.
Building distributed systems? Start with signal handling tests. Your future oncall self will thank you.