Hacker News Viewer

Show HN: Forkrun – NUMA-aware shell parallelizer (50×–400× faster than parallel)

by jkool702 on 3/27/2026, 12:12:20 PM

forkrun is the culmination of a 10-year-long journey focused on &quot;how to make shell parallelization fast&quot;. What started as a standard &quot;fork jobs in a loop&quot; has turned into a lock-free, CAS-retry-loop-free, SIMD-accelerated, self-tuning, NUMA aware shell-based stream parallelization engine that is (mostly) a drop-in replacement for xargs -P and GNU parallel.<p>On my 14-core&#x2F;28-thread i9-7940x, forkrun achieves:<p>* 200,000+ batch dispatches&#x2F;sec (vs ~500 for GNU Parallel)<p>* ~95–99% CPU utilization across all 28 logical cores, even when the workload is non-existant (bash no-ops &#x2F; `:`) (vs ~6% for GNU Parallel). These benchmarks are intentionally worst-case (near-zero work per task) because they measure the capability of the parallelization framework itself, not how much work an external tool can do.<p>* Typically 50×–400× faster on real high-frequency low-latency workloads (vs GNU Parallel)<p>A few of the techniques that make this possible:<p>* Born-local NUMA: stdin is splice()&#x27;d into a shared memfd, then pages are placed on the target NUMA node via set_mempolicy(MPOL_BIND) before any worker touches them, making the memfd NUMA-spliced. Each numa node only claims work that is <i>already</i> born-local on its node. Stealing from other nodes is permitted under some conditions when no local work exists.<p>* SIMD scanning: per-node indexers&#x2F;scanners use AVX2&#x2F;NEON to find line boundaries (delimiters) at speeds approaching memory bandwidth, and publish byte-offsets and line-counts into per-node lock-free rings.<p>* Lock-free claiming: workers claim batches with a single atomic_fetch_add — no locks, no CAS retry loops; contention is reduced to a single atomic on one cache line.<p>* Memory management: a background thread uses fallocate(PUNCH_HOLE) to reclaim space without breaking the logical offset system.<p>…and that’s just the surface. The implementation uses many additional systems-level techniques (phase-aware tail handling, adaptive batching, early-flush detection, etc.) to eliminate overhead, increase throughput and reduce latency at every stage.<p>In its fastest (-b) mode (fixed-size batches, minimal processing), it can exceed 1B lines&#x2F;sec.<p>forkrun ships as a single bash file with an embedded, self-extracting C extension — no Perl, no Python, no install, full native support for parallelizing arbitrary shell functions. The binary is built in public GitHub Actions so you can trace it back to CI (see the GitHub &quot;Blame&quot; on the line containing the base64 embeddings). Trying it is literally two commands:<p><pre><code> . frun.bash frun shell_func_or_cmd &lt; inputs </code></pre> For benchmarking scripts and results, see the BENCHMARKS dir in the GitHub repo<p>For an architecture deep-dive, see the DOCS dir in the GitHub repo<p>Happy to answer questions.

https://github.com/jkool702/forkrun

Comments

by: nasretdinov

Generally when I want to run something with so much parallelism I just write a small Go program instead, and let Go&#x27;s runtime handle the scheduling. It works remarkably well and there&#x27;s no execve() overhead too

3/31/2026, 5:51:38 PM


by: tombert

I guess I&#x27;ve never really used parallel for anything that was bound by the dispatch speed of parallel itself. I&#x27;ve always use parallel for running stuff like ffmpeg in a folder of 200+ videos, and the speed in which parallel decides to queue up the jobs is going to be very thoroughly eaten by the cost of ffmpeg itself.<p>Still, worth a shot.<p>I have to ask, was this vibe-coded though? I ask because I see multiple em dashes in your description here, and a lot of no X, no Y... notation that Codex seems to be fond of.<p>ETA: Not vibe coded, I see stuff from four years ago...my mistake!

3/31/2026, 6:56:08 PM


by: jkool702

Hi HN,<p>Have you ever run GNU Parallel on a powerful machine just to find one core pegged at 100% while the rest sit mostly idle?<p>I hit that wall...so I built forkrun.<p>forkrun is a self-tuning, drop-in replacement for GNU Parallel (and xargs -P) designed for high-frequency, low-latency shell workloads on modern and NUMA hardware (e.g., log processing, text transforms, HPC data prep pipelines).<p>On my 14-core&#x2F;28-thread i9-7940x it achieves:<p>- 200,000+ batch dispatches&#x2F;sec (vs ~500 for GNU Parallel)<p>- ~95–99% CPU utilization across all 28 logical cores (vs ~6% for GNU Parallel)<p>- Typically 50×–400× faster on real high-frequency low-latency workloads (vs GNU Parallel)<p>These benchmarks are intentionally worst-case (near-zero work per task), where dispatch overhead dominates. This is exactly the regime where GNU Parallel and similar tools struggle — and where forkrun is designed to perform.<p>A few of the techniques that make this possible:<p>- Born-local NUMA: stdin is splice()&#x27;d into a shared memfd, then pages are placed on the target NUMA node via set_mempolicy(MPOL_BIND) before any worker touches them, making the memfd NUMA-spliced.<p>- SIMD scanning: per-node indexers use AVX2&#x2F;NEON to find line boundaries at memory bandwidth and publish byte-offsets and line-counts into per-node lock-free rings.<p>- Lock-free claiming: workers claim batches with a single atomic_fetch_add — no locks, no CAS retry loops; contention is reduced to a single atomic on one cache line.<p>- Memory management: a background thread uses fallocate(PUNCH_HOLE) to reclaim space without breaking the logical offset system.<p>…and that’s just the surface. The implementation uses many additional systems-level techniques (phase-aware tail handling, adaptive batching, early-flush detection, etc.) to eliminate overhead at every stage.<p>In its fastest (-b) mode (fixed-size batches, minimal processing), it can exceed 1B lines&#x2F;sec. In typical streaming workloads it&#x27;s often 50×–400× faster than GNU Parallel.<p>forkrun ships as a single bash file with an embedded, self-extracting C extension — no Perl, no Python, no install, full native support for parallelizing arbitrary shell functions. The binary is built in public GitHub Actions so you can trace it back to CI (see the GitHub &quot;Blame&quot; on the line containing the base64 embeddings).<p>- Benchmarking scripts and raw results: <a href="https:&#x2F;&#x2F;github.com&#x2F;jkool702&#x2F;forkrun&#x2F;blob&#x2F;main&#x2F;BENCHMARKS" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;jkool702&#x2F;forkrun&#x2F;blob&#x2F;main&#x2F;BENCHMARKS</a><p>- Architecture deep-dive: <a href="https:&#x2F;&#x2F;github.com&#x2F;jkool702&#x2F;forkrun&#x2F;blob&#x2F;main&#x2F;DOCS" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;jkool702&#x2F;forkrun&#x2F;blob&#x2F;main&#x2F;DOCS</a><p>- Repo: <a href="https:&#x2F;&#x2F;github.com&#x2F;jkool702&#x2F;forkrun" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;jkool702&#x2F;forkrun</a><p>Trying it is literally two commands:<p><pre><code> . frun.bash # OR `. &lt;(curl https:&#x2F;&#x2F;raw.githubusercontent.com&#x2F;jkool702&#x2F;forkrun&#x2F;main&#x2F;frun.bash)` frun shell_func_or_cmd &lt; inputs </code></pre> Happy to answer questions.

3/27/2026, 12:16:28 PM


by: pjoubert

[flagged]

3/31/2026, 7:14:06 PM


by:

3/27/2026, 3:05:15 PM