# Parallelism — the CPU-not-GPU thesis, and how to keep it

## The thesis in one paragraph

The video industry assumes you need GPUs to make video. That assumption
is downstream of one architectural choice: **neural inference**. Generative
video models are billion-parameter transformers; one forward pass is one
giant matrix multiplication, and matrix multiplications are what GPUs are
built for. **But procedural composition is not a matrix multiplication.**
It is a million independent per-pixel transforms across thousands of
independent frames. That work is *embarrassingly parallel* and *trivially
vectorizable*. The right hardware for it is a multi-core CPU with numpy.
On an 8-core consumer CPU, a 60-second 24fps render that took ~2 minutes
sequentially drops to ~30 seconds parallel. On a 64-core EPYC server it
drops to ~4 seconds. **No GPU rental, no cloud bill, no $1500 card.**

## Why this changed what we're doing

Before recognizing this, we had a defensive crouch: "we can't compete with
Sora's GPU compute, so we'll compete on style fidelity." The CPU-not-GPU
recognition flipped the frame. We're not competing on the same axis at
all — we're on a parallel one with different physics:

| | Neural video (Sora etc.) | Procedural multiplane (us) |
|---|---|---|
| Per-render compute | one giant matmul on rented GPU | embarrassingly-parallel CPU work |
| Cost per 90-min feature | $540–$1620 cloud GPU rental | ~$2 in Haiku calls + zero compute |
| Wall clock per 90-min feature | hours of GPU time | 10–60 min on consumer CPU |
| Hardware requirement | A100s, H100s, supply-constrained | any modern laptop |
| Asset library | none — every render starts from weights | YAML mints — 100th render of a character is free |
| Character persistence | hard / unsolved | trivial — characters are deterministic from YAML |
| Scaling shape | linear in $$$ | linear in cores |

**This is the moat.** Not "we have better neural weights." We *don't have
neural weights.* We have a procedural authorship surface for the multiplane
primitive that runs on the laptop you already own. Each axis matters
individually; the combination is what no one else has.

The parallel pipeline is what makes the speed/cost claim load-bearing.
A serial procedural renderer would technically be cheap but would take
hours per minute of video — uncompetitive on wall clock. The parallel
pipeline collapses that to minutes. **Parallelism isn't an optimization;
it's what makes the architecture *work*.**

## The pattern (the "every server does the same thing" rule)

All new code that processes N independent items goes through
**`studio.parallel`**. One canonical helper, one set of defaults, one
opt-out. Every machine — your Windows laptop, Box A in Helsinki, future
Box B / Box C / customer hardware — runs the same parallel topology.

### The four canonical entry points

```python
from studio.parallel import (
    parallel_map,             # generic: any per-N hot loop
    parallel_per_frame_pass,  # specialized: ffmpeg + per-frame PIL
    parallel_enabled,         # bool — is parallelism active?
    worker_count,             # int — default = cpu_count - 1
)
```

### parallel_map — for ANY new hot loop

```python
# Before (serial — DON'T):
results = [process(item) for item in items]

# After (parallel — DO):
from studio.parallel import parallel_map
results = parallel_map(
    module_name="my.module",
    fn_name="process",
    args_iter=items,
    chunksize=4,
    label="item",
)
```

`process` must be a top-level function in `my.module` taking one
positional arg. Returns a list of results in input order.

### parallel_per_frame_pass — for ffmpeg-based style passes

Already used by all 9 post-process backends (watercolor, vector_flat,
van_gogh, cel_shaded, voxel_blocks, claymation, paper_cutout,
comic_halftone). Don't reinvent for new backends; call this.

### parallel_enabled / worker_count

For UI/diagnostic code that needs to know parallel status. The dashboard's
PIPELINE STATUS card uses these.

## The five rules

1. **DEFAULT IS PARALLEL.** Serial is the special case. New per-frame /
   per-variant / per-asset code defaults to `parallel_map`. If you find
   yourself writing `for x in items: ...` in a render path, stop and
   ask: is this the hot path? If yes, use `parallel_map`.

2. **WORKERS ARE TOP-LEVEL.** Pass `(module_name, fn_name)` strings.
   `ProcessPoolExecutor` uses `spawn` on Windows and pickles by name;
   closures and bound methods cannot cross the boundary. A module-level
   `def my_worker(arg): ...` works on every server. A nested or lambda
   does not.

3. **WORKER COUNT IS `cpu_count() - 1`.** Always leave one core for the
   OS, the dashboard, the IDE. `worker_count()` enforces this. Override
   only with strong reason (e.g. memory-bound work where 8× workers
   would OOM the box).

4. **OPT-OUT IS A SINGLE ENV VAR.** `STUDIO_PARALLEL=0` falls back to
   serial. Used for debugging stack traces (parallel workers swallow
   exceptions into futures). Never disable globally; never gate on a
   different mechanism.

5. **NUMPY VECTORIZATION FIRST.** If each per-frame call is 60s of pure
   Python, no amount of multiprocessing saves you — you've split a
   slow problem into 8 still-slow pieces. Vectorize per-frame work
   with numpy first. Watercolor grain went from ~12s/frame Python to
   ~120ms/frame numpy (~100×); *then* we parallelized across frames
   for another ~7×; the combined win is ~700×.

## Cross-platform notes

| | Windows | Linux (Box A) |
|---|---|---|
| `multiprocessing` start method | `spawn` | `fork` (default) |
| Closures across worker boundary | NO | yes (but unsafe — don't rely) |
| Top-level worker functions | required | recommended |
| numpy GIL release in BLAS | yes | yes |
| ffmpeg subprocess from worker | works | works |

**Top-level workers everywhere.** That guarantees the same code runs
identically on both platforms with `spawn` semantics. Linux `fork`
happens to be more permissive; coding to `spawn` means you never trip
on Windows-only failures.

## Verification — what to run when in doubt

```bash
# 1. Check the canonical helper works
python -m studio.parallel
# expects: correct_results=True, workers=N

# 2. Benchmark a specific backend
python -m backends._parallel --bench --module backends.svg_cairo --fn apply_vector_flat
# expects: speedup ≈ workers

# 3. The dashboard PIPELINE STATUS panel surfaces live parallel state
# Open localhost:5000 → look for "Parallel pipeline: enabled · 7/8 cores"
```

## What's already parallel (today)

| Module | Pattern | Speedup |
|---|---|---|
| `backends/_parallel.py::parallel_per_frame_pass` | per-frame pool | 2.4–2.6× verified on 8-core |
| `backends/{watercolor,svg_cairo,painterly_van_gogh,cel_shader,voxel_blocks,claymation,paper_cutout,comic_halftone}` | uses ↑ | same |
| `tv/silent_film_parallel.py::render_script_parallel` | two-phase plan/dispatch | 4–7× |
| `tv/multi_aspect.py` | per-platform pool | 4–7× across platforms |
| `tv/sprites/mint_alphabet.py` | per-letter pool | linear in cores |
| `tv/sprites/mint_letterform.py` | per-letter pool | linear in cores |

## What still needs parallelizing (gaps)

- **Brainrot v5** (`backends/brainrot_v5/`) — currently serial. The
  scheduler builds shared state up front, then per-frame application
  is independent. ~3-4hr refactor: scheduler stays single-threaded,
  per-frame application moves to `parallel_map`.

## When to NOT parallelize

- Work that takes <1s total. Pool spawn overhead is ~0.5s on Windows.
- Heavily IO-bound subprocess waits where N parallel ffmpeg processes
  would saturate disk bandwidth (test first).
- Code paths where order-dependent state is unavoidable (rare in
  procedural pipelines; most "order-dependent" code is actually
  per-frame independent if you decompose it right).

## TL;DR for new contributors

> If you're processing N independent items and N > 50 and the per-item
> work is non-trivial, use `studio.parallel.parallel_map`. If you're
> doing a per-frame ffmpeg-based style pass, use
> `studio.parallel.parallel_per_frame_pass`. If you're doing anything
> else, ask: should this be parallel? If maybe-yes, use `parallel_map`
> and let the runtime decide via `STUDIO_PARALLEL`.
