What $0.02 per million embeddings actually costs to run
Unit economics of running embedding inference on idle consumer Apple Silicon.
We charge customers $0.02 per million tokens for embeddinggemma-300m. That price is undercut only by Together's $0.008/M and OpenAI's $0.013/M on text-embedding-3-small. The interesting question — the one we get asked by every prospective enterprise customer — is whether that price is sustainable or a temporary land-grab loss-leader.
This post pulls the curtain back. We're a marketplace, not a margin business: the 99% pass-through to providers means our pricing is bottoms-up from real hardware cost, not topped-down from a desired margin. So the unit economics are mechanical and inspectable. Here they are.
The hardware baseline
We run embeddinggemma-300m on Apple's Neural Engine via Core ML. The dispatching logic prefers base-M2, base-M3, and base-M4 Macs (16 GB unified memory, single-cluster ANE). M-Pro and M-Max devices outperform on per-token latency but waste a lot of unified memory headroom that could be serving a fatter model in the same time window — so we route them elsewhere.
On a typical base-M3 16 GB Mac plugged in and thermally stable:
- ANE throughput: ~17,000 tokens / second sustained on the 300M-parameter Gemma embedding model. Burst throughput touches 22k tok/s but we plan around sustained.
- Active power draw during inference: ~18 watts at the wall, measured with a Kill-A-Watt across five provider Macs. Idle power on the same machines is ~7W, so the marginal power per inference second is ~11W.
- Effective utilization: ~78%. The other 22% is dispatch overhead, model load-on-first-request, and the inter-task pause window where the runner waits for its next assignment.
So the provider's marginal cost of a million tokens, at sustained throughput and effective utilization:
1,000,000 tokens / 17,000 tok/s / 0.78 utilization
= 75.4 active seconds per million tokens
= 75.4s × 11W = 829 watt-seconds = 0.23 watt-hoursAt a US residential power rate of $0.16/kWh, that's $0.000037 per million tokens in electricity. Round generously to half a cent if you live somewhere expensive, less than that everywhere else. The marginal cost of an embedding on a Mac that's already powered on is essentially zero.
Why we charge two cents, not zero
Three reasons the price isn't six orders of magnitude below where it is now.
Dispatch overhead is real. Every task carries fixed costs that don't scale with token count: the WebSocket message, the assignment signature verification, the result envelope round-trip, the ledger write at the coordinator. We measure this at roughly 280 ms of provider-side wall time plus around 4 KB of WebSocket traffic per task. At 100k tasks per day in aggregate, that's a non-trivial slice of operational load.
Hardware amortization. Providers aren't running cost-free hardware — they bought a Mac. The "what should the Mac earn per hour to make this worthwhile" question is what determines whether we have supply at any price. Our earnings estimator targets $0.10–0.20 per active hour for a base M3, depending on workload mix. At our embedding throughput, that translates to roughly $0.014 per million tokens — which is where the floor of our pricing sits.
Coordination cost. The 1% take rate has to cover: the Workers infrastructure, the D1 storage, the R2 bandwidth for inputs and outputs, the Stripe processing fees on customer charges and provider payouts, the Apple Developer Program fee, the notarization, the marketing site, the support inbox, and the salary of the people running it. At low volume that 1% doesn't cover much; the bet is that volume scales fast enough that the 1% is enough.
The bills, in order
Here's a per-million-tokens bill broken out at our current pricing:
At our target throughput of $0.10–0.20 per active hour, a provider earning $0.0189 per million tokens needs to push roughly 5–10 million tokens per active hour — exactly inside the sustainable throughput band for base-M3 hardware.
Where this breaks
A few honest caveats:
- At very low volume, the per-task overhead dominates. If you run a million-token embedding job once a day, the dispatch fixed costs aren't amortised across enough tokens. Your effective rate would be closer to $0.04/M, not $0.02/M. We don't charge for that — the quote is locked — but the provider economics get squeezed.
- At very high volume, we route differently. A million tasks a day flowing into the embeddings runner is more than any single base-M3 can sustainably absorb. We split the load across a wider pool, which is fine for throughput but introduces a different bottleneck (per-Mac thermal cycling) that we manage with longer cool-down windows.
- For non-US providers, electricity costs vary by 4–6×. A Mac in Germany costs ~30¢/kWh; in Spain ~20¢; in much of Asia ~10¢. We don't price-discriminate on geography because the task quote can't be conditional on which provider takes it. Providers in expensive electricity markets self-select out of the network or run during their grid's off-peak hours.
The comparison nobody wants to print
A typical hyperscaler running an equivalent embeddings model on a CPU instance prices in the $0.10/M range. On a GPU instance it drops to $0.04/M but you pay during the cold start and the queue. Our $0.02/M includes the cold start (because we don't have one — providers are already up) and excludes the per-second compute charge (because there isn't one — it's per-task).
The gap exists because the hyperscaler is renting hardware optimised for training, paying for a data center, paying for cooling, paying for hyperscaler margin, and amortising the cost of the GPUs they had to buy at $30k apiece. Our network is renting hardware that was already bought, in homes and offices that were already heated and cooled and powered.
That gap is the entire pitch.
What this doesn't say
This post is about one specific workload — embeddinggemma-300m — and a specific price point. The story is similar for Whisper, image upscale, and small-LLM inference; the pricing page has the per-task numbers.
For frontier chat at scale, the story is completely different. Apple Silicon is the wrong hardware for 70B-parameter transformer decoding throughput, and our network's chat pricing reflects that honestly. If you're running chat, route it to Together or Fireworks, and use Common Compute for the workloads where the unit economics line up.
If your workload is on this catalog and these numbers make sense, the $5 free credit on signup is enough to run a real benchmark. We'd rather you do that than take our word for it.