Where Will the Thinking Actually Happen?

On-device AI promises to redistribute intelligence from the cloud. The economics suggest both layers expand simultaneously.

By Rina Takahashi— April 22, 2026

Where Will the Thinking Actually Happen?

On-device AI promises to redistribute intelligence from the cloud. The economics suggest both layers expand simultaneously.

A flagship smartphone in 2026 runs AI models up to roughly 4 billion parameters in quantized form. An M4 Mac Mini handles 7 billion smoothly. Gemma 3's 1B model scores 62.8% on GSM8K math benchmarks and runs at over 2,500 tokens per second on a mobile GPU. Two years ago, a model several times that size couldn't follow basic instructions. The performance-per-parameter curve is steepening fast, and the devices people already own are becoming meaningfully capable inference machines.

The economics look favorable for certain workloads. For always-on features like continuous transcription or ambient translation, cloud pricing is punishing. Every second is metered. On-device, the meter doesn't exist. That distinction opens up entire categories of capability that aren't viable when you're paying per token.

	Mac Mini (M4)	Equivalent GPU rig
Power draw under AI inference	30–40W	350–450W
Annual electricity cost (daily use)	~$14	~$160–210

But the tasks generating the most business value tend to require what local hardware can't provide. Complex reasoning needs frontier models with hundreds of billions of parameters. Agentic workflows need large context windows. Training stays cloud-only. The on-device AI hardware market is growing at 19% annually, and cloud still dominates inference revenue.

The redistribution question is genuinely hard to call, and the reason sits in the cost curves. Inference unit costs have fallen roughly 280-fold since late 2022. Total inference spending keeps climbing because usage grows faster than prices fall. Enterprises deploy AI across more workflows, generating more tokens, and the rising aggregate cost pushes some workloads toward the edge.

The same dynamic means more workloads exist that only the cloud can serve.

Falling costs make the edge more attractive and the center more useful, simultaneously. Both are expanding at once.

Meanwhile, fragmentation compounds the difficulty of building for the edge. Qualcomm, Apple, MediaTek, and Samsung each maintain their own NPU frameworks and SDKs. On Android alone, developers have to support Qualcomm's QNN, MediaTek's NeuroPilot, Samsung's ONE, or fall back to CPU. Getting a model to run well on one device tells you little about whether it'll run on the next. Cloud APIs, whatever their costs, offer a single target. That operational overhead shapes real deployment decisions every day, and it's easy to underestimate from the outside. Apple's own Neural Engine dequantizes INT8 weights to FP16 before compute, meaning the marketed 38 TOPS is closer to 19 TFLOPS in practice. The gap between spec sheet and production behavior is wide, and it's different on every chip.

The emerging three-tier pattern

Cloud handles frontier reasoning and elastic training. On-premises infrastructure serves predictable high-volume inference. Edge devices take latency-critical or always-on tasks where metered compute makes no sense.

In practice, the center keeps absorbing more while the edge handles a specific, growing, but bounded category of work.

The PC added a layer on top of centralized computing. NPUs and local inference may do something similar: open up categories of AI capability that weren't viable on cloud economics, while the cloud's role keeps expanding underneath. The center and the edge can both get bigger at the same time. The center just gets harder to see.

	Mac Mini (M4)	Equivalent GPU rig
Power draw under AI inference	30–40W	350–450W
Annual electricity cost (daily use)	~$14	~$160–210

The same dynamic means more workloads exist that only the cloud can serve.

Falling costs make the edge more attractive and the center more useful, simultaneously. Both are expanding at once.

The emerging three-tier pattern

In practice, the center keeps absorbing more while the edge handles a specific, growing, but bounded category of work.