Edge AI vs cloud: when does it make sense to run the model on the device

Cloud inference is fine for many applications. But once latency, connectivity, privacy, power consumption, or operating cost enter the picture, running the model locally changes the equation. Here's how we think through that decision on real projects.

When a client comes to us with a device that needs some form of AI inference — anomaly detection on a vibration sensor, gesture recognition on an IMU, keyword spotting on a microphone, or even a small dedicated LLM — one of the first decisions is where the model runs. On-device (Edge AI) or in the cloud?

There's no universal answer. The right choice depends on four things: latency requirements, connectivity assumptions, data sensitivity, and total cost of ownership. Let's go through each.

Latency: the clearest forcing function

Cloud inference adds a round-trip. On a good LTE connection in a city, that round-trip is typically 80–150 ms. On a cellular network in an industrial environment or a rural area, add another 50–300 ms on top of that. On WiFi in a home or office, it's usually fine.

For most alert-style use cases — "notify me when this asset has been idle for 10 minutes" — that latency is irrelevant. But for real-time control loops, human-facing feedback, or safety-critical detection, anything above 50 ms starts to hurt the user experience. Below 20 ms, cloud is simply out of the picture.

If your device needs to react faster than a human can perceive the delay, run the model on the device.

Connectivity: the hidden assumption

Cloud inference requires a working connection every time you need a prediction. That sounds obvious, but it's easy to underestimate what "working connection" means in the field.

We've worked on devices deployed in underground facilities, inside metal enclosures, on moving vehicles in tunnels, and in remote agricultural settings. All of them had some connectivity on paper. None of them had reliable connectivity in practice.

If your device must keep working when the connection drops — even for 30 seconds — you either need on-device inference or a local fallback model. A device that silently fails when the cloud is unreachable is a support ticket waiting to happen.

Data privacy: what leaves the device

Sending raw sensor data to the cloud for inference means that data leaves the device, travels over the network, and is processed on a third-party server. For many sensor types that's fine. For others it's not.

The obvious cases are audio and video: nobody wants their microphone data streamed to a cloud endpoint for keyword detection. But the same logic applies to biometric sensors, medical wearables, and industrial sensors that carry production data a client considers proprietary.

Running inference on-device means raw data never leaves. Only the result — a label, a score, an alert — is transmitted. That's a much smaller and less sensitive payload, and it dramatically simplifies your GDPR and data-handling story.

Cost: the calculation most people get wrong

Cloud inference has a per-call cost. At small scale — a few hundred devices, a few inferences per minute — that cost is negligible. At large scale, or with high-frequency inference, it compounds fast.

The typical mistake is to evaluate cost at prototype scale, then scale the device to tens of thousands of units without revisiting the numbers. A model that runs inference 60 times per minute on 50,000 devices produces 3 billion API calls per month. At $0.0001 per call, that's $300,000/month in inference costs alone.

On-device inference has a fixed hardware cost (a slightly more capable MCU or an NPU like the STM32N6 or Renesas RA8D1) and no per-call fee. At scale, the economics almost always favour running on the device.

What Edge AI actually requires from the hardware

This is where the trade-off becomes concrete for device design. Running a model on the device means:

  • Enough compute — typically a Cortex-M55 or higher, or a dedicated NPU
  • Enough RAM to load the model and run inference (often 256 KB–2 MB for TinyML models)
  • Enough flash for the model weights (post-quantisation, most useful models fit in 100 KB–1 MB)
  • A power budget that accommodates the inference cycle — especially for battery-powered devices

The good news is that the model compression tooling has improved significantly. TensorFlow Lite for Microcontrollers, STM32Cube.AI, and ONNX Runtime for embedded can deploy quantised INT8 models on hardware that costs under $5 in volume. The trade-off is accuracy: a compressed model is less accurate than the full-size version. For most practical classification tasks on sensor data, that loss is acceptable.

The hybrid approach

For many devices, the right answer isn't either/or. A common pattern we use:

  1. Run a lightweight classifier on-device continuously — low power, always-on
  2. When the on-device model detects something interesting, send the raw event to the cloud for deeper analysis or logging
  3. Periodically sync model updates from the cloud back to the device via OTA

This gives you real-time local response, cloud-level accuracy for edge cases, and a path to improve the model after deployment. The on-device model acts as a filter, keeping cloud costs proportional to meaningful events rather than raw sensor throughput.

The short version

Run the model on the device if any of these are true:

  • Your response time requirement is under 100 ms
  • The device operates in environments with intermittent or no connectivity
  • The raw sensor data is sensitive (audio, biometric, proprietary industrial data)
  • You expect to deploy more than a few thousand units
  • The device needs to work autonomously without cloud dependency

Use the cloud if your use case is low-frequency, latency-tolerant, connectivity is guaranteed, and you're at a scale where the API cost is manageable. And consider the hybrid pattern if you need both.


If you're speccing a device and trying to figure out where inference should live, the answer usually falls out of the constraints — it's rarely a free choice. The tricky part is making sure you've identified all the constraints before you commit to a hardware platform.

Building a device that needs on-device AI?

We help hardware startups design the full stack — from sensor selection and PCB layout to model compression and deployment. Tell us what you're working on.

Tell us about your device