Deploy Needle for On-Device Tool Calling on Mobile/Wearables

Run 26M parameter function calling at 6000 tok/s on phones and watches

Updated: 5/13/2026
Difficulty
medium
Time
1-2 hours
Use Case
Enable real-time tool calling on consumer devices without cloud dependency
Popularity
0 views

About this automation

Needle achieves 6000 tok/s prefill and 1200 tok/s decode on consumer devices through its Simple Attention Network architecture (no FFN layers). Deploy using Cactus, an inference engine built from scratch for mobile, wearables, and custom hardware.

How to implement

1

Download Needle weights from Hugging Face (Cactus-Compute/needle)

2

Integrate Cactus inference engine into your mobile/wearable app

3

Define tool schemas for your application (timers, messaging, navigation, smart home, etc.)

4

Implement single-shot function calling pipeline (match query to tool, extract arguments, emit JSON)

5

Test latency and throughput on target device

6

Optimize for battery and memory constraints using Cactus tuning