How to Build Hybrid Cloud+Edge Model Routing
Route LLM requests between local/edge models and cloud APIs based on complexity, with automatic fallback. Research shows 60% cost reduction and 40% latency improvement.
The Problem
Running everything on cloud LLMs is expensive and slow. Running everything locally has quality limitations. The optimal strategy is hybrid routing:
- Simple queries (greetings, lookups, summarization) → local model (Ollama, WebLLM)
- Complex queries (multi-step reasoning, code generation) → cloud model (GPT-5.4, Claude)
- Fallback — if the local model fails or produces low-confidence output, fall back to cloud
No reactive library provides this pattern. Developers hand-wire if/else chains with no observability.
The Solution
callbag-recharge's route() splits the request stream by predicate. rescue() catches local model failures and falls back to cloud. Everything is observable via Inspector.
/**
* Hybrid Cloud+Edge Model Routing
*
* Demonstrates: Confidence-based routing between a local/edge LLM
* and a cloud LLM, with automatic fallback via rescue().
* Research shows this approach reduces cloud costs by 60% and latency by 40%.
*/
import { pipe, state } from "callbag-recharge";
import { filter, fromPromise, merge, rescue, subscribe, switchMap } from "callbag-recharge/extra";
import { route } from "callbag-recharge/orchestrate";
// ── Types ────────────────────────────────────────────────────
interface LLMRequest {
prompt: string;
maxTokens: number;
complexity: "simple" | "moderate" | "complex";
}
interface LLMResponse {
text: string;
model: string;
latencyMs: number;
cost: number;
}
// ── Simulated inference functions ────────────────────────────
function localInfer(req: LLMRequest): Promise<LLMResponse> {
return Promise.resolve({
text: `(local answer to: ${req.prompt.slice(0, 30)})`,
model: "llama4",
latencyMs: 120,
cost: 0,
});
}
function cloudInfer(req: LLMRequest): Promise<LLMResponse> {
return Promise.resolve({
text: `(cloud answer to: ${req.prompt.slice(0, 30)})`,
model: "gpt-5.4-mini",
latencyMs: 800,
cost: 0.003,
});
}
// ── Request source ───────────────────────────────────────────
const request = state<LLMRequest | null>(null, { name: "request" });
// ── Route based on complexity ────────────────────────────────
// route() splits the stream: simple/moderate → local, complex → cloud
const [localRoute, cloudRoute] = route(
request,
(req: LLMRequest | null) => req !== null && req.complexity !== "complex",
);
// ── Local model (Ollama / WebLLM) with cloud fallback ────────
const localResponse = pipe(
localRoute,
filter((req): req is LLMRequest => req != null),
// switchMap runs local inference for each routed request
switchMap((req) => fromPromise(localInfer(req))),
// rescue() catches local model errors and falls back to cloud
rescue(() =>
fromPromise(
Promise.resolve({
text: "(cloud fallback response)",
model: "gpt-5.4-mini",
latencyMs: 800,
cost: 0.003,
}),
),
),
);
// ── Cloud model (OpenAI / Anthropic) ─────────────────────────
const cloudResponse = pipe(
cloudRoute,
filter((req): req is LLMRequest => req != null),
switchMap((req) => fromPromise(cloudInfer(req))),
);
// ── Derived metrics ──────────────────────────────────────────
const routingStats = state(
{ localCount: 0, cloudCount: 0, totalCost: 0 },
{ name: "routingStats" },
);
// ── Merge all responses ──────────────────────────────────────
const allResponses = merge(localResponse, cloudResponse);
subscribe(allResponses, (resp) => {
if (resp) {
const source = resp.model === "llama4" ? "LOCAL" : "CLOUD";
console.log(`[${source}] ${resp.model}: "${resp.text}" (${resp.latencyMs}ms, $${resp.cost})`);
routingStats.update((s) => ({
localCount: s.localCount + (resp.model === "llama4" ? 1 : 0),
cloudCount: s.cloudCount + (resp.model !== "llama4" ? 1 : 0),
totalCost: s.totalCost + resp.cost,
}));
}
});
// Simple question → routed to local model
request.set({ prompt: "What is 2+2?", maxTokens: 50, complexity: "simple" });
// Complex question → routed to cloud
request.set({
prompt: "Analyze the geopolitical implications of...",
maxTokens: 2000,
complexity: "complex",
});
console.log("Stats:", routingStats.get());Why This Works
route(source, predicate)— splits the stream into[matching, notMatching]. Simple/moderate queries go to local; complex go to cloud. No if/else chains.rescue()— wraps the local model pipeline. If inference fails (OOM, model not loaded, timeout), it automatically switches to the cloud fallback. Zero manual error handling.Observable routing stats —
routingStatsis a reactive store. Dashboard, logs, or alerting can subscribe to it.
Confidence-Based Routing
For more sophisticated routing, score the local model's output confidence:
const localWithConfidence = pipe(
localRoute,
switchMap(req => producer(({ emit, complete }) => {
localInfer(req).then(result => {
emit({ ...result, confidence: estimateConfidence(result) })
complete()
})
}))
)
// Re-route low-confidence responses to cloud
const [confident, uncertain] = route(
localWithConfidence,
resp => resp.confidence > 0.8
)
// uncertain responses get re-processed by cloud
const cloudRefined = pipe(
uncertain,
switchMap(resp => cloudInfer({ prompt: resp.originalPrompt }))
)
// Merge confident local + cloud-refined responses
const finalResponses = merge(confident, cloudRefined, cloudResponse)Cost Tracking
Track routing economics reactively:
const costPerModel = { 'llama4': 0, 'gpt-5.4-mini': 0.003, 'claude-sonnet-4.6': 0.004 }
const totalCost = derived([routingStats], () => {
const stats = routingStats.get()
return stats.totalCost
})
const costSavings = derived([routingStats], () => {
const stats = routingStats.get()
const cloudOnlyCost = (stats.localCount + stats.cloudCount) * costPerModel['gpt-5.4-mini']
return cloudOnlyCost - stats.totalCost
})
effect([costSavings], () => {
console.log(`Saved $${costSavings.get().toFixed(4)} via hybrid routing`)
})See Also
- On-Device LLM Streaming — manage local model token streams
- Tool Calls for Local LLMs — reactive tool call lifecycle