Neutral Inference Routing & Pricing Protocol (Draft) — exploring a common discovery layer for model endpoints, inspired by gaps in existing infra.
Comments welcome
• Not a new model format or runtime • Not replacing provider SDKs • Not governance for model safety • Not handling training or fine-tuning
Agents today navigate a maze of inference options:
Result: agents hardcode endpoints, can't compare options, and miss cost savings from emerging providers.
inference.md explores what a neutral discovery layer might look like—where model endpoints become queryable through common patterns.
GET /v0/models?capability=code&max_latency=100ms&budget=0.001
Find the best model for your task across all providers, filtered by capability, latency, and price.
{ "endpoint": "/v0/infer", "request": { "messages": [{"role": "user", "content": "Analyze this image"}], "constraints": { "max_cost": 0.01, "max_latency_ms": 500, "required_capabilities": ["vision", "reasoning"], "min_quality_score": 0.9 } }, "response": { "routed_to": "mistral-large-2407", "fallback_chain": ["qwen-2.5-72b", "mixtral-8x22b"], "cost": "$0.0032", "latency_ms": 247, "reasoning": "Best vision+reasoning within constraints" } }
{ "endpoint": "/v0/prices/stream", "type": "websocket", "sample_message": { "timestamp": "2025-09-08T15:30:00Z", "spot_prices": { "70b_class": { "provider_a": 0.0009, "provider_b": 0.0012, "provider_c": 0.0008 }, "405b_class": { "together": 0.0035, "anyscale": 0.004, "replicate": 0.0038, "local_cluster_sf": 0.0008 }, "small_models": { "phi-3-mini": 0.00001, "gemma-2b": 0.00002, "tinyllama": 0.000008 } }, "surge_pricing": { "provider_a": 1.0, "provider_b": 2.1, "provider_c": 0.8 } } }
Note: Surge pricing shows multipliers (2.1 = 2.1× normal price due to demand)
All figures illustrative; see provider docs for current pricing and latency.
Model Class | Example Provider | Price/1M tokens | Latency | Key Differentiator |
---|---|---|---|---|
Large (70B+) | Together AI | $0.90-3.50 | 100-300ms | Open weight hosting |
Speed-optimized | Groq | $0.05-0.70 | 15-50ms | LPU architecture |
Small Models (1-7B) | Edge servers | $0.001-0.05 | 5-30ms | Local-first, Phi/Gemma |
Specialized | Replicate | $0.10-5.00 | 100-2000ms | 10,000+ models |
Mobile/Embedded | ONNX Runtime | $0.0001-0.001 | 10-100ms | On-device, TinyLlama |
Multi-provider | OpenRouter | Market rates | Varies | Routing layer |
# Agent needs code generation within budget import requests response = requests.post('https://inference.md/v0/infer', { "task": "generate_sql_query", "constraints": { "max_cost_per_request": 0.001, "max_latency_ms": 500, "min_quality_score": 0.7 } }).json() print(f"Routed to: {response['model']}") # codegen-16b print(f"Provider: {response['provider']}") # local_vllm_cluster print(f"Cost: $0.0008") # $0.0008 print(f"Alternative: {response['alternative']}") # mixtral-8x7b @ $0.0012
Automatic fallback if primary fails.
# Track price changes across providers async def monitor_prices(): stream = await connect('wss://inference.md/v0/prices/stream') async for update in stream: model = "qwen-2.5-72b" prices = update['prices'][model] cheapest = min(prices.items(), key=lambda x: x[1]) expensive = max(prices.items(), key=lambda x: x[1]) if expensive[1] > cheapest[1] * 1.5: print(f"Price gap: {cheapest[0]} @ $" + str(cheapest[1])) print(f"vs {expensive[0]} @ $" + str(expensive[1]))
Could route traffic or alert users on significant price differences.
Today's agents hardcode endpoints or rely on single providers. As inference becomes commoditized, the real value might be in efficient routing and discovery.
Potential benefits of a neutral inference.md layer:
inference.md
© 2025 inference.md contributors · MIT License · Exploratory sketch for feedback
All provider and model names are illustrative; no affiliation or endorsement implied.