INFERENCE.md

Neutral Inference Routing & Pricing Protocol (Draft) — exploring a common discovery layer for model endpoints, inspired by gaps in existing infra.

Part of the protocols.md network

🧠 Early exploration — What would neutral inference routing look like? This is a sketch for feedback, not production code. Comments welcome

Executive Summary

What it is: A proposed protocol for discovering and comparing AI inference endpoints across providers with transparent pricing
Who it's for: Developers building agents, infrastructure teams managing inference costs, and providers wanting broader distribution
What it's not: Not a new model format, not replacing existing APIs, not handling model training or safety governance
Status: Early draft exploring feasibility. Seeking feedback from the community before any implementation

Non-Goals

• Not a new model format or runtime • Not replacing provider SDKs • Not governance for model safety • Not handling training or fine-tuning

Problem

Agents today navigate a maze of inference options:

Major providers each have proprietary APIs, rate limits, and pricing tiers
Open weight models (70B, 405B parameter) are hosted by dozens of providers with 10x price variance
Small language models (Phi-3, Gemma-2B, TinyLlama) run on edge but lack discovery
Specialized models (CodeGen-16B, BioGPT, ChemBERTa) need different endpoints
Local clusters running vLLM or TGI have no standard registry
Mobile/embedded inference (ONNX models on phones) remains siloed

Result: agents hardcode endpoints, can't compare options, and miss cost savings from emerging providers.

Approach – Inference Discovery Layer

GET https://inference.md/v0/discover

inference.md explores what a neutral discovery layer might look like—where model endpoints become queryable through common patterns.

Core APIs

Model Discovery

GET /v0/models?capability=code&max_latency=100ms&budget=0.001

Find the best model for your task across all providers, filtered by capability, latency, and price.

Unified Routing (Concept)

{
  "endpoint": "/v0/infer",
  "request": {
    "messages": [{"role": "user", "content": "Analyze this image"}],
    "constraints": {
      "max_cost": 0.01,
      "max_latency_ms": 500,
      "required_capabilities": ["vision", "reasoning"],
      "min_quality_score": 0.9
    }
  },
  "response": {
    "routed_to": "mistral-large-2407",
    "fallback_chain": ["qwen-2.5-72b", "mixtral-8x22b"],
    "cost": "$0.0032",
    "latency_ms": 247,
    "reasoning": "Best vision+reasoning within constraints"
  }
}

Real-Time Pricing Feed

{
  "endpoint": "/v0/prices/stream",
  "type": "websocket",
  "sample_message": {
    "timestamp": "2025-09-08T15:30:00Z",
    "spot_prices": {
      "70b_class": {
        "provider_a": 0.0009,
        "provider_b": 0.0012,
        "provider_c": 0.0008
      },
      "405b_class": {
        "together": 0.0035,
        "anyscale": 0.004,
        "replicate": 0.0038,
        "local_cluster_sf": 0.0008
      },
      "small_models": {
        "phi-3-mini": 0.00001,
        "gemma-2b": 0.00002,
        "tinyllama": 0.000008
      }
    },
    "surge_pricing": {
      "provider_a": 1.0,
      "provider_b": 2.1,
      "provider_c": 0.8
    }
  }
}

Note: Surge pricing shows multipliers (2.1 = 2.1× normal price due to demand)

Inference Landscape

All figures illustrative; see provider docs for current pricing and latency.

Model Class	Example Provider	Price/1M tokens	Latency	Key Differentiator
Large (70B+)	Together AI	$0.90-3.50	100-300ms	Open weight hosting
Speed-optimized	Groq	$0.05-0.70	15-50ms	LPU architecture
Small Models (1-7B)	Edge servers	$0.001-0.05	5-30ms	Local-first, Phi/Gemma
Specialized	Replicate	$0.10-5.00	100-2000ms	10,000+ models
Mobile/Embedded	ONNX Runtime	$0.0001-0.001	10-100ms	On-device, TinyLlama
Multi-provider	OpenRouter	Market rates	Varies	Routing layer

Use Cases

Cost-Aware Routing

# Agent needs code generation within budget
import requests

response = requests.post('https://inference.md/v0/infer', {
  "task": "generate_sql_query",
  "constraints": {
    "max_cost_per_request": 0.001,
    "max_latency_ms": 500,
    "min_quality_score": 0.7
  }
}).json()

print(f"Routed to: {response['model']}")        # codegen-16b
print(f"Provider: {response['provider']}")      # local_vllm_cluster
print(f"Cost: $0.0008")                         # $0.0008
print(f"Alternative: {response['alternative']}") # mixtral-8x7b @ $0.0012

Automatic fallback if primary fails.

Price Monitoring

# Track price changes across providers
async def monitor_prices():
  stream = await connect('wss://inference.md/v0/prices/stream')

  async for update in stream:
    model = "qwen-2.5-72b"
    prices = update['prices'][model]

    cheapest = min(prices.items(), key=lambda x: x[1])
    expensive = max(prices.items(), key=lambda x: x[1])

    if expensive[1] > cheapest[1] * 1.5:
      print(f"Price gap: {cheapest[0]} @ $" + str(cheapest[1]))
      print(f"vs {expensive[0]} @ $" + str(expensive[1]))

Could route traffic or alert users on significant price differences.

Why This Could Matter

Today's agents hardcode endpoints or rely on single providers. As inference becomes commoditized, the real value might be in efficient routing and discovery.

Potential benefits of a neutral inference.md layer:

Reduce costs by 50–90% through provider competition
Improve reliability with automatic fallbacks
Enable edge inference discovery (local models, mobile devices)
Simplify integration — one interface instead of dozens
Create price transparency across the fragmented market

spec_version: 0.1.0-exploration

published: 2025-09-08T16:45:23-07:00

status: thought_experiment

feedback: inference@protocols.md

inspired_by: OpenRouter, vLLM, SkyPilot limitations

inference.md

All provider and model names are illustrative; no affiliation or endorsement implied.