INFERENCE.md

Neutral Inference Routing & Pricing Protocol (Draft) — exploring a common discovery layer for model endpoints, inspired by gaps in existing infra.

Part of the protocols.md network
🧠 Early exploration — What would neutral inference routing look like? This is a sketch for feedback, not production code. Comments welcome

Executive Summary

  • What it is: A proposed protocol for discovering and comparing AI inference endpoints across providers with transparent pricing
  • Who it's for: Developers building agents, infrastructure teams managing inference costs, and providers wanting broader distribution
  • What it's not: Not a new model format, not replacing existing APIs, not handling model training or safety governance
  • Status: Early draft exploring feasibility. Seeking feedback from the community before any implementation

Non-Goals

• Not a new model format or runtime • Not replacing provider SDKs • Not governance for model safety • Not handling training or fine-tuning

Problem

Agents today navigate a maze of inference options:

  • Major providers each have proprietary APIs, rate limits, and pricing tiers
  • Open weight models (70B, 405B parameter) are hosted by dozens of providers with 10x price variance
  • Small language models (Phi-3, Gemma-2B, TinyLlama) run on edge but lack discovery
  • Specialized models (CodeGen-16B, BioGPT, ChemBERTa) need different endpoints
  • Local clusters running vLLM or TGI have no standard registry
  • Mobile/embedded inference (ONNX models on phones) remains siloed

Result: agents hardcode endpoints, can't compare options, and miss cost savings from emerging providers.

Approach – Inference Discovery Layer

GET https://inference.md/v0/discover
Loading...

inference.md explores what a neutral discovery layer might look like—where model endpoints become queryable through common patterns.

Core APIs

Model Discovery

GET /v0/models?capability=code&max_latency=100ms&budget=0.001

Find the best model for your task across all providers, filtered by capability, latency, and price.

Unified Routing (Concept)

{
  "endpoint": "/v0/infer",
  "request": {
    "messages": [{"role": "user", "content": "Analyze this image"}],
    "constraints": {
      "max_cost": 0.01,
      "max_latency_ms": 500,
      "required_capabilities": ["vision", "reasoning"],
      "min_quality_score": 0.9
    }
  },
  "response": {
    "routed_to": "mistral-large-2407",
    "fallback_chain": ["qwen-2.5-72b", "mixtral-8x22b"],
    "cost": "$0.0032",
    "latency_ms": 247,
    "reasoning": "Best vision+reasoning within constraints"
  }
}

Real-Time Pricing Feed

{
  "endpoint": "/v0/prices/stream",
  "type": "websocket",
  "sample_message": {
    "timestamp": "2025-09-08T15:30:00Z",
    "spot_prices": {
      "70b_class": {
        "provider_a": 0.0009,
        "provider_b": 0.0012,
        "provider_c": 0.0008
      },
      "405b_class": {
        "together": 0.0035,
        "anyscale": 0.004,
        "replicate": 0.0038,
        "local_cluster_sf": 0.0008
      },
      "small_models": {
        "phi-3-mini": 0.00001,
        "gemma-2b": 0.00002,
        "tinyllama": 0.000008
      }
    },
    "surge_pricing": {
      "provider_a": 1.0,
      "provider_b": 2.1,
      "provider_c": 0.8
    }
  }
}

Note: Surge pricing shows multipliers (2.1 = 2.1× normal price due to demand)

Inference Landscape

All figures illustrative; see provider docs for current pricing and latency.

Model ClassExample ProviderPrice/1M tokensLatencyKey Differentiator
Large (70B+)Together AI$0.90-3.50100-300msOpen weight hosting
Speed-optimizedGroq$0.05-0.7015-50msLPU architecture
Small Models (1-7B)Edge servers$0.001-0.055-30msLocal-first, Phi/Gemma
SpecializedReplicate$0.10-5.00100-2000ms10,000+ models
Mobile/EmbeddedONNX Runtime$0.0001-0.00110-100msOn-device, TinyLlama
Multi-providerOpenRouterMarket ratesVariesRouting layer

Use Cases

Cost-Aware Routing

# Agent needs code generation within budget
import requests

response = requests.post('https://inference.md/v0/infer', {
  "task": "generate_sql_query",
  "constraints": {
    "max_cost_per_request": 0.001,
    "max_latency_ms": 500,
    "min_quality_score": 0.7
  }
}).json()

print(f"Routed to: {response['model']}")        # codegen-16b
print(f"Provider: {response['provider']}")      # local_vllm_cluster
print(f"Cost: $0.0008")                         # $0.0008
print(f"Alternative: {response['alternative']}") # mixtral-8x7b @ $0.0012

Automatic fallback if primary fails.

Price Monitoring

# Track price changes across providers
async def monitor_prices():
  stream = await connect('wss://inference.md/v0/prices/stream')

  async for update in stream:
    model = "qwen-2.5-72b"
    prices = update['prices'][model]

    cheapest = min(prices.items(), key=lambda x: x[1])
    expensive = max(prices.items(), key=lambda x: x[1])

    if expensive[1] > cheapest[1] * 1.5:
      print(f"Price gap: {cheapest[0]} @ $" + str(cheapest[1]))
      print(f"vs {expensive[0]} @ $" + str(expensive[1]))

Could route traffic or alert users on significant price differences.

Why This Could Matter

Today's agents hardcode endpoints or rely on single providers. As inference becomes commoditized, the real value might be in efficient routing and discovery.

Potential benefits of a neutral inference.md layer:

  • Reduce costs by 50–90% through provider competition
  • Improve reliability with automatic fallbacks
  • Enable edge inference discovery (local models, mobile devices)
  • Simplify integration — one interface instead of dozens
  • Create price transparency across the fragmented market
spec_version: 0.1.0-exploration
published: 2025-09-08T16:45:23-07:00
status: thought_experiment
feedback: inference@protocols.md
inspired_by: OpenRouter, vLLM, SkyPilot limitations

inference.md

© 2025 inference.md contributors · MIT License · Exploratory sketch for feedback

All provider and model names are illustrative; no affiliation or endorsement implied.