How the Inference Index (IFX) Is Calculated

Methodology v1.1 — last updated June 19, 2026

Changelog

v1.1 (2026-06-19) — Reconstruction + price-relative chain

Re-anchored to 2024-04-01 = 100 (was 2026-06-18 = 100 in v1.0). A real, sourced historical reconstruction now extends the index back ~24 months.
Construction changed from dollar-weighted Laspeyres to chained Laspeyres in PRICE RELATIVES. Each monthly (or daily, live) period link is the weighted mean of price changes on the matched sample, not the ratio of weighted dollar baskets. The 50/50 toy case where one constituent cuts price by 50% now produces a −25% index move (was −8% under dollar-weighting). Weights now govern influence, not dollar contribution.
5-era time-varying basket introduced for the reconstruction. Each era has its own representative constituent set + tier weights reflecting the leading models of that time. Chain-link at every composition change (era boundary OR mid-era launch/exit) — matched sample only contributes to the link, so entries / exits cause zero published level move.
Sourced data only, no interpolation. Each historical price point carries { tier, type, url, observedDate } provenance with a 4-tier trust system (1 = launch post, 2 = Wayback Machine, 3 = aggregator/catalog, 4 = host-cited). Models with un-sourced prices are excluded from their era's basket for the affected months.
Reconstruction is config-derived (src/config/reconstruction.ts); live append-only data (data/history.json) is unchanged. The composed series prepends reconstructed monthly points to live daily points with a one-time handover splice computed from the matched-sample link between the reconstruction tail basket and the live basket.

v1.0 — Original launch. 29-model live basket, dollar-weighted Laspeyres, 2026-06-18 base.

Methodology v1.0 archive — superseded by v1.1

What IFX measures

The Inference Index (IFX) tracks the cost of running large language models over time. It answers one question in a single number: is it getting cheaper or more expensive to do AI inference?

Each day we collect the published API prices of a fixed basket of leading language models, blend them into one figure, and express that figure relative to a starting value of 100. If IFX reads 92, inference is 8% cheaper than it was at the index's inception. It works the same way a stock index turns thousands of individual share prices into one readable market signal.

IFX is a price index, not a quality ranking. It does not tell you which model is best. It tells you what the market is charging.

The basket

IFX is computed from a curated basket of 29 language models — chosen to be representative, not exhaustive. An index reflects the shape of a market; it does not list every participant, just as the S&P 500 is not the 500 "best" companies but a representative slice of the market.

The basket is organized into four tiers, and the tiers are weighted to approximate where real spending goes:

Frontier (35%) — top general-purpose and reasoning flagships.
Balanced workhorse (35%) — the mainstream production models most teams actually deploy.
Budget (20%) — cost-optimized tiers for high-volume and latency-sensitive work.
Open-hosted (10%) — open-weight models served through commercial APIs.

Within each tier, models are weighted equally. This is deliberate: the 35/35/20/10 split across tiers is our editorial judgment about where spend concentrates, but we don't pretend to know the precise share of one model versus another inside a tier.

Two consequences are intentional and worth stating plainly. First, providers with more models in the basket carry more total weight — a large lineup is treated as a rough proxy for that provider's share of overall spending. Second, open-weight models are weighted by their share of paid API revenue, not by their real-world deployment volume, which is far larger; a price index follows the money, not the install base.

How the daily number is built

For each model we take its published price per million input tokens and per million output tokens. We combine them into a single blended cost per million tokens using a fixed 3:1 input-to-output ratio — a standard assumption that a typical workload reads about three times as much as it writes. Holding that ratio constant means the index moves only when prices move, never because the assumed workload changed.

We then take the tier-weighted average of every model's blended cost to get the day's raw basket cost, and normalize it against the basket cost on the index's inception date. Inception is set to 100. Every later value is simply today's basket cost divided by the inception basket cost, times 100.

What IFX deliberately does not capture

A trustworthy index is honest about its edges. IFX intentionally excludes:

Discounts. Prompt caching, batch processing, and volume commitments can cut real costs by 50–90%. IFX tracks standard list prices, so it reflects the headline rate, not what a heavy optimizer actually pays.
Reasoning token consumption. Reasoning models often have low per-token prices but burn far more tokens per task. A price index measures the price of a token, not how many tokens a model spends — so a "cheap" reasoning model can still be expensive to run.
Non-token fees. Some products bundle per-request charges (for example, web-search fees). IFX measures token pricing only; models whose true cost lives mostly outside token pricing are excluded from the basket.
Other modalities. IFX v1 covers text/LLM pricing only. Image, video, and audio generation are priced in entirely different units ($/image, $/second, $/minute) and cannot be meaningfully blended into a token index. Dedicated companion indices for those modalities are planned but separate.

Data sourcing and integrity

Prices are collected daily from each provider's official pricing pages and, for the broader tracked universe, from aggregated pricing feeds. Two safeguards protect the number:

History is append-only. Each day's value is recorded and never retroactively rewritten, so the historical series is auditable.
New entries are human-reviewed. Automated collection flags price changes and newly launched models, but a change only enters the index after a person confirms it. The index will never silently publish a figure that hasn't been verified.

The reconstruction (v1.1)

IFX is anchored to 2024-04-01 = 100. The two years before live-collection began are reconstructed from sourced public pricing.

Five eras

The reconstruction divides the period into five eras, each with its own representative basket reflecting the leading models of that time:

| Era | Window | Defining inflection | |---|---|---| | A · Modern frontier | 2024-04 → 2024-09 | GPT-4 Turbo, Claude 3 family, Gemini 1.5; the first era where the modern "frontier vs budget" tier structure became legible | | B · Reasoning emerges | 2024-10 → 2025-03 | o1, Claude 3.5 Sonnet (new), DeepSeek V3 + R1 establish reasoning as a separate price tier | | C1 · GPT-5 era opening | 2025-04 → 2025-09 | Claude Opus 4 / Sonnet 4, Gemini 2.5 Pro / Flash, o3, GPT-4.1 | | C2 · GPT-5 era closing | 2025-10 → 2026-03 | GPT-5 family, Claude Opus 4.5 / Sonnet 4.5 / Haiku 4.5, Gemini 3 family, broad open-weight MoE generation | | D · Current | 2026-04 → today | The live 29-model basket; daily collection begins 2026-06-18 |

Within each era, tier weights are the same 35% / 35% / 20% / 10% split (frontier / balanced workhorse / budget / open-hosted), with equal weights inside each tier.

Price sources and trust tiers

Every reconstructed price point carries a source URL and a trust tier:

1. Launch post — provider's official launch announcement (most reliable). 2. Wayback Machine — archived snapshot of the provider's pricing page on the relevant date. 3. Aggregator / catalog — Artificial Analysis model pages, third-party catalog snapshots. 4. Host-cited — Together, Fireworks, OpenRouter pricing tables (least direct).

Promotional or launch-discount pricing is treated as transient and excluded — the reconstruction uses the standard list price across each model's lifespan, not the promo rate. (E.g., DeepSeek V3 launched at a promotional $0.14/$0.28 that ended Feb 2025, reverting to the announced standard $0.27/$1.10; the standard rate is used throughout.)

Price-relative chained Laspeyres

The new construction (v1.1 onward):

For each step t-1 → t (monthly in the reconstruction, daily in live):

1. Matched sample M = constituents present and priced (not gapped, not pre-launch, not exited) in BOTH periods. 2. Renormalize tier weights over M: w'ᵢ = wᵢ / Σⱼ∈M wⱼ. 3. Period link L_t = Σᵢ∈M w'ᵢ × (Pᵢ,t / Pᵢ,t-1). 4. index_t = index_{t-1} × L_t, with index₀ = 100 at 2024-04-01.

The link is a weighted mean of price relatives on matched constituents only. Entries do not contribute to L_t for the period they enter (no prior price); exits drop out. Composition changes produce zero level move by construction — only matched-model price changes do. The toy case of a 50/50 basket where one model cuts −50% and the other holds: L = 0.5 × 0.5 + 0.5 × 1.0 = 0.75 → index moves exactly −25%.

Era boundaries

At an era boundary, the per-model base weights swap to the new era's structure (tier_weight / models_in_tier). Matched models — those present in both the prior era's last priced month AND the new era's first priced month — drive the link. Because the chained construction is matched-sample, the boundary itself never creates a step; only sourced price changes between the matched models in those two periods move the level.

Handover to live data

The reconstruction tail (March 2026) splices to the live segment (June 2026 onward) via a single matched-sample price-relative link. The link is computed once at build time and the live segment chains forward day by day.

To avoid a visual break between the reconstruction tail (2026-03-31) and live inception (2026-06-18), bridge points for 2026-04-30 and 2026-05-31 are derived from the current 29-model basket with its published list prices held flat across those two months. The bridge points are flagged reconstructed: true and chained at link 1.0 (the matched sample's prices were stable across these months). The chart renders dashed continuously from the historical reconstruction into the live segment.

What the index says about deflation

IFX is a conservative fixed-basket measure. It tells you whether the list prices of representative LLMs are falling. The current chained answer to date: about −19% over 24 months — list prices themselves are fairly sticky. Major moves are concentrated in documented price events: the Aug 2024 Google price cuts, the June 2025 o3 list-price cut.

The much larger deflation a typical user experiences comes from substitution — switching down to a cheaper model that now meets the capability bar. That's the story the sibling Cost-of-Intelligence index (CoI) tells. IFX and CoI are complementary:

IFX answers: "are list prices themselves falling?" — held-basket Laspeyres.
CoI answers: "is it getting cheaper to buy a given level of capability?" — substitution measure.

CoI is forward-only — its history accrues from its 2026-06-18 inception, not back-filled — because capability scores can't be reliably reconstructed for historical models on the current AA v4.1 scale without chained re-evaluation. IFX's history is reconstructed; CoI's history is forward-accruing.

Limitations (v1.1)

Conservative bias is intentional. IFX captures sourced price-change events on held basket members. Minor or un-announced price tweaks that weren't sourced are missing — the index may understate real deflation by some margin. We accept that for the audit trail (every move points to a documented event).
Reconstruction granularity is monthly, live is daily. The handover splice is a single one-time link, not a continuous re-chaining.
No input-only / output-only sub-indices for the reconstructed segment — the historical reconstruction emits one blended series. Input/output sub-indices are live-only from 2026-06-18 forward.

Versioning

This methodology is versioned. The basket composition, tier weights, and input-to-output ratio are fixed within a version; any change to them increments the version number and is logged here, so anyone can see exactly how the calculation has evolved. The weights in particular are an editorial estimate and are expected to be refined as real usage-share data becomes available.

Cost of Intelligence — the capability-adjusted sibling index

IFX answers "how expensive is a token?" The Cost of Intelligence (CoI) index answers a different question: how expensive is it to buy a given level of capability?

Raw token prices fall slowly. The cost to achieve a given level of capability falls fast — because when a cheaper model crosses a capability threshold, you can switch down to it for the same job. CoI measures that.

The bars

CoI is computed against three fixed capability thresholds, called bars. Each bar names a numeric capability gate and carries a public-facing reference-class label.

| Bar | Gate (AA v4.1) | Reference class | |---|---:|---| | Frontier | ≥ 50 | ≈ GPT-5.5 / Opus-class | | Capable | ≥ 40 | ≈ DeepSeek V4 Pro-class | | Budget-capable | ≥ 25 | ≈ GPT-4-class |

Label is approximate class; gate is the numeric threshold. A model qualifies iff its canonical capability score meets the gate — the reference label is for human intuition, not for math.

The champion picker

For each bar on each day:

1. Take every non-provisional model whose canonical capability score meets the bar's gate. 2. Compute each qualifier's blended USD-per-1M-token cost using the same 3:1 input-to-output mix as IFX. 3. The model with the lowest blended cost is the champion for that bar that day. 4. The bar's series value is normalized to base 100 against the champion's blended cost on the bar's inception date.

When a cheaper qualifier crosses the bar — or when the existing champion's price drops — the index falls. That fall is the cost of intelligence dropping.

Capability scores

Capability is a sourced editorial input, treated exactly like price: a swappable adapter populates it, behind a per-source interface so the data source can evolve without touching the index math.

Primary source: Artificial Analysis Intelligence Index v4.1 (June 2026 scale). AA's published composite is used as the bar-gate value in v1.
Validation source: Epoch AI evals, archived in parallel under their own source tag.
Canonical effort = highest-available variant. AA tracks the same model at multiple reasoning-effort levels (e.g. GPT-5.5 xhigh / high / medium / low / non-reasoning). We gate on the model's highest-effort score and archive every variant. Caveat: peak-effort capability paired with list token price understates the real cost of high-effort reasoning, which burns far more tokens per task.
Provisional scores are excluded from gating. Any score that is a placeholder rather than a real keyed value carries a provisional: true flag and is removed from the champion-picking pool until a real value is sourced.

Granular archiving

CoI archives granular per-evaluation sub-scores, not just the composite. Each day, for each (model, source) pair, a row is written to data/capability-history.jsonl in a tidy/long shape — one record per (date, modelId, source), append-only, idempotent on that triple. The row carries:

All effort-level variants the source publishes
The source's per-evaluation sub-scores (as a key → score map, captured in full)
The source's own published composite, kept alongside but never read as canonical
Versioned method metadata (sourceMethodVersion, sourceAsOf) so the archive remains interpretable across source revisions

This preserves maximum data fidelity from day one, so a proprietary composite can be derived from history later — including domain-weighted aggregates (Agents 34% / Coding 24% / Scientific Reasoning 24% / General 18% per AA v4.1) — without re-querying the source.

Scale versioning and chaining

Bars are pinned to the underlying capability scale (currently aa-intelligence-index-v4.1). When AA bumps the scale, we never silently move a bar: scores from both the old and new versions are archived together across an overlap window, a one-time chain link is fit (score_new = a + b · score_old on overlap), and only then is the bar redefined against the new scale. Each scale event is recorded in CAPABILITY_SCALE_HISTORY so the audit trail is intact.

CoI v1 launches at AA v4.1 inception; no chaining event is required yet.

IFX is an independent reference index. Prices are sourced from public provider pricing as of each daily update and may differ from negotiated or discounted rates. CoI capability scores are sourced from Artificial Analysis and Epoch AI; bars are an editorial choice subject to revision under the chaining protocol above.