Serving & access: APIs, local inference, and gateways.
This entry separates two things people routinely conflate: which model you use and where it runs. You will leave with a clear map of the access layer — first-party APIs, cloud-hosted model catalogs, independent inference providers, local/self-hosted serving, and gateways — and the trade-offs that decide which one fits.
Model choice and serving choice are orthogonal.
A common mistake is treating "use Llama" or "use Claude" as a single decision. In reality you pick a model and, separately, a way to run it. A given open-weight model might be available from several independent inference companies, from cloud catalogs, and on your own hardware — same weights, very different latency, price, privacy, and reliability. Decoupling these two decisions is the core skill of this entry.
The access options.
First-party provider API
Call the lab that trained the model directly. You get the newest snapshots first, the provider's full feature set (tool use, caching, reasoning controls), and their reliability engineering. Trade-off: only their models, their pricing, their availability and roadmap. The default for closed frontier models.
Cloud-hosted model catalogs
Major cloud platforms offer managed catalogs that expose many models (often including third-party and open-weight ones) behind a unified, in-cloud endpoint. Value: data stays in your existing cloud boundary, unified billing and IAM, enterprise compliance. Trade-off: model availability lags the first-party source, and you inherit the cloud's regional and quota constraints.
Independent inference providers
Companies that specialize in serving open-weight models fast and cheap, competing on latency, throughput, and price per token. Value: often the cheapest/fastest route to a popular open model, no infrastructure to run. Trade-off: you are trusting a third party with your data and uptime, and the model menu is whatever they choose to host.
Local / self-hosted serving
Run open-weight models on your own hardware with an inference engine, from a laptop-scale runtime for development up to optimized GPU serving stacks in production. Value: maximum control and privacy, fixed cost at high utilization, no per-token bill. Trade-off: you own capacity planning, scaling, batching efficiency, security patching, and uptime — a real engineering function, not a checkbox.
Gateways / routers
A proxy layer in front of one or many of the above that presents a single API and adds cross-provider concerns: key management, fallback and load-balancing across providers, spend caps and rate limits, caching, and unified logging. Value: provider independence and operational control from one place. Trade-off: another component in the hot path to operate and secure, and a small added latency.
The dimensions that decide it.
- Data boundary. Where is inference data legally and contractually allowed to go? Often the first filter, and it can force self-hosting or in-cloud serving regardless of cost.
- Cost model. Per-token opex (APIs, inference providers) vs fixed capacity (self-host). Crossover depends entirely on sustained utilization.
- Latency and throughput. Independent providers compete hard here; first-party varies; self-host is whatever you engineer it to be.
- Feature lag. First-party gets new model snapshots and capabilities first; intermediaries lag by a cycle or more.
- Reliability and concentration risk. A single provider is a single point of failure. A gateway with fallbacks trades simplicity for resilience.
A pragmatic default.
For most teams starting out: call the first-party API of whichever model passes your evaluation, behind a thin internal abstraction of your own. That abstraction is the cheap insurance — it lets you later move a model behind a gateway, add a fallback provider, or self-host the high-volume path without rewriting your application. Premature multi-provider routing is over-engineering; an un-abstracted hard dependency on one endpoint is under-engineering. The thin seam between them is the right starting posture, and you add gateways or self-hosting only when a concrete cost, privacy, or reliability requirement demands it.
Decide the model on capability and your eval. Decide the serving path on data boundary, cost at your volume, and reliability needs. Keeping these two decisions separate — and seam-able in code — is what keeps the access layer from becoming lock-in.