Your B2B product handles sensitive data. The legal team said no to sending it to a third-party API. Your CTO says self-hosting a language model is "too complex." Both of them are half-right.
The legal team is correct that sending customer data to a third-party inference API likely violates your enterprise contracts — most enterprise MSAs have explicit data residency and subprocessor requirements that cloud inference APIs fail. The CTO is correct that running your own GPU cluster is operationally complex. What neither of them has fully mapped is the space between those two positions, which is where the right answer almost always lives.
What this costs if you ignore it
Your enterprise customers have security questionnaires. Those questionnaires ask where their data goes. "We send it to an inference API run by a third party" is not an answer that passes enterprise procurement for most regulated verticals.
Not having language model capabilities in your product in 2026 is a competitive disadvantage that is measurable in closed/lost rates. Your competitors in your category have these features. Buyers notice.
The false binary — third-party API or nothing — costs you the enterprise segment. Getting the architecture right costs you engineering time once. That's the trade.
The three self-hosting options
Option 1: Full self-host on GPU infrastructure
You provision GPU compute (typically AWS EC2 p3, g5, or inf2 instances, or equivalent on GCP/Azure), deploy the model weights, manage the inference server (vLLM, TGI, or similar), handle scaling, and own the full operational stack.
What this actually costs. For a production-grade 7B parameter model with reasonable throughput, you're looking at a minimum of two g5.12xlarge instances ($16.32/hr each as of late 2025) for redundancy, a load balancer, and the operational time to manage it. Before you've handled autoscaling, monitoring, or model updates, you're at $23k/month in base infrastructure plus engineering time.
For a 70B model that provides quality closer to frontier models, the numbers scale roughly linearly. This is a serious infrastructure investment.
When this is the right answer. You have extremely high query volume (the per-token cost on managed inference becomes prohibitive at scale). You have the strictest possible data residency requirements — air-gapped, on-premise, no cloud. You have the engineering team to run it. Almost no B2B SaaS company at Series A–B should be doing this.
Option 2: Managed inference on your own cloud account
Several inference providers — Amazon Bedrock, Azure AI, Google Vertex AI — allow you to deploy models within your own cloud account, with your own VPC controls. Your data never leaves your cloud account boundary. The provider manages the infrastructure; you manage the networking and access controls.
What this actually costs. Bedrock, as a representative example, prices per token. For a Llama-3 70B class model (which is representative of what you'd use for complex enterprise use cases), expect $0.002–$0.003 per 1,000 output tokens. At moderate usage (100k tokens/day), that's $200–300/month. At heavy enterprise usage (2M tokens/day), you're at $4k–6k/month.
The operational burden is substantially lower than full self-host. The infrastructure is managed. The model is updated by the provider. You focus on the application layer.
When this is the right answer. You're in AWS, Azure, or GCP already. Your data residency requirements are satisfied by keeping data within your cloud account. You want model quality close to frontier without the operational complexity of full self-host. This is the correct answer for most B2B SaaS companies with data sensitivity requirements.
Option 3: Hybrid with data isolation
For applications where the sensitive data is a small portion of the inference context — for example, a document summarization feature where you can strip PII before sending to inference, or a code review tool where the sensitive data is the customer's code but the model needs only code structure, not business logic — a hybrid approach is viable.
The pattern: a data sanitization layer runs in your infrastructure, strips or pseudonymizes the sensitive elements, and forwards the sanitized context to a cloud inference API. The response is then mapped back against the original context.
What this actually costs. The sanitization layer is engineering work — typically 2–4 weeks to build correctly, longer if the sanitization requirements are complex. The ongoing inference costs are cloud API rates ($0.003–$0.015 per 1,000 tokens depending on model and provider). The tradeoff is engineering investment for operational simplicity.
When this is the right answer. Your sensitive data can be meaningfully sanitized without destroying the quality of the inference context. The legal team and customers are satisfied by "we never send raw customer data off your cloud account." The hybrid approach requires careful analysis of whether the sanitization is genuinely removing what it claims to remove.
The decision matrix
Four variables determine the right architecture: data sensitivity, query volume, latency requirements, and model quality needs.
Data sensitivity (the axis that matters most for B2B). If your enterprise customers have strict data residency requirements, full self-host or managed-in-account inference is the answer. If they have standard enterprise data protection requirements, managed-in-account inference with proper VPC controls usually suffices. If the sensitive data can be isolated from the inference context, hybrid is viable.
Query volume. Below 500k tokens/day, managed inference pricing is almost always more economical than the infrastructure cost of self-hosting. Above 5M tokens/day, the economics of self-hosting start to make sense depending on model size. Most B2B SaaS features at Series A–B are below that threshold.
Latency requirements. For user-facing features, latency matters. Managed inference on GPU infrastructure optimized for inference (Bedrock, Vertex) achieves 50–200ms for first token on 7B–13B models. Full self-host on well-tuned infrastructure can achieve similar or better performance. Hybrid architectures add the latency of the sanitization layer.
Model quality needs. The open-weight model landscape in 2026 includes production-grade options from Llama 3, Mistral, Qwen, and Gemma at sizes from 3B to 70B. For many B2B use cases — document processing, structured extraction, summarization, classification — a fine-tuned 7B–13B model running on managed infrastructure matches or exceeds frontier model quality on the specific task, because fine-tuning on your domain beats general capability.
What the infrastructure actually looks like
For the managed-in-account path (the right answer for most B2B teams):
A VPC with private subnets. An inference endpoint within the VPC — Bedrock, a SageMaker endpoint running a custom model, or an Azure AI deployment. An application service that constructs the inference request, applies rate limiting and caching, calls the inference endpoint over the private network, and handles the response.
The application service is the part you build. The inference endpoint is managed infrastructure. The key pieces: request logging (you need to know what was sent and received, both for debugging and for compliance), a caching layer for repeated queries (deduplication on high-volume features can reduce inference costs by 30–50%), and circuit breaker logic (if the inference endpoint is slow, your feature should degrade gracefully, not time out for the user).
Model updates are handled by the provider on the managed path. If you've fine-tuned a custom model, updates require re-running the fine-tuning pipeline and deploying the new weights.
Cost modeling
The number to size is total tokens per user per day, multiplied by your user count, multiplied by your inference cost per token.
A document summarization feature used once per day, summarizing 2,000 words into 200 words, uses approximately 800 input tokens and 300 output tokens. At $0.002/1k input and $0.003/1k output: $0.0016 + $0.0009 = $0.0025 per use. At 1,000 active users: $2.50/day, $75/month.
This is not a significant cost for a B2B product with $50–500/seat pricing. The inference cost in most B2B applications is a rounding error against the value it provides, as long as you've applied basic caching and request deduplication.
The cost becomes significant at high volumes — if you're doing bulk processing (run inference against every document in a customer's corpus), you need to model the batch volume separately from the interactive volume, and likely cache aggressively and batch during off-peak hours.
The deployment story
The managed-in-account path deploys as a standard cloud service. Infrastructure as code (Terraform or CDK) provisions the VPC, subnets, endpoint, and IAM roles. The application service is containerized and deployed on your existing compute platform (ECS, EKS, Cloud Run, or equivalent). Network routing is private — the inference requests never traverse the public internet.
The full self-host path adds a GPU node group to your Kubernetes cluster, deploys vLLM or TGI as a service, and handles model weight distribution (models need to be stored accessibly by the inference pods — an S3 bucket with restricted access works). Model loading time on startup is significant for large models (10–30 minutes for 70B on cold start), which means autoscaling needs to account for warmup time.
This is for you if
You're building a B2B product that handles enterprise customer data, has legal or contractual requirements that prevent sending that data to third-party inference APIs, and needs language model capabilities to be competitive in your market.
Engagements of this type are part of larger product build engagements ($100k+), where the inference architecture is designed and deployed as part of the product. We do not do standalone inference infrastructure consulting — we build products that include this architecture as a component.
This is not for consumer applications where data sensitivity is lower, or for companies that have not yet worked through the contractual requirements with their legal team. The architecture decision should follow the legal analysis, not precede it.