- Valley Recap
- Posts
- 🤖 You Should Be Scared About Inferencing 💰 Bay Area Startups Collectively Secured $2.76B 🤝 AI Tech Leaders Club Recap 👨🏽💻
🤖 You Should Be Scared About Inferencing 💰 Bay Area Startups Collectively Secured $2.76B 🤝 AI Tech Leaders Club Recap 👨🏽💻

🤖 You Should be Scared of Inferencing
by Jay Chiang, CTO of Pascaline Systems
Training was the easy part.
I know it doesn’t sound like it. Especially not against the backdrop of billion-dollar clusters, megawatt power draws, and an entire economy built around GPU scarcity. But training is a project. It has a beginning, a budget, and an end.
Inference only looks easy.
Yes, each query is lighter, each model smaller. But that illusion hides a harder truth: inference never stops. It’s everything, everywhere, all at once. It scales not with chips, but with society. It doesn’t just compute. It’s what turns machine intelligence into social reality.
The GPU boom created the impression that the AI race is over. It seems that hyper scalers own the stack and the rest of us merely consume it. But training was the sprint. Inference is the marathon that follows, and few are prepared for the distance
The Infrastructure Illusion
Investors have poured hundreds of billions into AI infrastructure over the past three years. According to McKinsey’s 2024 “AI Infrastructure Outlook”, data-center CapEx is projected to exceed $400 — 450 billion by 2025, with more than 70 percent dedicated to training-centric capacity. (Deloitte AI Infrastructure 2024 and Dell’Oro Group 2024 reports show similar magnitudes.) The playbook was simple: build ever-larger campuses where power is available or cheapest. The physics of inference breaks that logic.
“Many believe they’ve mastered the playbook. What they have is the wrong chapter.”

Training sites thrive on concentration — hundreds of megawatts, predictable loads, batched traffic. Inference sites thrive on distribution — low latency, regulatory locality, variable demand. The first are built where energy is available; the second must be built where data and users actually are.
“We are entering a phase where inference will demand smaller, geographically distributed facilities — not just larger ones.” — Synergy Research Group, AI Data Center Update 2025
“The next big constraint is not compute power but where that compute can be located.” — International Energy Agency, Electricity 2025 Report
And that’s the illusion: billions committed to megawatt campuses while the next wave of demand splinters into thousands of smaller nodes. What happens when the world realizes the training build-out can’t host the inference economy? Do investors double down on sunk assets or pivot to where usage is already racing ahead?
Our AI Inference Activity Index (AIAI) (see Issue 1) already shows usage leading capital by nearly two years. It’s a wake-up call.
The Enterprise Gap
Every enterprise fears to be left behind. Generative AI is now boardroom doctrine. Executives race to show movement — hire a team, rent some GPUs, issue a press release. For many, that is the plan.
Then comes the fear of not being ready. Pilot projects stall. Integrations break. Costs mount. According to the MIT report The GenAI Divide: State of AI in Business 2025, “The outcomes are so starkly divided across both buyers… and builders… that we call it the GenAI Divide. Just 5% of integrated AI pilots are extracting millions in value, while the vast majority remain stuck with no measurable impact”.
Most companies quietly realize that adding AI to a software stack is not the same as running an intelligent system end-to-end. As soon as they start to address the technology challenges, the fear of the talent gaps arrives. Everyone knows the story by now. There simply aren’t enough experts. Over half of IT leaders report a worsening AI skills shortage. (CIO Dive, 2025)
One study estimated only about 22,000 true AI specialists worldwide, with millions of unfilled positions across industry. (Keller Executive Search, 2025) NVIDIA’s Jensen Huang warned recently that China is training more than a million AI engineers each year — a scale Western economies cannot yet match. (Reuters, 2025) That warning rattled markets, helping trigger a $1 trillion pullback in tech valuations as investors questioned whether enterprises could execute fast enough.
But these are the visible fears. The most underestimated is the fear of inference itself. Ironically, the more successful you are, the more likely you will face this one.
Inference isn’t an algorithmic milestone. It’s a continuous economic function. Every chat reply, every recommendation, every automated decision is a live transaction involving data movement, model execution, retrieval, and expectation. Each consumes compute and energy. None scale linearly. Unlike training, inference never ends. Every successful deployment multiplies its own cost base.
Yet most enterprises can’t even measure that economy.
They worry about headcount when they should worry about headroom — how to serve intelligence sustainably under volatile, always-on demand.
And underneath the spreadsheets lies the social layer no one models: how decisions change workflows, how automation reshapes teams, how trust shifts between humans and algorithms. Inference is not just technical. It is social, organizational, and moral infrastructure. Almost no enterprise is prepared for that.
The Technologist’s Blind Spot
Technologists are most comfortable when they can predict what comes next. This time, they can’t. Inference doesn’t behave like the systems they’ve mastered. Its rules are still being written while the machines are already running.
Training rewards averages. Inference punishes outliers. At hyperscale, one slow shard becomes the user experience. Jeff Dean and Luiz Barroso called this out a decade ago in The Tail at Scale: “Temporary high-latency episodes may come to dominate overall service performance.”
Inference lives permanently in that tail. Every request can differ wildly in length, context depth, and memory footprint. A single cache miss or congested hop can saturate a GPU, stall a pipeline, or trigger cascading retries. Adding GPUs hides inefficiency — it doesn’t cure it.
Serving keeps context alive. Each session brings its own history, embeddings, and KV cache, stretched across GPU, CPU, and storage. Keeping that state coherent decides your p99 latency and cost. Adding GPU can’t resolve it. We need memory. And I don’t mean the hardware. Each inference session carries states, such as conversation history, vector embeddings, cached retrievals , practically fragments scattered across GPU HBM, CPU DRAM, NVMe SSD, and networks. Keeping all that coherent is not a solved problem.
Recent experiments such as PagedAttention (UC Berkeley, 2024) and Pensieve (NYU, 2024) underscore how inference memory is still a research problem and not an engineered one. Both highlight the same fragility: state management, not raw compute, is the new bottleneck.
2025 is being called the Year of Agentic AI, which means we have to consider When Models Talk Back. Longer context windows gave engineers false comfort. It is a relief when humans were generating prompts. Then came agentic AI. They don’t just answer. They think aloud, branch, call tools, verify, and retry. They spend tokens the way supercomputers spend FLOPs.
Recent research on test-time compute (Anthropic 2024, DeepMind 2025) shows that multi-agent reasoning can amplify token generation 10–25× per query — a complete inversion of inference economics.
This new frontier isn’t about size alone. It’s about control — how far to let agents think, how deep to recurse, how to prune search trees in flight. Every decision changes cost, latency, and sometimes the answer itself.
“We’re learning that inference is no longer passive. It plans, acts, and spends.” — MIT CSAIL Workshop 2025
That leads a cruel fact. The Data Plane Is Breaking. Inference now generates its own gravity. Each query creates new embeddings, logs, feedback, and governance trails, data that must persist for learning and compliance. Add localization laws and retention mandates, and the old cloud pattern, stateless microservices over object stores, collapses.
Enterprises now need semantic memory planes: fast retrieval objects, compressed context stores, and policy-aware caches that bridge secure domains. Inside an organization, AI memory must be partitioned by trust, teams, customers, auditors, yet remain instantly searchable. That architecture barely exists today.
That is not all. Real-World Inference Is Physical. When AI leaves the browser and enters cars, robots, headsets, and factories, latency stops being cosmetic. In AR/VR, motion-to-photon must stay under 20 ms. In vehicles, perception-to-actuation must be deterministic. Variance, not speed, is the killer. Expect Time-Sensitive Networking (TSN), edge colocation, and deterministic fabrics to become as critical as GPUs. Inference at the physical edge has to think — and respond — on a schedule the cloud was never built for.
In summary, technologists should consider
Tail-first design — Engineer for p99 (99 percentile) from day one. Throughput can wait.
State as infrastructure — Treat conversation, KV, and embedding caches as tiered, observable systems, not black boxes.
Govern inference — Cap agent recursion depth, budget “thinking tokens,” and verify outputs.
Build deterministic paths — For AR, robotics, and automotive, latency jitter kills trust.
We’ve learned how to train large models. Now we have to learn how to serve them.
The Slow Policy Machine
Policy always trails technology, but inference has turned the lag into a chasm. Training lived in data centers, under lock and audit. Inference lives everywhere — in hospitals, banks, schools, public spaces, and the phones in our pockets.
Governments are starting to wake up. In 2025, the European Union’s AI Act entered its enforcement phase, classifying real-time inference systems in critical domains as “high risk.” The U.S. Executive Order on Safe, Secure, and Trustworthy AI mandates reporting of compute clusters above a training threshold, yet says little about inference, the phase actually touching citizens. China’s Generative AI Provisions (2024) require content filtering and provenance, but again, not energy, latency, or localization.
Everyone regulates the model. No one regulates the behavior.
“We’ve built a system where AI acts faster than governments can think.” — MIT Technology Review, Policy 2025 Roundtable
That’s the real fear for policymakers: inference is not just a workload, it’s a decision factory — one that operates in microseconds and everywhere at once.
And the infrastructure beneath it is straining. The International Energy Agency (IEA, 2025) now projects global data-center electricity use will reach 950 TWh by 2030, nearly doubling from 2023, with AI the main driver. Yet permitting cycles for new capacity still take 3–5 years. Grid expansions move in fiscal decades, not milliseconds.
Sovereignty laws like the EU’s Data Act, India’s Digital Personal Data Protection Act, and Saudi Arabia’s National Data Management Regulations are forcing inference to occur within borders, turning compliance into a physical siting problem.
Every millisecond of inference now carries a passport, a carbon cost, and a moral implication. And none of those fit neatly into existing frameworks of law, labor, or liability. Policy isn’t just slow — it’s built for a world that assumed decisions were human.
The fear isn’t only for ministers and regulators. Citizens should be scared too, because inference isn’t an app you install. It’s an ambient power that shapes how you see, decide, and act, before anyone has agreed on who governs it.
Where Fear Meets Design
Every fear points to a missing design principle.
The investor’s fear: misaligned capital, which is a spatial problem.
The enterprise’s fear: runaway OpEx, which is an architectural problem.
The technologist’s fear: unpredictability, which is a systems problem.
The policymaker’s fear: loss of control, which is a temporal problem.
All of them converge in the same place: how we serve intelligence. The risk isn’t that AI becomes too powerful. It’s that we mistake its continuity for simplicity.
Inference was never meant to be handled by the old web stack. Its units aren’t requests; they’re relations — between users, agents, data, and time. Each relation carries memory, regulation, and risk. Serving it safely demands new layers of awareness in both hardware and software.
The emerging blueprint is taking shape:
Spatial locality — inference must live closer to data and users.
Temporal persistence — context must survive across sessions, not reset with every query.
Semantic control — memory must know what it stores and why.
Economic observability — systems must track joules and dollars per decision, not per GPU-hour.
This is not a feature list. It’s a manifesto for stability in an intelligent world.
“The next revolution in AI won’t be smarter models — it will be infrastructure that knows it’s intelligent.” — Pascaline AIAI Research, 2025
When design absorbs fear, fear becomes discipline. That’s how every new infrastructure era begins — when the engineers start designing for the things everyone else is afraid of.
Inference is the economic heartbeat of the AI era, but we’re still measuring it with training-era instruments.



“The Consulting Game” Workshop
AI Tech Leaders Club / Dec16
AI Tech Leaders Club brought together a packed room of senior technical decision makers on December 16 for an evening of focused, peer-to-peer problem solving on AI integration. Hosted by Jay Chodagam and Vlad Pavlov, the session centered on shared challenges in deploying AI across complex organizations. Participants represented leading Silicon Valley teams driving AI strategy, engineering, and operations.
The workshop blended structured small-group discussions with open conversation, giving leaders space to compare real-world use cases and constraints. The room stayed engaged well into the evening as teams traded lessons on technical architecture, talent models, and responsible rollout patterns.

Linda Yang of Supermicro leads her workgroup
Why it mattered: Conversations surfaced concrete paths to move from experimentation to production-grade AI, highlighting both quick wins and longer-horizon bets.The momentum coming out of the session reinforced the value of AI Tech Leaders Club as a system for ongoing signal-sharing across the Valley.
Upcoming Events

Bay Area Startups Collectively Secured $1.57B in December Week 3
Bay Area startups closed on $1.57B in the third week of December, bringing the MTD to $5.49B. 60% of the total came from six megadeals, Homebound ($300M), MoEngage ($180M), Chai Discovery ($130M), Mythic ($125M), Addition Therapeutics ($100M) and Nirvana Insurance ($100M).

IPOs - The fourteen 2025 Silicon Valley tech IPOs raised more than $7B, a significant uptick from 2024. Wealthfront's IPO earlier this month was the last, while Figma's $1.2B IPO was the largest of 2025 and the only one to raise more than a billion dollars.
For startups raising capital: Stay on top of who's raising, who's closing and who's investing with the Pulse of the Valley weekday newsletter. Click through to get more detail on investors and executives, including email addresses for both. Founders get the newsletter, database and alerts for just $7/month ($50 value). Check it out and sign up here.
Early Stage:
Link Cell Therapies closed a $60M Series A, developing technologies that allow for safe targeting of multiple antigens, reducing the risk of on-target, off-tumor toxicity.
Valerie Health closed a $30M Series A, using AI to modernize patient and provider communication interfaces for independent practices.
HEN Technologies closed a $20M Series A, building the world’s first end-to-end intelligent fire suppression ecosystem, powered by AI, IoT and advanced fluid dynamics.
Vybe closed a $10M Seed, a "Lovable for internal apps", empowering semi-technical folks to ship internal apps.
Interface closed a $3.5M Seed, providing the energy sector's AI-native management system.
Growth Stage:
Homebound closed a $300M Series D, the first tech-powered homebuilder, making building your perfect home simple, human, and transparent.
MoEngage closed a $180M Series F, providing insights into customer behavior, enabling users to engage customers across web, mobile, email, social, and messaging channels.
Chai Discovery closed a $130M Series B, builds frontier AI to predict and reprogram the interactions between biochemical molecules, the fundamental building blocks of life.
Mythic closed a $125M Series D, accelerating computing with Analog Processing Units, a different approach to AI inference that collapses compute and memory into a single plane.
BluePath Finance closed a $75M Series C, a leading financier of sustainable infrastructure.

HyperAccel is a semiconductor company building purpose-designed hardware for generative AI inference. The company focuses on accelerating large language models with architectures optimized for latency, throughput, and energy efficiency.
What HyperAccel Delivers
• Specialized AI processors designed for large language model inference
• Server systems optimized for high-throughput, low-latency workloads
• Hardware architectures that balance memory bandwidth and compute efficiency
• A supporting software stack that integrates models with custom silicon
Why It Matters
As generative AI workloads scale, traditional hardware struggles with cost, power consumption, and latency. HyperAccel addresses these constraints with inference-first silicon built for real-world deployment at scale.
Who It Serves
Cloud providers, data center operators, enterprises, and AI builders seeking more efficient hardware for large-scale inference workloads.
For more information contact: Jay Kim [email protected]
Your Feedback Matters!
Your feedback is crucial in helping us refine our content and maintain the newsletter's value for you and your fellow readers. We welcome your suggestions on how we can improve our offering. [email protected]
Logan Lemery
Head of Content // Team Ignite
What investment is rudimentary for billionaires but ‘revolutionary’ for 70,571+ investors entering 2026?
Imagine this. You open your phone to an alert. It says, “you spent $236,000,000 more this month than you did last month.”
If you were the top bidder at Sotheby’s fall auctions, it could be reality.
Sounds crazy, right? But when the ultra-wealthy spend staggering amounts on blue-chip art, it’s not just for decoration.
The scarcity of these treasured artworks has helped drive their prices, in exceptional cases, to thin-air heights, without moving in lockstep with other asset classes.
The contemporary and post war segments have even outpaced the S&P 500 overall since 1995.*
Now, over 70,000 people have invested $1.2 billion+ across 500 iconic artworks featuring Banksy, Basquiat, Picasso, and more.
How? You don’t need Medici money to invest in multimillion dollar artworks with Masterworks.
Thousands of members have gotten annualized net returns like 14.6%, 17.6%, and 17.8% from 26 sales to date.
*Based on Masterworks data. Past performance is not indicative of future returns. Important Reg A disclosures: masterworks.com/cd






