While majority of the GenAI investment / capex is focused on new datacenters, GPUs and hardware, is it possible that the long term future of LLM inference and training is actually on local hardware we already have? Two trends worth tracking:
1. Better local stacks.
Our local desktops, laptops and mobile phones hide a surprising amount of compute capacity which is often not used fully. For example, a recent paper estimated that M-series chips on Apple laptops can go as high as 2.9 TFLOPS for an M4 and Google’s Pixel 10 Android phone can in theory hit 1.5 TFLOPS (for comparison an NVIDIA GeForce RTX 4090 GPU can go as high as 82 TFLOPS and the H100 can go to 67 TFLOPS, at FP32).
Local inference stacks like llama.cpp, Ollama and LM Studio have been getting better and better, with underlying improvements such as Apple’s support for inference via MLX, support for AMD GPUs, and integration into the overall ecosystem via things like MCPs, tools, local web interfaces for coding assistants, etc. All have been showing better and better performance over the past year – as an example, compare Cline’s recommendations for local models between May 2025:
When you run a “local version” of a model, you’re actually running a drastically simplified copy of the original. This process, called distillation, is like trying to compress a professional chef’s knowledge into a basic cookbook – you keep the simple recipes but lose the complex techniques and intuition. … Think of it like running your development environment on a calculator instead of a computer – it might handle basic tasks, but complex operations become unreliable or impossible.
and Nov 2025:
Local models with Cline are now genuinely practical. While they won’t match top-tier cloud APIs in speed, they offer complete privacy, zero costs, and offline capability. With proper configuration and the right hardware, Qwen3 Coder 30B can handle most coding tasks effectively.The key is proper setup: adequate RAM, correct configuration, and realistic expectations. Follow this guide, and you’ll have a capable coding assistant running entirely on your hardware.
Even OpenClaw (reluctantly) supports local models
2. Model improvements including inference and training.
Because of pressure to squeeze out better performance from existing hardware, open source inference engines such as VLLM and PyTorch, and the models themselves have been focusing on faster inference speed and throughput. For example, VLLM recently announced 38% performance improvements for OpenAI’s gpt-oss-120b model. However, what is more interesting is fundamental changes in models themselves. DeepSeek for example, released a paper recently showing how to increase transfomer performance via memory lookups. Several model providers such as Google / Gemini and LiquidAI have been releasing small models intended to run on limited hardware such as phones.
On the training side, Andrej Karpathy has recently posted about he managed to optimize the training process for GPT-2 from 32 TPUs to a single H100 GPU:
Seven years later, we can beat GPT-2’s performance in nanochat ~1000 lines of code running on a single 8XH100 GPU node for ~3 hours. At ~$24/hour for an 8×H100 node, that’s $73, i.e. ~600× cost reduction. That is, each year the cost to train GPT-2 is falling to approximately 40% of the previous year. (I think this is an underestimate and that further improvements are still quite possible).
These improvements will trickle down to local hardware as well over time
Implications for security and other things
If the long term picture consists of models running locally on the hardware the rest of the stack runs on, the security of those models starts to look very different. For example, in an enterprise environment, it is possible to today to monitor and block network connectivity to outside model providers like OpenAI, Anthropic, etc. but if everything is running locally, the goal of network security would instead be to look for large downloads of model weights and scan the local hardware for models or excessive GPU usage. Second, centralized controls such as what model can someone use won’t work anymore if those models are running locally – and instead deploying those controls starts to look like what we do today for locally-installed software with OS-level scanning and reporting. Third, supply chain issues with models such as malicious models, updating insecure models, etc. suddenly become very important – again requiring us to borrow the tricks we use today for local software and open source dependencies.
For all of the new data centers being built – is there truly a need if existing local hardware can eventually do the job?
No Responses