{"id":6950,"date":"2026-02-10T15:10:46","date_gmt":"2026-02-10T15:10:46","guid":{"rendered":"https:\/\/cybersecurityinfocus.com\/?p=6950"},"modified":"2026-02-10T15:10:46","modified_gmt":"2026-02-10T15:10:46","slug":"is-local-hardware-is-all-you-need","status":"publish","type":"post","link":"https:\/\/cybersecurityinfocus.com\/?p=6950","title":{"rendered":"Is Local Hardware is All You Need?"},"content":{"rendered":"<p class=\"wp-block-paragraph\">While majority of the GenAI investment \/ capex is focused on new datacenters, GPUs and hardware, is it possible that the long term future of LLM inference and training is actually on local hardware we already have? Two trends worth tracking:<\/p>\n<h3 class=\"wp-block-heading\">1. Better local stacks.<\/h3>\n<p class=\"wp-block-paragraph\">Our local desktops, laptops and mobile phones hide a surprising amount of compute capacity which is often not used fully. For example, <a href=\"https:\/\/arxiv.org\/abs\/2502.05317\">a recent paper estimated<\/a> that M-series chips on Apple laptops can go as high as 2.9 TFLOPS for an M4 and <a href=\"https:\/\/www.androidcentral.com\/phones\/google-pixel\/google-promises-gpu-improvements-pixel-10\">Google\u2019s Pixel 10 Android phone<\/a> can in theory hit 1.5 TFLOPS (for comparison an NVIDIA GeForce RTX 4090 GPU\u00a0can go as high as 82 TFLOPS and the H100 can go to 67 TFLOPS, at FP32).<\/p>\n<p class=\"wp-block-paragraph\">Local inference stacks like <a href=\"https:\/\/github.com\/ggml-org\/llama.cpp\">llama.cpp<\/a>, <a href=\"https:\/\/ollama.com\/\">Ollama<\/a> and <a href=\"https:\/\/lmstudio.ai\/\">LM Studio<\/a> have been getting better and better, with underlying improvements such as Apple\u2019s support for inference via <a href=\"https:\/\/opensource.apple.com\/projects\/mlx\/\">MLX<\/a>, support for <a href=\"https:\/\/www.amd.com\/en\/developer\/resources\/technical-articles\/running-llms-locally-on-amd-gpus-with-ollama.html\">AMD GPUs<\/a>, and integration into the overall ecosystem via things like MCPs, tools, local web interfaces for coding assistants, etc. All have been showing better and better performance over the past year \u2013 as an example, compare Cline\u2019s recommendations for local models between <a href=\"https:\/\/github.com\/cline\/cline\/commit\/bc9eaeeff7354c081de87a1a65c674a65524b7e3\">May 2025<\/a>:<\/p>\n<p class=\"wp-block-paragraph\"><em>When you run a \u201clocal version\u201d of a model, you\u2019re actually running a drastically simplified copy of the original. This process, called distillation, is like trying to compress a professional chef\u2019s knowledge into a basic cookbook \u2013 you keep the simple recipes but lose the complex techniques and intuition. \u2026 Think of it like running your development environment on a calculator instead of a computer \u2013 it might handle basic tasks, but complex operations become unreliable or impossible.<\/em><\/p>\n<p class=\"wp-block-paragraph\">and <a href=\"https:\/\/docs.cline.bot\/running-models-locally\/overview\">Nov 2025:<\/a><\/p>\n<p class=\"wp-block-paragraph\"><em>Local models with Cline are now genuinely practical. While they won\u2019t match top-tier cloud APIs in speed, they offer complete privacy, zero costs, and offline capability. With proper configuration and the right hardware, Qwen3 Coder 30B can handle most coding tasks effectively.The key is proper setup: adequate RAM, correct configuration, and realistic expectations. Follow this guide, and you\u2019ll have a capable coding assistant running entirely on your hardware.<\/em><\/p>\n<p class=\"wp-block-paragraph\">Even OpenClaw (reluctantly) <a href=\"https:\/\/docs.openclaw.ai\/gateway\/local-models\">supports<\/a> local models <\/p>\n<h3 class=\"wp-block-heading\">2. Model improvements including inference and training.<\/h3>\n<p class=\"wp-block-paragraph\">Because of pressure to squeeze out better performance from existing hardware, open source inference engines such as VLLM and PyTorch, and the models themselves have been focusing on faster inference speed and throughput. For example, VLLM recently <a href=\"https:\/\/blog.vllm.ai\/2026\/02\/01\/gpt-oss-optimizations.html\">announced 38% performance improvements<\/a> for OpenAI\u2019s gpt-oss-120b model. However, what is more interesting is fundamental changes in models themselves. DeepSeek for example, <a href=\"https:\/\/www.arxiv.org\/abs\/2601.07372\">released a paper recently<\/a> showing how to increase transfomer performance via memory lookups. Several model providers such as Google \/ Gemini and <a href=\"https:\/\/www.liquid.ai\/\">LiquidAI<\/a> have been releasing small models intended to run on limited hardware such as <a href=\"https:\/\/developer.android.com\/ai\/gemini-nano\">phones<\/a>. <\/p>\n<p class=\"wp-block-paragraph\">On the training side, Andrej Karpathy <a href=\"https:\/\/github.com\/karpathy\/nanochat\/discussions\/481\">has recently posted<\/a> about he managed to optimize the training process for GPT-2 from 32 TPUs to a single H100 GPU:<\/p>\n<p class=\"wp-block-paragraph\"><em>Seven years later, we can beat GPT-2\u2019s performance in nanochat ~1000 lines of code running on a single 8XH100 GPU node for ~3 hours. At ~$24\/hour for an 8\u00d7H100 node, that\u2019s\u00a0<strong>$73<\/strong>, i.e.\u00a0<strong>~600\u00d7 cost reduction<\/strong>. That is, each year the cost to train GPT-2 is falling to approximately 40% of the previous year. (I think this is an underestimate and that further improvements are still quite possible).<\/em><\/p>\n<p class=\"wp-block-paragraph\">These improvements will trickle down to local hardware as well over time<\/p>\n<h3 class=\"wp-block-heading\">Implications for security and other things<\/h3>\n<p class=\"wp-block-paragraph\">If the long term picture consists of models running locally on the hardware the rest of the stack runs on, the security of those models starts to look very different. For example, in an enterprise environment, it is possible to today to monitor and block network connectivity to outside model providers like OpenAI, Anthropic, etc. but if everything is running locally, the goal of network security would instead be to look for large downloads of model weights and scan the local hardware for models or excessive GPU usage. Second, centralized controls such as what model can someone use won\u2019t work anymore if those models are running locally \u2013 and instead deploying those controls starts to look like what we do today for locally-installed software with OS-level scanning and reporting. Third, supply chain issues with models such as malicious models, updating insecure models, etc. suddenly become very important \u2013 again requiring us to borrow the tricks we use today for local software and open source dependencies.<\/p>\n<p class=\"wp-block-paragraph\">For all of the new data centers being built \u2013 is there truly a need if existing local hardware can eventually do the job?<\/p>","protected":false},"excerpt":{"rendered":"<p>While majority of the GenAI investment \/ capex is focused on new datacenters, GPUs and hardware, is it possible that the long term future of LLM inference and training is actually on local hardware we already have? Two trends worth tracking: 1. Better local stacks. Our local desktops, laptops and mobile phones hide a surprising [&hellip;]<\/p>\n","protected":false},"author":0,"featured_media":6951,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-6950","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-blog"],"_links":{"self":[{"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=\/wp\/v2\/posts\/6950"}],"collection":[{"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=6950"}],"version-history":[{"count":0,"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=\/wp\/v2\/posts\/6950\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=\/wp\/v2\/media\/6951"}],"wp:attachment":[{"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=6950"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=6950"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cybersecurityinfocus.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=6950"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}