Running AI locally became a movement: the state of the art in 2026

Open Chinese models now lead the downloads, OpenAI shipped open weights again, and for $4,000 a giant model fits on your desk. But speed at home still runs into a stubborn bottleneck. A straight map of where no-cloud AI actually stands.

For years, running AI on your own machine was a hobbyist's pursuit: an expensive graphics card and a deep well of patience. In 2026 it turned into something else. The movement stands on three legs at once. Open models have closed the gap with the best closed ones, hardware finally holds a large model without a datacenter, and a serious reason for any of it to matter has arrived, from privacy and data sovereignty to a regulatory nudge in Europe. The picture changed who still needs a paid API and who no longer does.

The term you'll see throughout is local inference: running the AI model on your own hardware (laptop, workstation, home server), with everything processed there, nothing leaving for someone else's server. The opposite is the cloud. You send your text to OpenAI, Google, or another provider, and the answer comes back.

The open ecosystem became an ocean (and changed its accent)

The clearest gauge of the movement is Hugging Face, the repository where the community publishes open models. It has passed 2 million public models, and the geopolitical twist is what stands out: over the past year, Chinese models (Qwen, DeepSeek, and their derivatives) reached 41% of the Hub's downloads, overtaking the United States in monthly downloads.¹ Alibaba's Qwen family alone now has more than 113,000 derivative models. By Hugging Face's own count, "Alibaba as an organization has more derivative models than Google and Meta combined."¹

Before that sounds like an infinite buffet of options, a fact pulls you back to earth. About half the models on the Hub have fewer than 200 downloads, and the 200 most-downloaded (0.01% of the total) account for nearly half of all downloads.¹ In other words: millions of models, but real usage piles up on a handful. The long tail is enormous and almost entirely silent.

The frontier models that fit at home

The leap across 2025 and 2026 was in quality. Four names sum up the open state of the art.

gpt-oss (OpenAI). On August 5, 2025, OpenAI released its first open weights since GPT-2: gpt-oss-120b (117 billion parameters, 5.1 billion active at a time) and gpt-oss-20b (21 billion, 3.6 billion active), both under the permissive Apache 2.0 license, free for commercial use.² ³ The architecture is Mixture-of-Experts (MoE): rather than firing the whole model on every word, it activates only the relevant "experts," which slashes the cost of running without giving up size. By OpenAI's own numbers, the 120b fits on a single 80GB GPU and the 20b runs in 16GB of memory, consumer-card territory.³ The community has adopted the 20b as the reliable default for anyone with a mid-range card.

DeepSeek-V3.2. The Chinese lab published the paper "Pushing the Frontier of Open Large Language Models" on arXiv on December 2, 2025.⁴ The technical novelty is DeepSeek Sparse Attention (DSA), a way for the model to pay attention only to the parts of a long text that matter, instead of all of it at once, which cuts the cost of processing long context. The paper claims its highest-compute variant, DeepSeek-V3.2-Speciale, beats GPT-5 on part of the evaluations and earned gold medals at the 2025 IMO and IOI (the international math and informatics olympiads).⁴ That's the lab's own claim, so the usual skepticism applies. Still, the capability jump in an open model is real.

Qwen (Alibaba). It isn't the flashiest, and that's exactly why it won as the safe bet. Under Apache 2.0, multilingual, with an ocean of community-tuned variants, it became the engine behind the Hub's derivative explosion.¹ ⁵ When a company needs something open, commercial, and stable, this is the name that comes up.

Gemma 3 (Google). Released on March 12, 2025 in 1B, 4B, 12B, and 27B sizes, with a 128,000-token context window (the model's working memory within a conversation) and support for more than 35 languages.⁶ Google positioned Gemma 3 27B as "the world's best single-accelerator model," claiming it reaches an Elo of 1338 on the Chatbot Arena "while requiring only a single GPU when others need up to 32."⁶

To actually run any of this, two programs do the heavy lifting on the user's machine: Ollama and llama.cpp. They load the model, split the computation between CPU and graphics card, and (the part that matters to the movement) run fully offline, on a machine with no internet ("air-gapped"). That's what makes local AI viable outside the lab.⁷

The hardware that unlocked it

The missing piece was memory. A large model has to fit in fast memory, and that's where 2025 and 2026 changed the game.

The NVIDIA DGX Spark is the symbol of that shift. Announced as "Project Digits" at CES 2025, it went on sale in October 2025 for $3,999.⁸ ⁹ It's a small desktop with 128GB of unified memory (shared between CPU and GPU, with none of the 32GB ceiling of a consumer card) capable of running inference on models of up to roughly 200 billion parameters.⁸ ⁹ Around February 2026, NVIDIA reportedly raised the price to $4,699, blaming a shortage of LPDDR5x memory (a secondary-source figure, still to be confirmed).⁹

Here's the part that separates marketing from reality. The Spark wins on capacity (the big model fits) and loses on speed. The bottleneck is memory bandwidth, meaning how fast data moves to and from the chip. Community bench measurements (unofficial) put gpt-oss-20b at around 50 tokens per second on the Spark, against more than 200 on an RTX 5090.¹⁰ The Spark holds what an ordinary card can't, but answers more slowly. If you need a large model, that's a blessing. If you need a fast answer, it may not have been the right buy.

On the Apple side, the bet is Apple Silicon's unified memory plus MLX, Apple's own machine-learning framework. In research published on November 19, 2025, Apple showed that the Neural Accelerators in the M5 chip's GPU deliver up to ~4× speedup in time-to-first-token (the wait until the first word of a reply) compared with the M4, with a 19% to 27% gain on generating subsequent tokens.¹¹ Again, the ceiling on sustained speed is memory bandwidth, the same bottleneck as the Spark's.

Why "open" is now a legal argument too

There's a layer few people notice: European regulation has started to reward openness. The EU AI Act created an exemption for open-source models. General-purpose (GPAI) models under a free and open license, with public weights, architecture, and usage information, and not monetized, are released from part of the heavy documentation requirements and from having to appoint a representative in the EU.¹² They still have to respect copyright and publish a summary of their training data, but the message is clear: opening the model lightens the regulatory load.

There's a ceiling, though. Models trained above 10²⁵ FLOPs (a measure of the compute spent on training) are presumed to carry "systemic risk" and lose the exemption, at which point every obligation kicks back in.¹² The GPAI rules took effect on August 2, 2025, with models already on the market given until August 2, 2027 to comply.¹²

And there's the argument that drives much of the movement, before any law: running locally means the data never leaves your machine. For defense, government, healthcare, and legal, that turns compliance with GDPR, HIPAA, and the rest into something that exists by design rather than by a vendor's promise.

What the community is saying

On r/LocalLLaMA, the home of the topic, the 2026 tone is less awe and more trade-off engineering. The conversation revolves around which model fits in how much memory, which quantization, how many tokens per second. The peak of collective excitement was still the launch of DeepSeek R1 in January 2025, celebrated as "reasoning quality competing with much larger models." (What follows is aggregated community opinion, not fact verified by us.)

The consensus rests on three pillars. Unified memory changed the game ("this isn't a GPU question anymore, it's a memory-architecture question"). Qwen became the safe pick for commercial use. And "newer is almost always better" is nearly a mantra: releases move so fast that the best of six months ago is now mediocre.

Where it splits: is running locally worth it? One side defends privacy and API bills that only climb. The skeptic counters that "you bought a $4,000 Spark to run slower than an API that costs pennies," and that local only pencils out at high volume, sensitive workloads, or hobby use. The second rift is the Spark itself: "NVIDIA's Apple moment" to some, "a bandwidth letdown" to the forum skeptics, who sum it up with the most-repeated line, the one that says the Spark solves the wrong problem for most people. Anyone who needed speed should have bought a 5090. There's also a quieter, unresolved unease about running Chinese-origin weights in a corporate environment.

Verdict

2026 is the year local AI stopped being a hobby and became real infrastructure, for the people who fit the right cases. If your reason is privacy, data sovereignty, offline operation, or high and predictable volume, the open state of the art (gpt-oss, Qwen, DeepSeek, Gemma) now delivers quality that a year ago was unthinkable outside the cloud, and 128GB hardware finally lets a large model fit on the desk.

But set your expectations correctly. Speed at home still runs into memory bandwidth, and no $4,000 box changes that by decree: it buys capacity, not necessarily quickness. And most of the Hub's 2 million models are noise. In practice, you'll live inside half a dozen names. The movement is real and the ceiling has risen sharply. Just don't confuse "it fits on my machine" with "it runs at cloud speed," because they aren't the same thing yet.

Sources

State of Open Source on Hugging Face: Spring 2026 · Hugging Face · https://huggingface.co/blog/huggingface/state-of-os-hf-spring-2026 · Mar 17, 2026
Introducing gpt-oss · OpenAI · https://openai.com/index/introducing-gpt-oss/ · Aug 5, 2025
gpt-oss (official repository) · OpenAI / GitHub · https://github.com/openai/gpt-oss · 2025
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models · DeepSeek-AI / arXiv · https://arxiv.org/abs/2512.02556 · Dec 2, 2025 · arXiv:2512.02556
Qwen (overview / model cards) · Alibaba Qwen Team / GitHub · https://github.com/QwenLM · 2025–2026
Gemma 3: Google's new open model based on Gemini 2.0 · Google · https://blog.google/innovation-and-ai/technology/developers-tools/gemma-3/ · Mar 12, 2025
Local AI Runtime Update: Ollama, vLLM, llama.cpp, MLX, LM Studio · Codersera · https://codersera.com/blog/local-ai-runtimes-may-2026-update/ · 2026 (runtime capability description; versions to verify on the official GitHub)
NVIDIA starts selling its $3,999 DGX Spark AI developer PC · Engadget · https://www.engadget.com/ai/nvidia-starts-selling-its-3999-dgx-spark-ai-developer-pc-120034479.html · Oct 14, 2025
NVIDIA DGX Spark — official product page · NVIDIA · https://www.nvidia.com/en-us/products/workstations/dgx-spark/ · 2025/2026 (price increase to $4,699 reported by Constellation Research, secondary, to confirm)
(community/bench — NOT official) DGX Spark vs RTX 5090 tokens/s · NVIDIA Developer Forum (DGX Spark / GB10) · https://forums.developer.nvidia.com/c/accelerated-computing/dgx-spark-gb10/ · 2025/2026
Exploring LLMs with MLX and the Neural Accelerators in the M5 GPU · Apple Machine Learning Research · https://machinelearning.apple.com/research/exploring-llms-mlx-m5 · Nov 19, 2025
What Open-Source Developers Need to Know about the EU AI Act's Rules for GPAI Models · Hugging Face · https://huggingface.co/blog/yjernite/eu-act-os-guideai · Aug 4, 2025

Community read (opinion, not fact): r/LocalLLaMA (aggregated sentiment) · NVIDIA Developer Forum — DGX Spark / GB10. Upvote counts and bench tokens/s are community impressions, not official measurements.

By Newsroom · Acta Verum