Of monsters and large language models

Why bigger is not always better

Comparing small local LLMs against the biggest models from OpenAI and Deepseek on real tasks. The results might surprise you, the energy numbers definitely will.
language models
llms
neural networks
energy
european sovereignty
Author

Michael Green

Published

April 2, 2026

Introduction

Recently, there’s been another surge of posts on social media subscribing to the Big AI narrative of OpenAI, Anthropic, Google etc. While of course people may say and think what they want, I feel that this talking point is really not helping anyone except these big corporations to maintain the illusionary moat of AI supremacy. I’m not contesting the results achieved by the massive LLMs on benchmark tasks. It’s impressive and useful in its own right. Let’s put aside all the benchmark hacking there is (Zheng et al. 2025) for a moment and just be impressed with what has been achieved with these large models. They have indeed broken boundaries we were facing before 2020 and as a consequence there’s been a myriad of products and services developed. In the midst of this the concept of “Agents” have surfaced again (it is not new) where several LLMs are linked together with tools, e.g., APIs and databases. The purpose here is to let the LLMs get access to functions that are deterministic by nature. As an example, it is utter nonsense to try to use an LLM to answer a question of what 3.17 + 6.912 is. They can do it today but it’s an enormous waste that they have this information stored when they really should be calling a tool (Schick et al. 2023) instead.

You might wonder how all of this would fit together: LLM, Chat interface, Agents, Tools, APIs etc. I’ve given it a shot at showing just how these things are typically connected. We’re still in the infancy of discovering all the types of setups we can do here and what they in turn will be capable of.

An illustration of how AI Agents get access to external tools and databases.

Architecture of an AI Assistant

While there are many assistants out there, very few of them are actually able to do useful stuff. By stuff, I mean like answering your emails for you, prioritizing your day, paying your bills, cancelling subscriptions, call your mom, monitor your investments etc. The reason for this is twofold (i) it’s pretty darn hard to do right and (ii) the consequence of getting it wrong can be socially and financially catastrophic.

So what do we need for this to be practically achievable? Well, we need a language model that understands tool calling (Ding and Stevens 2025). This is an architecture that can emit tokens (text) that lets whoever is receiving that text know that they would like to call a tool. How this tool is supposed to be called can be learned either by a “Skill” or directly via an MCP server. The important thing here is that none of this requires a 5 trillion parameter model. The ability to understand when to call a tool and what arguments to pass is a capability that smaller models handle perfectly well. In fact, models like Qwen3:8b and Llama 3.1:8b have been specifically trained for tool calling and do a solid job of it.

And that brings me to the question I want to explore in this post: if a small model running on your laptop can do the things you actually need, why are we collectively pretending that only the biggest models matter?

A tale of David and Goliath

To put my money where my mouth is, I’m going to run a series of common tasks using both the biggest cloud models (GPT-5.2, Deepseek 2026) and a small local model (Qwen3:8b running via Ollama on my AMD Ryzen AI 7 350 with a Radeon 860M). No cherry-picking. I’ll show you the actual responses and let you judge.

The tasks I’ve chosen are deliberately mundane. Summarization, advice, and explaining a concept. These are the kinds of things people actually use LLMs for every day, not exotic benchmarks designed to stress-test reasoning at scale. My argument is simple: for everyday tasks, the small model does the job. And it does it at a fraction of the energy cost.

Summarization task

The text we’re going to summarize is the preface from Malcolm Gladwell’s “What the dog saw” (Gladwell 2009).

Preface

When I was a small child, I used to sneak into my father’s study and leaf through the papers on his desk. He is a mathematician. He wrote on graph paper, in pencil — long rows of neatly written numbers and figures. I would sit on the edge of his chair and look at each page with puzzlement and wonder. It seemed miraculous, first of all, that he got paid for what seemed, at the time, like gibberish. But more important, I couldn’t get over the fact that someone whom I loved so dearly did something every day, inside his own head, that I could not begin to understand.

This was actually a version of what I would later learn psychologists call the other minds problem. One-year-olds think that if they like Goldfish Crackers, then Mommy and Daddy must like Goldfish Crackers, too: they have not grasped the idea that what is inside their head is different from what is inside everyone else’s head. Sooner or later, though, children come to understand that Mommy and Daddy don’t necessarily like Goldfish, too, and that moment is one of the great cognitive milestones of human development. Why is a two-year-old so terrible? Because she is systematically testing the fascinating and, to her, utterly novel notion that something that gives her pleasure might not actually give someone else pleasure—and the truth is that as adults we never lose that fascination. What is the first thing that we want to know when we meet someone who is a doctor at a social occasion? It isn’t “What do you do?” We know, sort of, what a doctor does. Instead, we want to know what it means to be with sick people all day long. We want to know what it feels like to be a doctor, because we’re quite sure that it doesn’t feel at all like what it means to sit at a computer all day long, or teach school, or sell cars. Such questions are not dumb or obvious. Curiosity about the interior life of other people’s day-today work is one of the most fundamental of human impulses, and that same impulse is what led to the writing you now hold in your hands.

Prompt

Summarize the text below in 2 sentences

Locally on my computer this corresponds to running

echo "Summarize the text below in 2 sentences $(cat mgv1.md)" | ollama run qwen3:8b 

which is put in a script called summary.sh which I use for energy consumption measurement below.

Response from ChatGPT - GPT-5.2

The author recalls childhood wonder at her mathematician father’s incomprehensible work, which sparked an early awareness that other people’s minds and inner experiences differ from one’s own. She connects this realization to a fundamental human curiosity about what it feels like to live and work inside someone else’s mind, a curiosity that motivates both everyday questions and the writing of this book.

Response from Qwen3:8b

The author recalls childhood fascination with their mathematician father’s work, highlighting the mystery of understanding others’ inner lives, which ties into the “other minds problem” in psychology. This curiosity about others’ experiences, from childhood to adulthood, drives the author’s exploration of human cognition and forms the basis of their book.

I don’t know about you but to me these answers are both perfectly fine. One might even go so far as to say that the Qwen3:8b did a better job as it didn’t assume the author was female. Now, I want you to think about the fact that GPT-5.2 is estimated to be in the vicinity of 2 to 5 trillion parameters. That’s a minimum of 2,000,000,000,000 parameters. Compare this to Qwen3:8b which has 8 billion parameters. This means, in the 5 trillion case, that GPT-5.2 is a whoppin’ 625 times larger.

A single query sent to ChatGPT amounts to around 2.1 Wh (Costa et al. 2025). That’s the equivalent of your 1000 W microwave running for 7.2 seconds. That may not sound like much but this should be seen in the context that nearly everyone is writing questions to LLMs today even if you’re only searching, as the search engines typically automatically provide you with LLM context. If you’re interested in more numbers I invite you to look at the research sheet here. When I run the local LLM on my computer (AMD Ryzen AI 7 350 w/ Radeon 860M) with the same query it consumes 0.2 Wh which is less than 10% of what the query would have used had I sent it to ChatGPT instead.

The energy was measured using the perf program on my Arch Linux machine as follows.

sudo perf stat -e power/energy-pkg/ ./summary.sh
# Performance counter stats for 'system wide':
#         794.12 Joules power/energy-pkg/
#   39.278005283 seconds time elapsed

There are many error margins to be aware of in my comparison here but I hope you can appreciate the magnitude of the difference.

Giving advice task

Prompt

I want to wash my car. The car wash is just 200 meters down the street. Should I walk or drive there?

Again the command on my local computer would be

echo "I want to wash my car. The car wash is just 200 meters \
down the street. Should I walk or drive there?" | ollama run qwen3:8b 

which prints the response to the console or your chat app. But as before let’s start with the big gun. The latest Deepseek model from the website.

Response from Deepseek - 2026

If the car wash is only 200 meters away, walking makes more sense:

  • Driving: You’d start the car, move it a short distance, likely wait, and use fuel/electricity for negligible benefit.
  • Walking: You avoid moving a dirty car onto a clean one if you go after washing, get a bit of exercise, and save the hassle of parking.

Unless you have heavy supplies to carry or mobility constraints, walking is quicker and simpler.

Response from Qwen3:8b

To determine whether you should walk or drive to the car wash, consider the following:

Key Factors:

  1. Distance: The car wash is 200 meters away, which is a very short distance.
  2. Purpose: You want to wash your car, which typically requires bringing your car to the car wash (unless it’s a self-service option at the location).
  3. Efficiency: Driving is faster for such a short distance, especially if you need to bring your car to the car wash.

Analysis:

  • Walking: While 200 meters is a short walk, walking would not allow you to bring your car to the car wash. Unless the car wash is located near your home and you can walk to it while keeping your car nearby (which is unlikely), walking is not practical for washing your car.
  • Driving: Driving 200 meters is quick and efficient. You can start the car, drive to the car wash, park, and proceed to wash your car. This is the most straightforward method.

Conclusion:

Drive to the car wash. Driving is faster and allows you to bring your car to the location, making it the more practical choice for washing your car. Walking would not be feasible unless you have an alternative way to transport your car, which is not indicated here.

I don’t know about you but I didn’t like either of these responses. The one from Deepseek completely misses the mark. It tells you to walk to the car wash which of course defeats the purpose of bringing the car. The local model gets the right answer but is rather confused about it and takes a long detour to get there. So again, very little benefit of a bigger model here. If anything, the small model outperformed the big one on basic common sense.

Explaining a concept

For the third task I asked both models to explain quantum mechanics in a short paragraph. This is a knowledge task where you might expect the bigger model to have an edge due to the sheer amount of training data it has absorbed.

Prompt

Create a small paragraph explaining quantum mechanics.

echo "Create a small paragraph explaining quantum mechanics." \
| ollama run qwen3:8b 

Response from ChatGPT - GPT-5.2

Quantum mechanics is a branch of physics that studies how matter and energy behave at extremely small scales, such as atoms and subatomic particles. Unlike everyday objects, particles in quantum systems can exist in multiple states at once (a concept called superposition) and can be linked across distances through entanglement, meaning the state of one instantly affects another. Instead of precise outcomes, quantum mechanics often deals with probabilities, describing what is most likely to happen rather than what definitely will happen. This framework has led to many modern technologies, including semiconductors, lasers, and quantum computing.

Response from Qwen3:8b

Quantum mechanics is a branch of physics that describes the behavior of matter and energy at the smallest scales, such as atoms and subatomic particles. Unlike classical physics, it reveals that particles can exist in multiple states simultaneously (superposition) and can be entangled, meaning the state of one particle is instantly linked to another, even across vast distances. These phenomena challenge classical intuitions, as measurements often yield probabilistic outcomes rather than definite results. It underpins modern technologies like semiconductors and lasers, and remains central to understanding the fundamental nature of reality at the quantum level.

Ok so these are essentially the same answer. Both cover superposition, entanglement, probabilistic outcomes, and applications (semiconductors, lasers, quantum computing). The structure is similar, the accuracy is similar, and as someone with a doctorate in theoretical physics I’d say both are perfectly serviceable introductions. Neither says anything wrong. Neither says anything the other doesn’t.

And that’s exactly my point. For a knowledge task like this, the 8 billion parameter model running on my laptop produced output that is functionally indistinguishable from the trillion-parameter cloud model. Same content, same structure, same level of accuracy. The 625x size difference bought you nothing.

What the numbers tell us

Let’s put the results side by side.

Task Big model Qwen3:8b (local) Winner
Summarization Good (assumed gender) Good (gender neutral) Qwen3:8b
Advice Wrong (walk to car wash) Right (drive) Qwen3:8b
Explanation Good Equally good Tie

Score: Qwen3:8b wins 2, ties 1, loses 0. Against models that are 625 times larger.

Now let’s talk about what this costs. As I showed earlier, a single ChatGPT query consumes roughly 2.1 Wh while my local query consumed about 0.2 Wh. If you scale this to an organization sending, say, 1000 queries per day, the cloud approach burns 2,100 Wh (2.1 kWh) per day while the local approach uses 200 Wh (0.2 kWh). Over a year that’s 766 kWh versus 73 kWh. Multiply that across every company, every developer, every casual user, and you start to see why the energy argument matters even if you don’t care about the environmental angle (you should, but that’s a different post).

And the energy is only part of the story. There’s also the question of where your data goes. Every query you send to OpenAI, Anthropic, or Google travels to their servers, gets logged, and feeds their systems. When you run a local model, your data stays on your machine. For anyone working with sensitive business data, customer information, or anything remotely confidential, this isn’t a nice-to-have. It’s a requirement.

Conclusion

Look, I’m not saying big models are useless. They clearly have their place in complex reasoning tasks, multilingual generation, and problems where you genuinely need the breadth of knowledge that comes from training on the entire internet. I use them myself for certain things.

What I am saying is that the narrative of “bigger is always better” is convenient for the companies selling API access at scale and inconvenient for everyone else. The truth, as these simple experiments show, is that for the vast majority of what people actually use LLMs for (summarizing text, answering questions, drafting content, explaining concepts) a small model running locally does the job just as well. Sometimes better.

This matters for Europe especially (I’ve written about this before). If we accept the premise that only trillion-parameter models from American corporations can do useful AI work, then we’ve already lost. We become permanent API consumers, sending our data and our money across the Atlantic in exchange for a service we could provide ourselves. The small model movement (and it is a movement, with Mistral, Qwen, Llama, and others leading the way) is one of the most important developments in AI precisely because it breaks this dependency.

Run your own models. Measure the actual difference on your actual tasks. I think you’ll find, as I did, that the monster under the bed isn’t nearly as scary as they want you to believe.

References

Costa, Luís, Emilija Kucinskaite, and Vaidas Ruzgas. 2025. How Much Energy Does Generative AI Use? - Surfshark — Surfshark.com. Https://surfshark.com/research/chart/chatbots-energy-consumption.
Ding, Peng, and Rick Stevens. 2025. Unified Tool Integration for LLMs: A Protocol-Agnostic Approach to Function Calling. arXiv:2508.02979. arXiv. https://doi.org/10.48550/arXiv.2508.02979.
Gladwell, Malcolm. 2009. What the Dog Saw: And Other Adventures. Hachette UK.
Schick, Timo, Jane Dwivedi-Yu, Roberto Dessì, et al. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761. arXiv. https://doi.org/10.48550/arXiv.2302.04761.
Zheng, Xiaosen, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, and Min Lin. 2025. Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates. https://arxiv.org/abs/2410.07137.