I've done some things

An Interesting Thing About Granite 4.1

IBM recently released a big pile of new open weight foundation models, Granite version 4.1, and there’s a few interesting things going on with these models. I want to briefly talk about one thing that isn’t called out in the announcement, but I noticed right away: Granite 4.1 is a citation machine. It cites its work extensively when the topic is something with citable research, and seemingly mostly accurately. They’re not hallucinated citations, as models are infamous for, though they may not always be cited in the text in a way that makes it easy to find the source. So far, the citations are all real (verifiable in the search API calls, though the way it cites things in the text is vague, making it a little hard to Google some sources without digging into the API log) and they’re mostly directly relevant to the topic.

I think this is a side effect of, or at least related to, one of the features the Granite team called out in their announcement: Granite is a strong tool user. As a small model, it can’t know everything, or even much of anything, really. It’s only 33.5GB in the 8-bit quantization I’m using, far too small to even contain all of the English language WikiPedia articles (105GB uncompressed), but it does effectively use search to expand what it can speak with authority on (whether the web is trustworthy is another matter, but it seems to favor good sources, too) in a way I haven’t seen from other models. Qwen 3.6 and Gemma 4 are very good tool users for their size, and will also search the web, but they don’t cite their work as consistently or extensively as Granite, at least not without being prompted explicitly to do so.

And, it’s notable that the citations are real (and mostly relevant), rather than hallucinated. One of the biggest risks of LLM models, even the big guys, is that they lie with confidence, including making up citations. Humans are particularly gullible when they see/hear confident lies. Smaller models are more prone to lying because they have so little knowledge…they are unable to say “I don’t know”, so they make something up. A small model that is strongly inclined to look stuff up rather than dream up some truthy answers feels like a big deal, at least for some categories of problem.

For example, I asked it a question that’s been on my mind lately, “What’s the current scientific consensus on how to engage on social media in a way that changes minds, and helps people stuck in conspiratorial thinking or in cult-like groups find their own way out of it?” This is on my mind because I’ve been using Gemma 4 to analyze my social media writing over the past several years, about 5 million words, to rate it based on rudeness/helpfulness, among other metrics, and considering whether I’m wasting my time (~5 million words in five years, obviously I’m wasting my time, but how much of it was wasted?), particularly when engaging on topics that are rife with conspiracy thinking (which is pretty much all topics in the US, now).

Granite cited nine different sources, including WikiPedia. Most I would consider strong/trustworthy sources: MIT Technology Review, MIT Sloan, PLOS ONE, ScienceDirect three times, FBI report on extremist recruitment, and Pew Research Center. And, I would consider the answer on par with the frontier models I regularly use, Opus 4.7 and Gemini 3.1 Pro. To be fair, Gemma 4 and Qwen 3.6 also gave pretty similar recommendations, but with fewer citations. So, in this case, every model was successful, but I think Granite was more successful. It gave a very good answer that could be verified and it did it with very limited resources (64GB of VRAM on two old/cheap GPUs, in this case).

Qwen 3.6 has been mentioned repeatedly on Hacker News and reddit as being better at coding, among other things, and that may be true. I haven’t compared that yet, and I rarely use self-hosted models for code. But, both Gemma 4 and Qwen 3.6 are quite prone to lying and/or glossing over stuff they don’t know, even when they have a search tool available, and there’s a lot of stuff they don’t know. That disqualifies them from being my choice for research. I’ll probably still go to bigger models for this kind of thing, but I’m pleasantly surprised by Granite. Its prose is excellent, as well, for a small model; concise and precise.

In short: Granite 4.1 is limted but careful, in a way I don’t think any other small model has been, perhaps because of IBM’s focus on enterprise use for these models, and that feels like an important and valuable metric to measure these tools on.

Another example: I asked Granite questions about IEC 62443-4-1 (a secure software development lifecycle standard commonly used for industrial automation and control systems, which I’ve been researching for my job at the robot factory), and it gave me the following notice at the end:

Note on Sources

Because IEC 62443‑4‑1 is a proprietary standard, public web searches via the Brave Search tools did not return freely accessible full‑text excerpts. The summary above reflects the publicly known structure and requirement categories as described in official abstracts, industry whitepapers, and secondary documentation (e.g., NIST’s cross‑walk to IEC 62443). For precise clause numbers, exact wording, and any normative annexes, consult the purchased standard document or an authorized institutional copy.

No other model, small or large, qualified their answer in this way, or made such a strong effort to direct the reader to sources. And, the answer it gave was excellent. It was concise, clear, and focused more on specific and actionable details than Qwen 3.6 (27B at 8 bits) or Gemma 4 (31B at 8-bits). It seems weaker on other metrics, and the benchmarks they’ve published mostly compare to earlier Granite models, so I’m not sure where it fits in the overall landscape. But, I think a small model that seems tuned for telling the truth and backing it up is an interesting turn.

Postscript

Out of curiosity, I ran the same questions by all three small models with the search tool disabled. Their answers were all pretty good, shockingly good for how limited their knowledge must be at roughly 30-35GB on disk. I liked the answers that used search better in all cases, but there wasn’t as much difference as I expected. Gemma 4 and Qwen 3.6 without search performed better than Granite 4 on both questions, and even vaguely cited (real) sources and gave more practical/actionable advice. So, I think Gemma and Qwen are “smarter”, Granite is just more careful. Which answer was “better” is somewhat subjective, of course, but number of citations is measurable, and Granite with search likes to cite sources and does a good job synthesizing data from search results, and that feels novel for a very small model.