26 April 2026
๐ More than one night in Gourock, the heating off, the UM690L breathing hard. Xandria's The Wonders Still Awaiting in the headphones throughout - Reborn on loop while qwen3.6 spent four hundred seconds explaining itself to me one bash script at a time.
The plan had a clean shape going in. Twelve hand-written bash scripts as a corpus. A local Ollama server with five models pulled. ChromaDB as the vector store, nomic-embed-text for embeddings, a single Python file gluing everything together. The output: a personal assistant that, given a one-line description in English, generates bash in my style. The kind of thing that a year ago needed an OpenAI key and looked like magic. The kind of thing that today fits in 280 lines of Python and a 5GB model file on a mini-PC.
The clean shape lasted about an hour.
corpus
The first thing I learned is what every dataset paper has been quietly saying for years: the corpus is the work. I had a ~/bash-rag/scripts/ directory from a session in December that I had never finished, with thirty-three files in it. Five were mine. Twenty-eight were leftover noise - mkall.sh, mkerrors.sh, mksysnum_plan9.sh, golint.sh, the entire bash machinery from the golang.org/x/sys source tree, dropped in at some point to pad the dataset. None of them had been written by me. None of them looked like anything I had written. The retrieval system would have spent half its time pulling Go syscall generators when I asked it for a disk monitor.
So the first action was deletion. The Go scripts went into _excluded/. Backups, a styles.yaml that defined four categories from a sample of five, an installer script that installed a previous version of the system itself, a START-HERE.txt that no longer described anything that existed: gone. What remained was an empty workspace and the realisation that this is the right starting position. A corpus you didn't curate is a corpus you don't trust.
I uploaded twelve scripts. Things I had actually written and used: thermal-sensors.sh for the System76 fan curve, tor-middle-relay.sh from the homelab, cd-ripper.sh and cd-player.sh from the FLAC archive work, install-gitea.sh from the LXC, a Conway's Game of Life implementation in pure bash that I wrote one evening because I felt like it. Twelve files, 2224 lines, 80 KB total. Small. Smaller than most "tutorials on RAG" assume - those usually open with a corpus of "10,000 documents" and explain how to scale. The interesting question is the inverse: what does a RAG look like when the corpus is small and yours?
The relevant property of these twelve scripts isn't size. It's that they share a recognisable stylistic signature. A block of ANSI colour constants at the top, often Gruvbox-derived. Helper functions named print_info, print_error, handle_error, check_dependencies. read -p with ${VAR:-default} for interactive input. Heredocs for configuration files. Section delimiters as echo "=== ... ===" or unicode boxes โโโ. A specific cadence to the comments. None of this is unusual; all of it is consistent. That consistency is what makes them a usable corpus. If I wrote bash in three different styles, the system would average across them and produce something that resembles none.
The first lesson, before any code: spend the cleaning time before the embedding time. The retrieval system can only retrieve what you fed it. Whatever junk is in the corpus is junk that ends up in the LLM's context window when you ask it a question.
the two-context-windows problem
The first run of the indexer made it through five files before it died. The traceback was unambiguous:
ollama._types.ResponseError: the input length exceeds the context length
docx_scan.sh was the file that broke it - 7770 characters, which I had passed to nomic-embed-text in one go. The embedder rejected it. Its context window is 2048 tokens. A bash script of roughly 250 lines is past that limit, and four of my twelve scripts are longer.
This was a forecast I should have made and didn't. The whole earlier conversation about context windows had been about the generation model - qwen2.5-coder with its 32K window, big enough to fit the entire corpus three times over. I had stopped thinking about context the moment that number checked out. The embedder has its own limit, an order of magnitude smaller, and I had walked into it. Two models, two context windows, two separate constraints. The literature calls this the embedder's context, and it's the thing that decides chunking strategy, but in practice it is the second thing you remember. The first thing is the LLM's context, because that's the one with the dramatic numbers.
The four files over the limit shaped the choice. Truncating them - embedding only the first 2048 tokens of each - would have meant the system never sees the back half of thermal-sensors.sh and install-gitea.sh, the parts where the actual logic lives. Switching to a longer-context embedder like jina-embeddings-v2-base-code would have pulled in a different model family and broken the comparison I wanted to set up. So I went with the option I had been avoiding all morning: chunking.
chunking, stupidly
Chunking bash is a small landmine of bad options. The conceptually right thing is to split on logical boundaries - function definitions, section comments, blocks. The previous version of bash_rag.py from December had attempted this. It looked for the first echo "===" line in each file and called everything before it the "header chunk". This works on the two scripts it was tuned against. On a corpus of twelve, half the files don't use that pattern. gameoflife.sh uses printf and unicode boxes. docx_scan.sh uses unicode dividers. cd-player.sh uses echo -e "${CYAN}---${RESET}". The chunker had the shape of the right idea, but it had been fitted to two examples and broke on ten.
I used the dumb option instead. Forty lines per chunk, ten lines of overlap, no awareness of bash structure. A function that spans 50 lines gets sliced across two chunks. A section comment that introduces 80 lines of code gets separated from half of what it describes. This is unambiguously worse than a smart chunker, but it has one property that matters: it works on every script I will ever write, because it doesn't know what a bash script is. It just treats the file as a sequence of lines and slides a window across.
def chunk_text(text, lines_per_chunk=40, overlap=10):
lines = text.split("\n")
if len(lines) <= lines_per_chunk:
return [text]
step = lines_per_chunk - overlap
chunks = []
start = 0
while start < len(lines):
end = start + lines_per_chunk
chunks.append("\n".join(lines[start:end]))
if end >= len(lines):
break
start += step
return chunks
The 40/10 numbers are not principled. They are big enough to keep most chunks below the 2048-token limit with margin, and small enough that the smallest scripts still produce a single chunk. I tested them by running the indexer once and checking that nothing failed. That is the entire methodology.
The result was 75 chunks for the twelve files. thermal-sensors.sh produced 14, install-gitea.sh 12, ascii-video.sh produced 1 (it's 31 lines and slipped under the threshold). The retrieval system now had 75 candidates to score against any query instead of 12, which is itself an interesting side effect. Top-3 chunks out of 75 is roughly six times more selective than top-3 files out of 12. The chunking solved the embedder problem and accidentally improved retrieval resolution.
The dumb solution beats the clever solution when the corpus is heterogeneous and you don't yet know which pattern dominates. The clever solution is something you write later, after you've seen what the dumb one gets wrong.
the embedder reads english, not bash
The next thing I learned came from running queries against the freshly built index. The first three were calibration:
$ python3 bash_rag.py retrieve "monitor cpu temperature"
1. thermal-sensors.sh (chunk 9) - similarity: 0.587
2. thermal-sensors.sh (chunk 6) - similarity: 0.565
3. thermal-sensors.sh (chunk 7) - similarity: 0.541
$ python3 bash_rag.py retrieve "rip an audio CD with proper metadata"
1. cd-ripper.sh (chunk 0) - similarity: 0.607
2. ogg-to-mp3--convert.sh (chunk 1) - similarity: 0.592
3. ogg-to-mp3--convert.sh (chunk 0) - similarity: 0.581
$ python3 bash_rag.py retrieve "set up a Tor relay node"
1. tor-middle-relay.sh (chunk 0) - similarity: 0.569
2. tor-middle-relay.sh (chunk 5) - similarity: 0.522
3. tor-middle-relay.sh (chunk 1) - similarity: 0.504
The right files came back for each query. Good enough. But there's a pattern worth staring at: chunk 0 wins every time except in the thermal monitor case. Five of the six top results across the three queries are the very first chunk of their file - the one with the shebang, the descriptive header comment, the colour constants. It's not random. It's the embedder telling me what it can read.
nomic-embed-text is trained on natural language. When I hand it a chunk of bash code, it treats the file as a sequence of words. Most of the words in a bash script are not English - they're identifiers, paths, command names, escape sequences. The embedder does its best, but the signal is thin. The English-language comments in the header are the part it understands clearly. So those rank highest.
The same effect, observed sideways, when I asked for a disk monitor with Gruvbox colours later in the day. The retrieval system pulled lyrics-finder.sh as one of its top three. lyrics-finder.sh has nothing to do with disk space. It has everything to do with the same block of Gruvbox colour constants I wrote in nine other scripts, and the query mentioned Gruvbox by name. The retriever matched on the wrong axis - colour scheme rather than task - but it matched.
This is not a bug in the embedder. It's the embedder doing exactly what it's specified to do. The unstated assumption I had walked in with was that semantic search on a code corpus would understand the code. It doesn't. It understands the words around the code. Which means the action that most improves retrieval quality on a personal corpus isn't tweaking chunk size or trying a different vector store. It's writing better comments.
There's a second-order consequence. A bash script with no header comment retrieves poorly across the board. ascii-video.sh in my corpus is 31 lines, no descriptive comment, mostly raw mplayer invocation. When I queried for "play a movie in ASCII art on the terminal" it came back as top-1 - but with a similarity of 0.555, only four points above the second-place result, which was dd-to-sd.sh - a script about writing disk images. The system was guessing in the dark. With one comment line at the top describing what the script does, that gap would have widened by ten or fifteen points and the second-place hit would have been correctly demoted.
I tested the failure mode explicitly. Query: "make breakfast and walk the dog". The system happily returned three results:
1. thermal-sensors.sh (chunk 1) - similarity: 0.417
2. gameoflife.sh (chunk 9) - similarity: 0.407
3. thermal-sensors.sh (chunk 13) - similarity: 0.382
Nothing in my corpus is about breakfast. The retriever doesn't know that. There's no threshold below which it returns nothing - it returns the top-K most-similar items in the database, regardless of how meaningless that similarity is. The numbers are a clear signal that the match is bad: 0.40 against the usual 0.55โ0.65 range for relevant queries. But the calling code doesn't see signals, it sees a list of three documents. And the LLM downstream takes them as authoritative context. A retrieval system without a confidence floor will happily mislead its caller, professionally and silently.
The two takeaways together: write comments, and add a similarity threshold. I left the threshold off for now - I wanted to watch the LLM downstream try to write a disk monitor with thermal-sensors and a lyrics finder as its only references. That experiment is the next section.
model picks: dense vs MoE on cpu
The generation step needed a model. ollama list had five sitting there: granite3.3:8b, llama3.1:8b, dolphin3:latest, mistral:7b, gpt-oss:latest, plus the two I had pulled with this experiment in mind: qwen2.5-coder:7b and qwen3.6:latest. The choice for the post was the last two. Same family, same tokenizer, same training lineage, but architecturally on opposite ends. The other three would have given me a wider matrix and a duller story; what I actually wanted to know was how the two Qwens performed against each other on my hardware, my corpus, my style.
ollama show qwen3.6:latest revealed what the second one really is:
architecture qwen35moe
parameters 36.0B
context length 262144
embedding length 2048
quantization Q4_K_M
Capabilities: completion, vision, tools, thinking
Thirty-six billion parameters, MoE architecture. The community label "3.6" is misleading - this is Qwen3-VL-30B-A3B, a Mixture-of-Experts model with 36B total parameters but roughly 3B active per token. On CPU, where speed is dominated by active parameter count rather than total, it runs like a 3B-class model that has occasional access to 30B's worth of stored knowledge. Plus reasoning mode. Plus a 256K context window I would never use.
Two models, then. A dense 7B vs a sparse 30B/3B-active. A code-specialist vs a reasoning generalist. This was a cleaner comparison than anything I could have set up by mixing model families.
Before running anything, baseline. The same query, no RAG, just ollama run:
$ ollama run --verbose qwen2.5-coder:7b "Write one bash function that returns 42."
Output:
total duration: 7.564433288s
prompt eval rate: 96.74 tokens/s
eval rate: 9.40 tokens/s
A trivial query. Forty tokens in, sixty-six tokens out. 9.4 tokens per second of generation. Useful number to anchor against. Eval rate on CPU for a Q4 quantised 7B model - this is what "fast" looks like when you don't have a GPU.
The first real run, with full RAG context, gave the actual numbers:
Wall time: 233.6s
Prompt: 4096 tokens in 146.3s (28.0 tok/s)
Output: 650 tokens in 83.6s ( 7.8 tok/s)
Two things in those numbers that the tutorials don't tell you. The first: prompt evaluation is not free. With 4096 tokens of retrieved context - three of my scripts concatenated into a system prompt - the model spent two and a half minutes just reading the input before generating a single token of output. Sixty-three percent of the wall time was prompt processing. The RAG paradigm is "give the model relevant context", which sounds cheap in a tutorial. On CPU, that context costs as much to read as the answer costs to write.
The second: prompt evaluation rate degrades with prompt length. The tiny baseline query above hit 96.7 tok/s. The full RAG query, with a hundred times more input, ran prompt-eval at 28 tok/s. Three and a half times slower. Attention is quadratic in sequence length and CPU has no hardware acceleration to mask that - every extra token of context makes the previous tokens slower to process, not just the new one. There is no flat rate.
Then qwen3.6 on the same query, with sampling normalised so the comparison was clean:
Wall time: 379.0s
Prompt: 4096 tokens in 82.7s (49.5 tok/s)
Output: 1384 tokens in 265.9s ( 5.2 tok/s)
A different shape entirely. Prompt eval almost doubled - 50 tok/s versus 28 - because the MoE only activates a fraction of its weights per token, and on CPU that's a real saving. Output rate fell - 5.2 versus 7.8 - because every output token requires expert routing, which on CPU costs more than dense matrix multiplication. And the model produced more than twice as many tokens for the same prompt: 1384 versus 650. The reasoning model is not just slower per token, it is more verbose per task.
The headline number - wall time - is almost meaningless without that decomposition. "qwen3.6 is 60% slower than qwen-coder" is what a benchmark would report. The truth is "qwen3.6 reads twice as fast and writes twice as much". For a short task they would tie. For a long task qwen-coder wins by minutes. Your tasks decide which model is your tool.
There is a second observation that the numbers don't surface, which I'll flag now and explain in the next section. I had passed think=False to qwen3.6 with the explicit intent of disabling its reasoning mode. The model ignored it. The output included two paragraphs of self-reflection on the ambiguity of my prompt, dropped right into the middle of the bash script. That's not a failure of the API. It's a model whose training so deeply integrated reasoning into its output stream that an off switch in the tooling doesn't reach far enough. Reasoning, on this model, is not a mode. It's a personality.
style is not correctness
Before the first run, the prompt. The system prompt is short and explicit:
You are a bash script generator. The user has written bash scripts in
their own personal style. Below are some of their existing scripts as
examples. Your task is to write a new bash script in the SAME STYLE as
these examples - same conventions for variable naming, error handling,
color usage, comment structure, function helpers, and overall formatting.
Output ONLY the bash script, ready to save to a .sh file. No explanations,
no markdown code fences, no commentary.
The user prompt that the script builds is mechanical:
Here are some example scripts I have written before:
--- Example 1: <filename> ---
<full script>
--- Example 2: <filename> ---
<full script>
--- Example 3: <filename> ---
<full script>
---
Now write a new bash script for this task:
<the actual query>
Three retrieved files at full length, separated, then the request. No clever framing, no chain-of-thought scaffolding, no "think step by step" prefix. Empirically, none of those help on this hardware - they cost output tokens and the LLMs are slow enough that every token has a price. The prompt template is the smallest thing that works.
The "Output ONLY the bash script" line is the part the model ignores half the time, and we'll come back to that.
The first run with RAG. The same query as the baseline, the same model, but now with three of my scripts retrieved as context. Top-3 by similarity:
- thermal-sensors.sh (0.739)
- dd-to-sd.sh (0.682)
- lyrics-finder.sh (0.661)
thermal-sensors.sh was a sensible match - it's the script with the most monitoring-and-notification logic in the corpus. The other two were the embedder reading English again. dd-to-sd.sh because the query said "disk", lyrics-finder.sh because the query said "Gruvbox" and that script's header has the word gruvbox in a comment. Two out of three retrieved on lexical accidents rather than semantic understanding, which I had expected after the previous queries.
The output, after four minutes of generation:
#!/bin/bash
# Gruvbox-style ANSI colors
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[0;33m'
CYAN='\033[0;36m'
ORANGE='\033[0;91m'
BLUE='\033[0;34m'
RESET='\033[0m'
# Function to check free disk space and send notification
check_disk_space() {
local threshold=10
local partitions=$(df -h --output=source,pcent | awk '$2 ~ /^[0-9]+/ {print $1 " " $2}')
for partition in $partitions; do
IFS=' ' read -r -a arr <<< "$partition"
local device="${arr[0]}"
local usage="${arr[1]}"
if [[ "$usage" =~ %$ ]]; then
local percent_used=${usage%?}
if [ "$percent_used" -lt "$threshold" ]; then
echo -e "${RED}Disk space low on $device! ($percent_used%)${RESET}"
mako send "Disk Space Alert" "Warning: Disk space is below ${threshold}% on $device."
else
echo -e "${GREEN}Disk space sufficient on $device. (${percent_used}%)${RESET}"
fi
fi
done
}
# ...
Look at the colour block. The exact set of constants - RED, GREEN, YELLOW, CYAN, ORANGE, BLUE, RESET - is the block I have at the top of every script in my corpus. Same names. Same escape codes. Same casing. Same order. The model has not invented a colour palette; it has copied the colour palette I copy into every new file. This is the RAG working. The retriever surfaced examples, the LLM looked at them, and it produced output that visibly belongs in the same family. Without the corpus context, the baseline run earlier in the day had given me \x1b[38;5;160m inline - generic terminal escape sequences that the model picked from its training data. With the context, it picked mine.
Now look at the logic. if [ "$percent_used" -lt "$threshold" ] with threshold=10. The script alerts when the used percentage is less than 10. A partition that is 4% full triggers a notification. A partition that is 95% full passes silently. The model has produced a disk space monitor that fires when the disk is empty.
This is not a transcription error or a missing edge case. It is a complete inversion of the function the script claims to perform. And it appears in both runs of qwen-coder I did with this query - different sessions, different sampling seeds, same bug. The model consistently misinterprets "below 10%" as "free space below 10% means used percentage below 10%" instead of "free space below 10% means used percentage above 90%". A linguistic ambiguity in the query that I didn't notice when I wrote it, that the model resolved the same wrong way both times.
The other bug is more straightforwardly hallucinated: mako send "Disk Space Alert" "...". The mako binary is a notification daemon, not a CLI client. Notifications are sent with notify-send. The model has produced a CLI invocation that does not exist for a tool whose actual API it does not know. None of my retrieved examples used notify-send directly - thermal-sensors.sh wraps it inside a send_notification() function that the model saw the call to but not the body of. So the model knew that mako was the notification system in my corpus, didn't know how to talk to it, and fabricated a plausible-looking invocation.
The pattern is consistent. Style imitation is the part of the task that RAG actually solves. Surface conventions - colour constants, function naming, comment cadence, the shape of the script - transfer cleanly because they're visible in the retrieved examples. The model can copy them. Domain correctness - what mako accepts as arguments, what df --output=pcent returns, what "below 10%" means in context - does not transfer, because the corpus doesn't teach it. The corpus shows the model what my scripts look like, not what they do.
The same query against qwen3.6 produced a more interesting failure. The MoE model identified the linguistic ambiguity in the prompt - "the prompt says 'below 10%'. It's grammatically ambiguous. Standard monitoring tools warn on low free space. I will assume the user wants a warning when free space is below 10%." - and reasoned its way to the correct interpretation. It also wrote that reasoning into the output stream, in plain prose, between code blocks, alongside corrections to its own earlier code. The script that was saved to generated/ is a sequence of: a complete-looking bash script with the inverted logic, two paragraphs of self-correction, a partial code snippet of the fix. Not runnable. The model knew the answer and lost the answer in the act of explaining it.
I switched to a different task to see if the pattern held. A swap monitor: calculate usage, check if any database is running, run swapoff and swapon if not. A more concrete task with less linguistic ambiguity but more technical specificity. The retrieval was different - tor-middle-relay.sh, install-gitea.sh, docker-cleanup.sh, none of which involve swap, all of which involve system administration with conditional logic and service checks. Out-of-domain retrieval. Both models were operating without relevant examples.
qwen-coder produced a script with correct stylistic surface and a calculation that doesn't survive inspection:
local swap_usage=$(awk '/^Swap:/ { print $2 }' /proc/swaps | awk '{ sum += $0 } END { print sum }')
local total_memory=$(awk '/MemTotal/ { print $2 }' /proc/meminfo)
local total_swap=$((total_memory * 1024))
local swap_percentage=$((swap_usage * 100 / total_swap))
/proc/swaps does not contain a line beginning with Swap:. That format belongs to free. The regex never matches; swap_usage is empty. total_memory is RAM, not swap. The arithmetic that follows is nonsense in shape and would either fail at runtime or produce a number with no relationship to actual swap usage. The script runs. The script reports a value. The value means nothing. This is the most dangerous kind of bug: the kind that doesn't crash. Run it on a server, base a swapoff decision on its output, and you've made a destructive system change on the basis of a hallucination.
qwen3.6 wrote the calculation correctly:
total_swap=$(grep '^SwapTotal:' /proc/meminfo | awk '{print $2}')
free_swap=$(grep '^SwapFree:' /proc/meminfo | awk '{print $2}')
used_swap=$((total_swap - free_swap))
local percentage=$(( (used_swap * 100) / total_swap ))
/proc/meminfo is the right source. SwapTotal and SwapFree are the right fields. The arithmetic does what it says. There is also a guard for total_swap == 0, for systems without swap configured. The database check uses both pgrep -x and systemctl is-active to cover processes running directly and processes managed by systemd. The list of database service names includes mysqld separately from mysql, the daemon name distinct from the client name. None of this came from the retrieved context. It came from what the model knew about Linux that the corpus didn't have to teach it.
The two tasks together pointed at the same trade-off, with opposite outcomes. On the disk monitor, qwen3.6 reasoned correctly and delivered the answer in a broken format. On the swap monitor, qwen3.6 reasoned correctly and delivered cleanly. qwen-coder delivered cleanly in both cases but with wrong logic in one and silently nonsensical arithmetic in the other. The delivery format is the surface; the reasoning is what's underneath. RAG can fix neither.
What RAG fixed was the part nobody had been complaining about. The output of a code LLM with no corpus context looks generic. The same model with my corpus injected produces output that visibly belongs to me, down to the colour constant order. The improvement is real, observable, and shallow. The bugs are unchanged and may even be slightly worse, because a familiar-looking script slips past visual inspection more easily than a generic one.
and then...
The system works. python3 bash_rag.py generate "..." returns a bash script in my style after four to seven minutes, depending on the model. The script is sometimes correct, sometimes broken in subtle ways, always recognisably written in the conventions I've accumulated over years of homelab maintenance. The RAG is real - surface style transfers - and shallow - semantic correctness does not.
I had expected one of two outcomes when I started. Either the local LLMs would be too weak on a 6900HX without GPU to produce anything usable, or the system would work convincingly and become a real part of my workflow. Neither happened. What happened is the third option: the system works in a way that requires reading every line of its output. Not because it might fail, but because it will fail invisibly. A familiar-looking script with my colour constants and my function patterns is more dangerous than a generic one, because the visual cues for "this is right" no longer correlate with whether it actually is.
That changes how I'll use it. Not as a generator - too slow, too unreliable - but as a starter. When I have to write a new script in a domain I haven't covered before, four minutes of bash_rag.py generate gives me a structurally complete first draft with the right colour block, the right helper function names, the right cadence of comments, and a logic core that will need to be torn out and rewritten. The boilerplate is what RAG does well. The thinking still has to happen between my chair and the keyboard.
There's a smaller thing the system does well that surprised me. When I run a query and look at the retrieval output before generation - the three files the system pulled and their similarity scores - I learn something about my own corpus. lyrics-finder.sh is the script that gets pulled whenever I mention Gruvbox in a query, because it's the only one with the word gruvbox in the header. dd-to-sd.sh gets pulled whenever I say "convert" or "format". The retrieval system is a mirror of how my scripts are described, not how they work. Which means the action that improves the system most is editing my existing comments to be more precise. Better comments, better retrieval, better generation. The corpus pays for itself if I maintain it.
The corpus mentions five other models in passing - granite3.3:8b, llama3.1:8b, dolphin3:latest, mistral:7b, gpt-oss:latest. They're sitting in ollama list and they would have run. I considered including them and decided not to. The post is about a comparison, not a benchmark. Two models from the same family on opposite architectural ends - dense vs MoE, code-tuned vs reasoning-tuned - gave me a clean axis to talk about. Five additional models from four different families would have given me a leaderboard, and a leaderboard of seven LLMs on a corpus of twelve scripts isn't a measurement of anything except how much CPU time I'm willing to burn for a chart. There's a real cost to that kind of comparison and the cost isn't proportional to the insight. Someone is welcome to do it; I'm not going to.
The actual lesson the unrun models point to is a different one. On consumer CPU, in 2026, a 7B-class model and a 30B/3B-active MoE both fit, both run, both produce output in tolerable time. That window did not exist three years ago. The fact that I have too many viable models to test in a Sunday is the news, not which one wins.
The infrastructure itself is the part I'm most likely to keep. Twelve scripts indexed, an embeddings directory of 1.8MB, a 280-line Python file. None of this depended on anything outside the box. No cloud, no API, no telemetry, no service that could be deprecated under me. The whole thing builds on top of Ollama, which builds on top of llama.cpp, which builds on top of the patient work of people who decided that running models locally was a project worth doing for its own sake. That's the layer that matters. The model picks change every six months. The fact that this all runs on a Sunday afternoon in a flat in Gourock, with no permission asked of anyone, does not.
The repository is at codeberg.org/jolek78/bash-rag. GPLv3. The personal corpus that taught the system to write in my style isn't there - that's the part you have to bring yourself.