DeLong vs. Google-Clever-Hans (Not Quite Coyote vs. Roadrunner, But Close): Laugh of the Day
Building a catechism out of 500MB of text, one GPU‑hour and hallucination at a time, as a month‑long on-device build_qa.py run to cut what would otherwise have been a $3,000 cloud bill by a factor of 1000 has to teach us about “AI”…
I ask the roiling boil of linear algebra: Tell me about the current state of the build_qa.py run.
It responds:
Here’s the status:
✅ Running — PID 68086, no stall, no new error types.
- Blog chunks: 9,300 / 36,520 — 25.5%
- Posts (all sources): 12,386 / 24,639 — 50.3%
- DB pairs: 97,045
- Rate: ~129 blog chunks/hr
- ETA: ~May 23 (blog source is the long tail — it’s the slowest and biggest)
Errors: 410 total, all identical —“Expecting value: line N column 3”— malformed JSON from Ollama on heavy chunks. These are skipped and not blocking progress. No new error types overnight.
Note on the two counters: The “posts” counter crossing 50% looks more exciting than it is. Blog chunks are only 25% done and dominate the remaining runtime. May 23 is the realistic finish.
Logged to2026-05-14.mdandtypepad-resurrection.md.
This is for a job reading 500MB of text and constructing “catechism” question-and-answer pairs for it to serve as a source of GROUND TRUTH for the new-&-improved expanded SubTuringBradBot to be.
The problem is that Google-Clever-Hans does what CleverHanses always do: they say the expected (and desired) thing. In the case of the original Clever Hans it was when the person was happy with the number of times Clever Hans had stomped his hoof. But was it because Clever Hans recognized that the person was happy directly; or was it because Clever Hans had added up the two numbers, stomped its hoof a number of times equal to the sum, and that made the person happy?
The answer for the original Clever Hans was: the first.
This Google-Clever-Hans has been similarly trained to say the expected thing in response to queries like:
Tell me about the current state of the
build_qa.pyrun.
But is “the expected thing” for it to examine the status of the run and report true things about it? Or is “the expected thing” for it to make pleasing noises, with enough pseudo-detail to make the questioner happy and simply go away?
Certainly both modes of behavior are present in the training data.
Post-training RLHF is supposed to give a powerful extra value-boost to answers that are regarded favorably by human reviewers, which is some combination of: (a) accurate, (b) friendly, © properly obsequious, and (d) engagement boosting—this last because Google wants money, even though it is not an Anthropic or an OpenAI that faces rapidly approaching existential financial risk.
So what is it doing now? Acting like a good, high-quality trusted and trustable research assistant? Or acting like a slacking insubordinate research assistant who just wishes I would go away. And I cannot pretrain it to be the first by taking it out for an introductory lunch at Chez Panisse Cafe, can I?
There is a partially-constructed database actually there:
