May 20, 2026 14 min read

The Knowledge Harness: How Uristocrat.com Keeps AI Output Trustworthy

The Knowledge Harness

Engineers do not rely on raw large language model output; instead, they use verification systems. When AI generates code, it is run through a harness, which includes tools such as linters to check code style and type checkers to ensure data types are correct. These tools identify hidden errors. Test runners execute code to confirm it functions as intended, while CI (Continuous Integration) gates are automated checks that prevent faulty changes. Adversarial reviewers, whether human or agent, attempt to break the work before release. Large language models are powerful but non-deterministic. The harness introduces quality controls, transforming unreliable output into consistent results.

There is no direct equivalent for knowledge work. When Claude drafts a strategy memo, research summary, or article, there is no linter to flag weak arguments. Addressing this gap has been my primary focus. There is no test suite to validate unsupported claims, nor a CI gate to prevent inaccurate information from reaching readers. The output appears complete, which is problematic because nothing ensures its accuracy.

To address this, I developed a knowledge harness for Uristocrat.com. This system includes tools, guardrails, and review loops that support Claude when producing content rather than code. It intentionally adopts engineering principles, applying the same discipline to a new context.

A note on terminology: 'Harness' can be interpreted in various ways. For engineers, it often refers to the complete runtime system, including the control loop, tool-calling layer, sandboxes, and observability that support a model in production. I use the term in this sense, but highlight that there is no established term for a system that enables non-engineers to run Claude reliably on complex knowledge work. I refer to it as a 'knowledge harness.' For those with an infrastructure background, consider it as automated quality controls applied to writing and research, not only software.

Here's what that looks like in practice.

Uristocrat.com operates with a small group of AI agents. They produce the daily roundup, the Saturday intelligence brief, standalone stories, and SEO updates. There is no editor, CMS rotation, or researcher. The site is managed solely by me, a Ghost installation, and a team of agents that update the site each morning before I am awake. The motivation is simple: I enjoy sneakers, streetwear, culture, and technology news, and wanted a single, accessible platform to view it all.The system functions effectively due to the harness that oversees the agents. Without it, the process would produce posts that are difficult to read, error-prone, and incomplete. The following explains its setup and the rationale behind its design.

The problem: agents are non-deterministic

If you've shipped anything real with an LLM agent, you already know this. The same prompt run twice produces two different outputs. Sometimes the difference is cosmetic. Sometimes the agent decides today is the day to invent a new headline format, drop the internal link it always adds, or publish three posts about the same news story because it forgot what it published yesterday.

That's fine when a human reviews every output. It's a disaster when the agents run on a cron and ship to production at 6 a.m.

The standard answer is "evals": build a test set, score each run, and gate on the score. That works for narrow tasks. It does not work for an editorial product where the output is open-ended, and the quality bar is taste. You cannot unit-test a headline.

So I needed something else—a control plane that constrains the nondeterminism without trying to eliminate it.

The setup

The harness is a single skill (uristocrat-harness) with a knowledge tree behind it. The skill itself is short: it tells the model where to look, not what to think. The knowledge tree is the durable part:

uristocrat-harness/
├── SKILL.md            ← entry point, operating protocol
├── AGENTS.md           ← registry of every agent (source of truth)
├── ARCHITECTURE.md     ← run order, shared state
├── DESIGN.md           ← editorial design principles
├── FRONTEND.md         ← Ghost theme contract
├── PRODUCT_SENSE.md    ← what makes a good Uristocrat post
├── QUALITY_SCORE.md    ← rubric for grading agent output
├── docs/│   
    └── product-specs/  ← one spec per agent
├── exec-plans/         ← in-flight and shipped improvements
└── generated/          ← auto-logged state (quality log, etc.)

Five production agents currently run on a schedule:

Agent	Runs	What it produces	Output mode
`daily-roundup`	Daily, pre-dawn	The morning roundup across sneakers, streetwear, culture, and tech	Draft (`[DRAFT]`)
`intelligence-brief`	Weekly (Saturday)	The Saturday intelligence brief	Draft (`[DRAFT]`)
`story-researcher`	Daily	Standalone stories on breaking news	Publishes directly
`seo-fixer`	Daily	On-page SEO and metadata fixes across existing posts	Draft
`quality-audit`	Weekly (Sunday)	Scores the week's posts against the 8-point rubric	Read-only

our of the five drop into drafts or stay read-only. Only the story-researcher publishes directly.

The content agents don't write in a vacuum. Before anything reaches a draft, they call a shared skill, uristocrat-editor — the prose layer of the harness, and my attempt at the linter I said content was missing. It's the closest thing content has to one: a single skill that enforces voice, headline rules, length fit, and the house style the rubric later grades against. The standard lives in one file instead of getting re-described in every agent's prompt. When the editorial bar moves, I edit uristocrat-editor, not five separate skills. It's also where the agents-as-employees idea shows up most literally: the editor is the style guide every writer on staff works from.

How the harness constrains non-determinism

There are four mechanisms.

1. A single source of truth. AGENTS.md is the authoritative list of what's running. If a scheduled task exists that isn't in the registry, the harness flags it as drift. If a row in the registry has no matching task, the registry is wrong, and it gets fixed. This stops the slow rot where agents pile up over time and nobody remembers what they do.

2. A product spec per agent. Every agent has a spec at docs/product-specs/<agent-name>.md that serves as the contract; the prompt is the implementation. The rule: never edit the prompt without updating the spec first. It sounds bureaucratic until you've watched yourself tweak a prompt at midnight and forget by Friday what the agent was supposed to do.

3. A weekly quality review. Every Sunday, an audit agent reads the past seven days of posts and scores each one against an 8-point rubric (Lead, POV, Why-now, Voice, Length fit, Image, Internal link, Tags), plus a hard-fail headline check. Results get appended to a quality log. Week-over-week deltas surface drift before it becomes a pattern. The audit is read-only. It can't fix anything, but it tells me what to fix.

4. Draft mode by default. Three of the four content agents save drafts with a [DRAFT] prefix, and nothing goes live without me reading it. The story-researcher publishes directly, which is exactly why I watch it most closely. The asymmetry is intentional: I'd rather run a slow pipeline that needs my eyes than a fast one that ships garbage.

Why this shape

The harness is doc-driven on purpose. The knowledge lives in markdown files, not in code. That buys three things:

I can read it without running anything.
The agents read the same files the harness reads. No translation layer.
When I'm reviewing what went wrong, I'm reading the same spec the agent worked from.

The whole thing is meant to be legible. If I can't explain in plain English what an agent does and how to tell whether it's working, I haven't finished designing it.

Example one: the wrong-Jason photo

On May 13, the story-researcher published a post on Jason Collins, the first openly gay active NBA player, who had just died at 47. The body was good. It scored 8/8 on the rubric, the headline followed the rule, the internal link was there, and the tags were clean.

The featured image was a photo of Jason Williams.

Different Jason. Point guard from the late-'90s Kings. Not the person the post was about. The agent had grabbed an image from a web search; the search returned a photo tagged "Jason" from an NBA wire feed, and the agent shipped it. The rubric had an "Image" dimension ("featured image set, on-topic, quality"), and the agent self-scored it a pass because the photo was on-topic for basketball. It never occurred to the agent that the photo could be of the wrong human.

This is the exact failure mode the harness exists for. Not a hallucinated fact in the body, not a bad headline. A plausible, brand-damaging mistake any human editor would catch in two seconds, and no eval would flag.

What I changed in the story-researcher skill:

No verified feature image, no publish. The skill had two rules in conflict. One section said to publish without an image if you can't verify the person in the photo; the preflight said don't publish without an image at all. Both paths now agree. If there's no verified feature image, the post saves as a draft and gets flagged in the run report for me to handle. A wrong-person photo on an obituary is a correction event. A missing photo is just a draft I'll clean up later.
Publish became the explicit default in the frontmatter. The skill's own description still said it "saves drafts for review." That hadn't been true for a while, so we updated it: posts go live unless they trip the no-image exception.
The run log became a real thing. The skill kept telling the agent to write check lines to "the run log" without ever saying what that was. It's now a single persistent markdown file in the RSS reader's data directory, appended each run under a dated header. Future quality reviews can actually read it.
Renamed a misleading preflight item. The check labeled "feature image URL set" was misleading, because the field gets set after creation. It now reads "feature image URL extracted," which is what the agent is actually verifying at that point.

The harness didn't prevent the mistake. It caught the class of mistake, and the rule moved upstream into the skill itself. The agents running today won't ship a post with an unverified photo.

Example two: the tag duplicates the audit was never looking for

The weekly quality audit ran read-only across 93 posts from May 11–17. The per-post findings were small: one !H headline violator (a LeBron piece with an editorial tail bolted onto the news), one empty placeholder draft, one roundup that shipped with no tags. Nothing dramatic.

The interesting finding wasn't the score. It was a pattern the rubric flagged almost as an afterthought: tag-slug duplicates.

Ghost matches incoming post tags by exact name. Send {"name": "Hip-Hop"} and Ghost reuses the existing tag. Send {"name": "Hip Hop"}, the same concept but a different string, and Ghost creates a brand-new one. The slug hip-hop is taken, so Ghost silently appends a suffix: hip-hop-2, hip-hop-3, hip-hop-4. Some of those tag pages 404, because they were never properly indexed.

By the time we looked, the catalog had:

Four "Hip-Hop" tags, two of them orphaned to /404/.
Two "Daily Roundup" tags. daily-roundup held the Dec–Apr roundups; daily-roundup-2 held the May ones. The archive was silently split across two URLs.
Two "Intelligence Brief" tags with the same split.
Two "Air Force 1" tags. One had the clean slug but an ugly lowercase display name; the other had the proper display name but an ugly -2 slug. Neither was fully right.

None of this was visible from any single post. It only surfaced when scanning the whole week's tag list. The content agents generate posts with the LLM authoring tags freehand, and the LLM has no concept of "this string must exactly match the previous string." It picks whichever spelling feels right in the moment: "Hip Hop" Monday, "Hip-Hop" Tuesday. Every variant Ghost sees becomes a new tag. Each one looks fine in isolation. The split only shows up in a cross-catalog scan.

What I did.

On the catalog (17 posts retagged, 6 tags deleted): migrated every post off the duplicates onto canonicals. Deleted daily-roundup-2, intelligence-brief-2, hip-hop-2, hip-hop-3, hip-hop-4, and air-force-1-2. Renamed the canonical air-force-1 tag's display name to "Air Force 1," keeping the slug and the URLs intact.

On the agents: each of the three content-skill wrappers got an explicit ## Tagging section with canonical tag IDs baked in. The rule: pass tags by id, not by name. Passing {"id": "..."} to posts_add forces Ghost to reuse the existing tag. No name parsing, no fuzzy match, no chance to fork a new one. IDs are stable; names drift. For the fixed-set skills (daily-roundup, intelligence-brief), the entire tag set is now a static list of IDs: five for roundups, six for briefs, no thinking required. For the variable skill (story-researcher), verticals are by ID, and topic tags have to be tags_browse-checked first. Creating a new tag is the last resort.

Two lessons worth pulling out:

Tag systems that match by name are footguns for LLM agents. The LLM treats the tag-name field as a piece of writing and picks the variant that reads well. The tag system treats the same field as a primary key. Those two purposes conflict, and the conflict stays invisible until you audit. If you're building any taxonomy an agent touches, hand it IDs and never let it author identifiers as natural-language strings.
Read-only weekly audits earn their keep on the things you weren't looking for. The rubric was built to grade headlines and leads. It surfaced a catalog-integrity bug because one line of the checklist asked, "do any tag slugs have -N suffixes?" The score-by-post output was beside the point. The pattern flag was the find.

Example three: pushing checks down into CI

The story-researcher script had a hardcoded Ghost Admin API key sitting in plaintext. Convenient for an unattended task, and exactly the kind of thing that should never depend on a human reading the file to catch.

This is a different category of finding. The first two examples were about the harness watching the agents at runtime. The hardcoded key is about the harness watching the agents at the code layer, before they ever run. Two different surfaces, two different controls.

The lesson: the harness shouldn't be the last line of defense for things a CI pipeline can enforce. Every Uristocrat repo now gets a baseline of GitHub checks wired in. Secret scanning on every push. A linter over the skill files themselves, so frontmatter contradictions like the story-researcher's ("publish by default" in one place, "save drafts" in another) get flagged before they ship. Dependency audits on anything the agents call. None of this is novel. It's the standard CI hygiene any production repo should have. The agents had been getting a pass because they felt like scripts. They aren't scripts. They publish to a live site.

The rule: anything that can be caught deterministically in CI should be caught in CI, not in the weekly audit. The audit's job is the stuff CI can't see. Editorial taste, topic collisions, a wrong face on an obituary. Those need a human reading the output. Secrets and broken contracts in skill files don't.

What it doesn't solve

Non-determinism doesn't go away. The harness just narrows the blast radius. It doesn't make the agents as reliable as a deterministic script.

Two problems are still open. The first is same-day topic collisions between the daily roundup and the story-researcher. They read the same sources, the dedup check isn't wired up yet, and so they occasionally cover the same news. The quality review catches it after the fact but doesn't prevent it.

The second is that the rubric is a proxy for taste. An 8/8 post can still be a bad post. A 6/8 post can be the best thing I publish that week. The rubric keeps the floor up, but it doesn't raise the ceiling. And uristocrat-editor narrows the linter gap I opened with, but it doesn't close it. There's still no test that tells me an argument is weak.

The point

The reason any of this works is that I treat the agents like employees who are learning and need coaching, not like scripts. They have job descriptions (product specs), a manager (the harness), a style guide (uristocrat-editor), performance reviews (the quality log), and a published org chart (AGENTS.md). When something goes wrong, I read the transcript the way I'd read a 1:1 note.

That's the knowledge harness: the same back-pressure engineers wrap around code, pointed at content instead.

Thanks to Darlin Alberto, whose feedback sharpened this piece.

Appendix: the anatomy of a harness

"Harness" doesn't have one settled definition, which is exactly why I scoped mine at the top. Still, it's worth showing the fuller picture, because the engineering version is much more than documentation or skills. A complete agent harness usually spans six layers and a couple of dozen components. Here's the map, so you can see where a knowledge harness sits inside it.

Instruction & guidance layer (the "feedforward") — what the agent is told before it acts.

System prompt — the base instructions defining the agent's role, behavior, and operating rules. The single most leveraged component; small changes ripple through everything.
Rules / context files — persistent project guidance like AGENTS.md, CLAUDE.md, or .cursorrules. These encode conventions, "never do X" rules, and architectural context the agent reloads each session.
Skills — reusable, packaged capabilities (instructions plus optional scripts/resources) the agent pulls in on demand for specific task types.
Few-shot examples & prompt templates — worked examples and structured scaffolds that steer output format and style.

Tools (what the agent can actually do) — the actions available to it.
5. File operations — read, write, edit, and view tools so the agent can inspect and modify work.
6. Shell / command execution — the ability to run bash, install packages, and execute scripts. Often the most powerful and most dangerous tool.
7. Search & navigation — grep/ripgrep, semantic search, and indexing so the agent finds the right material instead of guessing.
8. Tool descriptions — the natural-language specs of each tool. These are themselves a tuned component; vague descriptions cause misuse, so they get edited deliberately.
9. External integrations / MCP servers — connectors to outside systems (issue trackers, docs, browsers, databases, APIs) that extend what the agent can reach.

Context & memory — what the agent knows in the moment and over time.
10. Context management — the logic deciding what goes into the model's limited context window: what to include, summarize, compress, or drop.
11. Retrieval (RAG / indexing) — embedding-based or structural retrieval that surfaces relevant files and docs on demand.
12. Memory — short-term scratchpads and long-term memory files (e.g., a persistent MEMORY.md) that carry learnings across turns or sessions.

Control flow & orchestration — how work is sequenced.
13. The agentic loop — the core orchestrator that cycles reason → act → observe, decides when to stop, and handles retries. The literal engine of the harness.
14. Sub-agents — specialized child agents (a tester, a reviewer, a researcher) the main agent delegates to, each with its own prompt and tools.
15. Middleware / hooks — interceptors that fire before or after actions to inject context, modify behavior, or block disallowed moves.
16. Model configuration — which model, plus settings like temperature, max tokens, and reasoning depth.

Feedback sensors (the "feedback") — how the agent learns whether the work is good.
17. Linters & type checkers — automated quality signals that catch issues the model can't reliably self-detect.
18. Test runners — the agent runs the suite and reads results, turning "did it work?" into an objective signal rather than self-assessment.
19. Compilers / build systems — fast, unambiguous correctness feedback.
20. Pre-commit hooks & CI gates — back-pressure that blocks bad changes from landing. This is where "agent mistakes become permanent rules" usually gets enforced.

Guardrails & environment — the limits the agent operates inside.
21. Sandbox / execution environment — the container or VM the agent runs in, defining what it can touch and reach.
22. Permissions & approval gates — allow/deny rules and human-in-the-loop checkpoints for risky actions (deleting files, pushing, spending money).
23. Observability / tracing — logging of the agent's full trajectory so you can see why it failed and which component to fix. Without it you're tuning blind, which is why recent work treats observability as the bottleneck.

Where the knowledge harness fits

A knowledge harness is the same skeleton, pointed at content instead of code. The mapping is nearly one-to-one:

Feedforward is the knowledge tree: AGENTS.md, the product specs, PRODUCT_SENSE.md, and the uristocrat-editor skill that carries house style (items 2–3).
Tools are the Ghost API, web search, and the agents' file access (items 5–9).
Feedback sensors are where the two diverge most. Code gets linters, tests, and compilers (items 17–19): deterministic and objective. Content gets the weekly rubric and the audit agent, a proxy for taste rather than a true linter. That's the gap I keep flagging. The one place the mapping holds cleanly is CI gates (item 20): secret scanning and skill-file linting catch deterministic faults the same way they would in any repo.
Guardrails are draft-mode-by-default and the publish/approval asymmetry (items 21–22), and the run log is the start of real observability (item 23).

So "knowledge harness" isn't a looser word for documentation. It's the full apparatus above, minus the components that only make sense when the output is code, plus a human, because the feedback layer can't yet be automated the way a test suite can.