9 min read

How the Uristocrat.com Harness Is Set Up (and Why)

Uristocrat.com is run by a small group of AI agents. They write the daily roundup, the Saturday intelligence brief, the standalone stories, and the SEO fixes. There is no editor on staff, no CMS rotation, and no researcher. There is just me, a Ghost install, and a fleet of agents that touch the site every morning before I'm awake.

Why does Uristocrat.com exist? I like sneakers, streetwear, culture, and technology news, and I like having a place where I can see it all in one easy-to-read format.

The whole setup works because of a harness that sits on top of the agents. Without it, the setup ships posts that are hard to read, full of errors, and missing information. Here's how it's set up and why I built it the way I did.

The problem: agents are non-deterministic

If you've shipped anything real with an LLM agent, you already know this. The same prompt run twice produces two different outputs. Sometimes the difference is cosmetic. Sometimes the agent decides today is the day to invent a new headline format, drop the internal link it always adds, or publish three posts about the same news story because it forgot what it published yesterday.

This is fine when a human is reviewing every output. It's a disaster when the agents run on a cron and ship to production at 6 am.

The standard answer is "evals." Build a test set, score each run, and gate on the score. That works for narrow tasks. It does not work for an editorial product where the output is open-ended, and the quality bar is taste. You cannot unit-test a headline.

So I needed something else. A control plane that constrains the non-determinism without trying to eliminate it.

The setup

The harness is a single skill (uristocrat-harness) with a knowledge tree behind it. The skill itself is short. It tells the model where to look, not what to think. The knowledge tree is the durable part:

uristocrat-harness/├
  ── SKILL.md            
    ← entry point, operating protocol
  ── AGENTS.md           
    ← registry of every agent (source of truth)
  ── ARCHITECTURE.md     
    ← run order, shared state
  ── DESIGN.md           
    ← editorial design principles
  ── FRONTEND.md         
    ← Ghost theme contract
  ── PRODUCT_SENSE.md    
    ← what makes a good Uristocrat post
  ── QUALITY_SCORE.md    
    ← rubric for grading agent output
  ── docs/    
  ── product-specs/  
    ← one spec per agent    
  ── exec-plans/     
    ← in-flight and shipped improvements    
  ── generated/      
    ← auto-logged state (quality log, etc.)

Five production agents currently run on a schedule:

AgentWhenWhat it doesPublish mode
seo-fixes-for-uristocratDaily 05:02Pulls non-indexed URLs from Search Console, diagnoses and fixes in placeAuto-apply
uristocrat-daily-roundupWeekdays 06:04Writes the morning cultural roundupDraft
uristocrat-intelligence-briefSaturday 08:08Writes the weekly subscriber briefDraft
uristocrat-story-researcherDaily 09:09Finds and writes 3–5 standalone storiesPublished directly
uristocrat-quality-reviewSunday 09:09Audits the past 7 days against the rubricRead-only

Four of the five drop into drafts or read-only. One publishes directly.

How the harness constrains non-determinism

There are four mechanisms.

1. A single source of truth. AGENTS.md is the authoritative list of what's running. If a scheduled task exists that isn't in the registry, the harness flags it as drift. If a row in the registry has no matching task, the registry is wrong and gets fixed. This prevents the slow rot in which agents pile up over time, and nobody remembers what they do.

2. A product-spec per agent. Every agent has a spec at docs/product-specs/<agent-name>.md which serves as the contract.

The agent's prompt is the implementation. The rule is: never edit the prompt without updating the spec first. This sounds bureaucratic until you've watched yourself tweak a prompt at midnight and forget what the agent was supposed to do by Friday.

3. A weekly quality review. Every Sunday, an audit agent reads the past 7 days of posts and scores each one against an 8-point rubric (Lead, POV, Why-now, Voice, Length fit, Image, Internal link, Tags) plus a hard-fail headline check. Results get appended to a quality log.

Week-over-week deltas surface drift before they become a pattern. The audit is read-only. It cannot fix anything, but it does tell me what to fix.

4. Draft mode by default. Three of the four content agents save drafts with a [DRAFT] prefix. Nothing goes live without me reading it. The story-researcher publishes directly, and it's the one I monitor most closely because of that. The asymmetry is intentional: I'd rather have a slow pipeline that requires my eyes than a fast one that ships garbage.

Why this shape

The harness is doc-driven on purpose. The knowledge lives in markdown files, not in code. This means:

  • I can read it without running anything.
  • The agents can read the same files that the harness reads. No translation layer.
  • When I'm reviewing what went wrong, I'm reading the same spec the agent was working from.

The whole thing is meant to be legible. If I can't explain in plain English what an agent does and how to know if it's working, I haven't finished designing it yet.

Example one: the wrong-Jason photo

On May 13, the story-researcher published a post on Jason Collins, the first openly gay active NBA player, who had just died at 47. The body of the post was good. It scored 8/8 on the rubric. The headline followed the rule. The internal link was there. The tags were clean.

The featured image was a photo of Jason Williams.

Different Jason. Point guard from the late-90s Kings. Obviously, this is very bad. Not the person the post was about. The agent had grabbed an image from a web search, the search returned a photo tagged "Jason" from an NBA wire feed, and the agent shipped it. The rubric had an "Image" dimension ("featured image set, on-topic, quality") and the agent self-scored it as a pass because the image was on-topic for basketball. It didn't occur to the agent that the photo could be of the wrong human.

This is the exact failure mode the harness exists for. Not a hallucinated fact in the body. Not a bad headline. A plausible, brand-damaging mistake any human editor would catch in two seconds, and no eval would flag.

What I changed in the story-researcher skill:

  1. No feature image, no publish. The skill had two rules in conflict. One section said to publish without an image if you can't verify the person in the photo. The preflight said don't publish without an image at all. Both paths now agree. If there's no verified feature image, the post is saved as a draft and flagged in the run report for me to handle. A wrong-person photo on an obituary is a correction event. A missing photo is just a draft I will clean up later.
  2. Publish became the default in the frontmatter. The skill's own description still said it "saves drafts for review." That hadn't been true for a while, so we updated to reflect that posts go live unless they trip the no-image exception.
  3. The run log became a real thing. The skill kept telling the agent to write check lines to "the run log" without saying what that was. It's now a single persistent markdown file in the RSS reader data directory, appended each run under a dated header. Future quality reviews can actually read it.
  4. Renamed a misleading preflight item. The check labeled "feature image URL set" was misleading because the field gets set after creation. It now says "feature image URL extracted," which is what the agent is actually verifying at that point.

The harness didn't prevent the mistake. It caught the class of mistake, and the rule got moved upstream into the skill itself. The agents running today won't ship a post with an unverified photo.

Example two: the tag duplicates the audit was never looking for

The weekly quality audit ran read-only across 93 posts from May 11–17. The per-post findings were small. One !H headline violator (a LeBron piece with an editorial tail bolted onto the news). One empty placeholder draft. One roundup that shipped with no tags. Nothing dramatic.

The interesting finding wasn't the score. It was a pattern flagged by the rubric almost as an afterthought: tag-slug duplicates.

Ghost matches incoming post tags by exact name. Send {"name": "Hip-Hop"} and Ghost reuses the existing tag. Send {"name": "Hip Hop"}, same concept, different string, and Ghost creates a brand-new tag. The slug hip-hop is taken, so Ghost silently appends a suffix: hip-hop-2hip-hop-3hip-hop-4. Some of those tag pages 404 because they were never properly indexed.

By the time we looked, the catalog had:

  • Four "Hip-Hop" tags (two of them orphaned to /404/)
  • Two "Daily Roundup" tags. daily-roundup held Dec–Apr roundups. daily-roundup-2 held the May ones. The roundup archive was silently split across two URLs.
  • Two "Intelligence Brief" tags with the same split.
  • Two "Air Force 1" tags. One had the clean slug with an ugly lowercase display name. The other had the proper display name with an ugly -2 slug. Neither was fully right.

None of this was visible from any single post. It only surfaced when scanning the whole week's tag list.

The content agents generate posts with the LLM authoring tags freehand. The LLM has no concept of "this string must exactly match the previous string." It picks whichever spelling feels right that moment. "Hip Hop" Monday, "Hip-Hop" Tuesday. Every variant Ghost sees becomes a new tag. Each one looks fine in isolation. Nothing fails loudly. The split only becomes visible in a cross-catalog scan.

What I did:

On the catalog (17 posts retagged, 6 tags deleted): Migrated all posts off the duplicate tags onto canonicals. Deleted daily-roundup-2intelligence-brief-2hip-hop-2hip-hop-3hip-hop-4air-force-1-2. Renamed the canonical air-force-1 tag's display name to "Air Force 1," preserving the slug and keeping URLs intact.

On the agents: Each of the three content-skill wrappers got an explicit ## Tagging section with canonical tag IDs baked in. The rule:

Pass tags by id, not by name.

Passing {"id": "..."} to posts_add forces Ghost to reuse the existing tag. There's no parsing of the name, no fuzzy match, no opportunity to fork a new tag. IDs are stable. Names drift. For the fixed-set skills (daily-roundup, intelligence-brief), the entire tag set is now a static list of IDs. Five for roundups, six for briefs, no thinking required. For the variable skill (story-researcher), verticals are by ID, and topic tags must be tags_browse-checked first. Creating a new tag is the last resort.

Two lessons worth pulling out:

  1. Tag systems that match by name are footguns for LLM agents. The LLM treats the tag-name field as a piece of writing. It picks the variant that reads well. Tag systems treat the same field as a primary key. Those two purposes conflict, and the conflict is invisible until you audit. If you're building any taxonomy an agent touches, give the agent IDs and never let it author identifiers as natural-language strings.
  2. Read-only weekly audits earn their keep on the things you weren't looking for. The rubric was built to grade headlines and leads. It surfaced a catalog-integrity bug because one line of the checklist asked "do any tag slugs have -N suffixes?" The score-by-post output was beside the point. The pattern flag was the find.

Example three: pushing checks down into CI

The story-researcher script had a hardcoded Ghost Admin API key sitting in plaintext. Convenient for an unattended task. Also, exactly the kind of thing that should never depend on a human reading the file to catch.

This is a different category of finding. The first two examples were about the harness watching the agents at runtime. The hardcoded key is about the harness watching the agents at the code layer, before they ever run. Two different surfaces, two different controls.

The learning that came out of it: the harness shouldn't be the last line of defense for things a CI pipeline can enforce. Every Uristocrat repo now gets a baseline of GitHub checks wired in. Secret scanning on every push. A linter passes on the skill files themselves, so frontmatter contradictions like the one the story-researcher had ("publish by default" in one place, "save drafts" in the other) get flagged before they ship. Dependency audits on anything the agents call. None of this is novel. It's the standard CI hygiene that any production repo should have. The agents had been getting a pass because they felt like scripts. They aren't scripts. They publish to a live site.

The rule is: anything that can be caught deterministically in CI should be caught in CI, not in the weekly audit. The audit's job is the stuff CI can't see. Editorial taste, topic collisions, a wrong face on an obituary. Those need a human reading output. Secrets and broken contracts in skill files don't.

What it doesn't solve

Non-determinism doesn't go away, but the harness narrows the blast radius. It doesn't make the agents as reliable as a deterministic script.

Two problems are still open. The first is same-day topic collisions between the daily roundup and the story researcher. They both read from the same web sources; the dedup check isn't wired up yet, so occasionally they cover the same news. The harness's quality review catches it after the fact, but doesn't prevent it.

The second is that the rubric is a proxy for taste. An 8/8 post can still be a bad post. A 6/8 post can be the best thing I publish that week. The rubric keeps the floor up, but it doesn't raise the ceiling.

The point

The reason any of this works is that I treat the agents like employees who are learning and need coaching and advising, not like scripts. They have job descriptions (product specs), a manager (the harness), performance reviews (the quality log), and a published org chart (AGENTS.md). When something goes wrong, I read the transcript as I would a 1:1 note.