Post 012 — Methodology

The harness that lied

On the verdict-before-evidence pattern in security research — and how I caught my own PoC confirming itself regardless of the facts.

12 May 2026 3 min read Stuart Thomas

The most expensive lesson I learned this spring cost me nothing except the certainty I thought I had.

In May 2026 I withdrew two reports from Apple's Security Bounty programme on the same morning. Not because Apple told me to. Because I looked at my own test harness and found that it was printing EXPLOIT CONFIRMED regardless of whether anything had been exploited.

The situation

I was researching a time-of-check to time-of-use race condition in XProtect Remediator — the component macOS uses to clean up known malware. A closely related race had received a CVE the previous year. I found what appeared to be unfixed siblings in two remediator components, built test harnesses, and watched EXPLOIT CONFIRMED appear in the terminal. I wrote reports and submitted them.

Apple's reviewer flagged concerns on the first. He was right to.

The investigation

I ran fs_usage system-wide during a fresh harness execution, capturing every filesystem operation. Three thousand operations during the race window — every single one from my own harness's rm calls. Zero from any XProtect process.

The EXPLOIT CONFIRMED banner had fired because my cleanup step ran before my verdict check. The harness was not observing the race — it was staging a scene and then photographing it.

I withdrew both reports the same morning.

The pattern has a name

I call this the verdict-before-evidence pattern. In the same two-week window it happened twice. In the first case — a different finding — a DTrace probe silently failed to attach. The script printed success because the error path never ran. I had confused absence of failure with evidence of success.

In the second case, the TOCTOU harness: the script generated its own only evidence. Different root causes, same shape. The observation apparatus was entangled with the thing it was supposed to observe.

Where this instinct comes from

I first understood what a honeynet was in 2001, running one at Oracle. The point is to watch what actually happens — to create conditions where the truth has to show itself, rather than conditions that confirm your expectation. The attacker either appears in the logs or does not. You do not manufacture the log entry and then check whether it looks right.

I knew this principle. I forgot to apply it to my own test tooling. The 2am sessions where this kind of error incubates have a particular quality: you have been running the same experiment for three hours, the terminal is full of output, and something appears that confirms what you suspected. At 2am you do not run fs_usage for fifteen minutes to account for every filesystem operation. You should.

The PoC

# What I should have run — before filing anything
# System-wide fs_usage capture during the race window:

sudo fs_usage -f filesys -e $(pgrep -x fs_usage) 2>/dev/null | \
    grep -E "(XProtect|unlink|rmdir)" | \
    awk '{print $NF, $0}' | \
    sort | tee /tmp/race_evidence.txt

# Verdict rule: if XProtect process names appear in the unlink/rmdir column,
# the race fired. If only your own harness PID appears — the harness lied.
# I did not run this before filing. I ran it after Brent flagged concerns.
# Every unlink came from my own rm calls. Zero from XProtect.

The rule I now apply

Every PoC verdict must reduce to OS-level evidence that the harness itself cannot produce. If the only record of exploitation is a string the harness printed, the harness has told you nothing. If fs_usage, DTrace, spindump, or an IPS file shows the effect independently — if the kernel has a record your script does not control — then you have something.

The corollary: any cleanup step in a test harness must run after the verdict is recorded and verified. Never before.

Opinion, clearly flagged: The reviewer who closed the first report was substantively correct. I disagreed at the time. He was right. The binary evidence of the asymmetric guard remains valid; the exploit claim does not. I was measuring the wrong thing and calling the measurement a result.

Disclosure note

Both reports were withdrawn by me after confirming the verification flaw through independent fs_usage capture. Apple's Security Bounty team handled both cases professionally throughout. I have kept the report text on file as a record, not as a finding.

Are you there? I asked the kernel. The kernel said yes. I should have checked what it was basing that on.

This post publishes on 12 May 2026