An Executable-Gate Multi-Agent Research Organization: Artifact, Case Study, and a Pre-Registered Gate-Calibration Study
Kacper Saks
Multi-agent systems are increasingly proposed as autonomous research organizations, yet their quality-control mechanisms are rarely measured. This work releases a domain-agnostic, 39-role multi-agent research organization in which every role output must pass an executable gate and adversarial sign-off — nothing self-certifies — together with two empirical records of its behavior. First, a limitations-forward case study of the initial end-to-end run (v0.1.0), in which the organization abandoned all five candidate research directions and retracted its own flagship integrity exhibit after it failed its verifier. Second, a pre-registered, blinded, seeded gate-calibration study (v0.2.0) answering the question the first run left open: do the gates discriminate between sound and flawed research, or do they uniformly reject? On a corpus of 20 theses (10 known-flawed with planted defects, 10 known-sound reproductions, SHA-256 sealed labels), the gates detected 15 of 15 planted flaws and cited the correct reason in 13–14 of 15 cases. An ablation isolated the round-1 auto-fail prior as the dominant false-positive source: it reduced pooled validity-gate specificity from 0.96 [0.86, 1.00] to 0.72 [0.58, 0.84] and killed 8 of 10 known-sound theses with no sensitivity benefit. The release includes a Citation Fidelity Protocol with SHA-256-pinned verbatim anchors, a full reproducibility harness (make reproduce / make verify), and seven named limitations stated in full.