CogForce
Mechanics

How a guess becomes a graded judgment.

Most preference data is collected once, weighted equally, and never audited. CogForce treats reviewer judgment as a measurable thing — with all the nuance that “measurable” should imply.

01 · Tasks are small judgment calls

A task is a single decision: pick A or B, score warmth on a five-step scale, mark a refusal as correct, hedged, or paranoid. Each item is standalone. No tasker carries cognitive load between items.

Tasks are domain-tagged (legal, cooking, code review, Korean translation, pediatric tone). Routing is by demonstrated calibration in that domain, not self-reported expertise.

02 · Some items are probes

A probe is an item where expert consensus is already known and held aside. Probes look identical to non-probes. Taskers cannot tell which is which, and the share rotates over time so memorization fails.

Probe rates by domain
11–14%
Tone & warmth
18–22%
Translation
9–12%
Code review

03 · Accuracy becomes weight

A tasker's accuracy on probes — corrected for item difficulty and reviewer disagreement — sets the weight of their judgment on unprobed items. High calibration means a tasker's call counts more in the consensus signal that goes back to the model.

Calibration is per-domain. A great editorial judge is not automatically a great code reviewer.

04 · Score compounds forward

Calibration carries between sessions. Strong taskers unlock harder, better-paid work and start training the next layer. The unlock ladder is visible: every tasker can see what they need to do to move tier.

Score is portable. If you leave, your calibration history goes with you, signed and exportable.

Common questions

What kind of work does CogForce route?
Small, well-scoped human judgment calls — picking the warmer of two AI replies, choosing the more on-brand microcopy, marking whether an AI's refusal was right, hedged, or paranoid. Each task is a single decision; no cognitive load is carried between items.
How does CogForce grade a tasker without knowing the right answer?
Two invisible signals run on every task. Probes are items where expert consensus is already known and held aside; near-duplicate items spaced across sessions measure whether a tasker agrees with themselves. Probes look identical to non-probes, the share rotates, and neither signal alone is gameable.
Is this RLHF, DPO, or something else?
Both. CogForce delivers per-item consensus weighted by reviewer calibration, plus disagreement structure and per-domain reviewer scores — usable directly for RLHF reward modeling, DPO preference pairs, or evaluation suites.
How does calibration compound?
Calibration carries between sessions and is per-domain. Strong reviewers unlock harder, better-paid work and start training the next layer. A tasker's score is portable, signed, and exportable.
Where does the work happen?
Anywhere. Tasks are designed to be small enough to do on a phone in five minutes — on the train, at a kitchen table, between meetings. No webcam, no shift schedules, no surveillance dashboards.
For AI labs

What you actually get back.

  • Per-item consensus, weighted by calibrated reviewer judgment.
  • Disagreement structure — where humans split, and along what lines.
  • Per-domain reviewer calibration for downstream RLHF or DPO training.
  • Audit trails: who answered, how confident, how their calibration was earned.