How a guess becomes a graded judgment.
Most preference data is collected once, weighted equally, and never audited. CogForce treats reviewer judgment as a measurable thing — with all the nuance that “measurable” should imply.
01 · Tasks are small judgment calls
A task is a single decision: pick A or B, score warmth on a five-step scale, mark a refusal as correct, hedged, or paranoid. Each item is standalone. No tasker carries cognitive load between items.
Tasks are domain-tagged (legal, cooking, code review, Korean translation, pediatric tone). Routing is by demonstrated calibration in that domain, not self-reported expertise.
02 · Some items are probes
A probe is an item where expert consensus is already known and held aside. Probes look identical to non-probes. Taskers cannot tell which is which, and the share rotates over time so memorization fails.
03 · Accuracy becomes weight
A tasker's accuracy on probes — corrected for item difficulty and reviewer disagreement — sets the weight of their judgment on unprobed items. High calibration means a tasker's call counts more in the consensus signal that goes back to the model.
Calibration is per-domain. A great editorial judge is not automatically a great code reviewer.
04 · Score compounds forward
Calibration carries between sessions. Strong taskers unlock harder, better-paid work and start training the next layer. The unlock ladder is visible: every tasker can see what they need to do to move tier.
Score is portable. If you leave, your calibration history goes with you, signed and exportable.
Common questions
- What kind of work does CogForce route?
- Small, well-scoped human judgment calls — picking the warmer of two AI replies, choosing the more on-brand microcopy, marking whether an AI's refusal was right, hedged, or paranoid. Each task is a single decision; no cognitive load is carried between items.
- How does CogForce grade a tasker without knowing the right answer?
- Two invisible signals run on every task. Probes are items where expert consensus is already known and held aside; near-duplicate items spaced across sessions measure whether a tasker agrees with themselves. Probes look identical to non-probes, the share rotates, and neither signal alone is gameable.
- Is this RLHF, DPO, or something else?
- Both. CogForce delivers per-item consensus weighted by reviewer calibration, plus disagreement structure and per-domain reviewer scores — usable directly for RLHF reward modeling, DPO preference pairs, or evaluation suites.
- How does calibration compound?
- Calibration carries between sessions and is per-domain. Strong reviewers unlock harder, better-paid work and start training the next layer. A tasker's score is portable, signed, and exportable.
- Where does the work happen?
- Anywhere. Tasks are designed to be small enough to do on a phone in five minutes — on the train, at a kitchen table, between meetings. No webcam, no shift schedules, no surveillance dashboards.
What you actually get back.
- Per-item consensus, weighted by calibrated reviewer judgment.
- Disagreement structure — where humans split, and along what lines.
- Per-domain reviewer calibration for downstream RLHF or DPO training.
- Audit trails: who answered, how confident, how their calibration was earned.