Expert data for code-native models

Ground truth for
code & SWE agents.

We build the expert human-data layer behind frontier coding models — SFT trajectories, RLHF preference data, agentic SWE evaluations, and verifiable benchmarks. Every artifact is reviewed by senior engineers, not crowdworkers.

Start a pilot dataset See how it works

SOC 2 Type II Verifiable test harness 40+ languages

cograde · eval-run #4127 live

$ cograde eval --suite swe-agent --rubric strict

▸ task repo/auth-service#218 · multi-file patch

unit tests42/42 passed

regression suiteclean

reviewer rubric8.7 / 10

correctness94%

code quality88%

instruction-follow91%

✓ graded by 3 senior reviewers · κ=0.86▋

Powering data pipelines for frontier labs & AI-native teams

NEURAFORGE Hyperion AI stackmind Lumina Labs VECTORA Orbit Systems NEURAFORGE Hyperion AI stackmind Lumina Labs VECTORA Orbit Systems

Capabilities

Four data products, one quality bar

Each engagement is staffed by working software engineers and calibrated against a strict, versioned rubric. No bulk crowdsourcing.

SFT & Instruction Data

High-signal supervised fine-tuning examples: multi-file edits, bug-fix pairs, test authoring, and CLI/terminal traces — each with a verifiable solution.

Repo-grounded tasks across 40+ languages
Unit-tested, reproducible solutions
Difficulty-tiered & deduplicated

RLHF Preference Data

Calibrated pairwise & ranked preferences over model completions, with structured rationales engineers can actually audit and replay.

Rubric-anchored comparisons
Inter-annotator agreement tracked (κ)
Rationale captured for every choice

Agentic SWE Evaluation

End-to-end grading of coding agents on real repositories — tool use, multi-step trajectories, and environment outcomes scored against ground truth.

Sandboxed execution & trace capture
Step-level + outcome-level scoring
Failure-mode taxonomies

Custom Benchmarks

Private, contamination-resistant benchmarks tailored to your stack — so you measure progress on the code that actually matters to you.

Held-out, leakage-checked sets
Versioned & rerunnable harness
Per-capability score breakdowns

Pipeline

Quality is a process, not a promise

A four-stage pipeline with measurement baked into every step — the same rigor you'd expect from a senior code review.

01

Scope & calibrate

We co-author the rubric with you, run a calibration batch, and lock the spec before scaling.

02

Expert annotation

Working engineers produce and execute solutions in sandboxed environments with full trace capture.

03

Multi-pass review

Independent senior reviewers grade each item; disagreements are adjudicated and tracked as κ.

04

Verify & deliver

Automated test harness confirms correctness; data ships with full provenance and metrics.

Benchmarks

Measure what your models actually do

Every dataset ships with a transparent scorecard. Track capability-level performance over time, compare model versions, and catch regressions before they ship.

Contamination-resistant. Held-out tasks are leakage-checked against public corpora.
Reproducible. A versioned harness reruns the exact same suite on demand.
Explainable. Drill from headline score into per-task traces and reviewer notes.

SWE-Agent Benchmark v3

held-out · 480 tasks

+4.2 vs v2

Bug fixing82%

Feature implementation74%

Multi-file refactor68%

Test authoring89%

78.3

overall

0.86

reviewer κ

100%

verified

Join as a contributor

Get paid to make AI write better code

We're hiring working software engineers to author, solve, and review high-signal data for frontier coding models. Flexible, remote, and paid per accepted contribution — with a clear path to senior reviewer.

Premium pay. Competitive per-task rates that scale with difficulty and your review tier.
Fully flexible. Work from anywhere, on your own hours. Take batches when you want them.
Interesting work. Real repos, real bugs, agentic trajectories — not toy snippets.

100% remote Weekly payouts Async screening

Contributor application

Takes ~2 minutes. We review every application and respond within 3 business days.

Full name *

Email *

Location / timezone

Years of experience *

Primary languages *

GitHub / portfolio

Weekly availability

Resume / CV (PDF, DOC, DOCX · max 5MB) Click to upload or drag & drop PDF, DOC, or DOCX up to 5MB

What kind of work interests you most?

I agree to be contacted about contributor opportunities and consent to a short async coding screen. *

Ship a pilot dataset
in under a week.

Tell us your model, your stack, and the capability you're trying to move. We'll scope a calibration batch and send you a sample scorecard.

No spam. We reply to every serious inquiry within 24 hours.

Ground truth for
code & SWE agents.

Four data products, one quality bar

SFT & Instruction Data

RLHF Preference Data

Agentic SWE Evaluation

Custom Benchmarks

Quality is a process, not a promise

Scope & calibrate

Expert annotation

Multi-pass review

Verify & deliver

Measure what your models actually do

Built for the labs' bar

SOC 2 Type II

IP & NDA protected

Vetted experts only

Full provenance

Reliable cadence

Free re-do guarantee

Get paid to make AI write better code

Contributor application

Ship a pilot dataset
in under a week.

Ground truth for code & SWE agents.

Four data products, one quality bar

SFT & Instruction Data

RLHF Preference Data

Agentic SWE Evaluation

Custom Benchmarks

Quality is a process, not a promise

Scope & calibrate

Expert annotation

Multi-pass review

Verify & deliver

Measure what your models actually do

Built for the labs' bar

SOC 2 Type II

IP & NDA protected

Vetted experts only

Full provenance

Reliable cadence

Free re-do guarantee

Get paid to make AI write better code

Contributor application

Ship a pilot dataset in under a week.

Ground truth for
code & SWE agents.

Ship a pilot dataset
in under a week.