Skip to content
Expert data for code-native models

Ground truth for
code & SWE agents.

We build the expert human-data layer behind frontier coding models — SFT trajectories, RLHF preference data, agentic SWE evaluations, and verifiable benchmarks. Every artifact is reviewed by senior engineers, not crowdworkers.

SOC 2 Type II Verifiable test harness 40+ languages
cograde · eval-run #4127 live
$ cograde eval --suite swe-agent --rubric strict
▸ task repo/auth-service#218 · multi-file patch
unit tests42/42 passed
regression suiteclean
reviewer rubric8.7 / 10
correctness94%
code quality88%
instruction-follow91%
graded by 3 senior reviewers · κ=0.86

Powering data pipelines for frontier labs & AI-native teams

NEURAFORGE Hyperion AI stackmind Lumina Labs VECTORA Orbit Systems NEURAFORGE Hyperion AI stackmind Lumina Labs VECTORA Orbit Systems
Capabilities

Four data products, one quality bar

Each engagement is staffed by working software engineers and calibrated against a strict, versioned rubric. No bulk crowdsourcing.

SFT & Instruction Data

High-signal supervised fine-tuning examples: multi-file edits, bug-fix pairs, test authoring, and CLI/terminal traces — each with a verifiable solution.

  • Repo-grounded tasks across 40+ languages
  • Unit-tested, reproducible solutions
  • Difficulty-tiered & deduplicated

RLHF Preference Data

Calibrated pairwise & ranked preferences over model completions, with structured rationales engineers can actually audit and replay.

  • Rubric-anchored comparisons
  • Inter-annotator agreement tracked (κ)
  • Rationale captured for every choice

Agentic SWE Evaluation

End-to-end grading of coding agents on real repositories — tool use, multi-step trajectories, and environment outcomes scored against ground truth.

  • Sandboxed execution & trace capture
  • Step-level + outcome-level scoring
  • Failure-mode taxonomies

Custom Benchmarks

Private, contamination-resistant benchmarks tailored to your stack — so you measure progress on the code that actually matters to you.

  • Held-out, leakage-checked sets
  • Versioned & rerunnable harness
  • Per-capability score breakdowns
98.6%

Acceptance rate on lab QA

1,200+

Vetted senior engineers

40+

Languages & frameworks

<72h

Median pilot turnaround

Pipeline

Quality is a process, not a promise

A four-stage pipeline with measurement baked into every step — the same rigor you'd expect from a senior code review.

01

Scope & calibrate

We co-author the rubric with you, run a calibration batch, and lock the spec before scaling.

02

Expert annotation

Working engineers produce and execute solutions in sandboxed environments with full trace capture.

03

Multi-pass review

Independent senior reviewers grade each item; disagreements are adjudicated and tracked as κ.

04

Verify & deliver

Automated test harness confirms correctness; data ships with full provenance and metrics.

Benchmarks

Measure what your models actually do

Every dataset ships with a transparent scorecard. Track capability-level performance over time, compare model versions, and catch regressions before they ship.

  • Contamination-resistant. Held-out tasks are leakage-checked against public corpora.

  • Reproducible. A versioned harness reruns the exact same suite on demand.

  • Explainable. Drill from headline score into per-task traces and reviewer notes.

SWE-Agent Benchmark v3

held-out · 480 tasks

+4.2 vs v2
Bug fixing82%
Feature implementation74%
Multi-file refactor68%
Test authoring89%
78.3
overall
0.86
reviewer κ
100%
verified
Trust & security

Built for the labs' bar

Frontier buyers are uncompromising on security, IP, and quality. So are we.

SOC 2 Type II

Audited controls, encryption at rest and in transit, and least-privilege access throughout.

IP & NDA protected

Per-project NDAs, segregated workspaces, and full chain-of-custody on every artifact.

Vetted experts only

Every annotator passes a live coding screen. No anonymous crowd labor, ever.

Full provenance

Every label carries authorship, review history, timestamps, and rubric version.

Reliable cadence

Predictable weekly delivery with live dashboards — no black-box timelines.

Free re-do guarantee

Anything below the agreed quality bar is reworked at no cost. Aligned incentives.

Join as a contributor

Get paid to make AI write better code

We're hiring working software engineers to author, solve, and review high-signal data for frontier coding models. Flexible, remote, and paid per accepted contribution — with a clear path to senior reviewer.

  • Premium pay. Competitive per-task rates that scale with difficulty and your review tier.

  • Fully flexible. Work from anywhere, on your own hours. Take batches when you want them.

  • Interesting work. Real repos, real bugs, agentic trajectories — not toy snippets.

100% remote Weekly payouts Async screening

Contributor application

Takes ~2 minutes. We review every application and respond within 3 business days.

Ship a pilot dataset
in under a week.

Tell us your model, your stack, and the capability you're trying to move. We'll scope a calibration batch and send you a sample scorecard.

No spam. We reply to every serious inquiry within 24 hours.