Cubit by Maiden Labs

How It Works

Public data. Transparent methodology. If the numbers matter, you should know how they're produced.

Source Data

Two public datasets. One describes what people do at work; the other measures what AI can actually do.

O*NET (Dept. of Labor)

The Occupational Information Network is the most widely used taxonomy of work in the United States. Maintained by the Department of Labor, it provides:

  • 923 occupations spanning the entire US economy
  • Task-level granularity: 10–25 discrete activities per occupation, each describing specific work performed in context
  • 120 standardized requirements: Skills, Abilities, and Knowledge domains that tasks draw upon
  • Survey-grounded ratings: importance and prevalence data from worker surveys

O*NET answers: What do people actually do at work, and what capabilities does that work require?

Stanford HELM

The Holistic Evaluation of Language Models, developed by Stanford's Center for Research on Foundation Models, provides standardized evaluation of LLM performance across dozens of cognitive domains:

  • 1.2 million+ AI evaluations spanning reasoning, knowledge, language, mathematics, coding, and more
  • 95 leading language models evaluated under consistent protocols
  • Transparent methodology: open data, reproducible results, expert-annotated success/failure labels
  • Strict correctness metrics rather than partial-credit, producing conservative capability estimates

HELM answers: What can current AI models actually do, and how well?

Why these two? O*NET has the best combination of granularity, coverage, and survey grounding, and its SOC codes link to US employment stats. HELM covers the cognitive breadth needed to map AI capability to occupational requirements.

Scoring Pipeline

Raw data becomes usable scores in four stages.

1

Task Annotation

Every occupational task (18,796 in total) is scored on four foundational dimensions, each on a 0–10 scale:

DimensionWhat It Measures
Procedural IntensityHow rule-based and repeatable the structure of the work is
Digital AccessibilityHow accessible the task's inputs and outputs are to software systems
Physical EmbodimentRequirements for motor control, physical manipulation, and bodily presence
Socio-Emotional DepthLevel of interpersonal engagement, empathy, trust-building, and emotional labor

Each annotation includes a score and a structured natural language explanation, creating an audit trail. Annotations use LLM-based scoring calibrated against expert-reviewed golden-set examples with explicit rubrics and consistency validation.

2

Requirement Linking

Each task is connected to the specific Skills, Abilities, and Knowledge domains it requires. O*NET defines approximately 120 such requirements (35 Skills, 52 Abilities, 33 Knowledge areas). Cubit builds a graph linking each task to its relevant requirements with relevance scores indicating how central each is to performing the task. Only links exceeding a relevance threshold are retained, eliminating noise from tangential connections.

Tasks are concrete; requirements are abstract. This bridge lets us connect measured AI performance (evaluated against abstract cognitive skills) to specific work activities (what decision-makers care about).

3

Capability Mapping

Each HELM benchmark evaluation is analyzed to identify which cognitive requirements it tests, creating a bridge from measured AI performance to the standardized requirement vocabulary. Benchmarks don't map one-to-one onto occupational requirements: a math benchmark tests multiple skills; a skill may be tested by multiple benchmarks. Cubit accounts for this many-to-many relationship, weighting by relevance.

AI performance is measured using the highest-performing model per skill across all mapped assessments, capturing peak demonstrated capability even when it is distributed across multiple models.

4

Score Computation & Aggregation

Benchmark performance flows through the requirement graph to produce capability estimates, which then aggregate to tasks and jobs. Dimension scores combine into three strategic pillars:

Structural Exposure

Procedural Intensity + Digital Accessibility

Human Imperative

Physical Embodiment + Socio-Emotional Depth

Demonstrated Capability

Requirement → benchmark graph

When rolling task scores up to jobs, Cubit weights by O*NET's survey-derived task importance, so profiles reflect what workers actually spend time on, not unweighted averages that treat minor tasks equally with core responsibilities.

The Four Zones

Structural Exposure and Demonstrated Capability combine into AI Exposure Potential. Cross that with Human Imperative and every task falls into one of four zones.

Human Imperative
Low
High
AI Exposure

Automation

AI can access & perform the work, and human presence is not essential. Highest displacement risk.

Form processing, data entry, report generation

Augmentation

AI handles the technical work, but human presence remains essential. Human-AI collaboration territory.

Medical diagnosis, legal research, financial advisory

Low              High

Status Quo

Not structured enough to automate, not especially human-essential. Varied cognitive or manual work.

Ad-hoc problem solving, facility maintenance, exploratory research

Human-Centric

Too unstructured for AI, deeply dependent on human presence. The durable core of human work.

Surgery, crisis counseling, skilled trades, early childhood education

Why blend both? Structure alone would flag work as “at risk” just because it's digital, even if no AI can do it yet. Capability alone would miss work AI could handle but can't access. The intersection is where real near-term pressure lives.

Job-Level Scores

Task scores roll up into three job-level metrics, weighted by how important each task is to the role.

Automation Susceptibility

How much of the role's work is exposed to AI. Weighted by task importance.

Human-Centric Resilience

How strongly the role depends on human presence, judgment, and interpersonal skills.

Balanced Impact Score

Resilience minus susceptibility. Positive means the role is net-positive for human workers.

Most tools ask “Can AI do this job?” and get it wrong in both directions. Cubit requires three conditions for high risk: the work is structurally accessible, human presence isn't essential, and current AI can actually perform the required skills.

Validation

Tested against every public dataset on occupational AI exposure. Three exist. Cubit predicts the findings of all three.

CubitGDPval (OpenAI)RLI (CAIS)ILO Global
MeasuresMulti-dimensional automation susceptibilityBinary task completionFreelance project automationCognitive GenAI exposure
Occupations9234410 projects427
Tasks scored18,796220102,861
Dimensions per task4 continuous1 binary1 holistic1 continuous
Composite pillars3000
AccessREST API + SDKFile downloadFile downloadPDF report

vs. ILO Global Index: 71% variance explained

The ILO measures cognitive GenAI exposure across 427 ISCO-08 occupations. Using the Bureau of Labor Statistics' published ISCO-08 ↔ SOC 2010 crosswalk, we match 387 of 427 ILO occupations to Cubit (90.6%).

Cubit's three pillars jointly explain 71% of ILO's variance, stable under 10-fold cross-validation with negligible shrinkage. At the group level, Cubit's predicted ranking matches ILO's actual ranking on 5 of 9 groups exactly, and 7 of 9 within one rank.

The relationship is strictly asymmetric. ILO's single score overlaps substantially with only one of Cubit's three dimensions (Human Imperative) and is nearly orthogonal to the other two. Cubit's framework contains the information ILO measures; ILO cannot reconstruct Cubit's multi-dimensional profile.

Critically, collapsing Cubit into a single composite score captures almost none of ILO's variance. The 68-percentage-point jump from composite to three pillars proves that single-score approaches destroy discriminatory information.

vs. OpenAI GDPval: 100% coverage, significant differentiation

GDPval provides 220 binary task evaluations across 44 occupations in 9 economic sectors. Cubit covers all 44 GDPval occupations plus 879 more. GDPval covers just 4.8% of Cubit's occupations, drawn from 13 of 22 major groups. Nine groups (Construction, Education, Transportation, and others) have zero representation.

Cubit's continuous scores significantly differentiate GDPval's sector classifications (F(8,34) = 2.31, p = 0.042, η² = 0.35, large effect size), confirmed by permutation testing. Pairwise concordance across all 821 cross-sector occupation pairs yields 71.7% agreement (p < 10⁻³⁶).

Within-sector variation in Cubit scores exceeds between-sector variation, demonstrating that sector-level analysis obscures more variation than it explains.

vs. CAIS Remote Labor Index: 2.8% converges with 2.5%

The RLI tested AI agents on 10 real freelance projects. The highest-performing agent achieved an automation rate of just 2.5%. We decompose each mapped occupation into Cubit-scored tasks (178 total across 9 unique SOC codes).

Under stringent criteria, Cubit independently predicts 5 of 178 tasks as fully automatable (2.8%), converging with the RLI's empirical 2.5%. The result is robust: tightening thresholds yields 2.2%, loosening yields 5.1%. At any stringent threshold, fewer than 5% of tasks qualify.

Zone assignments show strong face validity: 7 of 10 projects fall in the Automation zone (consistent with freelance platforms self-selecting for outsourceable digital work), while the two Human-Centric exceptions (interior design and music arrangement) are the two most expensive projects.

Developer Access

REST API with a Python SDK. Sync and async clients.

# pip install cubit-api

from cubit import CubitClient

 

with CubitClient("cubit_xxxxxxxxxxxx") as client:

job = client.get_job("15-1252.00")

print(f"{job['title']}: {job['scores']['balanced_impact_score']}")

# Async client for high-throughput workflows

import asyncio

from cubit import CubitAsyncClient

 

async def analyze_portfolio(soc_codes):

async with CubitAsyncClient("cubit_xxxxxxxxxxxx") as client:

jobs = await asyncio.gather(

*[client.get_job(soc) for soc in soc_codes]

)

return jobs

Full Technical Paper

Complete methodology, validation details, and sensitivity analyses. Available on request.

Current data version: O*NET 29.0, HELM Lite v1.13.0, 2026-Q1 annotations. Updated quarterly as upstream sources release new data. Every score is auditable. Every benchmark is public.