Cubit by Maiden Labs

The Science Behind Cubit

Cubit is built on public, auditable data and a transparent methodology. This page describes our source data, scoring pipeline, and external validation — because trust in AI workforce intelligence starts with knowing exactly how the numbers are produced.

Foundation: Source Data

Cubit is built on two gold-standard public datasets, selected for their rigor, granularity, and complementary strengths. One describes what people do at work; the other measures what AI can actually do.

O*NET (Dept. of Labor)

The Occupational Information Network is the definitive taxonomy of work in the United States. Maintained by the Department of Labor, it provides:

  • 900+ occupations spanning the entire US economy
  • Task-level granularity: 10–25 discrete activities per occupation, each describing specific work performed in context
  • 120 standardized requirements: Skills, Abilities, and Knowledge domains that tasks draw upon
  • Survey-grounded ratings: importance and prevalence data from worker surveys

O*NET answers: What do people actually do at work, and what capabilities does that work require?

Stanford HELM

The Holistic Evaluation of Language Models, developed by Stanford's Center for Research on Foundation Models, provides standardized evaluation of LLM performance across dozens of cognitive domains:

  • 182,000+ evaluations spanning reasoning, knowledge, language, mathematics, coding, and more
  • 95 leading language models evaluated under consistent protocols
  • Transparent methodology: open data, reproducible results, expert-annotated success/failure labels
  • Strict correctness metrics rather than partial-credit, producing conservative capability estimates

HELM answers: What can current AI models actually do, and how well?

Why these two sources? Alternative occupational taxonomies exist, but O*NET offers the best combination of granularity, coverage, and survey grounding — and its SOC job codes link directly to US employment statistics, enabling real-world application at scale. HELM offers the breadth of cognitive coverage needed to map AI capability to occupational requirements, with standardized methodology that enables temporal tracking as models improve.

From Raw Data to Actionable Scores

Raw data points are not insight. The value of Cubit lies in the proprietary process that transforms public source data into actionable intelligence. The pipeline has four stages.

1

Task Annotation

Every occupational task (18,796 in total) is scored on four foundational dimensions, each on a 0–10 scale:

DimensionWhat It Measures
Procedural IntensityHow rule-based and repeatable the structure of the work is
Digital AccessibilityHow accessible the task's inputs and outputs are to software systems
Physical EmbodimentRequirements for motor control, physical manipulation, and bodily presence
Socio-Emotional DepthLevel of interpersonal engagement, empathy, trust-building, and emotional labor

Each annotation includes not just a score but a structured natural language explanation — creating an audit trail. Annotations use LLM-based scoring calibrated against expert-reviewed golden-set examples with explicit rubrics and consistency validation.

2

Requirement Linking

Each task is connected to the specific Skills, Abilities, and Knowledge domains it requires. O*NET defines approximately 120 such requirements (35 Skills, 52 Abilities, 33 Knowledge areas). Cubit builds a graph linking each task to its relevant requirements with relevance scores indicating how central each is to performing the task. Only links exceeding a relevance threshold are retained, eliminating noise from tangential connections.

Tasks are concrete; requirements are abstract. This bridge lets us connect measured AI performance (evaluated against abstract cognitive skills) to specific work activities (what decision-makers care about).

3

Capability Mapping

Each HELM benchmark evaluation is analyzed to identify which cognitive requirements it tests, creating a bridge from measured AI performance to the standardized requirement vocabulary. Benchmarks don't map one-to-one onto occupational requirements — a math benchmark tests multiple skills; a skill may be tested by multiple benchmarks. Cubit accounts for this many-to-many relationship, weighting by relevance.

AI performance is measured using the highest-performing model per skill across all mapped assessments, capturing peak demonstrated capability even when it is distributed across multiple models.

4

Score Computation & Aggregation

Benchmark performance flows through the requirement graph to produce capability estimates, which then aggregate to tasks and jobs. Dimension scores combine into three strategic pillars:

Structural Exposure

Procedural Intensity + Digital Accessibility

Human Imperative

Physical Embodiment + Socio-Emotional Depth

Demonstrated Capability

Requirement → benchmark graph

When rolling task scores up to jobs, Cubit weights by O*NET's survey-derived task importance — so profiles reflect what workers actually spend time on, not unweighted averages that treat minor tasks equally with core responsibilities.

The Automation Landscape

Cubit blends Structural Exposure with Demonstrated Capability into a composite measure called AI Exposure Potential, capturing work that is both accessible and performable by AI. By intersecting AI Exposure Potential against Human Imperative, every task falls into one of four strategic zones.

Human Imperative
Low
High
AI Exposure

Automation

AI can access & perform the work, and human presence is not essential. Highest displacement risk.

Form processing, data entry, report generation

Augmentation

AI handles the technical work, but human presence remains essential. Human-AI collaboration territory.

Medical diagnosis, legal research, financial advisory

Low              High

Status Quo

Not structured enough to automate, not especially human-essential. Varied cognitive or manual work.

Ad-hoc problem solving, facility maintenance, exploratory research

Human-Centric

Too unstructured for AI, deeply dependent on human presence. The durable core of human work.

Surgery, crisis counseling, skilled trades, early childhood education

Why blend Structural Exposure with Demonstrated Capability? Pure structural analysis would flag work as “at risk” simply because it's digital and procedural, even if no AI system can perform the required skills today. Pure capability analysis would miss work that AI could do in theory but cannot access due to analog interfaces.

By combining both signals, Cubit identifies work that is both accessible and performable: the intersection representing genuine near-term pressure. This is more accurate, more nuanced, and more actionable than binary “at risk / not at risk” classifications.

From Tasks to Jobs

While task-level analysis provides the foundation, decision-makers need job-level summaries. Cubit aggregates task scores using importance-weighted methods into three composite metrics designed for practical decision-making.

Automation Susceptibility

Importance-weighted average of task-level AI Exposure Potential across the role. Measures overall vulnerability to AI automation.

Human-Centric Resilience

Combines Human Imperative with dispositional factors — work styles emphasizing adaptability, interpersonal orientation, and situational judgment. Protection beyond the task itself.

Balanced Impact Score

Net positioning: resilience minus susceptibility. Positive values indicate a role is net-positive for human workers as AI advances.

Traditional assessments ask “Can AI do this job?” and produce systematic errors in both directions: false positives (flagging work as “at risk” while ignoring that patients want human doctors) and false negatives (classifying work as “safe” while ignoring that AI already performs many judgment-intensive skills). Cubit avoids both by requiring three conditions for high displacement risk: the work must be structurally accessible, human presence must not be essential, and current AI must actually be capable of the required skills.

External Validation

Cubit has been tested against every publicly available dataset that addresses occupational AI exposure. Three exist. Cubit's scores predict the findings of all three, with statistically significant results across each.

CubitGDPval (OpenAI)RLI (CAIS)ILO Global
MeasuresMulti-dimensional automation susceptibilityBinary task completionFreelance project automationCognitive GenAI exposure
Occupations9234410 projects427
Tasks scored18,796220102,861
Dimensions per task4 continuous1 binary1 holistic1 continuous
Composite pillars3000
AccessREST API + SDKFile downloadFile downloadPDF report

vs. ILO Global Index — 71% variance explained

The ILO measures cognitive GenAI exposure across 427 ISCO-08 occupations. Using the Bureau of Labor Statistics' published ISCO-08 ↔ SOC 2010 crosswalk, we match 387 of 427 ILO occupations to Cubit (90.6%).

Cubit's three pillars jointly explain 71% of ILO's variance, stable under 10-fold cross-validation with negligible shrinkage. At the group level, Cubit's predicted ranking matches ILO's actual ranking on 5 of 9 groups exactly, and 7 of 9 within one rank.

The relationship is strictly asymmetric. ILO's single score overlaps substantially with only one of Cubit's three dimensions (Human Imperative) and is nearly orthogonal to the other two. Cubit's framework contains the information ILO measures; ILO cannot reconstruct Cubit's multi-dimensional profile.

Critically, collapsing Cubit into a single composite score captures almost none of ILO's variance — the 68-percentage-point jump from composite to three pillars proves that single-score approaches destroy discriminatory information.

vs. OpenAI GDPval — 100% coverage, significant differentiation

GDPval provides 220 binary task evaluations across 44 occupations in 9 economic sectors. Cubit covers all 44 GDPval occupations plus 879 more. GDPval covers just 4.8% of Cubit's occupations, drawn from 13 of 22 major groups — nine groups (Construction, Education, Transportation, and others) have zero representation.

Cubit's continuous scores significantly differentiate GDPval's sector classifications (F(8,34) = 2.31, p = 0.042, η² = 0.35 — large effect size), confirmed by permutation testing. Pairwise concordance across all 821 cross-sector occupation pairs yields 71.7% agreement (p < 10⁻³⁶).

Within-sector variation in Cubit scores exceeds between-sector variation, demonstrating that sector-level analysis obscures more variation than it explains.

vs. CAIS Remote Labor Index — 2.8% converges with 2.5%

The RLI tested AI agents on 10 real freelance projects. The highest-performing agent achieved an automation rate of just 2.5%. We decompose each mapped occupation into Cubit-scored tasks (178 total across 9 unique SOC codes).

Under stringent criteria, Cubit independently predicts 5 of 178 tasks as fully automatable (2.8%), converging with the RLI's empirical 2.5%. The result is robust: tightening thresholds yields 2.2%, loosening yields 5.1%. At any stringent threshold, fewer than 5% of tasks qualify.

Zone assignments show strong face validity: 7 of 10 projects fall in the Automation zone (consistent with freelance platforms self-selecting for outsourceable digital work), while the two Human-Centric exceptions (interior design and music arrangement) are the two most expensive projects.

Developer Access

RESTful JSON API with an official Python SDK. Typed methods, structured exceptions, sync and async clients.

# pip install cubit-api

from cubit import CubitClient

 

with CubitClient("cubit_xxxxxxxxxxxx") as client:

job = client.get_job("15-1252.00")

print(f"{job['title']}: {job['scores']['balanced_impact_score']}")

# Async client for high-throughput workflows

import asyncio

from cubit import CubitAsyncClient

 

async def analyze_portfolio(soc_codes):

async with CubitAsyncClient("cubit_xxxxxxxxxxxx") as client:

jobs = await asyncio.gather(

*[client.get_job(soc) for soc in soc_codes]

)

return jobs

Full Technical Paper

The complete methodology, all validation appendices with statistical details, robustness checks, sensitivity analyses, and additional demonstrations are documented in a comprehensive technical paper. Available to qualified prospects upon request.

Current data version: O*NET 29.0, HELM Lite v1.13.0, 2026-Q1 annotations. Updated quarterly as upstream sources release new data. Every score is auditable. Every benchmark is public.