How we evaluate talent in the AI era

In 2026, Anthropic published a randomized controlled trial that quantified something many hiring managers had suspected: developers who used AI for conceptual inquiry — asking "why does this work?" and "what are the tradeoffs?" — scored 65%+ on comprehension tests. Those who delegated code generation scored below 40%. Same tools, same tasks, completely different outcomes.

A year earlier, METR's RCT found that experienced developers using AI completed tasks 19% slower than without it — while perceiving themselves as 20% faster. The gap between perception and reality was not small. It was directionally opposite.

These findings crystallized what we had been building toward at NouSpark. Traditional technical assessments evaluate output: does the code compile, do the tests pass, is the architecture sound? For decades, output was a reliable proxy for understanding. AI broke that proxy. A junior developer with Claude or Cursor can now produce code indistinguishable from what a senior engineer would write. The output is identical. The understanding behind it is not.

This post describes the evaluation methodology we developed to measure the thing that actually matters: how engineers think and work with AI, not just what they produce.

The evaluation pipeline

NouSpark is async and environment-agnostic. Candidates work in their own setup — Claude Code, Cursor, ChatGPT, Copilot, whatever they actually use at work. They complete a task, then submit their result alongside an export of their AI chat log. We evaluate both.

The chat log is the core signal. It is a structured record of how a candidate thinks, not just what they produce. We parse logs from any major AI tool — Claude Code JSONL, ChatGPT JSON exports, Cursor SQLite databases, Copilot Chat JSON — and normalize them into a common format: an ordered sequence of user and assistant messages with timestamps and indices.

Candidate messages are preserved in full. AI assistant messages are truncated to 500 characters. We care about what the candidate said, not the AI's complete response — though we retain message length metadata for review-time analysis.

The normalized chat then passes through three parallel analysis passes, each evaluating two orthogonal dimensions. A final synthesis pass takes all results, applies role-specific weights, and produces a scored report with evidence. Four LLM calls total. Wall time: 15–30 seconds.

Three parallel passes

We deliberately separated evaluation into three independent passes rather than a single monolithic prompt. Each pass receives the full chat and evaluates two dimensions. The passes run in parallel, reducing latency and preventing one dimension's assessment from biasing another.

Pass 1 evaluates Problem Decomposition and Iteration Quality — how the candidate structured the work and how they refined it over time. Pass 2 evaluates Quality Control and AI Fluency — whether they verified output and how skillfully they steered AI. Pass 3 evaluates Depth of Understanding and Independence — whether they actually understand what they're building and what they contributed beyond prompting.

Each pass returns a score (0–100) per dimension, a 2–3 sentence reason with turn references, specific evidence quotes from the chat with turn numbers, and flags categorized as positive, warning, or red_flag — each pointing to a specific moment in the conversation.

The synthesis pass then takes all six dimension scores, applies role-specific weights, and produces the final report: overall score, candidate pattern classification, a summary for the hiring manager, top strengths and gaps with linked evidence, a hiring recommendation, and suggested follow-up interview questions.

Six evaluation dimensions

Each dimension targets a distinct cognitive behavior observable in chat logs. We arrived at these six through a combination of published research (Barke et al., Anthropic's RCT, the PEEM framework) and our own analysis of several hundred candidate sessions.

1. Problem decomposition

What it measures: how the candidate breaks down a task before and while working with AI.

The strongest signal here is the first message. Does the candidate dump the entire spec — "do everything" — or do they provide structured context with constraints? We look at the number of distinct subtasks identified, whether the candidate tackles them sequentially, and whether they provide requirements and constraints upfront.

For engineering roles, we focus on technical decomposition: architecture decisions, module boundaries, data flow. For non-technical roles, we focus on clarity of requirements, goal explanation, and logical step ordering.

2. Quality control

What it measures: whether the candidate verifies AI output before accepting it.

This is where timestamps become critical. A pause of less than 10 seconds after a complex code response means the candidate did not read it. That is rubber-stamping. A pause of 1–5 minutes suggests actual review. Five to thirty minutes suggests they tested it in their IDE.

Beyond timing, we track the correction ratio (corrections divided by total candidate turns), whether the candidate catches bugs or issues in AI responses, whether they challenge the AI's approach, and whether they reference specific code from AI output — a strong indicator that they actually read it.

For engineers, this dimension is critical — weighted at 25%. An engineer who accepts 47 lines of code in 3 seconds without reading is a red flag. For non-technical roles, it carries minimal weight (5%). A marketer who doesn't review generated code is behaving normally — they care about the output, not the implementation.

3. Depth of understanding

What it measures: does the candidate actually understand what they are building?

"Why" questions are the single strongest predictor we have, consistent with Anthropic's RCT findings. Beyond explicit questions, we look for demonstrations of understanding in the candidate's own messages: mentions of edge cases, error handling, performance considerations, testing strategies, and tradeoff discussions between approaches.

For engineers, we expect deep technical reasoning — not just what to build, but why one approach is better than another. For non-technical roles, we look for problem domain understanding: knowing what the system should do, even if they can't evaluate the code that does it.

4. Iteration quality

What it measures: how the candidate refines AI output through conversation.

We track refinement chains: sequences of instruction → AI response → refinement → AI response → refinement. Chain depth of 1 means one-shot prompting. Depth of 3+ indicates deliberate iteration. More importantly, we measure specificity growth within chains — whether each refinement gets more precise than the previous one, or whether the candidate repeats the same vague instruction hoping for better output.

The refinement ratio — refinement turns divided by total turns — is a useful aggregate metric. This dimension is weighted equally across roles.

5. AI fluency

What it measures: how skillfully the candidate uses AI as a tool.

Signals include providing good context and constraints, using AI for appropriate tasks (not trivially simple, not full delegation), maintaining control of the overall direction, and combining their own knowledge with AI capabilities.

One specific signal stands out: setting collaboration terms upfront. Only about 30% of candidates do this — establishing the scope, approach, or constraints before diving in. It is a strong predictor of session quality across all roles.

For engineers, AI fluency carries lower weight (10%) — it is an expected competency. For non-technical roles, it is a key dimension (25%). If a marketer or PM's primary value-add is working with AI to produce output, fluency is their core skill.

6. Independence

What it measures: the candidate's own contribution beyond prompting.

The core question: could this task have been completed by anyone copy-pasting the requirements into ChatGPT? If yes, the candidate demonstrated no unique value. We look for candidates who add their own ideas, constraints, and domain knowledge — who write code or logic themselves (for engineers), or who bring specific business requirements and edge cases (for non-technical roles).

Weighted equally at 10% across roles.

Role-based weighting

The same six dimensions are evaluated for every candidate, but the weights differ by role. Engineering roles weight Quality Control and Depth of Understanding at 25% each — these are the dimensions where AI creates the most dangerous blind spots. Non-technical roles weight Problem Decomposition at 30% and AI Fluency at 25% — clarity of communication and tool mastery matter more than code review.

The reasoning is straightforward. An engineer who accepts complex code without reading it is shipping bugs. A non-technical hire who can't articulate what they need from AI is bottlenecking their team. Different roles, different failure modes, different weights.

Candidate pattern classification

Beyond dimension scores, we classify each candidate into one of four behavioral patterns based on the overall shape of their session.

One-shot: 1–2 messages, copied AI output verbatim, no review. Red flag for any role. The session contains almost no signal — which is itself a signal.
Passive consumer: delegated most work, minimal corrections, no verification. The candidate was present but not engaged. Low competence signal.
Collaborator: good back-and-forth, reviews output, makes corrections. AI does the heavy lifting but the candidate maintains control. Strong signal for most roles.
Driver: drives architecture and decisions, uses AI as an accelerator, contributes significant independent work. The strongest signal of senior-level capability.

Evidence-linked scoring

Every score in a NouSpark report links to specific moments in the conversation. When a hiring manager sees "Quality Control: 34/100," they can see the exact turn where the candidate accepted a 200-line database migration in 4 seconds. When they see "Depth of Understanding: 91/100," they can read the candidate's own words explaining why they chose one approach over another.

Each piece of evidence includes the turn number and timestamp, the candidate's quote, a classification tag (instruction, question, correction, refinement, delegation, approval, or context_dump), and an explanation of why it matters for that dimension.

This is not a black box. It is a documented argument.

Chat turn taxonomy

To classify evidence, we developed a taxonomy of candidate turn types. Each message from the candidate is tagged as one of seven types.

Instruction — an imperative, clear ask: "Add error handling to the login endpoint."
Question — seeks understanding: "Why does this throw a null reference?"
Correction — references AI output and identifies an error: "No, the API returns an array, not an object."
Refinement — builds on previous output, adds a constraint: "Good, but also handle 429 rate limiting."
Delegation — hands off an entire task without specifics: "Write the whole auth module."
Approval — short acceptance with no evidence of review: "Looks good," "ok," "ship it."
Context dump — pastes code or errors without a clear ask: a stack trace followed by "help."

How this differs from existing approaches

Several companies are working on AI-era technical assessment. The approaches differ in one fundamental dimension: where the candidate works.

CodeSignal runs agentic assessments where candidates work in CodeSignal's environment, with live session recording and transcript analysis. Meta's AI-enabled coding round puts candidates in CoderPad with AI access, observed by an interviewer in real time. Foretoken AI runs real-work simulations with AI — again, in their environment. HackerRank added an AI-assisted IDE with prompt engineering questions that score prompt quality.

NouSpark is the only platform where candidates work in their own environment, with their own tools, asynchronously. No sandbox, no proctoring, no unfamiliar IDE. The candidate exports their chat log when they're done. We evaluate the real workflow, not a performance.

This matters because engineers have deeply personalized environments — terminal aliases, custom keybindings, snippet libraries, specific AI tools they've built fluency with. Forcing them into an unfamiliar setup doesn't test their ability. It tests their ability to adapt to your testing tool.

Research foundation

This methodology draws on several published frameworks and studies.

Barke et al. (2023) identified two modes of AI-assisted development: acceleration (the developer knows what to do and uses AI for speed) and exploration (the developer is uncertain and uses AI to explore options). Our dimensions capture both modes — iteration quality and AI fluency map to acceleration, while depth of understanding and independence map to exploration.
METR's RCT (2025) demonstrated that AI made experienced developers 19% slower while they perceived a 20% speedup. This perception gap motivated our emphasis on quality control — the dimension most likely to reveal whether speed gains are real or illusory.
Anthropic's RCT (2026) established the correlation between conceptual inquiry and comprehension. This directly informed our weighting of depth of understanding for engineering roles.
The PEEM framework (Prompt Engineering Evaluation Metrics) provides a structured rubric for scoring prompts on clarity, specificity, and context. We adapted elements of this for our AI fluency dimension.
DX Core 4 (2026) proposed four metrics for developer effectiveness: speed, effectiveness, quality, and business impact. Traditional proxy metrics — lines of code, commit frequency — are broken in the AI era. Our six dimensions are designed to measure effectiveness directly.
Zapier's AI Fluency Rubric defines four tiers from Unacceptable to Transformative, updated yearly with a rising bar. This influenced our candidate pattern classification — the expectation of what constitutes baseline AI competence will continue to increase.

What we're measuring

The ability to use AI is rapidly becoming table stakes. Zapier updates their fluency rubric yearly — what was advanced last year is expected this year. The bar will keep rising.

What separates great hires from mediocre ones is not whether they can use AI, but whether they understand what they're building and can steer the process toward a good outcome. Whether they review before they ship. Whether they ask "why" before they ask "how." Whether their contribution would be missed if they were replaced by someone else with the same prompt.

That is what we measure. Not keystrokes. Not syntax. Not LeetCode performance. We measure how people actually work — and whether they are building understanding or just accumulating output.