What it tests

Whether the candidate can build a working, production-quality AI application using LLM APIs — including prompt engineering, agent design, and delivering something a customer could actually use.

Format

1Candidate is given a customer use case (e.g., 'A legal team wants to extract obligations from contracts — build a working prototype')
2Candidate builds a working LLM-powered solution in Python using any available APIs — internet access is allowed
3Candidate explains architecture decisions: prompt design, context management, output reliability, error handling
4Final 10 minutes: candidate presents the output as if demoing to the customer's CTO

What to look for

Production mindset — do they handle edge cases, failures, and prompt reliability rather than just a happy path?
Prompt engineering craft — are prompts structured, testable, and adaptable or just naive one-liners?
Architecture judgment — do they choose the right pattern (single call, chain, agent) for the problem?
Customer communication — can they explain a working AI system to a non-AI technical leader?

Adaptation guide

Swap the legal use case for any vertical where your product is deployed (healthcare, finance, logistics). The key is that the deliverable must be something a real customer could immediately evaluate — not a stub. Allow internet access to simulate real working conditions. Score heavily on reliability and explainability, not just whether it runs.

Full description

Format:

Candidate receives a real customer use case — something an enterprise team actually needs solved
Candidate builds a working LLM-powered prototype in Python using any APIs (internet allowed)
Candidate explains architecture decisions: prompt design, context management, output reliability, error handling
Final 10 minutes: candidate presents the output as if demoing to the customer's technical lead

Time: 60 minutes

What to look for:

Production mindset — do they handle edge cases and failures, not just the happy path?
Prompt engineering craft — are prompts structured, testable, and reliable?
Architecture judgment — right pattern (single call, chain, agent) for the actual problem?
Customer communication — can they explain an AI system to a non-AI technical leader?

Adaptation: Swap the use case for any vertical where your customers operate. The deliverable must be something a real customer could evaluate immediately. Score heavily on reliability and explainability, not just whether it runs.