Senior Software Engineer - AI Interaction Evaluator (Codex / Claude Code, up to $200/hr)

$50 – $200/hr Miami, US remote contract senior May 8, 2026

Skills

About this role

SENIOR AI INTERACTION EVALUATOR (CODEX / CLAUDE CODE) These roles are currently filled but we hire on a rolling basis as new projects open up. Apply now to join our talent bench — qualified candidates will be contacted directly when roles become available. Contract | $50-200/hr | 10–20 hrs/week | Start ASAP (through early May) Check out this Loom video for more details! https://www.loom.com/share/b0d1b0bf24c44ae8b95dca84b9db60e5 We’re looking for highly experienced software engineer (SR+) to help evaluate the quality of interactions with modern coding agents such as OpenAI Codex and Claude Code. This is not a traditional engineering role. You won’t be writing production code. You’ll be evaluating something harder: whether the model thinks like a great engineer. WHAT THIS ROLE ACTUALLY IS You will assess how AI coding agents behave in real-world scenarios — focusing on: - Whether the response makes sense - Whether the preamble and reasoning are useful - Whether the output reflects strong engineering judgment - Whether the interaction feels right to an experienced developer This role is about engineering taste — not syntax correctness. WHAT YOU’LL BE DOING - Evaluate AI-generated coding interactions end-to-end - Judge whether outputs are: - Useful - Correct (at a high level) - Aligned with how a strong engineer would think - Assess the quality of explanations and reasoning, not just code - Distinguish between different levels of response quality (e.g. what makes something a 2 vs 4) - Provide clear, opinionated feedback on: - What worked - What didn’t - What felt “off” or misleading - Help define what great looks like when interacting with tools like Cursor WHAT WE MEAN BY “TASTE” We’re specifically looking for engineers who can answer questions like: - Does this feel like something a strong engineer would actually say? - Is this explanation helpful, or just technically correct? - Is the model guiding the user well, or just dumping output? - Would this interaction build or erode trust? You should be comfortable making subjective but rigorous judgments. WHO YOU ARE - Staff / Principal-level engineer (or equivalent experience) - Strong background in one of the below: - TypeScript / JavaScript - Python - Hands-on experience using: - OpenAI Codex - Claude Code - Cursor - Deep familiarity with modern AI-assisted dev workflows - Able to evaluate code without needing to fully execute or deeply review every line - Comfortable giving direct, opinionated feedback - High bar for what “good engineering” looks like NICE TO HAVE - Experience with tools like Cursor or similar AI-first IDEs - Prior exposure to prompt design or evaluation workflows - Experience mentoring senior engineers or defining engineering standards ENGAGEMENT DETAILS - US and Canada up to $200/hr - EU and Latam up to $150/hr - Other locations up to $100/hr - Hours: ~10–20 hours/week - Duration: Through early May (with possible extension) - Start: ASAP - Process: - Take-home evaluation exercise - One behavioral interview