Job Description

Role Overview

The AI Agent Evaluation Engineer is responsible for ensuring the quality, accuracy, explainability, and reliability of AI agent systems across Proof-of-Concept, Pilot, and Production. The role focuses on establishing enterprise-grade evaluation frameworks for agentic AI, LLMs, and AI-driven workflows to ensure outputs are trustworthy, measurable, and continuously improving.

Key Responsibilities

• Design and implement evaluation frameworks for AI agents, LLMs, and RAG-based systems.

• Measure accuracy, relevance, consistency, hallucinations, and task success across AI outputs.

• Establish baseline and comparative evaluations across models, prompts, and agent strategies.

• Validate agent decision logic, reasoning paths, and tool usage for explainability and traceability.

• Support human-in-the-loop (HITL) evaluation for high-impact or high-risk use cases.

• Partner with engineering teams to improve prompts, retrieval strategies, and agent orchestration.

• Validate AI observability, monitoring, drift detection, and regression controls.

• Support vendor PoCs, pilots, and RFP evaluations with fact-based assessments.

Required Qualifications
• Experience evaluating Generative AI, LLMs, and agentic AI systems.

• Strong understanding of AI/ML evaluation metrics and error analysis.

• Hands-on experience with Python and AI evaluation workflows.

• Familiarity with RAG architectures, prompt evaluation, and agent orchestration.

• Experience with cloud AI platforms (Azure or GCP preferred).

Preferred Qualifications
• Experience in Education, Healthcare, or other regulated domains.

• Exposure to synthetic data generation and test scenario design.

• Familiarity with AI governance, risk, and compliance practices.

Success Measures

• Measurable improvement in AI accuracy, reliability, and trustworthiness.

• Clear visibility into why AI agents made specific decisions.

• Standardized evaluation frameworks adopted across AI initiatives.

• Increased leadership confidence in AI-driven outcomes.

Apply Now

Python Engineer - Freelance (Hybrid work in Warsaw)

Netguru is a trusted partner in digital commerce. The company helps leading brands modernize B2B solutions, marketplaces, and retail ecosystems. Since 2008, it has empowe

python
engineer
serverless

Engineering Team Lead, AI Platform

At Roofr, we’re obsessed with our customers. We constantly gather feedback to shape, prioritize, and launch the products they truly need. That’s what makes Roofr’s CRM sp

exec
engineer

Senior Systems Architect

We are looking for a Senior Systems Architect to design and lead scalable, secure, and high-performance system architectures across complex, enterprise-level solutions. T

Senior
architecture

Head of Development

𝐀𝐛𝐨𝐮𝐭 𝐏𝐫𝐨𝐩 𝐅𝐢𝐫𝐦 𝐌𝐚𝐭𝐜𝐡 𝐆𝐥𝐨𝐛𝐚𝐥 𝐅𝐙𝐂𝐎Prop Firm Match Global FZCO is the leading platform for discovering, comparing, and selecting proprietary tra

Head Level
exec
dev
javascript

[Job - 28525] AI Quality Engineer Senior, QA

Job Description

Remote Brazil

Software development

6 hours ago

Senior

Python Engineer - Freelance (Hybrid work in Warsaw)

Engineering Team Lead, AI Platform

Senior Systems Architect

Head of Development

Find Remote Jobs

About us

Additional

[Job - 28525] AI Quality Engineer Senior, QA

Job Description

Remote Brazil

Software development

6 hours ago

Senior

Python Engineer - Freelance (Hybrid work in Warsaw)

Engineering Team Lead, AI Platform

Senior Systems Architect

Head of Development

Subscribe to Job Alerts

Find Remote Jobs

About us

Additional