Title:
AI Reliability Engineer (AI SRE)
Job Type:
Contract
Contract Length:
12 Months
Pay Range:
$50/hr – $175/hr
Start Date:
ASAP
Location:
Remote
About the Opportunity:
Our client, a leader in AI testing and Generative AI solutions, is looking for a skilled AI Reliability Engineer (AI SRE)
to join their team for a 12-month engagement. This project involves ensuring
the reliability, availability, and performance of mission-critical AI systems by defining SLOs, implementing automated resilience measures, and leading incident response. This is a high-impact role that requires a self-motivated professional who can hit the ground running and deliver results quickly.
Key Responsibilities & Deliverables:
This role is focused on the successful completion of specific tasks and deliverables. Your responsibilities will include:
- Defining and maintaining Service Level Objectives (SLOs) for AI inference latency and availability.
- Building automated "circuit breakers" and fallback logic (e.g., switching to a smaller model if the primary fails).
- Leading incident response and root-cause analysis (RCA) for complex AI system failures.
- Developing stress-testing and chaos engineering scenarios specifically for AI agent swarms.
- Optimizing the "cold start" and scaling time for serverless AI functions.
We are looking for someone with a proven track record of successful contract engagements. The ideal candidate will have:
- 4+ years of experience in Site Reliability Engineering (SRE).
- Deep expertise in system monitoring, incident management, and cloud resilience. This isn't a learning role—you need to be a subject matter expert.
- Demonstrated ability to work autonomously and manage your own time effectively to meet project goals.
- Experience with Python/Go, Kubernetes, and observability stacks (Datadog, New Relic).
- Strong communication skills to provide clear and concise status updates to the project team.





