Required Skills
About the Job
Join Microsoft's CoreAI division as a Senior Software Engineer on the Azure SRE Agent Platform team. We build and operate AI Agents as a Service designed to help Microsoft customers detect, diagnose, and resolve production issues across their services and workloads on Microsoft platforms. These agents act as virtual SRE teammates, continuously monitoring systems, investigating problems, and recommending or performing fixes with a focus on quality, safety, security, enterprise scale, and real-world impact. Your work will involve the full lifecycle of agentic systems in production, from designing core agent behaviors (tool design, planning, execution, orchestration, evaluation, safety guardrails) to building the operational foundations for dependability (observability, progressive delivery, reliability engineering, live-site learning). You'll also contribute to a seamless user experience for our customers. We seek engineers with a strong product quality mindset, end-to-end ownership, and a passion for details that transform prototypes into trusted systems. Operating in a highly agile environment with short cycles and constant learning, you'll have high autonomy. We value a strong owner's mindset and bias for action, encouraging you to tackle ambiguous problems, adopt modern research and engineering practices, move quickly, and continuously raise the quality bar.
**Responsibilities:**
- Own critical areas of the Azure SRE Agent Platform, including agent capabilities, orchestration, evaluation, multi-form factor user experiences, and supporting platform services.
- Build and iterate on agentic systems, encompassing tools, planning/execution loops, evaluations, and safety mechanisms.
- Design and deploy reliable features to enhance incident detection, diagnosis, mitigation, and operational learning.
- Utilize telemetry, experiments, evaluations, and user feedback to drive iteration and investment.
- Contribute to resilient, observable systems that operate safely and effectively in production.
- Collaborate closely with engineers, SREs, and product teams to transform ambiguous challenges into high-quality, delivered solutions.
- Participate in debugging, live-site learning, and post-incident hardening to continuously improve system quality.
- Influence architecture, engineering standards, and development practices across the team.