Required Skills
About the Job
Join Microsoft's CoreAI division as a Software Engineer II on the Azure SRE Agent Platform team. We build and operate AI Agents as a Service that empower Microsoft customers to detect, diagnose, and resolve production issues across their services and workloads on Microsoft platforms. These agents act as virtual SRE teammates, continuously monitoring systems, investigating problems, and recommending or implementing fixes with a strong focus on quality, safety, security, enterprise scale, and real-world impact.
Our work covers the entire lifecycle of agentic systems in production. You will design and enhance core agent capabilities, including tool design, planning and execution, orchestration, evaluation, and safety guardrails. You will also build the operational foundations for dependability, such as observability, progressive delivery, reliability engineering, and live-site learning. Furthermore, you'll contribute to creating a seamless and intuitive user experience for our agents.
We are seeking talented full-stack Software Engineers who are passionate about product quality, end-to-end ownership, and delivering systems that customers can trust during critical moments. You'll operate with high autonomy in an agile environment, embracing short cycles, feature flags, progressive delivery, and continuous learning. A strong owner's mindset and a bias for action are essential – engineers who tackle ambiguous problems, leverage modern research and engineering practices, move quickly, learn from production, and consistently raise the quality bar.
Responsibilities:
- Own critical components of the Azure SRE Agent Platform, including agent capabilities, orchestration, evaluation, user experiences, and supporting platform services.
- Build and iterate on agentic systems, focusing on tools, planning and execution loops, evaluations, and safety mechanisms.
- Design and deploy reliable features to improve incident detection, diagnosis, mitigation, and operational learning.
- Utilize telemetry, experiments, evaluations, and user feedback to drive product iteration and investment.
- Contribute to building resilient, observable systems that operate safely and effectively in production.
- Collaborate closely with engineers, SREs, and product managers to transform ambiguous challenges into high-quality shipped solutions.
- Participate in debugging, live-site learning, and post-incident analysis to enhance system quality.
- Contribute to architectural decisions, engineering standards, and development practices.
Required Qualifications:
- Bachelor's or Master's degree in Computer Science, or equivalent practical experience.
- 4+ years of experience building production software using languages like C#, C++, Go, Java, or Python.
- Solid understanding of Generative AI fundamentals, software engineering principles, data structures, and problem-solving.
- Proven ability to quickly learn new technologies and deliver customer and business impact.
Preferred Qualifications:
- Experience building and operating LLM-powered agentic systems in production, with ownership of quality and reliability.
- 3+ years of experience building and operating cloud platforms or distributed services, with expertise in service architecture, deployment, and observability.
- Strong product sense with a history of owning ambiguous problem spaces and driving them to successful outcomes.
- Deep understanding of systems design, performance, and debugging in complex production environments.
- Experience designing, running, and optimizing evaluations for agentic systems.
- Expertise with Kubernetes, container orchestration, or cloud-native infrastructure is a significant advantage.