Required Skills
About the Job
JPMorgan Chase is seeking a skilled Site Reliability Engineer (SRE) to join our Asset and Wealth Management Technology team in Bengaluru. In this role, you will be instrumental in enhancing the reliability and resilience of our advanced AI systems, which are transforming how we service and advise clients. You'll focus on ensuring the robustness, availability, and optimal performance of AI models, ultimately deepening client engagement and driving process transformation.
We are looking for passionate individuals who excel in applying advanced reliability engineering practices, AI observability, and incident response strategies to solve complex business challenges. You will contribute to high-quality, cloud-centric software delivery within a culture that values experimentation, continuous improvement, and intellectual curiosity. Join a collaborative and trusting environment that embraces diversity of thought and fosters innovative solutions for our global clientele.
Key Responsibilities:
- Define and refine Service Level Objectives (SLOs) for large language model serving and training systems, encompassing metrics like accuracy, fairness, latency, drift targets, TTFT, and TPOT.
- Design, implement, and continuously enhance monitoring systems for availability, latency, and other critical metrics.
- Collaborate on the design and implementation of highly available language model serving infrastructure for high-traffic internal workloads.
- Champion Site Reliability Engineering (SRE) culture and practices, providing technical leadership and influence across teams.
- Develop and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers.
- Create AI Incident Response playbooks for AI-specific failures (e.g., sudden drift, bias spikes), including automated rollbacks and AI circuit breakers.
- Lead incident response for critical AI services, ensuring rapid recovery and systematic learning.
- Build and maintain cost optimization systems for large-scale AI infrastructure.
- Engineer for scale and security using techniques such as load balancing, caching, optimized GPU scheduling, and AI Gateways.
- Collaborate with ML engineers to ensure seamless integration and operation of AI infrastructure.
- Implement continuous evaluation processes, including pre-deployment, pre-release, and post-deployment monitoring for drift and degradation.