Site Reliability Engineer (SRE) - AI Platforms

JP Morgan Chase & Co. Bengaluru / Bangalore, Karnataka
Permanent Job Not disclosed
Load Balancing incident response strategies site reliability culture

Join JPMorgan Chase's Asset and Wealth Management Technology team as a Site Reliability Engineer (SRE) and play a crucial role in enhancing the reliability and resilience of our cutting-edge AI systems. You will be instrumental in ensuring the robustness and availability of AI models that transform how we serve and advise clients. This role focuses on deepening client engagements and driving process transformation through high-quality, cloud-centric software delivery. We are looking for individuals passionate about applying advanced reliability engineering practices, AI observability, and incident response strategies to solve complex business challenges.

Your responsibilities will include:

  • Defining and refining Service Level Objectives (SLOs) for large language model (LLM) serving and training systems, balancing performance with development velocity.
  • Designing, implementing, and continuously improving comprehensive monitoring systems.
  • Collaborating on the design and implementation of high-availability LLM serving infrastructure.
  • Championing Site Reliability Engineering (SRE) culture and practices, providing technical leadership and influencing teams.
  • Developing and managing automated failover and recovery systems for model deployments across multiple regions and cloud providers.
  • Creating AI incident response playbooks for AI-specific failures, including automated rollbacks and circuit breakers.
  • Leading incident response for critical AI services, ensuring rapid recovery and continuous improvement.
  • Building and maintaining cost optimization systems for large-scale AI infrastructure.
  • Engineering for scale and security using techniques such as load balancing, caching, optimized GPU scheduling, and AI Gateways.
  • Collaborating with ML engineers to ensure seamless integration and operation of AI infrastructure.
  • Implementing continuous evaluation processes for drift and degradation monitoring.

Similar Jobs

View all

Custom Software Engineer

Accenture

Hyderabad / Secunderabad, Telangana, Telangana 3-5 Years
Permanent Job Not disclosed

Data Engineer

Accenture

Nagpur, Maharashtra 5-7 Years
Permanent Job Not disclosed

Lead Software Engineer (Cloud Native | Microservices | AWS)

Experian

Hyderabad / Secunderabad, Telangana, Telangana 10-12 Years
Permanent Job Not disclosed

Software Engineer

Milliman

Gurgaon / Gurugram, Haryana 2-4 Years
Permanent Job Not disclosed

Platform Customer Engineer

Google India

Bengaluru / Bangalore, Karnataka 4-6 Years
Permanent Job Not disclosed

Senior Software Engineer (Java)

HighRadius

Hyderabad / Secunderabad, Telangana, Telangana 5-7 Years
Permanent Job Not disclosed
Apply Now