Required Skills
About the Job
Google Cloud is seeking an AI/ML Software Engineer to join our team in Hyderabad/Secunderabad, Telangana. You will be instrumental in enabling and optimizing foundational AI models, including Large Language Models (LLMs) and Diffusion models, within key frameworks like vLLM, MaxText, and MaxDiffusion. This role involves collaborating with customers and Customer Engineering teams to measure AI/ML model performance on Google Cloud infrastructure, identify and resolve technical bottlenecks, and drive customer success.
You will partner with internal infrastructure teams to enhance support for demanding AI workloads, contribute to product improvements by identifying bugs and recommending enhancements, and conduct performance profiling, debugging, and troubleshooting of training and inference workloads. The role also includes maintaining and updating documentation and educational content, triaging and resolving system issues, and designing/implementing specialized Machine Learning solutions.
We are looking for engineers who are passionate about AI technologies and thrive in a dynamic environment. Experience with distributed computing, GPUs, TPUs, and a strong understanding of ML infrastructure, including model deployment, evaluation, and data processing, is essential. Excellent debugging skills and the ability to collaborate effectively with cross-functional teams are key to success in this role.