Software Engineer 2 – Azure AI Infrastructure
Overview:
The Azure AI Infrastructure team at Microsoft is dedicated to building and scaling the largest deep-learning infrastructure service globally. This team develops cutting-edge solutions that allow for high-scale AI workloads on the Azure AI Platform, enabling distributed deep learning training and inference on a large scale. The team’s work lies at the intersection of AI, machine learning, distributed systems, and cloud infrastructure.
Location: Bangalore, Karnataka, India
Work Site: Up to 50% work from home
Travel: 0-25%
Employment Type: Full-Time
Job Number: 1761291
Date Posted: September 13, 2024
As a Software Engineer 2, you will collaborate with some of the brightest engineering talent within Microsoft to tackle challenges like cluster orchestration, job scheduling, networking, containerization, and operating system integration. Your work will play a key role in driving innovation in AI Infrastructure, enabling data scientists and AI practitioners to seamlessly experiment, iterate, and scale their models.
In addition to driving key technological developments, you will also be responsible for building and maintaining infrastructure components that are critical to Microsoft Service Fabric and Kubernetes clusters. This role will allow you to participate in frontline customer support, architecture design, and service excellence initiatives.
Who We Are:
At Azure AI Infrastructure, we believe that building a planet-scale AI Supercomputer from the ground up is an unparalleled opportunity to revolutionize the field of AI. Our vision is to eliminate the fundamental pain points faced by data scientists and AI engineers, offering them a seamless platform to train and infer large-scale models.
If you're passionate about tackling the most complex engineering challenges and want to work on a globally distributed infrastructure that enables AI at unprecedented scales, this is your opportunity to join a team with an ambitious and transformative mission.
What Is Azure AI Infrastructure?:
Azure AI Infrastructure is designed to handle the most demanding AI workloads, including large-scale model training and inference across vast volumes of data using hundreds to thousands of GPUs. Azure's infrastructure abstracts the underlying complexities, enabling a highly distributed and cost-effective system that supports AI applications with full utilization of GPU compute.
The platform empowers data scientists to focus on building, scaling, experimenting, and iterating models without worrying about the complexities of the infrastructure. Azure AI Infrastructure operates as a globally distributed, multi-tenant service that optimizes compute, networking, and storage, making it one of the most advanced platforms for AI workloads today.
Our team applies innovative approaches from fields like distributed systems, machine learning, information retrieval, networking, and security to ensure that Azure AI remains at the forefront of the AI revolution.
Key Responsibilities:
As a Software Engineer 2 on the Azure AI Infrastructure team, your work will be central to delivering world-class AI services and infrastructure. You will:
Design and implement a robust container orchestration platform for Azure AI Infrastructure, which will ensure the efficient execution of distributed AI workloads.
Develop the scheduling sub-system responsible for meeting Service Level Agreements (SLAs) for AI training and inferencing workloads. This includes optimizing job scheduling to maximize resource usage and efficiency.
Build and optimize storage and caching systems to support deep neural network (DNN) training and inference processes. Ensuring that large volumes of data are efficiently accessed and processed is critical for high-performance AI systems.
Design and build control plane APIs that enable the creation and management of AI training jobs and inference model metadata. This API layer will provide a user-friendly interface for data scientists and engineers to interact with the AI infrastructure.
Deliver systems for node management, fault detection, and node repair as a service, ensuring that AI jobs and models are reliable and resilient to system failures.
Develop monitoring systems and telemetry pipelines to enhance the observability of the services you build. This ensures that both end-users and operators have visibility into system performance, job execution, and potential issues.
Codify security and compliance requirements, integrating them into the infrastructure to protect against malicious attacks and exploits. Ensuring the security of Azure AI workloads is a critical part of delivering a reliable and trustworthy platform.
Utilize performance and profiling tools to identify and resolve performance bottlenecks across the hardware-software stack. This includes analyzing performance from CPU and GPU utilization down to networking, microcode, and operating system-level optimizations.
Qualifications:
Required:
4+ years of experience in programming languages such as C#, C, C++, Rust, or Go. Strong expertise in one or more of these languages is essential for success in this role.
Experience with the Linux operating system and proficiency in Kubernetes cluster orchestration. Hands-on experience with containerization and cluster management is vital for this role.
Proven experience in improving service operations or engineering fundamentals, demonstrating a strong commitment to delivering high-quality, reliable systems.
Excellent collaboration skills and the ability to work effectively across teams in a fast-paced and innovative environment.
A Bachelor’s or Master’s degree in Computer Science or a related field. Academic and practical experience in distributed systems, machine learning, or cloud infrastructure is highly valued.
3+ years of experience in building and shipping production software or services. The ability to deliver complex projects on time, while ensuring quality and reliability, is critical.
Security and Compliance:
This position requires meeting Microsoft, customer, and government security screening requirements. These screenings include the Microsoft Cloud Background Check, which is mandatory upon hire and must be renewed every two years. Compliance with security protocols is crucial to protecting both Microsoft's infrastructure and its users.
Why Join Us?:
This role offers an opportunity to be a part of Microsoft’s most innovative AI efforts, contributing to the development of one of the world’s most advanced AI infrastructure platforms. You’ll have the chance to work on large-scale, distributed systems and be at the forefront of shaping the future of AI by building the infrastructure that powers some of the world’s most sophisticated AI applications.
The Azure AI Infrastructure team is driven by a shared passion for creating a globally distributed, resilient, and performant AI platform. Our work helps data scientists and AI practitioners push the boundaries of AI, creating solutions that have the potential to change industries and improve lives.
Application Procedure:
[Click Here To Apply]Join JOB FOR ME WHATSAPP CHANNEL for more job updates.
Microsoft is accepting applications for this role. If you are excited about building the future of AI and have the technical expertise required for this role, we encourage you to apply as soon as possible.
This role is ideal for engineers passionate about cloud computing, AI infrastructure, and large-scale distributed systems. If you are eager to work on cutting-edge technologies and want to be part of a team that delivers innovative AI services to customers worldwide, this opportunity at Microsoft’s Azure AI Infrastructure team will be a perfect fit for you.