Staff AI Infrastructure Site Reliability Engineer
Company: Tbwa Chiat/Day Inc
Location: Santa Clara
Posted on: March 29, 2025
Job Description:
Staff AI Infrastructure Site Reliability EngineerXPeng Motors is
one of China's leading smart electric vehicle (EV) companies. We
design, develop, and manufacture smart EVs that are seamlessly
integrated with advanced Internet, AI and autonomous driving
technologies. We are committed to in-house R&D and intelligent
manufacturing to create a better mobility experience for our
customers. We strive to transform smart electric vehicles with
technology and data, shaping the mobility experience of the
future.As a Staff AI Infrastructure SRE, you will be instrumental
in leading the design and implementation of robust, cloud-native AI
infrastructure solutions that support our autonomous driving
initiatives. Your expertise will guide the development of systems
capable of handling large-scale, real-time data processing and
advanced machine learning models.Job Responsibilities:
- Architect and lead the development of scalable, secure AI
infrastructure on cloud-native platforms to support autonomous
driving technologies.
- Collaborate closely with ML teams to facilitate seamless
integration and optimal performance of AI algorithms.
- Identify and address system bottlenecks and instabilities,
applying innovative solutions to enhance system reliability and
efficiency.
- Foster technological advancements through research and
implementation of state-of-the-art AI tools and methodologies.
- Act as a key technical leader and mentor, promoting a culture
of technical excellence and collaborative innovation within the AI
infrastructure team.Minimum Skill Requirements:
- Bachelor's or Master's in Computer Science, Engineering, or
related technical field.
- 5+ years of experience in designing, deploying, and managing
GPU clusters for high-performance computing in AI applications,
particularly within cloud environments.
- Proficient in cloud services (AWS, Azure, ALI Cloud) and
building containerized applications using Kubernetes and
Docker.
- Strong programming skills in Python, Golang, and experience
with AI/ML frameworks (TensorFlow, PyTorch).Preferred Skill
Requirements:
- Expertise in designing and managing high-availability,
high-throughput systems that support machine learning and deep
learning workloads.
- Demonstrable leadership skills with a track record of mentoring
and leading technical teams.
- In-depth understanding of data structures, algorithms, and
software engineering principles relevant to AI and autonomous
systems.What do we provide:
- A fun, supportive and engaging environment.
- Opportunity to make significant impact on transportation
revolution by advancing autonomous driving.
- Opportunity to work on cutting-edge technologies with the top
talent in the field.
- Competitive compensation package.
- Snacks, lunches, and fun activities.The base salary range for
this full-time position is $215,280-$364,320, in addition to bonus,
equity, and benefits. Our salary ranges are determined by role,
level, and location. The range displayed on each job posting
reflects the minimum and maximum target for new hire salaries for
the position across all US locations. Within the range, individual
pay is determined by work location and additional factors,
including job-related skills, experience, and relevant education or
training.We are an Equal Opportunity Employer. It is our policy to
provide equal employment opportunities to all qualified persons
without regard to race, age, color, sex, sexual orientation,
religion, national origin, disability, veteran status, marital
status, or any other prescribed category set forth in federal or
state regulations.
#J-18808-Ljbffr
Keywords: Tbwa Chiat/Day Inc, Santa Clara , Staff AI Infrastructure Site Reliability Engineer, Professions , Santa Clara, California
Didn't find what you're looking for? Search again!
Loading more jobs...