Tech Jobs for Talents without Borders
English-1st. Relocation-friendly. Curated daily by Imagine.
4,534 Jobs at 189 Companies

DataLake AI Platform Operation Engineer



Software Engineering, Data Science
Shanghai, China
Posted on Monday, July 1, 2024

We help the world run better

At SAP, we enable you to bring out your best. Our company culture is focused on collaboration and a shared passion to help the world run better. How? We focus every day on building the foundation for tomorrow and creating a workplace that embraces differences, values flexibility, and is aligned to our purpose-driven and future-focused work. We offer a highly collaborative, caring team environment with a strong focus on learning and development, recognition for your individual contributions, and a variety of benefit options for you to choose from.

We are seeking a skilled and motivated individual to join our team as a DataLake AI Platform Operations Engineer. This role focuses on Cloud Infrastructure, Kubernetes (K8S), and Machine Learning, as well as AI Model Training tooling solutions. In this position, you will be responsible for setting up and managing AI and general computing infrastructure connected to an OpenStack-based private cloud, provisioning cloud resources from IaaS, implementing various service components to support distributed model training tasks and productive use-case serving instances across K8S clusters, and overseeing the runtime metrics of each component while continuously optimizing them.

What You'll Do:


  • Infrastructure Operation: Utilize OpenStack-based IaaS resources and optimize their provisioning to ensure efficient infrastructure operations.
  • Cross-Node Resource Management: Manage Kubernetes clusters across different regions and availability zones, ensuring optimal performance for use-cases and shared services while minimizing resource consumption.
  • Logging, Auditing, and Metrics: Implement distributed logging solutions using Loki and OpenSearch. Configure auditing for each use-case and collect Prometheus-based metrics from both platform services and use-cases.
  • Dashboarding and Monitoring: Develop dashboards tailored to specific needs and monitor the platform using the dashboard tools you create.
  • Support Platform Use-Cases: Assist use-case development teams in maximizing the platform's capabilities for their projects.
  • TCO Management: Automate the calculation of the total cost of ownership for platform infrastructure and licenses, and allocate these costs to each specific use-cases.
  • Collaboration, Documentation, and Training: Collaborate with peers across regions to support various projects, document new changes, and provide training to platform users.

What You Bring:


  • Bachelor's degree in Computer Science, Engineering, or a related field; advanced degrees are a plus.
  • Basic understanding of GPU-based computing concepts, and familiarity with AI/ML frameworks and tools such as CUDA, Kubeflow, Spark, or PyTorch.
  • Solid knowledge of Kubernetes and container orchestration concepts.
  • Proficiency in coding languages (e.g., Python, Go, Shell) for automation and infrastructure management.
  • Proven experience in infrastructure and operations management for cloud service solutions.
  • Strong problem-solving skills and the ability to diagnose and resolve complex technical issues.
  • Excellent communication and collaboration skills to work effectively with cross-functional teams.
  • Strong attention to detail and the ability to manage multiple priorities in a fast-paced environment.

Join our dynamic team and contribute to cutting-edge solutions in AI and cloud infrastructure!

Bring out your best

SAP innovations help more than four hundred thousand customers worldwide work together more efficiently and use business insight more effectively. Originally known for leadership in enterprise resource planning (ERP) software, SAP has evolved to become a market leader in end-to-end business application software and related services for database, analytics, intelligent technologies, and experience management. As a cloud company with two hundred million users and more than one hundred thousand employees worldwide, we are purpose-driven and future-focused, with a highly collaborative team ethic and commitment to personal development. Whether connecting global industries, people, or platforms, we help ensure every challenge gets the solution it deserves. At SAP, you can bring out your best.

We win with inclusion

SAP’s culture of inclusion, focus on health and well-being, and flexible working models help ensure that everyone – regardless of background – feels included and can run at their best. At SAP, we believe we are made stronger by the unique capabilities and qualities that each person brings to our company, and we invest in our employees to inspire confidence and help everyone realize their full potential. We ultimately believe in unleashing all talent and creating a better and more equitable world.
SAP is proud to be an equal opportunity workplace and is an affirmative action employer. We are committed to the values of Equal Employment Opportunity and provide accessibility accommodations to applicants with physical and/or mental disabilities. If you are interested in applying for employment with SAP and are in need of accommodation or special assistance to navigate our website or to complete your application, please send an e-mail with your request to Recruiting Operations Team:
For SAP employees: Only permanent roles are eligible for the SAP Employee Referral Program, according to the eligibility rules set in the SAP Referral Policy. Specific conditions may apply for roles in Vocational Training.

EOE AA M/F/Vet/Disability:

Qualified applicants will receive consideration for employment without regard to their age, race, religion, national origin, ethnicity, age, gender (including pregnancy, childbirth, et al), sexual orientation, gender identity or expression, protected veteran status, or disability.
Successful candidates might be required to undergo a background verification with an external vendor.

Requisition ID: 398759 | Work Area: Software-Development Operations | Expected Travel: 0 - 10% | Career Status: Professional | Employment Type: Regular Full Time | Additional Locations: #LI-Hybrid.