We are looking for a self-driven, software engineering mindset AI SRE engineer to
- Defining and implementing robust architectural patterns for Machine Learning (ML) models, Large Language Models (LLMs), AI agents, computer vision solutions to help drive new shift left activities critical to apply Site Reliability Engineering (SRE) and quality assurance principles within the application design & architecture / Project roadmap that enables resilient outcomes.
- Collaborate closely with PepsiCos Data and AI Architecture team to ensure these patterns align with our Data and AI Strategy and technical standards.
- Apply pre-emptive approach into production minimizing business impact, via SRE-driven orchestration of connecting all components of the ecosystem diagnosing anomalies prior to user & remediating through automation & AI.
This is a critical enabler achieving a high resiliency during operations and also continuously improving through design during the software development lifecycle.
The DPA SRE Principal Engineer is integral part of the global team with its main purpose to provide a delightful customer experience for the user of the global consumer, commercial, supply chain and enablement functions in the PepsiCo digital products application portfolio of 260+ applications, enabling a full SRE Practice incident prevention / proactive resolution model.
The scope of this role is focused on the cloud architecture application full stack development, B2B pepsi connect and Direct to Customer and other S&T roadmap applications.
Ensures that PepsiCo DPA applications service performance, reliability and availability expected by our customers and internal groups.
It requires a blend of technical expertise on SRE tools, modern applications cloud architecture i.e. full stack, IT operations experience, and analytics & influence skills.
Responsibilities- Architectural Pattern Definition: Define platform and model development patterns for ML models, LLMs, and AI agents in collaboration with the Data and AI Architecture team.
- AI Observability: Establish and standardize AI observability patterns for ML, computer vision systems, LLMs, and AI agents to enable robust model monitoring, explainability, and drift detection.
- Cross-Functional Collaboration: Engage with cross-functional teams (data science, engineering, operations, etc.) to standardize AI architecture practices and patterns across the enterprise.
- Enterprise Alignment: Work closely with the Enterprise Architecture (EA) group to align AI/ML architecture patterns with PepsiCo's broader IT strategy and standards.
- Technology Evaluation: Evaluate and recommend AI/ML frameworks, platforms, and tools to ensure best-in-class solutions are utilized within PepsiCo's AI ecosystem.
- Scalability & Reliability: Ensure AI solutions are scalable, reliable, secure, and high-performing across cloud environments (Azure AWS and GCP), following cloud-native architecture best practices.
- Innovation & Strategy: Stay up to date with emerging AI/ML trends and technologies, and provide strategic technical direction for future AI initiatives and capabilities.
- Ensure ecosystem availability and performance in production environments, Pro-actively preventing P1, P2, potential P3s.
- Engage & influence product and engineering teams during the design and development phases to embed reliability and operability into new services defining & enforce events, logging, monitoring, and observability standards across applications.
- Work closely with customer-facing support teams to empower them with SRE insights and tooling.
- Observe, diagnose & improve the end-2-end ecosystem performance of the Modern architected application portfolio i.e. technical “understanding of interactions" of a full stack application alongside with peer SRE team member.
- Continuously optimize the L2/support operations work via AI and Agentic flows.
- Be the key architect for the AI flavor and use cases in SRE orchestration platform design with inputs from Production Operations, Business usage & Product and engineering teams.
- Actively engage and drive AI Ops adoption across teams
- Education: Masters or Ph.D. in Computer Science, AI/ML, or a related field.
- Experience: 12+ years of experience in AI/ML architecture, data science, with a focus on designing and deploying enterprise-scale AI solutions.
- Cloud Expertise: Strong expertise in Azure and AWS AI/ML services and cloud-native architectures for machine learning.
- Advanced AI Technologies: Hands-on experience with LLMs, AI agents, Machine Learning Operations (MLOps) pipelines
- AI Observability: Deep knowledge of AI observability (model monitoring, explainability, and drift detection) to ensure models remain trustworthy and effective over time.
- Programming & Frameworks: Strong programming skills in Python and experience with AI/ML frameworks such as TensorFlow, PyTorch, or similar libraries.
- Collaboration: Ability to collaborate effectively with data engineers, data scientists/AI scientists, and IT leadership to drive AI initiatives forward.
- Governance & Compliance: Solid understanding of enterprise AI governance, security best practices, and compliance requirements (data privacy, model ethics, etc.).
- Experience with popular AI frameworks like langchain, langgraph, crewai
- Knowledge of AI ethics and responsible AI practices.
- Strong communication skills with the ability to explain complex technical concepts to non-technical stakeholders.
