AI Solutions and Platforms Observability Engineer

PepsiCo

India · Hyderabad, Telangana, India

Posted on May 19, 2026

Overview

Agentic AI Observability Senior Engineer is responsible for deploying, integrating, and operating a scaled Agentic AI observability platform across both internal and external agent frameworks. This role focuses on production-ready instrumentation and telemetry pipelines that provide end-to-end visibility across multi-step agent workflows—including planner/executor loops, tool/function calls, RAG retrieval, and memory/state—ensuring reliability, safety, performance, and cost governance at scale

Responsibilities

Agentic AI Observability at Scale (0%)

Platform Deployment & Operations (Agentic AI Observability at Scale)• Deploy and run the Agentic AI observability platform across dev/uat/prod with HA, resiliency, and controlled rollouts• Implement release automation (CI/CD), canary deployments, rollback strategies, and configuration management for platform components• Own operational readiness: on-call runbooks, incident response, and production support for agent observability services
End-to-End Agent Workflow Tracing (Planner → Tools → Retrieval → Response)• Implement distributed tracing for full agent execution graphs, including correlation across: prompts, intermediate reasoning steps (where permitted), tool calls, external APIs, retrieval queries, and final responses• Enforce consistent trace context propagation, correlation IDs, and semantic conventions across agent services• Build instrumentation patterns to represent agent flows as spans (e.g., plan span, tool span, retrieval span, memory span, response span)
Agent Framework Integrations & Standardized Instrumentation• Deploy and maintain integrations for internal agent frameworks and external ecosystems such as Crew.ai, LangChain, Semantic Kernel, AutoGen, and custom orchestrators• Create reusable SDKs/middleware/sidecar patterns for teams to instrument agents with minimal effort• Define and implement tagging standards for: agent name/version, tool name, model provider, prompt template ID, retrieval source, tenant/app, and environment
Agentic AI Telemetry Pipelines & AI-Specific Signals• Build scalable pipelines for agent telemetry (logs/metrics/traces) using OpenTelemetry and platform observability tooling• Capture AI-specific metrics including: token usage, cost per task, tool-call latency, retrieval latency, grounding score proxies, error rates, and agent loop iterations• Implement sampling and redaction strategies for sensitive agent payloads (prompts, responses, retrieved content) aligned to governance requirements

Collaboration with Teams (10%)
- Collaborate with transformation teams and business stakeholders to understand requirements and tailor AI agents to specific domains.
- Work closely with AI platform teams to build scalable and cross-domain AI agents while ensuring end-to-end observability.
Integration & Deployment (10%)
- Build and maintain CI/CD pipelines for agent services and operations center components, including automated testing and deployment
- Automate onboarding for new agent use cases (templates, scaffolding, configuration checks)
- Drive best practices for secure, scalable, and cost-effective agent deployments
Continuous Learning (10%)
- Stay updated with the latest advancements in AI and machine learning technologies and integrate these into existing or new AI agents.
- Conduct thorough testing and validation to ensure the reliability and accuracy of AI agents and solutions.

Qualifications

Education: Bachelor’s or Masters in Computer Science, AI/ML, Data Science, or a related field.
Experience: 4-8+ years of software engineering experience; 2-3+ years building and observe AI/ML or GenAI applications preferred
Required Expertise:

- Strong hands-on experience deploying observability solutions (Prometheus/Grafana/Elastic/Splunk/Datadog or equivalent)
- Deep working knowledge of OpenTelemetry instrumentation and telemetry pipeline operations
- Experience observing agentic AI systems: tool/function calls, orchestration, routing, memory/state, and RAG pipelines
- Familiarity with Crew.ai, LangChain, Semantic Kernel, AutoGen, or similar agent frameworks
- Experience with evaluation/quality monitoring and safe logging strategies for LLM systems
- FinOps experience for tracking token and GPU spend, chargeback/showback, and cost anomaly detection
- Experience implementing data governance controls for AI telemetry (PII redaction, retention, auditability)
- Strong Kubernetes experience (AKS/EKS/GKE) including Helm, operators, ingress, and service networking
- Strong automation skills (Python/Bash/Go) and CI/CD experience
- Infrastructure-as-Code (Terraform/Bicep/CloudFormation)
- Agent workflow tracing and telemetry correlation
- Production operations and debugging distributed systems
- Observability-as-a-platform enablement and automation
- Strong documentation, collaboration, and stakeholder influence
- Technical Proficiency: Implement monitoring for agent failure modes: tool-call failures, infinite loops, timeouts, hallucination risk signals, retrieval misses, and degraded response quality. Create alerts aligned to operational SLOs (availability, latency, tool reliability) and AI-specific indicators (cost spikes, loop bursts, retrieval anomalies). Support guardrail observability: policy blocks, content filtering events, and safety classifier outcomes (where applicable). Build onboarding automation (IaC, templates, CI checks) that makes observability “default-on” for all agentic services.
- Problem-Solving: Ability to translate business challenges into technical solutions.
- Collaboration Skills: Effective at working within cross-functional teams.
- Agility: Flexibility to adapt to changing requirements and new technologies.
- Communication Skills: Capable of explaining complex technical concepts to non-technical stakeholders.