The AI Observability Engineer (Agentic Frameworks & AI Agent Operations Center Developer) builds and operationalizes agentic AI solutions using modern orchestration frameworks and contributes to an AI Agent Operations Center that enables safe, reliable, and observable agent behavior at scale. This role focuses on developing agent workflows (planning, tool execution, memory, and RAG), integrating guardrails and evaluations, and delivering operational capabilities such as run management, telemetry, and incident triage for production agents.
Responsibilities- AI Agent Operations Center (70%)
- Build “operations center” capabilities for agent runtime management: agent registry, versioning, deployment tracking, and run histories
- Enable operational workflows such as incident triage, replay/debug runs, trace correlation, and root-cause analysis across agent steps
- Implement operational dashboards and views for agent health: success rate, latency, tool failure rate, cost per run, and loop detection
- Instrument agent flows end-to-end using OpenTelemetry (or equivalent), enabling correlation across prompts, tool calls, retrieval, and responses
- Implement semantic conventions and tagging standards (agent name/version, tool name, model provider, environment, tenant/app)
- Partner with SRE/observability teams to ensure production-grade monitoring, alerting, and operational readiness
- Collaboration with Teams (10%)
- Collaborate with transformation teams and business stakeholders to understand requirements and tailor AI agents to specific domains.
- Work closely with AI platform teams to build scalable and cross-domain AI agents while ensuring end-to-end observability.
- Integration & Deployment (10%)
- Build and maintain CI/CD pipelines for agent services and operations center components, including automated testing and deployment
- Automate onboarding for new agent use cases (templates, scaffolding, configuration checks)
- Drive best practices for secure, scalable, and cost-effective agent deployments
- Continuous Learning (10%)
- Stay updated with the latest advancements in AI and machine learning technologies and integrate these into existing or new AI agents.
- Conduct thorough testing and validation to ensure the reliability and accuracy of AI agents and solutions.
| Key Skills/Experience Required Minimum Qualifications:
|
|
