AI Ops Architect
IBM
At IBM, work is more than a job – it’s a calling: To build. To design. To code. To consult. To think along with clients and sell. To make markets. To invent. To collaborate. Not just to do something better, but to attempt things you’ve never thought possible. Are you ready to lead in this new era of technology and solve some of the world’s most challenging problems? If so, lets talk.
Your Role and Responsibilities
We are looking for an AIOps Architect to lead the development and deployment of AI-enhanced solutions for IT operations. In this role, you will architect cloud-native, compliant platforms that integrate AIOps, cognitive computing, and machine learning models to improve infrastructure performance, reduce downtime, and enhance system observability. You will design scalable, secure, and resilient systems, develop automated operations, and implement robust security practices to ensure compliance and operational excellence.
As an AIOps Architect, you will guide clients in their digital transformation, utilizing state-of-the-art technologies to build intelligent operations platforms that drive efficiency, enhance system reliability, and support business growth.
Core Responsibilities
- Architect and deploy hybrid, multi-cloud, and cloud-native solutions to support payments transformation, aligning infrastructure, systems, networking, and data center strategies
- Architect and implement comprehensive Solution Architectures, High-Level Designs (HLD), and Low-Level Designs (LLD) that ensure seamless integration of cloud-native technologies, AI-enhanced monitoring, and automation tools, adhering to best practices in security, compliance, and governance
- Develop and deploy strategies to enhance scalability, resilience, and operational efficiency across hybrid and multi-cloud environments, integrating automation, observability, and robust security protocols to support seamless, high-performing, and compliant systems
- Design and implement solutions that optimize cloud operations, infrastructure management, application performance, DevOps pipelines, security frameworks, network architecture, MLOps, and LLMOps.
- Deep expertise in monitoring tools (AppDynamics, Dynatrace, Splunk, Instana, QRadar, AWS CloudWatch, Azure Monitor, Google Operations Suite), with a focus on LLM observability and security for real-time analytics and anomaly detection
- Develop advanced monitoring and observability frameworks leveraging LLM observability and security, enabling robust tracking of application performance, anomaly detection, and real-time analytics for Large Language Models and other AI/ML workloads
- Integrate supervised learning models for predictive analytics, employing techniques such as data cleaning, event correlation, and root cause analysis to generate actionable insights that drive proactive incident resolution and optimize system performance
- Design and implement IT Service Management (ITSM) and ITIL frameworks, encompassing incident management, problem management, change management, and service level management, to standardize operational workflows and enhance service reliability
- Utilize AI/ML models, including machine learning-based anomaly detection and reinforcement learning, to automate incident response, performance tuning, and infrastructure scaling, reducing Mean Time to Detection (MTTD) and Mean Time to Resolution (MTTR)
- Engineer robust security architectures that include Cloud Native Application Protection Platforms (CNAPP), Zero Trust Network Access (ZTNA), and fully automated DevSecOps pipelines, ensuring compliance with stringent regulatory requirements and maintaining security posture across multi-cloud ecosystems
- Design and deploy High Availability (HA) and Disaster Recovery (DR) solutions using distributed architectures, multi-zone redundancy, data replication, and automated failover, ensuring minimal service disruption and business continuity in multi-region deployments
- Implement chaos engineering practices, conducting FURPS (Functionality, Usability, Reliability, Performance, Supportability) testing to identify potential failure points, validate system resilience, and ensure seamless recovery under high-stress conditions.
- Lead end-to-end project lifecycle management, including agile project methodologies, DevOps pipelines, resource allocation, risk management, and milestone tracking, to ensure the successful deployment of scalable, robust, and secure solutions aligned with client objectives
Required Technical and Professional Expertise
- 8+ years of experience in the design, delivery, and scaling of complex, large-scale IT projects, with a focus on cutting-edge technology solutions across hybrid, multi-cloud, and on-premises environments
- 3+ years of technical leadership as a solution architect, driving the design, integration, and management of hybrid cloud solutions, including seamless coordination across various cloud environments
- Demonstrated success in leading super complex projects, from initial solution design through to deployment, managing diverse teams, multi-vendor coordination, and ensuring alignment with strategic business goals
- Strong background in architecting complex, multi-cloud systems, leveraging hyperscalers (AWS, Azure, IBM Cloud, Google Cloud), with experience in multi-region deployments, multi-cloud networking, and cross-cloud service integration
- Proven expertise in designing cloud-native solutions with microservices, containers (Docker, Podman), and orchestration platforms (Kubernetes, OpenShift), ensuring modular, scalable, and resilient deployments
- In-depth understanding of regulatory compliance, security frameworks, and best practices in designing secure, resilient architectures
- Familiarity with integrating AI/ML models to enhance monitoring, incident response, and predictive maintenance processes
- Expertise in emerging technologies, such as AI-enhanced operations, automation frameworks, and cloud-native security, to future-proof systems and improve operational efficiency
Preferred Technical and Professional Expertise
Same as above