Reliability engineering jobs

29+ Reliability engineering Jobs in India

Apply to 29+ Reliability engineering Jobs on CutShort.io. Find your next job, effortlessly. Browse Reliability engineering Jobs and apply today!

DevOps Engineer / Site Reliability Engineer (SRE)

at NeoGenCode Technologies Pvt Ltd

2 candid answers

Posted by Akshay Patil

Gurugram

5 - 10 yrs

₹12L - ₹18L / yr

DevOps

Reliability engineering

Amazon Web Services (AWS)

Terraform

Ansible

+18 more

Job Title : DevOps Engineer / Site Reliability Engineer (SRE)

Experience : 5+ Years

Location : Gurugram, Haryana

Work Mode : On-site (Full-time)

About the Role :

We are looking for a skilled DevOps Engineer with 5+ years of experience in cloud infrastructure, CI/CD, automation, Kubernetes, and Site Reliability Engineering (SRE). The ideal candidate will be responsible for building scalable cloud infrastructure, automating deployments, improving system reliability, and ensuring high availability across production environments.

Mandatory Skills :

AWS, Terraform, Ansible, CloudFormation, Jenkins, GitLab CI, GitHub Actions, Docker, Kubernetes, Helm, Python, Bash, Grafana, Prometheus, ELK Stack, CloudWatch, New Relic, SRE, CI/CD, Infrastructure as Code (IaC), Linux

Key Responsibilities :

Design, deploy, and manage cloud infrastructure primarily on AWS (EC2, VPC, IAM, S3, RDS, Route53, ALB, Auto Scaling, Lambda).
Build and maintain Infrastructure as Code (IaC) using Terraform, Ansible, and CloudFormation.
Develop and optimize CI/CD pipelines using Jenkins, GitLab CI, and GitHub Actions.
Deploy and manage containerized applications using Docker, Kubernetes, and Helm.
Implement monitoring and observability using Grafana, Prometheus, ELK Stack, CloudWatch, and New Relic.
Drive SRE practices by defining SLIs, SLOs, SLAs, handling production incidents, conducting RCA, and improving system reliability.
Automate operational tasks using Python, Bash, and Groovy scripting.
Collaborate with Development, QA, Security, and Operations teams to ensure reliable and secure software delivery.

Required Skills & Qualifications :

Bachelor's degree in Computer Science, IT, Electronics, or a related field.
5+ years of experience in DevOps, SRE, or Cloud Infrastructure.
Strong expertise in AWS, with exposure to Azure/GCP.
Hands-on experience with Terraform, Ansible, CloudFormation, Docker, Kubernetes, Helm, Jenkins, GitLab CI, GitHub Actions, and Git.
Strong scripting skills in Python and Bash.
Experience with monitoring tools such as Grafana, Prometheus, ELK Stack, CloudWatch, and New Relic.
Good understanding of Linux, networking, SQL, and cloud security best practices.

Preferred Skills :

Experience with multi-cloud environments and DevSecOps practices.
Knowledge of disaster recovery, automation, and microservices architecture.
Strong troubleshooting, communication, and problem-solving skills.

Job Title : DevOps Engineer / Site Reliability Engineer (SRE)

Experience : 5+ Years

Location : Gurugram, Haryana

Work Mode : On-site (Full-time)

About the Role :

Mandatory Skills :

Key Responsibilities :

Design, deploy, and manage cloud infrastructure primarily on AWS (EC2, VPC, IAM, S3, RDS, Route53, ALB, Auto Scaling, Lambda).
Build and maintain Infrastructure as Code (IaC) using Terraform, Ansible, and CloudFormation.
Develop and optimize CI/CD pipelines using Jenkins, GitLab CI, and GitHub Actions.
Deploy and manage containerized applications using Docker, Kubernetes, and Helm.
Implement monitoring and observability using Grafana, Prometheus, ELK Stack, CloudWatch, and New Relic.
Drive SRE practices by defining SLIs, SLOs, SLAs, handling production incidents, conducting RCA, and improving system reliability.
Automate operational tasks using Python, Bash, and Groovy scripting.
Collaborate with Development, QA, Security, and Operations teams to ensure reliable and secure software delivery.

Required Skills & Qualifications :

Bachelor's degree in Computer Science, IT, Electronics, or a related field.
5+ years of experience in DevOps, SRE, or Cloud Infrastructure.
Strong expertise in AWS, with exposure to Azure/GCP.
Hands-on experience with Terraform, Ansible, CloudFormation, Docker, Kubernetes, Helm, Jenkins, GitLab CI, GitHub Actions, and Git.
Strong scripting skills in Python and Bash.
Experience with monitoring tools such as Grafana, Prometheus, ELK Stack, CloudWatch, and New Relic.
Good understanding of Linux, networking, SQL, and cloud security best practices.

Preferred Skills :

Experience with multi-cloud environments and DevSecOps practices.
Knowledge of disaster recovery, automation, and microservices architecture.
Strong troubleshooting, communication, and problem-solving skills.

Lead Cloud Security & Reliability Engineer - Pune

at Searce Inc

3 recruiters

Posted by Karthika Senthilkumar

Pune

7 - 12 yrs

Best in industry

Google Cloud Platform (GCP)

Terraform

Kubernetes

GKE

Reliability engineering

+1 more

About Searce

Searce is a global, AI-native, engineering-led technology consultancy and a Premier Google

Cloud Partner — recognized as the Google Cloud Workplace AI Transformation Partner of the

Year, APAC (2026). With 20+ years of experience and 3,000+ clients across 10+ countries, we

help businesses stay ahead of the cloud curve.

The Role

We're looking for a Lead Cloud Security & Reliability Engineer with deep GCP expertise to own

end-to-end cloud reliability and security forAPAC enterprise clients. As Lead, you'll set the architectural direction, mentor your squad, and drive measurable client outcomes across multi-

cloud environments.

What You'll Do

Own Client Delivery — Lead 24x7 GCP cloud operations forAPAC clients. Define SLO frameworks and ensure adherence.

Architect Solutions — Design scalable, secure GCP-primary architectures with multi-cloud awareness.

Drive Reliability — Lead incident response, RCA, and long-term remediation across production systems.

Mentor & Elevate — Coach and grow a squad of Senior CSREs.

Drive FinOps — Own cloud cost governance and optimization with quantified impact.

Be the Expert — Represent Searce's technical depth in global client conversations.

What We're Looking For

Experience

7–12 years total with 5+ years on GCP cloud infrastructure

Strong background in Cloud Managed Services / MSP environments

Proven experience leading a team in client-facing delivery

Multi-cloud exposure (AWS/Azure secondary) preferred

Technical Skills (Must-Have)

GCP: GKE, IAM, VPC, Cloud Monitoring, Stackdriver, KMS — demonstrated in work
experience
Kubernetes: GKE — production cluster management, Helm
IaC: Terraform — module-level, reusable frameworks
Observability: Prometheus, Grafana, Thanos or equivalent
Security: IAM, Zero-trust, DevSecOps, CSPM tools
Scripting: Python or Go
FinOps: GCP cost governance demonstrated

Nice to Have

GCP Professional Cloud Architect / Pro DevOps Engineer certification
AWS / Azure secondary experience
CKA (Certified Kubernetes Administrator)
ITIL / change management awareness
APAC client delivery experience

Why Searce?

🏆 Google Cloud Partner of the Year — APAC 2026

🌍 Work with APAC enterprise clients across multiple industries

🤖 AI-first, engineering-led culture

📈 Lead-level ownership with real career growth

🤝 HAPPIER values — Humble, Adaptable, Positive, Passionate, Innovative, Excellence,

Responsible

About Searce

Searce is a global, AI-native, engineering-led technology consultancy and a Premier Google

Cloud Partner — recognized as the Google Cloud Workplace AI Transformation Partner of the

Year, APAC (2026). With 20+ years of experience and 3,000+ clients across 10+ countries, we

help businesses stay ahead of the cloud curve.

The Role

We're looking for a Lead Cloud Security & Reliability Engineer with deep GCP expertise to own

end-to-end cloud reliability and security forAPAC enterprise clients. As Lead, you'll set the architectural direction, mentor your squad, and drive measurable client outcomes across multi-

cloud environments.

What You'll Do

Own Client Delivery — Lead 24x7 GCP cloud operations forAPAC clients. Define SLO frameworks and ensure adherence.

Architect Solutions — Design scalable, secure GCP-primary architectures with multi-cloud awareness.

Drive Reliability — Lead incident response, RCA, and long-term remediation across production systems.

Mentor & Elevate — Coach and grow a squad of Senior CSREs.

Drive FinOps — Own cloud cost governance and optimization with quantified impact.

Be the Expert — Represent Searce's technical depth in global client conversations.

What We're Looking For

Experience

7–12 years total with 5+ years on GCP cloud infrastructure

Strong background in Cloud Managed Services / MSP environments

Proven experience leading a team in client-facing delivery

Multi-cloud exposure (AWS/Azure secondary) preferred

Technical Skills (Must-Have)

GCP: GKE, IAM, VPC, Cloud Monitoring, Stackdriver, KMS — demonstrated in work
experience
Kubernetes: GKE — production cluster management, Helm
IaC: Terraform — module-level, reusable frameworks
Observability: Prometheus, Grafana, Thanos or equivalent
Security: IAM, Zero-trust, DevSecOps, CSPM tools
Scripting: Python or Go
FinOps: GCP cost governance demonstrated

Nice to Have

GCP Professional Cloud Architect / Pro DevOps Engineer certification
AWS / Azure secondary experience
CKA (Certified Kubernetes Administrator)
ITIL / change management awareness
APAC client delivery experience

Why Searce?

🏆 Google Cloud Partner of the Year — APAC 2026

🌍 Work with APAC enterprise clients across multiple industries

🤖 AI-first, engineering-led culture

📈 Lead-level ownership with real career growth

🤝 HAPPIER values — Humble, Adaptable, Positive, Passionate, Innovative, Excellence,

Responsible

Senior SRE

It is an Product Based Company(Domain- EV Charging)

Agency job

via Unique Occupational by Mantasha Naaz

Bengaluru (Bangalore)

6 - 8 yrs

₹18L - ₹20L / yr

SRE

Reliability engineering

on call Support

Incident management

Amazon Web Services (AWS)

Job Title: Senior Site Reliability Engineer

Location: Bengaluru, India (Hybrid)

Employment Type: Full-time

Experience: 6+ years

About Compnay

It is driving the electric mobility revolution through cutting-edge software, infrastructure, and professional services. Our technology empowers utilities, cities, fleets, transit agencies, and automakers to deploy EV charging infrastructure at scale safely, efficiently, and sustainably. With a global footprint spanning three continents and operations in 13 countries, we are passionate about shaping the future of sustainable transport.

Operating over 70,000 charge points globally, It is driving the transition toward cleaner, smarter, and more efficient mobility. The India team serves as a critical operational hub, supporting global platforms focused on decarbonization, digitalization, and scalable infrastructure growth.

We value purpose-driven individuals who want to make a meaningful impact and help create a cleaner, smarter, and more connected world.

Role Overview

We are seeking a skilled and proactive Site Reliability Engineer (SRE) to join our growing team. In this role, you will be responsible for maintaining system reliability, scalability, and performance across our EV charging platforms. You will collaborate closely with development and operations teams to build resilient, automated, and observable systems.

Key Responsibilities

Ensure high availability, performance, and reliability of production systems
Design, implement, and manage scalable infrastructure solutions
Build and maintain CI/CD pipelines for efficient software delivery
Monitor system health using observability tools and respond to incidents proactively
Automate operational processes using scripting and Infrastructure as Code (IaC)
Manage containerized environments using Docker and Kubernetes
Collaborate with cross-functional teams to improve system architecture and resilience
Participate in on-call rotations and incident management processes
Continuously optimize cloud infrastructure for cost, performance, and scalability

Required Qualifications & Skills

Bachelor’s degree in Computer Science, IT, or related field
4+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure roles
Strong experience with containerization (Docker) and orchestration (Kubernetes)
Proficiency in Linux administration, networking, and system security
Hands-on experience with cloud platforms, especially AWS (EKS, EC2, S3, RDS, Lambda)
Experience with CI/CD tools such as Jenkins, GitLab CI/CD, or similar
Knowledge of Infrastructure as Code tools (Terraform, AWS CloudFormation, Ansible)
Proficiency in scripting languages (Python, Bash, or PowerShell)
Experience with monitoring tools like Dynatrace, Prometheus, Grafana, or Zabbix
Solid understanding of system architecture, microservices, and SaaS/PaaS models
Strong analytical and problem-solving skills

What We Offer

Work with some of the brightest minds in the emerging EV industry.
Make a tangible impact in reducing carbon emissions and enabling sustainable energy.
Freedom to suggest, implement, and innovate on systems, processes, and technologies.
Daily ownership in a high-growth, challenging environment.
Flexible work environment with hybrid schedules and virtualization options.
Competitive pay and benefits including health coverage, innovative PTO program, and performance bonuses.

Job Title: Senior Site Reliability Engineer

Location: Bengaluru, India (Hybrid)

Employment Type: Full-time

Experience: 6+ years

About Compnay

We value purpose-driven individuals who want to make a meaningful impact and help create a cleaner, smarter, and more connected world.

Role Overview

Key Responsibilities

Ensure high availability, performance, and reliability of production systems
Design, implement, and manage scalable infrastructure solutions
Build and maintain CI/CD pipelines for efficient software delivery
Monitor system health using observability tools and respond to incidents proactively
Automate operational processes using scripting and Infrastructure as Code (IaC)
Manage containerized environments using Docker and Kubernetes
Collaborate with cross-functional teams to improve system architecture and resilience
Participate in on-call rotations and incident management processes
Continuously optimize cloud infrastructure for cost, performance, and scalability

Required Qualifications & Skills

Bachelor’s degree in Computer Science, IT, or related field
4+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure roles
Strong experience with containerization (Docker) and orchestration (Kubernetes)
Proficiency in Linux administration, networking, and system security
Hands-on experience with cloud platforms, especially AWS (EKS, EC2, S3, RDS, Lambda)
Experience with CI/CD tools such as Jenkins, GitLab CI/CD, or similar
Knowledge of Infrastructure as Code tools (Terraform, AWS CloudFormation, Ansible)
Proficiency in scripting languages (Python, Bash, or PowerShell)
Experience with monitoring tools like Dynatrace, Prometheus, Grafana, or Zabbix
Solid understanding of system architecture, microservices, and SaaS/PaaS models
Strong analytical and problem-solving skills

What We Offer

Work with some of the brightest minds in the emerging EV industry.
Make a tangible impact in reducing carbon emissions and enabling sustainable energy.
Freedom to suggest, implement, and innovate on systems, processes, and technologies.
Daily ownership in a high-growth, challenging environment.
Flexible work environment with hybrid schedules and virtualization options.
Competitive pay and benefits including health coverage, innovative PTO program, and performance bonuses.

Platform Engineer/ SRE

It is an Product Based Company(Domain- EV Charging)

Agency job

via Unique Occupational by Mantasha Naaz

Bengaluru (Bangalore)

2 - 4 yrs

₹10L - ₹14L / yr

SRE

Reliability engineering

Incident management

24/7

Platform Engineer

Location: Bengaluru, India (Hybrid)

Employment Type: Full-time

Experience: 2-4 years

About Compnay

This is driving the electric mobility revolution through cutting-edge software, infrastructure, and professional services. Our technology empowers utilities, cities, fleets, transit agencies, and automakers to deploy EV charging infrastructure at scale safely, efficiently, and sustainably. With a global footprint spanning three continents and operations in 13 countries, we are passionate about shaping the future of sustainable transport.

Operating over 70,000 charge points globally, this is driving the transition toward cleaner, smarter, and more efficient mobility. The India team serves as a critical operational hub, supporting global platforms focused on decarbonization, digitalization, and scalable infrastructure growth.

Role Overview

What you’ll do:

Ensure system reliability, uptime, and performance of global platform.
Conduct real-time surveillance of our EV charging systems to proactively identify and mitigate performance issues and anomalies near 24/7 basis. As such, you collaborate with IDT and FMC players to ensure incident detection also happens outside office hours (monitoring shifts among team members subject to duty schedule).
Deliver on change & releases like firmware changes and drive insights & intelligence back into testing processes and tech discussions with the wider organization.
Successfully deliver and project manage first time right commissioning activities alongside our Engineering Procurement Contract Management (EPCM) partners to successfully bring charge points onto our Charge Point Management System (CPMS).
Provide technical guidance and support to DC specialists during the commissioning of EV charging solutions.
Work closely with Shell, Engineering, and IT colleagues to ensure projects are completed on time and to specification.
Act as a liaison with the Engineering Procurement Contract Management (EPCM) partner to manage projects from start to finish, ensuring charge points are successfully onboarded on the Charge Point Management System (CPMS).
Collaborate with development, operations and support teams to build scalable and resilient systems.
Contribute to incident response, root-cause analysis, and post-mortem reviews, driving continuous improvement.
Participate in capacity planning, performance tuning, and resource optimization.
Integrate security and compliance best practices into all infrastructure operations.
Stay current with emerging SRE tools, frameworks, and cloud technologies to continuously improve reliability practices.
Participate in and lead on-call rotations and incident response, conducting detailed postmortems and RCA reports.
Flexible to resolve blocking issues during off hours or weekends if required.

What We’re Looking For:

Basic Qualifications and Skills

Bachelor’s degree in Engineering, Electrical, ECE, Computer Science, Information Technology, or related field.
2–4 years of overall experience with at least 1+ years of experience as a Site Reliability Engineer, DevOps Engineer, or Technical Project Coordinator.
Proven experience of DevOps, SRE or Technical Project Coordination with IoT or connected devices-based platforms.
Experience with incident management and on-call best practices. Provide support to on-call engineers.
Excellent analytical and problem-solving skills with a proactive mindset.
Expertise with monitoring and observability tools (Dynatrace, Prometheus, Grafana, Zabbix, etc.).
Solid understanding of cloud platforms (AWS) and AWS native services (EKS, EC2, S3, RDS, Lambda).
Proactively monitor the network, triage performance outliers, and coordinate correction actions to ensure optimal system functionality.
Fluency in English (spoken and written).
Successfully recommission or decommission chargers following changes in our network.
Responsible for the go-live of the chargers on Shell’s public network following commissioning attempts.

Additional Information

This role involves managing infrastructure for a global platform operating in over ten countries, requiring effective communication and collaboration across regions.
Strong verbal and written communication skills, along with availability and flexibility to resolve blocking issues, are essential to support on-call engineers.
This role may involve EU or US time-zone shifts based on business requirements.
Shift timing: 2 PM IST to 11 PM IST.

What is required to be successful in this role:

Global platform experience (B2C or B2B).
AWS native service experience.
Firmware deployment and cloud cost optimization experience.
Strong exposure to monitoring and alerts.
Experience with firmware rollout, IoT devices onboarding and offboarding will be an added advantage.
Experience as an SRE or DevOps Engineer with some exposure to Project Management or Technical Project Management in IoT-based projects will be helpful.

What We Offer

Work with some of the brightest minds in the emerging EV industry.
Make a tangible impact in reducing carbon emissions and enabling sustainable energy.
Freedom to suggest, implement, and innovate on systems, processes, and technologies.
Daily ownership in a high-growth, challenging environment.
Flexible work environment with hybrid schedules and virtualization options.
Competitive pay and benefits including health coverage, innovative PTO program, and performance bonuses.

Platform Engineer

Location: Bengaluru, India (Hybrid)

Employment Type: Full-time

Experience: 2-4 years

About Compnay

Role Overview

What you’ll do:

Ensure system reliability, uptime, and performance of global platform.
Conduct real-time surveillance of our EV charging systems to proactively identify and mitigate performance issues and anomalies near 24/7 basis. As such, you collaborate with IDT and FMC players to ensure incident detection also happens outside office hours (monitoring shifts among team members subject to duty schedule).
Deliver on change & releases like firmware changes and drive insights & intelligence back into testing processes and tech discussions with the wider organization.
Successfully deliver and project manage first time right commissioning activities alongside our Engineering Procurement Contract Management (EPCM) partners to successfully bring charge points onto our Charge Point Management System (CPMS).
Provide technical guidance and support to DC specialists during the commissioning of EV charging solutions.
Work closely with Shell, Engineering, and IT colleagues to ensure projects are completed on time and to specification.
Act as a liaison with the Engineering Procurement Contract Management (EPCM) partner to manage projects from start to finish, ensuring charge points are successfully onboarded on the Charge Point Management System (CPMS).
Collaborate with development, operations and support teams to build scalable and resilient systems.
Contribute to incident response, root-cause analysis, and post-mortem reviews, driving continuous improvement.
Participate in capacity planning, performance tuning, and resource optimization.
Integrate security and compliance best practices into all infrastructure operations.
Stay current with emerging SRE tools, frameworks, and cloud technologies to continuously improve reliability practices.
Participate in and lead on-call rotations and incident response, conducting detailed postmortems and RCA reports.
Flexible to resolve blocking issues during off hours or weekends if required.

What We’re Looking For:

Basic Qualifications and Skills

Bachelor’s degree in Engineering, Electrical, ECE, Computer Science, Information Technology, or related field.
2–4 years of overall experience with at least 1+ years of experience as a Site Reliability Engineer, DevOps Engineer, or Technical Project Coordinator.
Proven experience of DevOps, SRE or Technical Project Coordination with IoT or connected devices-based platforms.
Experience with incident management and on-call best practices. Provide support to on-call engineers.
Excellent analytical and problem-solving skills with a proactive mindset.
Expertise with monitoring and observability tools (Dynatrace, Prometheus, Grafana, Zabbix, etc.).
Solid understanding of cloud platforms (AWS) and AWS native services (EKS, EC2, S3, RDS, Lambda).
Proactively monitor the network, triage performance outliers, and coordinate correction actions to ensure optimal system functionality.
Fluency in English (spoken and written).
Successfully recommission or decommission chargers following changes in our network.
Responsible for the go-live of the chargers on Shell’s public network following commissioning attempts.

Additional Information

This role involves managing infrastructure for a global platform operating in over ten countries, requiring effective communication and collaboration across regions.
Strong verbal and written communication skills, along with availability and flexibility to resolve blocking issues, are essential to support on-call engineers.
This role may involve EU or US time-zone shifts based on business requirements.
Shift timing: 2 PM IST to 11 PM IST.

What is required to be successful in this role:

Global platform experience (B2C or B2B).
AWS native service experience.
Firmware deployment and cloud cost optimization experience.
Strong exposure to monitoring and alerts.
Experience with firmware rollout, IoT devices onboarding and offboarding will be an added advantage.
Experience as an SRE or DevOps Engineer with some exposure to Project Management or Technical Project Management in IoT-based projects will be helpful.

What We Offer

Work with some of the brightest minds in the emerging EV industry.
Make a tangible impact in reducing carbon emissions and enabling sustainable energy.
Freedom to suggest, implement, and innovate on systems, processes, and technologies.
Daily ownership in a high-growth, challenging environment.
Flexible work environment with hybrid schedules and virtualization options.
Competitive pay and benefits including health coverage, innovative PTO program, and performance bonuses.

Enterprise Platform Engineer - AI Agents

at Recruiting Bond

2 candid answers

Posted by Pavan Kumar

Bengaluru (Bangalore)

7 - 12 yrs

₹70L - ₹110L / yr

Platform as a Service (PaaS)

Platform Engineering

Agentic AI

AI Agents

Model Context Protocol (MCP)

+41 more

About My Client Company

We're building the learning infrastructure that transforms AI agents into true digital workers. While today's agents can reason and plan, they fail to do meaningful work because they lack real experience operating in apps. My Client Product gives agents continuously improving, reusable skills across 1000+ production-grade app connectors including Gmail, Linear, and Hubspot. We handle authentication, tool routing, retries, failure handling, and observability, making every action safe and dependable.

About the Role

Every enterprise is racing to make AI work — not as a demo, but as infrastructure that runs their business. My Client Product is becoming the critical layer that makes this possible: the platform that connects AI agents to 250+ real-world applications with production-grade auth, execution, and reliability.

We've built this for the cloud. Now we need to build it for the enterprise — and that means rethinking the platform from the ground up with the right abstractions, primitives, and architectural decisions that let us serve a massive, diverse set of enterprise customers without bespoke engineering for each one. This is a founding role.

Your Impact

Agent infrastructure platform: The foundational layer that enterprise AI agents run on — governance, observability, and control planes for MCP-powered agent ecosystems. You'll define how organizations monitor, audit, and manage AI agents operating at scale across their systems
The integration gateway: The secure, reliable bridge between an enterprise's AI agents and the outside world — every SaaS tool, internal system, and API they need to act on. Not just connectors, but a platform-grade gateway with the right trust, permissioning, and routing primitives
Platform primitives for scale: Multi-tenancy, isolation, configuration, and extensibility abstractions that let Composio serve thousands of enterprise customers without linear engineering cost
Enterprise-grade architecture: Deployment flexibility, security, and compliance as first-class platform capabilities — not bolted-on afterthoughts
The repeatable deployment motion: Turn enterprise onboarding from a services engagement into a product experience. Shorter cycles, fewer custom touches, more self-serve

What you bring

You've built platforms at genuine scale — not just high user counts, but high complexity: many customer types, deployment models, and integration surfaces
You think in abstractions and primitives. Your instinct is to find the right foundational model, not to solve each problem individually
You've shipped enterprise product capabilities (deployment flexibility, security, admin tooling, compliance) and understand them as product problems, not just checkboxes
You've built or shipped an AI product — or you're the person who can't stop tinkering. You're building agents on weekends, stress-testing the latest models, experimenting with MCP, and forming your own opinions on where agent architectures are headed. You have a point of view on this space, not just a resume line
You're a force multiplier. When you join a team, the entire product moves faster because the platform decisions are right

Skills & Expertise

Platform Engineering, AI Infrastructure, Agentic AI, AI Agents, MCP (Model Context Protocol), Distributed Systems, Enterprise Architecture, Multi-Tenant Architecture, Backend Platform Engineering, Enterprise SaaS, API Platform Engineering, Integration Platforms, SaaS Connectors, Cloud Infrastructure, AWS, GCP, Kubernetes, Docker, Terraform, Microservices, Event-Driven Architecture, API Gateway, OAuth 2.0, RBAC, IAM, Observability, OpenTelemetry, Prometheus, Grafana, Reliability Engineering, SRE, Python, Golang, Node.js, TypeScript, REST APIs, GraphQL, AI Orchestration, LLM Infrastructure, LangChain, LangGraph, OpenAI APIs, Claude APIs, RAG, Workflow Automation, AI Tool Routing, Enterprise Security, Compliance Engineering, Deployment Architecture, Configuration Management, Extensible Systems, Scalability Engineering, High-Scale Systems, Technical Strategy, Platform Primitives, Developer Platforms, Enterprise Integrations, Infrastructure Engineering, Founding Engineer Mindset.

This role demands deep platform thinking. You've designed systems where the abstractions were the product — where getting the primitives right meant the difference between a product that scales and one that drowns in customer-specific code.

You've done this within large organizations and seen what "enterprise-grade" actually means when thousands of teams depend on your platform. But you've also operated in environments where you had to build fast, make tradeoffs, and ship before the architecture was perfect.

The combination matters. Big-company pattern recognition with small-company intensity.

What We Offer

Lunch and dinner are provided in the office
$200/month learning and development budget
$1,000/month AI tool experimentation budget to automate, accelerate, and improve how you work
High-ownership role with direct exposure to leadership and company-building decisions
Competitive salary and equity

About My Client Company

About the Role

Your Impact

Agent infrastructure platform: The foundational layer that enterprise AI agents run on — governance, observability, and control planes for MCP-powered agent ecosystems. You'll define how organizations monitor, audit, and manage AI agents operating at scale across their systems
The integration gateway: The secure, reliable bridge between an enterprise's AI agents and the outside world — every SaaS tool, internal system, and API they need to act on. Not just connectors, but a platform-grade gateway with the right trust, permissioning, and routing primitives
Platform primitives for scale: Multi-tenancy, isolation, configuration, and extensibility abstractions that let Composio serve thousands of enterprise customers without linear engineering cost
Enterprise-grade architecture: Deployment flexibility, security, and compliance as first-class platform capabilities — not bolted-on afterthoughts
The repeatable deployment motion: Turn enterprise onboarding from a services engagement into a product experience. Shorter cycles, fewer custom touches, more self-serve

What you bring

You've built platforms at genuine scale — not just high user counts, but high complexity: many customer types, deployment models, and integration surfaces
You think in abstractions and primitives. Your instinct is to find the right foundational model, not to solve each problem individually
You've shipped enterprise product capabilities (deployment flexibility, security, admin tooling, compliance) and understand them as product problems, not just checkboxes
You've built or shipped an AI product — or you're the person who can't stop tinkering. You're building agents on weekends, stress-testing the latest models, experimenting with MCP, and forming your own opinions on where agent architectures are headed. You have a point of view on this space, not just a resume line
You're a force multiplier. When you join a team, the entire product moves faster because the platform decisions are right

Skills & Expertise

The combination matters. Big-company pattern recognition with small-company intensity.

What We Offer

Lunch and dinner are provided in the office
$200/month learning and development budget
$1,000/month AI tool experimentation budget to automate, accelerate, and improve how you work
High-ownership role with direct exposure to leadership and company-building decisions
Competitive salary and equity

Lead | Senior Site Reliability Engineer

at Searce Inc

3 recruiters

Posted by Reena Bandekar

Pune

4 - 9 yrs

₹10L - ₹26L / yr

DevOps

Kubernetes

Reliability engineering

Network Security

Amazon VPC

+4 more

Lead Cloud Reliability Engineer

Job Responsibilities

● Lead and manage the Cloud Reliability teams to provide strong Managed Services support to end-customers.

● Isolate, troubleshoot and resolve issues reported by CMS clients in their cloud environment

● Drive the communication with the customer providing details about the issue, current steps, next plan of action, ETA

● Gather client's requirements related to use of specic cloud services and provide assistance in seing them up and resolving issues

● Create SOPs and knowledge articles for use by the L1 teams to resolve common issues

● Identify recurring issues, perform root cause analysis and propose/implement preventive actions

● Follow change management procedure to identify, record and implement changes

● Plan and deploy OS, security patches in Windows/Linux environment and upgrade k8s clusters

● Identify the recurring manual activities and contribute to automation

● Provide technical guidance and educate team members on development and operations. Monitor metrics and develop ways to improve.

● System troubleshooting and problem-solving across plaorm and application domains. Ability to use a wide variety of open-source technologies and cloud services.

● Build, maintain, and monitor conguration standards.

● Ensuring critical system security through using best-in-class cloud security solutions.

Qualifications

● 4-7 years experience in Cloud Infrastructure and Operations domains and IT operational experience preferably in a global enterprise environment.

● Specialize in one or two cloud deployment platforms: AWS, GCP

● Hands on experience with AWS/GCP services (EKS, ECS, EC2, VPC, RDS, Lambda, GKE, Compute Engine)

● Understanding of one or more programming languages (Python, JavaScript, Ruby, Java, .Net)

● Logging and Monitoring tools (ELK, Stackdriver, CloudWatch)

● Knowledge on Conguration Management tools such as Ansible, Terraform, Puppet, Chef

● Experience working with deployment and orchestration technologies (such as Docker, Kubernetes, Mesos)

● Good analytical, communication, problem solving, and learning skills.

● Knowledge on programming against cloud plaorms such as Google Cloud Platform and lean development methodologies.

● Strong service aitude and a commitment to quality.

● Willingness to work in shifts.