16+ Reliability engineering Jobs in Bangalore (Bengaluru) | Reliability engineering Job openings in Bangalore (Bengaluru)
Apply to 16+ Reliability engineering Jobs in Bangalore (Bengaluru) on CutShort.io. Explore the latest Reliability engineering Job opportunities across top companies like Google, Amazon & Adobe.

It is an Product Based Company(Domain- EV Charging)
Job Title: Senior Site Reliability Engineer
Location: Bengaluru, India (Hybrid)
Employment Type: Full-time
Experience: 6+ years
About Compnay
It is driving the electric mobility revolution through cutting-edge software, infrastructure, and professional services. Our technology empowers utilities, cities, fleets, transit agencies, and automakers to deploy EV charging infrastructure at scale safely, efficiently, and sustainably. With a global footprint spanning three continents and operations in 13 countries, we are passionate about shaping the future of sustainable transport.
Operating over 70,000 charge points globally, It is driving the transition toward cleaner, smarter, and more efficient mobility. The India team serves as a critical operational hub, supporting global platforms focused on decarbonization, digitalization, and scalable infrastructure growth.
We value purpose-driven individuals who want to make a meaningful impact and help create a cleaner, smarter, and more connected world.
Role Overview
We are seeking a skilled and proactive Site Reliability Engineer (SRE) to join our growing team. In this role, you will be responsible for maintaining system reliability, scalability, and performance across our EV charging platforms. You will collaborate closely with development and operations teams to build resilient, automated, and observable systems.
Key Responsibilities
- Ensure high availability, performance, and reliability of production systems
- Design, implement, and manage scalable infrastructure solutions
- Build and maintain CI/CD pipelines for efficient software delivery
- Monitor system health using observability tools and respond to incidents proactively
- Automate operational processes using scripting and Infrastructure as Code (IaC)
- Manage containerized environments using Docker and Kubernetes
- Collaborate with cross-functional teams to improve system architecture and resilience
- Participate in on-call rotations and incident management processes
- Continuously optimize cloud infrastructure for cost, performance, and scalability
Required Qualifications & Skills
- Bachelor’s degree in Computer Science, IT, or related field
- 4+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure roles
- Strong experience with containerization (Docker) and orchestration (Kubernetes)
- Proficiency in Linux administration, networking, and system security
- Hands-on experience with cloud platforms, especially AWS (EKS, EC2, S3, RDS, Lambda)
- Experience with CI/CD tools such as Jenkins, GitLab CI/CD, or similar
- Knowledge of Infrastructure as Code tools (Terraform, AWS CloudFormation, Ansible)
- Proficiency in scripting languages (Python, Bash, or PowerShell)
- Experience with monitoring tools like Dynatrace, Prometheus, Grafana, or Zabbix
- Solid understanding of system architecture, microservices, and SaaS/PaaS models
- Strong analytical and problem-solving skills
What We Offer
- Work with some of the brightest minds in the emerging EV industry.
- Make a tangible impact in reducing carbon emissions and enabling sustainable energy.
- Freedom to suggest, implement, and innovate on systems, processes, and technologies.
- Daily ownership in a high-growth, challenging environment.
- Flexible work environment with hybrid schedules and virtualization options.
- Competitive pay and benefits including health coverage, innovative PTO program, and performance bonuses.

It is an Product Based Company(Domain- EV Charging)
Platform Engineer
Location: Bengaluru, India (Hybrid)
Employment Type: Full-time
Experience: 2-4 years
About Compnay
This is driving the electric mobility revolution through cutting-edge software, infrastructure, and professional services. Our technology empowers utilities, cities, fleets, transit agencies, and automakers to deploy EV charging infrastructure at scale safely, efficiently, and sustainably. With a global footprint spanning three continents and operations in 13 countries, we are passionate about shaping the future of sustainable transport.
Operating over 70,000 charge points globally, this is driving the transition toward cleaner, smarter, and more efficient mobility. The India team serves as a critical operational hub, supporting global platforms focused on decarbonization, digitalization, and scalable infrastructure growth.
Role Overview
What you’ll do:
- Ensure system reliability, uptime, and performance of global platform.
- Conduct real-time surveillance of our EV charging systems to proactively identify and mitigate performance issues and anomalies near 24/7 basis. As such, you collaborate with IDT and FMC players to ensure incident detection also happens outside office hours (monitoring shifts among team members subject to duty schedule).
- Deliver on change & releases like firmware changes and drive insights & intelligence back into testing processes and tech discussions with the wider organization.
- Successfully deliver and project manage first time right commissioning activities alongside our Engineering Procurement Contract Management (EPCM) partners to successfully bring charge points onto our Charge Point Management System (CPMS).
- Provide technical guidance and support to DC specialists during the commissioning of EV charging solutions.
- Work closely with Shell, Engineering, and IT colleagues to ensure projects are completed on time and to specification.
- Act as a liaison with the Engineering Procurement Contract Management (EPCM) partner to manage projects from start to finish, ensuring charge points are successfully onboarded on the Charge Point Management System (CPMS).
- Collaborate with development, operations and support teams to build scalable and resilient systems.
- Contribute to incident response, root-cause analysis, and post-mortem reviews, driving continuous improvement.
- Participate in capacity planning, performance tuning, and resource optimization.
- Integrate security and compliance best practices into all infrastructure operations.
- Stay current with emerging SRE tools, frameworks, and cloud technologies to continuously improve reliability practices.
- Participate in and lead on-call rotations and incident response, conducting detailed postmortems and RCA reports.
- Flexible to resolve blocking issues during off hours or weekends if required.
What We’re Looking For:
Basic Qualifications and Skills
- Bachelor’s degree in Engineering, Electrical, ECE, Computer Science, Information Technology, or related field.
- 2–4 years of overall experience with at least 1+ years of experience as a Site Reliability Engineer, DevOps Engineer, or Technical Project Coordinator.
- Proven experience of DevOps, SRE or Technical Project Coordination with IoT or connected devices-based platforms.
- Experience with incident management and on-call best practices. Provide support to on-call engineers.
- Excellent analytical and problem-solving skills with a proactive mindset.
- Expertise with monitoring and observability tools (Dynatrace, Prometheus, Grafana, Zabbix, etc.).
- Solid understanding of cloud platforms (AWS) and AWS native services (EKS, EC2, S3, RDS, Lambda).
- Proactively monitor the network, triage performance outliers, and coordinate correction actions to ensure optimal system functionality.
- Fluency in English (spoken and written).
- Successfully recommission or decommission chargers following changes in our network.
- Responsible for the go-live of the chargers on Shell’s public network following commissioning attempts.
Additional Information
- This role involves managing infrastructure for a global platform operating in over ten countries, requiring effective communication and collaboration across regions.
- Strong verbal and written communication skills, along with availability and flexibility to resolve blocking issues, are essential to support on-call engineers.
- This role may involve EU or US time-zone shifts based on business requirements.
- Shift timing: 2 PM IST to 11 PM IST.
What is required to be successful in this role:
- Global platform experience (B2C or B2B).
- AWS native service experience.
- Firmware deployment and cloud cost optimization experience.
- Strong exposure to monitoring and alerts.
- Experience with firmware rollout, IoT devices onboarding and offboarding will be an added advantage.
- Experience as an SRE or DevOps Engineer with some exposure to Project Management or Technical Project Management in IoT-based projects will be helpful.
What We Offer
- Work with some of the brightest minds in the emerging EV industry.
- Make a tangible impact in reducing carbon emissions and enabling sustainable energy.
- Freedom to suggest, implement, and innovate on systems, processes, and technologies.
- Daily ownership in a high-growth, challenging environment.
- Flexible work environment with hybrid schedules and virtualization options.
- Competitive pay and benefits including health coverage, innovative PTO program, and performance bonuses.
About My Client Company
We're building the learning infrastructure that transforms AI agents into true digital workers. While today's agents can reason and plan, they fail to do meaningful work because they lack real experience operating in apps. My Client Product gives agents continuously improving, reusable skills across 1000+ production-grade app connectors including Gmail, Linear, and Hubspot. We handle authentication, tool routing, retries, failure handling, and observability, making every action safe and dependable.
About the Role
Every enterprise is racing to make AI work — not as a demo, but as infrastructure that runs their business. My Client Product is becoming the critical layer that makes this possible: the platform that connects AI agents to 250+ real-world applications with production-grade auth, execution, and reliability.
We've built this for the cloud. Now we need to build it for the enterprise — and that means rethinking the platform from the ground up with the right abstractions, primitives, and architectural decisions that let us serve a massive, diverse set of enterprise customers without bespoke engineering for each one. This is a founding role.
Your Impact
- Agent infrastructure platform: The foundational layer that enterprise AI agents run on — governance, observability, and control planes for MCP-powered agent ecosystems. You'll define how organizations monitor, audit, and manage AI agents operating at scale across their systems
- The integration gateway: The secure, reliable bridge between an enterprise's AI agents and the outside world — every SaaS tool, internal system, and API they need to act on. Not just connectors, but a platform-grade gateway with the right trust, permissioning, and routing primitives
- Platform primitives for scale: Multi-tenancy, isolation, configuration, and extensibility abstractions that let Composio serve thousands of enterprise customers without linear engineering cost
- Enterprise-grade architecture: Deployment flexibility, security, and compliance as first-class platform capabilities — not bolted-on afterthoughts
- The repeatable deployment motion: Turn enterprise onboarding from a services engagement into a product experience. Shorter cycles, fewer custom touches, more self-serve
What you bring
- You've built platforms at genuine scale — not just high user counts, but high complexity: many customer types, deployment models, and integration surfaces
- You think in abstractions and primitives. Your instinct is to find the right foundational model, not to solve each problem individually
- You've shipped enterprise product capabilities (deployment flexibility, security, admin tooling, compliance) and understand them as product problems, not just checkboxes
- You've built or shipped an AI product — or you're the person who can't stop tinkering. You're building agents on weekends, stress-testing the latest models, experimenting with MCP, and forming your own opinions on where agent architectures are headed. You have a point of view on this space, not just a resume line
- You're a force multiplier. When you join a team, the entire product moves faster because the platform decisions are right
Skills & Expertise
Platform Engineering, AI Infrastructure, Agentic AI, AI Agents, MCP (Model Context Protocol), Distributed Systems, Enterprise Architecture, Multi-Tenant Architecture, Backend Platform Engineering, Enterprise SaaS, API Platform Engineering, Integration Platforms, SaaS Connectors, Cloud Infrastructure, AWS, GCP, Kubernetes, Docker, Terraform, Microservices, Event-Driven Architecture, API Gateway, OAuth 2.0, RBAC, IAM, Observability, OpenTelemetry, Prometheus, Grafana, Reliability Engineering, SRE, Python, Golang, Node.js, TypeScript, REST APIs, GraphQL, AI Orchestration, LLM Infrastructure, LangChain, LangGraph, OpenAI APIs, Claude APIs, RAG, Workflow Automation, AI Tool Routing, Enterprise Security, Compliance Engineering, Deployment Architecture, Configuration Management, Extensible Systems, Scalability Engineering, High-Scale Systems, Technical Strategy, Platform Primitives, Developer Platforms, Enterprise Integrations, Infrastructure Engineering, Founding Engineer Mindset.
This role demands deep platform thinking. You've designed systems where the abstractions were the product — where getting the primitives right meant the difference between a product that scales and one that drowns in customer-specific code.
You've done this within large organizations and seen what "enterprise-grade" actually means when thousands of teams depend on your platform. But you've also operated in environments where you had to build fast, make tradeoffs, and ship before the architecture was perfect.
The combination matters. Big-company pattern recognition with small-company intensity.
What We Offer
- Lunch and dinner are provided in the office
- $200/month learning and development budget
- $1,000/month AI tool experimentation budget to automate, accelerate, and improve how you work
- High-ownership role with direct exposure to leadership and company-building decisions
- Competitive salary and equity
Lead Cloud Reliability Engineer
Job Responsibilities
● Lead and manage the Cloud Reliability teams to provide strong Managed Services support to end-customers.
● Isolate, troubleshoot and resolve issues reported by CMS clients in their cloud environment
● Drive the communication with the customer providing details about the issue, current steps, next plan of action, ETA
● Gather client's requirements related to use of specic cloud services and provide assistance in seing them up and resolving issues
● Create SOPs and knowledge articles for use by the L1 teams to resolve common issues
● Identify recurring issues, perform root cause analysis and propose/implement preventive actions
● Follow change management procedure to identify, record and implement changes
● Plan and deploy OS, security patches in Windows/Linux environment and upgrade k8s clusters
● Identify the recurring manual activities and contribute to automation
● Provide technical guidance and educate team members on development and operations. Monitor metrics and develop ways to improve.
● System troubleshooting and problem-solving across plaorm and application domains. Ability to use a wide variety of open-source technologies and cloud services.
● Build, maintain, and monitor conguration standards.
● Ensuring critical system security through using best-in-class cloud security solutions.
Qualifications
● 4-7 years experience in Cloud Infrastructure and Operations domains and IT operational experience preferably in a global enterprise environment.
● Specialize in one or two cloud deployment platforms: AWS, GCP
● Hands on experience with AWS/GCP services (EKS, ECS, EC2, VPC, RDS, Lambda, GKE, Compute Engine)
● Understanding of one or more programming languages (Python, JavaScript, Ruby, Java, .Net)
● Logging and Monitoring tools (ELK, Stackdriver, CloudWatch)
● Knowledge on Conguration Management tools such as Ansible, Terraform, Puppet, Chef
● Experience working with deployment and orchestration technologies (such as Docker, Kubernetes, Mesos)
● Good analytical, communication, problem solving, and learning skills.
● Knowledge on programming against cloud plaorms such as Google Cloud Platform and lean development methodologies.
● Strong service aitude and a commitment to quality.
● Willingness to work in shifts.
Location - Bangalore Skill/Experience Expectations: 1. Total Experience 7-11 yrs 2. 3-4 years in managing scalable production environment 3. 2-4 yr experience in managing Google cloud infrastructure 4. proficient in terraform and any programming language 5. Expert in designing and managing observability solutions 6. 5 yr experience in DevOps and SRE practices and troubleshooting critical incidents.
Department: S&C – Site Reliability Engineering (SRE)
Experience Required: 4–8 Years
Location: Bangalore / Pune /Mumbai
Employment Type: Full-time
- Provide Tier 2/3 technical product support to internal and external stakeholders.
- Develop automation tools and scripts to improve operational efficiency and support processes.
- Manage and maintain system and software configurations; troubleshoot environment/application-related issues.
- Optimize system performance through configuration tuning or development enhancements.
- Plan, document, and deploy applications in Unix/Linux, Azure, and GCP environments.
- Collaborate with Development, QA, and Infrastructure teams throughout the release and deployment of lifecycles.
- Drive automation initiatives for release and deployment processes.
- Coordinate with infrastructure teams to manage hardware/software resources, maintenance, and scheduled downtimes across production and non-production environments.
- Participate in on-call rotations (minimum one week per month) to address critical incidents and off-hour maintenance tasks.
Key Competencies
- Strong analytical, troubleshooting, and critical thinking abilities.
- Excellent cross-functional collaboration skills.
- Strong focus on documentation, process improvement, and system reliability.
- Proactive, detail-oriented, and adaptable in a fast-paced work environment.
Dear Candidate,
Greetings from Wissen Technology.
We have an exciting Job opportunity for GCP SRE Engineer Professionals. Please refer to the Job Description below and share your profile if interested.
About Wissen Technology:
- The Wissen Group was founded in the year 2000. Wissen Technology, a part of Wissen Group, was established in the year 2015.
- Wissen Technology is a specialized technology company that delivers high-end consulting for organizations in the Banking & Finance, Telecom, and Healthcare domains. We help clients build world class products.
- Our workforce consists of 1000+ highly skilled professionals, with leadership and senior management executives who have graduated from Ivy League Universities like Wharton, MIT, IITs, IIMs, and NITs and with rich work experience in some of the biggest companies in the world.
- Wissen Technology has grown its revenues by 400% in these five years without any external funding or investments.
- Globally present with offices US, India, UK, Australia, Mexico, and Canada.
- We offer an array of services including Application Development, Artificial Intelligence & Machine Learning, Big Data & Analytics, Visualization & Business Intelligence, Robotic Process Automation, Cloud, Mobility, Agile & DevOps, Quality Assurance & Test Automation.
- Wissen Technology has been certified as a Great Place to Work®.
- Wissen Technology has been voted as the Top 20 AI/ML vendor by CIO Insider in 2020.
- Over the years, Wissen Group has successfully delivered $650 million worth of projects for more than 20 of the Fortune 500 companies.
- The technology and thought leadership that the company commands in the industry is the direct result of the kind of people Wissen has been able to attract. Wissen is committed to providing them the best possible opportunities and careers, which extends to providing the best possible experience and value to our clients.
We have served client across sectors like Banking, Telecom, Healthcare, Manufacturing, and Energy. They include likes of Morgan Stanley, MSCI, State Street Corporation, Flipkart, Swiggy, Trafigura, GE to name a few.
Job Description:
Please find below details:
Experience - 4+ Years
Location- Bangalore/Mumbai/Pune
Team Responsibilities
The successful candidate shall be part of the S&C – SRE Team. Our team provides a tier 2/3 support to S&C Business. This position involves collaboration with the client facing teams like Client Services, Product and Research teams and Infrastructure/Technology and Application development teams to perform Environment and Application maintenance and support.
Resource's key Responsibilities
• Provide Tier 2/3 product technical support.
• Building software to help operations and support activities.
• Manage system\software configurations and troubleshoot environment issues.
• Identify opportunities for optimizing system performance through changes in configuration or suggestions for development.
• Plan, document and deploy software applications on our Unix/Linux/Azure and GCP based systems.
• Collaborate with development and software testing teams throughout the release process.
• Analyze release and deployment processes to identify key areas for automation and optimization.
• Manage hardware and software resources & coordinate maintenance, planned downtimes with
infrastructure group across all the environments. (Production / Non-Production).
• Must spend minimum one week a month as on call support to help with off-hour emergencies and maintenance activities.
Required skills and experience
• Bachelor's degree, Computer Science, Engineering or other similar concentration (BE/MCA)
• Master’s degree a plus
• 6-8 years’ experience in Production Support/ Application Management/ Application Development (support/ maintenance) role.
• Excellent problem-solving/troubleshooting skills, fast learner
• Strong knowledge of Unix Administration.
• Strong scripting skills in Shell, Python, Batch is must.
• Strong Database experience – Oracle
• Strong knowledge of Software Development Life Cycle
• Power shell is nice to have
• Software development skillsets in Java or Ruby.
• Worked upon any of the cloud platforms – GCP/Azure/AWS is nice to have
Dear Candidate,
Greetings from Wissen Technology.
We have an exciting Job opportunity for GCP SRE Engineer Professionals. Please refer to the Job Description below and share your profile if interested.
About Wissen Technology:
- The Wissen Group was founded in the year 2000. Wissen Technology, a part of Wissen Group, was established in the year 2015.
- Wissen Technology is a specialized technology company that delivers high-end consulting for organizations in the Banking & Finance, Telecom, and Healthcare domains. We help clients build world class products.
- Our workforce consists of 1000+ highly skilled professionals, with leadership and senior management executives who have graduated from Ivy League Universities like Wharton, MIT, IITs, IIMs, and NITs and with rich work experience in some of the biggest companies in the world.
- Wissen Technology has grown its revenues by 400% in these five years without any external funding or investments.
- Globally present with offices US, India, UK, Australia, Mexico, and Canada.
- We offer an array of services including Application Development, Artificial Intelligence & Machine Learning, Big Data & Analytics, Visualization & Business Intelligence, Robotic Process Automation, Cloud, Mobility, Agile & DevOps, Quality Assurance & Test Automation.
- Wissen Technology has been certified as a Great Place to Work®.
- Wissen Technology has been voted as the Top 20 AI/ML vendor by CIO Insider in 2020.
- Over the years, Wissen Group has successfully delivered $650 million worth of projects for more than 20 of the Fortune 500 companies.
- The technology and thought leadership that the company commands in the industry is the direct result of the kind of people Wissen has been able to attract. Wissen is committed to providing them the best possible opportunities and careers, which extends to providing the best possible experience and value to our clients.
We have served client across sectors like Banking, Telecom, Healthcare, Manufacturing, and Energy. They include likes of Morgan Stanley, MSCI, State Street Corporation, Flipkart, Swiggy, Trafigura, GE to name a few
Job Description:
Please find below details:
Experience - 4+ Years
Location- Bangalore/Mumbai/Pune
Team Responsibilities
The successful candidate shall be part of the S&C – SRE Team. Our team provides a tier 2/3 support to S&C Business. This position involves collaboration with the client facing teams like Client Services, Product and Research teams and Infrastructure/Technology and Application development teams to perform Environment and Application maintenance and support.
Resource's key Responsibilities
• Provide Tier 2/3 product technical support.
• Building software to help operations and support activities.
• Manage system\software configurations and troubleshoot environment issues.
• Identify opportunities for optimizing system performance through changes in configuration or suggestions for development.
• Plan, document and deploy software applications on our Unix/Linux/Azure and GCP based systems.
• Collaborate with development and software testing teams throughout the release process.
• Analyze release and deployment processes to identify key areas for automation and optimization.
• Manage hardware and software resources & coordinate maintenance, planned downtimes with
infrastructure group across all the environments. (Production / Non-Production).
• Must spend minimum one week a month as on call support to help with off-hour emergencies and maintenance activities.
Required skills and experience
• Bachelor's degree, Computer Science, Engineering or other similar concentration (BE/MCA)
• Master’s degree a plus
• 6-8 years’ experience in Production Support/ Application Management/ Application Development (support/ maintenance) role.
• Excellent problem-solving/troubleshooting skills, fast learner
• Strong knowledge of Unix Administration.
• Strong scripting skills in Shell, Python, Batch is must.
• Strong Database experience – Oracle
• Strong knowledge of Software Development Life Cycle
• Power shell is nice to have
• Software development skillsets in Java or Ruby.
• Worked upon any of the cloud platforms – GCP/Azure/AWS is nice to have
Job Title: Site Reliability Engineer (SRE)
Experience: 4+ Years
Work Location: Bangalore / Chennai / Pune / Gurgaon
Work Mode: Hybrid or Onsite (based on project need)
Domain Preference: Candidates with past experience working in shoe/footwear retail brands (e.g., Nike, Adidas, Puma) are highly preferred.
🛠️ Key Responsibilities
- Design, implement, and manage scalable, reliable, and secure infrastructure on AWS.
- Develop and maintain Python-based automation scripts for deployment, monitoring, and alerting.
- Monitor system performance, uptime, and overall health using tools like Prometheus, Grafana, or Datadog.
- Handle incident response, root cause analysis, and ensure proactive remediation of production issues.
- Define and implement Service Level Objectives (SLOs) and Error Budgets in alignment with business requirements.
- Build tools to improve system reliability, automate manual tasks, and enforce infrastructure consistency.
- Collaborate with development and DevOps teams to ensure robust CI/CD pipelines and safe deployments.
- Conduct chaos testing and participate in on-call rotations to maintain 24/7 application availability.
✅ Must-Have Skills
- 4+ years of experience in Site Reliability Engineering or DevOps with a focus on reliability, monitoring, and automation.
- Strong programming skills in Python (mandatory).
- Hands-on experience with AWS cloud services (EC2, S3, Lambda, ECS/EKS, CloudWatch, etc.).
- Expertise in monitoring and alerting tools like Prometheus, Grafana, Datadog, CloudWatch, etc.
- Strong background in Linux-based systems and shell scripting.
- Experience implementing infrastructure as code using tools like Terraform or CloudFormation.
- Deep understanding of incident management, SLOs/SLIs, and postmortem practices.
- Prior working experience in footwear/retail brands such as Nike or similar is highly preferred.
The candidate should have a background in development/programming with experience in at least one of the following: .NET, Java (Spring Boot), ReactJS, or AngularJS.
Primary Skills:
- AWS or GCP Cloud
- DevOps CI/CD pipelines (e.g., Azure DevOps, Jenkins)
- Python/Bash/PowerShell scripting
Secondary Skills:
- Docker or Kubernetes
Company Description
Smarsh is the leader in communications compliance, archiving, and analytics. We provide compliance across the broadest set of communications channels with insights on what’s being captured. Smarsh customers manage over 500 million daily conversations across 80 channels and growing. Customers include the top 10 U.S., top 8 European, top 5 Canadian, and top 3 Asian banks. The Smarsh advantage is customers stay ahead of compliance and uncover patterns and relationships hidden within their data.
At Smarsh , we’ve been helping our customers manage new forms of communication since 1998. We work closely with regulators including the SEC, FINRA, IIROC, and the PRA and FCA, and with our customers, to ensure that they understand the capabilities of today’s technology and that our platform meets their most stringent requirements. Our products include Connected Capture, Connected Archive, Web Archive & Business Solutions.
About the team
Are you an SRE with excellent Observability, Containerization and Orchestration skills? As a Site Reliability Engineer (SRE) in the Smarsh SaaS Operations team, you'll be part of a team who measures and improves production performance reliability through sustainable engineering practices for our suite of applications. Toil will be your number one enemy, observability your closest friend and your mission will be to drive operational burden as close to zero as you can.
Responsibilities
- Responsible for technical direction at the platform solutions level. Is able to weigh the pros and cons of various solutions and credibly argue for the best path
- Work closely with Product Management and the rest of the engineering team to define features and their implementations with careful attention to quality, scalability, and maintainability
- Can break down complex technical solutions into abstractions that the rest of the team and understand
- Can investigate and solve complex bugs, performance, and scalability issues
- Collaborates with multiple agile teams to ensure their solutions integrate effectively
- Track work in ticketing system (JIRA)
- Participate in Pull Request reviews. Provide and receive feedback to continuously improve.
- Other duties as assigned.
Desired skills & experience
- A minimum 10+ years industry experience
- Masters in CS or equivalent
- Must have experience in Azure or AWS, either running some large-scale app there or migrating to Azure/AWS.
- Experience operating Cloud Foundry in production environments
- Experience managing CI/CD systems (Concourse, Jenkins, TravisCI etc.)
- Experience deploying and/or operating ELK stack
- Experience with container technologies and orchestration platforms (Docker, Kubernetes, Cloud Foundry)
- Experience working with monitoring and observability tools (We use Datadog and New Relic)
- Familiarity with working with PostgreSQL and MongoDB
- Background working in a multi-platform environment (Linux, Windows)
- Experience with running on a cloud platform, AWS preferred (S3, RDS, SQS)
- Familiarity with Agile/Scrum/Kanban methodologies
- Familiarity with programming/scripting languages (ie. Python, Bash, PowerShell, Go, etc.)
Additional Skills
- Expert programming skills in relevant languages
- Exceptional analytical and problem-solving skills
- Strong communication and collaboration skills
- Deep understanding of modern software architecture
- Deep domain knowledge of the industry, platform, and existing processes
- Fault-tolerant design & maintenance
- Knowledge and understanding of modern software programming/engineering.
- Product delivery lifecycle - requirement refinement through ops
Why Smarsh?
Ready to join a thriving tech company that’s redefining digital archiving and business intelligence?
Smarsh is the leading comprehensive archiving platform. Recognized as one of today’s fastest growing companies in the U.S., Smarsh delivers innovative cloud-based solutions that help organizations manage and enforce flexible and secure records retention and compliance strategies for electronic communications, including social media and enterprise social networks (Yammer, Chatter, Facebook, LinkedIn and more).
Our motto is ‘People First. Inspire Confidence. Embrace the Impossible.’ We hire lifelong learners who have a passion for their discipline and a track record of excellence. To learn more about us, visit www.smarsh.com/careers
- We are looking for a Senior SRE with a proven track record of success leading complex cloud-hybrid environments. You will have:
- Strong sense of Being an Owner, Wearing the Customer Shoes, with the ability to Empower Others demonstrated through clear
- communication and collaboration.
- Skills to work independently with multiple global teams, developing, configuring, deploying, and operating our global infrastructure on AWS and on-prem.
- Operational experience in complex distributed and real-time systems, including experience with SLO/SLAs towards high availability,reliability and DR goals.
- DevOps experience in building tools and frameworks, with an understanding of continuous deployment processes.
- Ability to think at scale, bringing a focus on continuous delivery methodologies from design through deployment and operations.
- Experience building and managing systems with tools including Kubernetes, Chef/Ansible/Puppet, Kafka, Docker, and Terraform.
- 5+ years experience in a Software and/or Site Reliability Engineering role
- Experience writing automation code in GoLang, Python or Java
- Experience developing and operating large scale distributed systems with Kubernetes and Docker
- Experience in running real time and low latency high available applications (Kafka, gRPC, RTP)
- Experience running public cloud environments on AWS
- Experience running hybrid clouds and on-prem infrastructures on Red Hat Enterprise Linux / CentOS
- Bachelor degree in Engineering, Computer Science or equivalent experience
- The ability to lead, partner, and collaborate cross functionally across an engineering organization

My client is a leader in the Ed Tech space
Job Title: Sr. Reliability Engineer/ Engineering Manager
Location: Powai (Mumbai)/ Bangaluru/ Delhi NCR
- System maintenance and administration of Applications and Software
- Integration with different systems and apps
- Must have knowledge on Microservices
- Should be proficient in scripting language Python, Good to have knowledge on Java
- Excellent over Root Cause Analysis
- Should be a technical as well as process-oriented candidate
- Excellent Debugging and Monitoring skills
- Responsible for daily trouble ticket resolution, client interaction and customer support via email and video calls
- Responsible for the setup for the Continuous Integration build and deployment for DEV and UAT environments
- Responsible for operational support and problem resolution for application users and internal operations team
- Identifying, tracking, managing and resolving project issues effectively and efficiently
- Supporting projects during the normalization phase after Go-Live and ensuring the smooth transitions to operations
- Review and implement the processes related to IT Support and Operations before handing over to Support Services
- Strong ability to work independently on complex issues
- Collaborate efficiently with internal experts to resolve customer issues quickly
- Both Proactive & reactive in work approach
- Should be willing to work in 24/7 work environment
Ideal candidate has:
- 6-10 years
- Product Background
● Research, propose and evaluate with a 5-year vision, the architecture, design, technologies,
processes and profiles related to Telco Cloud.
● Participate in the creation of a realistic technical-strategic roadmap of the network to transform
it to Telco Cloud and be prepared for 5G.
● Using your deep technical expertise, you will provide detailed feedback to Product Management
and Engineering, as well as contribute directly to the platform code base to enhance both the
Customer experience of the service, as well as the SRE quality of life.
● The individual must be aware of trends in network infrastructure as well as within the network
engineering and OSS community. What technologies are being developed or launched?
● The individual should stay current with infrastructure trends in the telco network cloud domain.
● Be responsible for the Engineering of Lab and Production Telco Cloud environments, including
patches, upgrades, and reliability and performance improvements.
Required Minimum Qualifications: (Education and Technical Skills/Knowledge)
● Software Engineering degree, MS in Computer Science or equivalent experience
● Years of experiences as an SRE, DevOps, Development and/or Support related role
● 0-5 years of professional experience for a junior position
● At least 8 years of professional experience for a senior position
● Unix server administration and tuning : Linux / RedHat / CentOS / Ubuntu
● You have deep knowledge in Networking Layers 1-4
● Cloud / Virtualization (at least two): Helm, Docker, Kubernetes, AWS, Azure, Google Cloud,
OpenStack, OpenShift, VMware vSphere / Tanzu
● You have in-depth knowledge of cloud storage solutions on top of AWS, GCP, Azure and/or
on-prem private cloud, such as Ceph, CephFS, GlusterFS
● DevOps: Jenkins, Git, Azure DevOps, Ansible, Terraform
● Backend Knowledge Bash, Python, Go (other knowledge of Scripting Language is a plus).
● PaaS Level solutions such as Keycloak for IAM, Prometheus, Grafana, ELK, DBaaS (such as MySQL,
Cassandra)
About the Organisation:
The team at Coredge.io is a combination of experienced and young professionals alike having
many years of experience in working with Edge computing, Telecom application development
and Kubernetes. The company has continuously collaborated with the open source community,
universities and major industry players in furthering its goal of providing the industry with an
indispensable tool to offer improved services to its customers. Coredge.io has a global market
presence with its offices in US and New Delhi, India.
Who You Are
- Creative thinker and strong problem solver with meticulous attention to detail
- Highly organized, creative, motivated, and passionate about achieving results
- Able to balance multiple tasks and projects effectively and quickly adapt to new situations and technologies
- Able to work both independently and as part of a team
- Systematic problem-solver, coupled with a strong sense of ownership and drive
What you need
- 3-7 years of experience as a Site Reliability Engineer or a mix of a software engineer and DevOps.
- Strong hands-on knowledge of Linux fundamentals, System administration scripting, performance tuning/scalability, troubleshooting.
- Write great quality code using SOLID principles including unit and integration tests.
- Hands-on development experience in an object-orientated programming language like Python.
- Hands-on experience developing task automations
- Experience using tools to create and manage CI (continuous integration) and CD (continuous delivery) pipelines.
- Familiarity with software development tools: source code management (SCM systems), code review systems, issue tracking tools, build tools, test frameworks, code quality tools.
- Experience implementing open-source observability and alerting tools, like Prometheus, Grafana, Cortex, Thanos, Alertmanager etc
- Have decent knowledge on networking (VPC, VNet, DNS etc) and of the TCP/IP stack, internet routing and load balancing.
- Worked with log and configuration management tool
- Prior experience of working with AWS, Azure, GCP is a plus
- Prior experience of working with Kubernetes, Docker and containers is plus
- Strong interpersonal communication skills (including listening, speaking, and writing) and ability to work well in a diverse, team-focused environment with other SREs, Engineers, Product Managers, etc.
- Documenting your work should be in your DNA
What you get
- A chance to develop and build something (probably from scratch) which you can be proud of
- Build and Implement modern systems observability solutions including monitoring, alerting, metrics, logging, and APM & distributed tracing.
- Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity.
- Maintain business continuity by identifying and driving opportunities to make systems highly resilient and human-free.
- Closely work with the software engineering team to ensure accurate monitoring and metrics are being built into applications before going to production.
- Develop and maintain software modules for use and re-use in cloud and on-premise systems automation.
- Identify process gaps and implement process improvements to increase operational reliability
- Drive standardization efforts across the services, infrastructure, systems, and practices
- Develop Systems & Tools to help with Development team to uphold the Reliability principles
Technical Leader reporting to the CTO/CEO. Your responsibilities include the following, but are not limited to:
- Architecting, Designing and Developing Software Programmes based on requirements provided
- Designing and Developing with a high quality of code that is modular, scalable and re-usable at all times
- Promote SRE (Site Reliability Engineering) to ensure all of the services are Highly-Available and Fault Tolerant at all times
- Communicate effectively the system requirements to other software development teams
- Involve pro-actively with client and their requirements
- Evaluate and select appropriate software or hardware and suggest integration methods
- Oversee assigned programs (e.g. conduct code review) and provide guidance to team members
- Assist with solving technical problems when they arise
- Ensure the implementation of agreed architecture and infrastructure
- Address technical concerns, ideas and suggestions
- Monitor systems to ensure they meet both user needs and business goals





