4+ Reliability engineering Jobs in Chennai | Reliability engineering Job openings in Chennai
Apply to 4+ Reliability engineering Jobs in Chennai on CutShort.io. Explore the latest Reliability engineering Job opportunities across top companies like Google, Amazon & Adobe.
We’re looking for an experienced Site Reliability Engineer to fill the mission-critical role of ensuring that our complex, web-scale systems are healthy, monitored, automated, and designed to scale. You will use your background as an operations generalist to work closely with our development teams from the early stages of design all the way through identifying and resolving production issues. The ideal candidate will be passionate about an operations role that involves deep knowledge of both the application and the product, and will also believe that automation is a key component to operating large-scale systems.
6-Month Accomplishments
- Familiarize with poshmark tech stack and functional requirements.
- Get comfortable with automation tools/frameworks used within cloudops organization and deployment processes associated with.
- Gain in depth knowledge related to related product functionality and infrastructure required for it.
- Start Contributing by working on small to medium scale projects.
- Understand and follow on call rotation as a secondary to get familiarized with the on call process.
12+ Month Accomplishments
- Execute projects related to comms functionality, independently, with little guidance from lead.
- Create meaningful alerts and dashboards for various sub-system involved in targeted infrastructure.
- Identify gaps in infrastructure and suggest improvements or work on it.
- Get involved in on-call rotation.
Responsibilities
- Serve as a primary point responsible for the overall health, performance, and capacity of one or more of our Internet-facing services.
- Gain deep knowledge of our complex applications.
- Assist in the roll-out and deployment of new product features and installations to facilitate our rapid iteration and constant growth.
- Develop tools to improve our ability to rapidly deploy and effectively monitor custom applications in a large-scale UNIX environment.
- Work closely with development teams to ensure that platforms are designed with "operability" in mind.
- Function well in a fast-paced, rapidly-changing environment.
- Participate in a 24x7 on-call rotation.
Desired Skills
- 5+ years of experience in Systems Engineering/Site Reliability Operations role is required, ideally in a startup or fast-growing company.
- 5+ years in a UNIX-based large-scale web operations role.
- 5+ years of experience in doing 24/7 support for large scale production environments.
- Battle-proven, real-life experience in running a large scale production operation.
- Experience working on cloud-based infrastructure e.g AWS, GCP, Azure.
- Hands-on experience with continuous integration tools such as Jenkins, configuration management with Ansible, systems monitoring and alerting with tools such as Nagios, New Relic, Graphite.
- Experience scripting/coding
- Ability to use a wide variety of open source technologies and tools.
Technologies we use:
- Ruby, JavaScript, NodeJs, Tomcat, Nginx, HaProxy
- MongoDB, RabbitMQ, Redis, ElasticSearch.
- Amazon Web Services (EC2, RDS, CloudFront, S3, etc.)
- Terraform, Packer, Jenkins, Datadog, Kubernetes, Docker, Ansible and other DevOps tools.
About Poshmark
Poshmark is a leading fashion resale marketplace powered by a vibrant, highly engaged community of buyers and sellers and real-time social experiences. Designed to make online selling fun, more social and easier than ever, Poshmark empowers its sellers to turn their closet into a thriving business and share their style with the world. Since its founding in 2011, Poshmark has grown its community to over 130 million users and generated over $10 billion in GMV, helping sellers realize billions in earnings, delighting buyers with deals and one-of-a-kind items, and building a more sustainable future for fashion. For more information, please visit www.poshmark.com, and for company news, visit newsroom.poshmark.com.
We’re looking for an experienced Site Reliability Engineer to fill the mission-critical role of ensuring that our complex, web-scale systems are healthy, monitored, automated, and designed to scale. You will use your background as an operations generalist to work closely with our development teams from the early stages of design all the way through identifying and resolving production issues. The ideal candidate will be passionate about an operations role that involves deep knowledge of both the application and the product, and will also believe that automation is a key component to operating large-scale systems.
6-Month Accomplishments
- Familiarize with poshmark tech stack and functional requirements.
- Get comfortable with automation tools/frameworks used within cloudops organization and deployment processes associated with.
- Gain in depth knowledge related to related product functionality and infrastructure required for it.
- Start Contributing by working on small to medium scale projects.
- Understand and follow on call rotation as a secondary to get familiarized with the on call process.
12+ Month Accomplishments
- Execute projects related to comms functionality, independently, with little guidance from lead.
- Create meaningful alerts and dashboards for various sub-system involved in targeted infrastructure.
- Identify gaps in infrastructure and suggest improvements or work on it.
- Get involved in on-call rotation.
Responsibilities
- Serve as a primary point responsible for the overall health, performance, and capacity of one or more of our Internet-facing services.
- Gain deep knowledge of our complex applications.
- Assist in the roll-out and deployment of new product features and installations to facilitate our rapid iteration and constant growth.
- Develop tools to improve our ability to rapidly deploy and effectively monitor custom applications in a large-scale UNIX environment.
- Work closely with development teams to ensure that platforms are designed with "operability" in mind.
- Function well in a fast-paced, rapidly-changing environment.
- Participate in a 24x7 on-call rotation
Desired Skills
- 4+ years of experience in Systems Engineering/Site Reliability Operations role is required, ideally in a startup or fast-growing company.
- 4+ years in a UNIX-based large-scale web operations role.
- 4+ years of experience in doing 24/7 support for large scale production environments.
- Battle-proven, real-life experience in running a large scale production operation.
- Experience working on cloud-based infrastructure e.g AWS, GCP, Azure.
- Hands-on experience with continuous integration tools such as Jenkins, configuration management with Ansible, systems monitoring and alerting with tools such as Nagios, New Relic, Graphite.
- Experience scripting/coding
- Ability to use a wide variety of open source technologies and tools.
Technologies we use:
- Ruby, JavaScript, NodeJs, Tomcat, Nginx, HaProxy
- MongoDB, RabbitMQ, Redis, ElasticSearch.
- Amazon Web Services (EC2, RDS, CloudFront, S3, etc.)
- Terraform, Packer, Jenkins, Datadog, Kubernetes, Docker, Ansible and other DevOps tools.
We are hiring a Site Reliability Engineer (SRE) to join our high-performance engineering team. In this role, you'll be responsible for driving reliability, performance, scalability, and security across cloud-native systems while bridging the gap between development and operations.
Key Responsibilities
- Design and implement scalable, resilient infrastructure on AWS
- Take ownership of the SRE function – availability, latency, performance, monitoring, incident response, and capacity planning
- Partner with product and engineering teams to improve system reliability, observability, and release velocity
- Set up, maintain, and enhance CI/CD pipelines using Jenkins, GitHub Actions, or AWS CodePipeline
- Conduct load and stress testing, identify performance bottlenecks, and implement optimization strategies
Required Skills & Qualifications
- Proven hands-on experience in cloud infrastructure design (AWS strongly preferred)
- Strong background in DevOps and SRE principles
- Proficiency with performance testing tools like JMeter, Gatling, k6, or Locust
- Deep understanding of cloud security and best practices for reliability engineering
- AWS Solution Architect Certification – Associate or Professional (preferred)
- Solid problem-solving skills and a proactive approach to systems improvement
Why Join Us?
- Work with cutting-edge technologies in a cloud-native, fast-paced environment
- Collaborate with cross-functional teams driving meaningful impact
- Hybrid work culture with flexibility and autonomy
- Open, inclusive work environment focused on innovation and excellence
Job Title: Site Reliability Engineer (SRE)
Experience: 4+ Years
Work Location: Bangalore / Chennai / Pune / Gurgaon
Work Mode: Hybrid or Onsite (based on project need)
Domain Preference: Candidates with past experience working in shoe/footwear retail brands (e.g., Nike, Adidas, Puma) are highly preferred.
🛠️ Key Responsibilities
- Design, implement, and manage scalable, reliable, and secure infrastructure on AWS.
- Develop and maintain Python-based automation scripts for deployment, monitoring, and alerting.
- Monitor system performance, uptime, and overall health using tools like Prometheus, Grafana, or Datadog.
- Handle incident response, root cause analysis, and ensure proactive remediation of production issues.
- Define and implement Service Level Objectives (SLOs) and Error Budgets in alignment with business requirements.
- Build tools to improve system reliability, automate manual tasks, and enforce infrastructure consistency.
- Collaborate with development and DevOps teams to ensure robust CI/CD pipelines and safe deployments.
- Conduct chaos testing and participate in on-call rotations to maintain 24/7 application availability.
✅ Must-Have Skills
- 4+ years of experience in Site Reliability Engineering or DevOps with a focus on reliability, monitoring, and automation.
- Strong programming skills in Python (mandatory).
- Hands-on experience with AWS cloud services (EC2, S3, Lambda, ECS/EKS, CloudWatch, etc.).
- Expertise in monitoring and alerting tools like Prometheus, Grafana, Datadog, CloudWatch, etc.
- Strong background in Linux-based systems and shell scripting.
- Experience implementing infrastructure as code using tools like Terraform or CloudFormation.
- Deep understanding of incident management, SLOs/SLIs, and postmortem practices.
- Prior working experience in footwear/retail brands such as Nike or similar is highly preferred.


