
Site Reliability Engineer - Product
Position: Site Reliability Engineer
Location: Pune (Currently WFH, post pandemic you need to relocate)
About the Organization:
A funded product development company, headquarter in Singapore and offices in Australia, United States, Germany, United Kingdom, and India. You will gain work experience in a global environment.
Job Description:
We are looking for an experienced DevOps / Site Reliability engineer to join our team and be instrumental in taking our products to the next level.
In this role, you will be working on bleeding edge hybrid cloud / on-premise infrastructure handing billions of events and terabytes of data a day.
You will be responsible for working closely with various engineering teams to design, build and maintain a globally distributed infrastructure footprint.
As part of role, you will be responsible for researching new technologies, managing a large fleet of active services and their underlying servers, automating the deployment, monitoring and scaling of components and optimizing the infrastructure for cost and performance.
Day-to-day responsibilities
- Ensure the operational integrity of the global infrastructure
- Design repeatable continuous integration and delivery systems
- Test and measure new methods, applications and frameworks
- Analyze and leverage various AWS-native functionality
- Support and build out an on-premise data center footprint
- Provide support and diagnose issues to other teams related to our infrastructure
- Participate in 24/7 on-call rotation (If Required)
- Expert-level administrator of Linux-based systems
- Experience managing distributed data platforms (Kafka, Spark, Cassandra, etc) Aerospike experience is a plus.
- Experience with production deployments of Kubernetes Cluster
- Experience in automating provisioning and managing Hybrid-Cloud infrastructure (AWS, GCP and On-Prem) at scale.
- Knowledge of monitoring platform (Prometheus, Grafana, Graphite).
- Experience in Distributed storage systems such as Ceph or GlusterFS.
- Experience in virtualisation with KVM, Ovirt and OpenStack.
- Hands-on experience with configuration management systems such as Terraform and Ansible
- Bash and Python Scripting Expertise
- Network troubleshooting experience (TCP, DNS, IPv6 and tcpdump)
- Experience with continuous delivery systems (Jenkins, Gitlab, BitBucket, Docker)
- Experience managing hundreds to thousands of servers globally
- Enjoy automating tasks, rather than repeating them
- Capable of estimating costs of various approaches, and finding simple and inexpensive solutions to complex problems
- Strong verbal and written communication skills
- Ability to adapt to a rapidly changing environment
- Comfortable collaborating and supporting a diverse team of engineers
- Ability to troubleshoot problems in complex systems
- Flexible working hours and ability to participate in 24/7 on call support with other team members whenever required.

About A listed product development organization
Similar jobs
Candidate MUST HAVE product-based company experience and a minimum of 3years of experience in DevOps.
What you will do (or learn) :
1. Build our application stack on AWS. Infrastructure as code (read Terraform)
2. Build state-of-the-art CI/CD pipelines.
3. Manage data warehouses and data pipelines.
4. Work on infrastructure and data security.
5. State-of-the-art log management system and tooling around them.
6. Monitoring and alerting system.
What do we expect from you?
1. 3 to 10 years of experience with DevOps or SRE principles.
2. Good fundamentals of database management and other distributed systems management.
3. Experience in infrastructure as code or other configuration management systems.
4. Experience in scripting languages (like bash, python, go lang etc.)
5. Good understanding of Linux systems
6. Strong debugging and troubleshooting skills
7. Experience in tooling around monitoring, CI/CD, log management systems.
Nvizion Solutions is looking for the position of Site Reliability Engineer.
If interested, kindly share your resume along with contact details.
Title: Site Reliability Engineer
No. of job openings: 2
Location:Gurgaon/ Hyderabad/ Bengaluru/ Mumbai/Chennai ( Remote location)
Remuneration:Best in the Industry
· Experience required: 2 to 4 yrs in the industry
· Ensuring overall System's reliability
· Add automation and alerting in the system
· Providing Troubleshooting support
· Cross team communications. Working closely with Product team and Customer success team.
· Proactive support - to ensures the system is back to the healthy state
· R&D for new tools/technologies to support product and support team
· Good verbal/written communication to connect with the client.
· Good team player with a zeal to learn new technologies.
· The candidate will be part of the team responsible for 24X7 monitoring of distributed global platform.
- Linux Scripting
- CI/CD knowledge (Jenkins/ BitBucket Pipelie /GitOps)
- Version Control
- Cloud platform knowledge (GCP/AWS/Azure/Digital Ocean)
- Docker, Kubernetes
Senior Site Reliability Engineer
Experience: 5-8 Years
Location: Pune
Type: Full-time
About Digit88:
Digit88 empowers digital transformation for innovative and high growth B2B and B2C SaaS companies as their trusted offshore software product engineering partner!
We are a lean mid-stage software company, with a team of 75+ fantastic technologists, backed by executives with deep understanding of and extensive experience in consumer and enterprise product development across large corporations and startups. We build highly efficient and effective engineering teams that solve real and complex problems for our partners.
With more than 50+ years of collective experience in areas ranging from B2B and B2C SaaS, web and mobile apps, e-commerce platforms and solutions, custom enterprise SaaS platforms and domains spread across Conversational AI, Chatbots, IoT, Health-tech, ESG/Energy Analytics, Data Engineering, the founding team thrives in a fast paced and challenging environment that allows us to showcase our best.
The Vision: To be the most trusted technology partner to innovative software product companies world-wide
The Opportunity:
Digit88 is expanding the extended software product engineering team for its partner, a US-based Energy Analytics SaaS platform company. Our partner is building a suite of cloud-based business operation support platforms in the Utilities Rate Lifecycle space in the Energy sector/domain. This is a bleeding edge AI and Big Data platform that helps large energy utility companies in the US plan, manage, review and optimize their new product and rate design, billing, rate analysis, forecasting, and CRM. The candidate would be joining an existing team of product engineers in the US and Pune/India and be at the forefront of SaaS product engineering.
Digit88 is seeking an enthusiastic Site Reliability Engineer (DevOps) with 5-8 years of hands-on experience with a fast-paced India/US product engineering B2B or B2C SaaS product setting up and maintaining a high-availability, high-performance real-time system is mandatory. Applicants must have a passion for engineering with accuracy and efficiency, be highly motivated and organized, able to work as part of a team, and also possess the ability to work independently with minimal supervision.
Responsibilities:
In this role, you'll get to:
- Build and maintain cross-team platform components: infrastructure based on Infrastructure-as-Code, CI/CD pipelines, application/infrastructure monitoring, and automation of other development-related processes
- Design and Deploy Automation of Container Applications using Kubernetes and Docker
- Setup application/system monitoring
- Work with Developers/QA to build and validate containerized applications
- Manage geographically deployed server farms
- Document Deployment Processes, Services and Environments
Skills/Requirements:
- BE/BTech with CS or related discipline
- 5+ years of experience as a DevOps, Site Reliability Engineer (SRE) or Systems Engineer
- Advanced understanding of AWS services and components - VPC, IAM, EC2, ALB, ECS/EKS
- Strong background in Linux Shell Programming.
- Strong experience with SQL and NoSQL (Cassandra or DynamoDB)
- Strong experience in Distributed Streaming Platform (Kafka/ Spark)
- Strong experience in Docker Containers
- Hands on experience with one or more of Java/Python/Go/NodeJS languages
- Implementation Experience in automation tools and frameworks (CI/CD pipelines) like Git (Source Repo), Maven/Gradle (build tool), Jenkins/Teamcity and Docker.
- Hands-on experience in Kubernetes
- Experience in Package Management Tools like npm.
- Experience with automation/configuration management tools like Salt/Puppet/Chef/Ansible.
- Ability to use a wide variety of open source technologies and cloud services (experience with AWS is required) for application deployment.
- Knowledge of best practices and IT operations in an always-up, always-available service
- Experience in troubleshooting production issues and co-ordinate with the development team to streamline code deployment.
- Proven experience in optimizing the company’s computing architecture.
Good to have:
- AWS certification (Architect, Operations) is a plus
- Monitoring tools like Grafana/Prometheus and Appdynamics is a plus
Additional Project/Soft Skills:
- Strong verbal and written communication with ability to articulate problems and solutions over the phone and emails.
- Strong sense of urgency, with a passion for accuracy and timeliness.
- Ability to work calmly in high pressure situations and manage multiple projects/tasks.
- Ability to work independently and possess superior skills in issue resolution.
Benefits/Culture @ Digit88:
- Comprehensive Insurance (Life, Health, Accident)
- Flexible Work Model
- Accelerated learning & non-linear growth
- Flat organisation structure driven by ownership and accountability.
- Global Peers - Working with some of the best engineers/professionals globally from the likes of Apple, Amazon, IBM Research, Adobe and other innovative product companies
- Ability to make a global impact with your work, leading innovations in Conversational AI, Tele-Medicine, Healthcare and more.
You will work with a founding team of serial entrepreneurs with multiple successful exits to their credit. The learning will be immense just as will the challenges.
This is the right time to join us and partner in our growth!
Principal Site Reliability Engineer (SRE)
Experience: 10-12 Years
Location: Pune
Type: Full-time
About Digit88:
Digit88 empowers digital transformation for innovative and high growth B2B and B2C SaaS companies as their trusted offshore software product engineering partner!
We are a lean mid-stage software company, with a team of 75+ fantastic technologists, backed by executives with deep understanding of and extensive experience in consumer and enterprise product development across large corporations and startups. We build highly efficient and effective engineering teams that solve real and complex problems for our partners.
With more than 50+ years of collective experience in areas ranging from B2B and B2C SaaS, web and mobile apps, e-commerce platforms and solutions, custom enterprise SaaS platforms and domains spread across Conversational AI, Chatbots, IoT, Health-tech, ESG/Energy Analytics, Data Engineering, the founding team thrives in a fast paced and challenging environment that allows us to showcase our best.
The Vision: To be the most trusted technology partner to innovative software product companies world-wide
The Opportunity:
Digit88 is expanding its product development team for its US based partner, that is building next-generation Big Data, Cloud-Based Business Operation Support technology for utilities, retail energy suppliers and Community Choice Aggregators (CCA). The candidate would be joining an existing team of engineers in Pune and the US and help us expand the product engineering team and work on different products and on different layers of the infrastructure.
Digit88 is seeking an enthusiastic Principal Site Reliability Engineer (SRE) with 10-12 years of hands-on experience to join the team. Experience with a fast-paced India/US product/engineering services company in a DevOps engineer role, setting up and maintaining a high-availability, high-performance real-time system is mandatory. Applicants must have a passion for engineering with accuracy and efficiency, be highly motivated and organized, able to work as part of a team, and also possess the ability to work independently with minimal supervision.
Responsibilities:
In this role, you'll get to:
- Build and maintain cross-team platform components: infrastructure based on Infrastructure-as-Code, CI/CD pipelines, application/infrastructure monitoring, and automation of other development-related processes
- Design and Deploy Automation of Container Applications
- Setup application/system monitoring
- Work with Developers/QA to build and validate containerized applications
- Manage geographically deployed server farms
- Document Deployment Processes, Services and Environments
Requirements:
- BE/BTech with CS or related discipline
- 8+ years of experience as a DevOps and Site Reliability Engineer (SRE)
- Advanced understanding of AWS services and components - VPC, IAM, EC2, ALB, ECS/EKS
- Strong hands-on expertise in Linux Shell Programming.
- Expert level experience in Terraform
- Strong experience with SQL and NoSQL (Cassandra or DynamoDB)
- Strong experience in Docker Containers
- Hands on experience with one or more of Java/Python/Go/NodeJS languages
- Implementation Experience in automation tools and frameworks (CI/CD pipelines) like Git (Source Repo), Maven/Gradle (build tool), Jenkins/Teamcity and Docker.
- Hands-on experience in Kubernetes
- Experience in Package Management Tools like npm.
- Experience with automation/configuration management tools like Salt/ Puppet/ Chef/Ansible.
- Experience in Distributed Streaming Platform (Kafka/ Spark)
- Ability to use a wide variety of open source technologies and cloud services (experience with AWS is required) for application deployment.
- Knowledge of best practices and IT operations in an always-up, always-available service
- Experience in troubleshooting production issues and co-ordinate with the development team to streamline code deployment.
- Proven experience in optimizing the company’s computing architecture.
Good to have:
- AWS certification (Architect, Operations) is a plus
- Monitoring tools like Grafana/Prometheus and Appdynamics is a plus
Additional Project/Soft Skills:
- Strong verbal and written communication with ability to articulate problems and solutions over the phone and emails.
- Strong sense of urgency, with a passion for accuracy and timeliness.
- Ability to work calmly in high pressure situations and manage multiple projects/tasks.
- Ability to work independently and possess superior skills in issue resolution.
Benefits/Culture @ Digit88:
- Comprehensive Insurance (Life, Health, Accident)
- Flexible Work Model
- Accelerated learning & non-linear growth
- Flat organisation structure driven by ownership and accountability.
- Global Peers - Working with some of the best engineers/professionals globally from the likes of Apple, Amazon, IBM Research, Adobe and other innovative product companies
- Ability to make a global impact with your work, leading innovations in Conversational AI, Tele-Medicine, Healthcare and more.
You will work with a founding team of serial entrepreneurs with multiple successful exits to their credit. The learning will be immense just as will the challenges.
This is the right time to join us and partner in our growth!
JD: Site Reliability Engineers
Location: PUNE, Remote
Sarvaha would like to welcome experienced SRE specialists with minimum of 5 years of professional experience in Google Cloud Platform or AWS based deployments and automation. Sarvaha is a niche software development company that works with some of the best funded startups and established companies across the globe. Your will be expected to work with a globally distributed team and contribute independently as well as lead a team of engineers. This is a hands-on position that would require you to be responsible for production software deployments across global availability zones.
Key Responsibilities
- Design, write and run services that provide visibility into a leading IoT platform & underlying services
- Automate deployments, diagnostic and debugging tools
- Participate in on-call rotations
- Adhere to industry-standard security best practices
- Work with other teams in troubleshooting and keeping the systems up and running
Skills Required
- Minimum Bachelor’s Degree in Computer Science or related degree
- Minimum 5+ years of total experience with at least 4 years of experience in SRE, DevOps or similar role. More experience in highly desired
- 4+ years of hands-on experience with one of AWS/Azure/GCP is must have for this position
- 1+ years of experience debugging code written in Python, Java or any strongly typed language
- 3+ years of experience with Kubernetes, Prometheus, ELK, Grafana, Nagios
- 2+ years of experience with Jenkins or similar build and deploy orchestration tool
- 2+ years of experience with RDBMs and no-SQL databases (MySQL, Oracle, Cassandra, CDH)
- 1+ years of experience writing infrastructure as code using Terraform
- Excellent verbal and written communication and strong interpersonal skills are requisite for success of this position
- Strong listening and interpersonal skills and attention to details is highly desired
Position Benefits
- Top-notch remuneration with non-linear growth
- Work with industry best cloud architects, DevOPs team and developers
- Excellent, no-nonsense work environment with the very best people to work with
- Cutting edge work with Fortune 500 businesses and learn from high-visibility systems that drive public facing, high-traffic systems
SRE - Tech Lead (DevOps):
Location: Permanent Work From Home Option
Notice: Candidates with a notice period of 30 days and less and preferred
SRE-DevOps- Tech Lead - JD:
Srijan is hiring for Site Reliability Engineering (SRE), We are looking for SRE/DevOps- Tech Lead or Sr. Tech Lead with strong automation skills and a good understanding of how to build & run secure & reliable platforms for cloud-native applications. Please find below the detailed job description and kindly go through the same for reference:-
Minimum Experience: 6+ years in DevOps/SRE
Permanent WFH option
Job Description:-
The focus of this role is to build scalable, resilient, secure infrastructure for cloud-native applications whilst automating every mundane task you could think of and build observability dashboards, set up alerts, etc to provide optics to relevant stakeholders. In a nutshell: “You are keepers of Production environments”. You must be a problem solver with the ability to multitask and come with strong collaboration and communication skills.
Key Responsibilities:-
-
Proactively monitor and review application performance
-
Handle on-call and emergency support
-
Ensure software has good logging and diagnostics
-
Create and maintain operational runbooks
-
Contribute in Solution Designing and evaluating Technical Debt
-
Set right practices for Well-Defined Architecture & to minimize toil.
-
Own SLI, SLO configuration as per Error Budget
-
Maintain production services through measuring and monitoring availability, latency, and overall system health.
-
Practice sustainable incident response and blameless postmortems.
-
Not be afraid to contribute changes back to the Software engineering team to improve the systems.
-
Managing the delivery pipeline into production.
-
Able to mentor junior members on regular basis
-
Troubleshooting issues with web applications
-
Understanding of security principles and best practices
-
Ensuring that critical data is backed up
-
Configuration of monitoring systems including infrastructure monitoring and Application Performance Monitoring systems such as New Relic.
-
Ensuring that web application infrastructure is built
-
Ability to act as Customer Technical Advocate and negotiate well with peers on technical fronts.
-
Flexible enough to work in different Shifts for hyper business requirement
-
Ability to handle multiple global clients on tech front and generate desired reports to represent health of SRE Delivery.
Skills/Experience:-
-
A key skill of a SRE Tech Lead is that they have a deep knowledge of the application, the code, and how it runs, is configured, and scales. That knowledge is what makes them so valuable at also monitoring and supporting it as site reliability engineers.
-
System administration, security, and networking
-
The SRE Tech Lead expected to have a good understanding of system administration (Linux or Windows) and networking.
-
Essential commands
-
User and Group Management
-
Knowledge of networking concepts (DNS, TCP/IP, and Firewalls)
-
Service Configuration
-
Storage Management
-
Good grasp of fundamental security concepts
-
Good understanding of infrastructure as code principles.
-
Knowledge of a scripting language such as Bash
-
Ability to configure infrastructure using a Configuration Management technology such as Puppet, Chef, or Ansible.
-
Familiarity with Jenkins or any other CI/CD tool
-
Proficiency in a high-level programming language such as Python or Go.
-
Understanding of container technologies such as Docker, Kubernetes
-
2 yrs+ hands on experience with container orchestration technologies such as ECS, EKS, AKS or Kubernetes would be beneficial.
-
Use Terraform and other IaC to deploy cloud infrastructure.
Cloud technologies:-
-
Experience designing available, cost-efficient, fault-tolerant, and scalable distributed systems on AWS/Azure
-
Hands-on experience using compute, networking, storage, and database AWS/Azure services
-
Hands-on experience of 4 yrs+ with AWS/Azure deployment and management services
-
Ability to identify and define technical requirements for an AWS/AZURE-based application
-
Ability to identify which AWS/AZURE services meet a given technical requirement
-
Knowledge of recommended best practices for building secure and reliable applications on the AWS/AZURE platform
-
An understanding of the AWS/AZURE global infrastructure
-
An understanding of network technologies as they relate to AWS/AZURE
-
An understanding of security features and tools that AWS/AZURE provides and how they relate to traditional services
● Research, propose and evaluate with a 5-year vision, the architecture, design, technologies,
processes and profiles related to Telco Cloud.
● Participate in the creation of a realistic technical-strategic roadmap of the network to transform
it to Telco Cloud and be prepared for 5G.
● Using your deep technical expertise, you will provide detailed feedback to Product Management
and Engineering, as well as contribute directly to the platform code base to enhance both the
Customer experience of the service, as well as the SRE quality of life.
● The individual must be aware of trends in network infrastructure as well as within the network
engineering and OSS community. What technologies are being developed or launched?
● The individual should stay current with infrastructure trends in the telco network cloud domain.
● Be responsible for the Engineering of Lab and Production Telco Cloud environments, including
patches, upgrades, and reliability and performance improvements.
Required Minimum Qualifications: (Education and Technical Skills/Knowledge)
● Software Engineering degree, MS in Computer Science or equivalent experience
● Years of experiences as an SRE, DevOps, Development and/or Support related role
● 0-5 years of professional experience for a junior position
● At least 8 years of professional experience for a senior position
● Unix server administration and tuning : Linux / RedHat / CentOS / Ubuntu
● You have deep knowledge in Networking Layers 1-4
● Cloud / Virtualization (at least two): Helm, Docker, Kubernetes, AWS, Azure, Google Cloud,
OpenStack, OpenShift, VMware vSphere / Tanzu
● You have in-depth knowledge of cloud storage solutions on top of AWS, GCP, Azure and/or
on-prem private cloud, such as Ceph, CephFS, GlusterFS
● DevOps: Jenkins, Git, Azure DevOps, Ansible, Terraform
● Backend Knowledge Bash, Python, Go (other knowledge of Scripting Language is a plus).
● PaaS Level solutions such as Keycloak for IAM, Prometheus, Grafana, ELK, DBaaS (such as MySQL,
Cassandra)
About the Organisation:
The team at Coredge.io is a combination of experienced and young professionals alike having
many years of experience in working with Edge computing, Telecom application development
and Kubernetes. The company has continuously collaborated with the open source community,
universities and major industry players in furthering its goal of providing the industry with an
indispensable tool to offer improved services to its customers. Coredge.io has a global market
presence with its offices in US and New Delhi, India.
• Develop and Maintain IAC using Terraform and Ansible
• Draft design documents that translate requirements into code.
• Deal with challenges associated with scale.
• Assume responsibilities from technical design through technical client support.
• Manage expectations with internal stakeholders and context-switch in a fast paced environment.
• Thrive in an environment that uses Elasticsearch extensively.
• Keep abreast of technology and contribute to the engineering strategy.
• Champion best development practices and provide mentorship.
What we’re looking for
• An AWS Certified Engineer with strong skills in
o Terraform
o Ansible
o *nix and shell scripting
• Preferably with experience in:
o Elasticsearch
o Circle CI
o CloudFormation
o Python
o Packer
o Docker
o Prometheus and Grafana
o Challenges of scale
o Production support
• Sharp analytical and problem-solving skills.
• Strong sense of ownership.
• Demonstrable desire to learn and grow.
• Excellent written and oral communication skills.
• Mature collaboration and mentoring abilities.
- 5+ years of software development or site reliability engineering or equivalent experience
- Skilled at problem solving, algorithms, and data structures
- Building tools and scripting frameworks from scratch
- Working with Cloud Automation tools like CloudFormation, Terraform, CDK, aws-cli
- Scripting languages like Python, Groovy, PowerShell, Bash, Perl etc.
- Configuration automation using Ansible or equivalent tools
- Exposure to Windows, Linux administration skills
- Project management tools like Jira, Trello
- Prior experience in dealing with Datastore technologies like Postgres, MySQL, SQL, DynamoDB is desirable
- Familiarity with basic networking, security and cloud engineering concepts
- Team player who is eager to help others to succeed through mentoring and leading by example
- Highly collaborative with effective written and verbal communication skills
- 5+ years of software development or site reliability engineering or equivalent experience
- Skilled at problem solving, algorithms, and data structures
- Building tools and scripting frameworks from scratch
- Working with Cloud Automation tools like CloudFormation, Terraform, CDK, aws-cli
- Scripting languages like Python, Groovy, PowerShell, Bash, Perl etc.
- Configuration automation using Ansible or equivalent tools
- Exposure to Windows, Linux administration skills
- Project management tools like Jira, Trello
- Prior experience in dealing with Datastore technologies like Postgres, MySQL, SQL, DynamoDB is desirable
- Familiarity with basic networking, security and cloud engineering concepts
- Team player who is eager to help others to succeed through mentoring and leading by example
- Highly collaborative with effective written and verbal communication skills

