Site Reliability Engineer - Product
Position: Site Reliability Engineer
Location: Pune (Currently WFH, post pandemic you need to relocate)
About the Organization:
A funded product development company, headquarter in Singapore and offices in Australia, United States, Germany, United Kingdom, and India. You will gain work experience in a global environment.
Job Description:
We are looking for an experienced DevOps / Site Reliability engineer to join our team and be instrumental in taking our products to the next level.
In this role, you will be working on bleeding edge hybrid cloud / on-premise infrastructure handing billions of events and terabytes of data a day.
You will be responsible for working closely with various engineering teams to design, build and maintain a globally distributed infrastructure footprint.
As part of role, you will be responsible for researching new technologies, managing a large fleet of active services and their underlying servers, automating the deployment, monitoring and scaling of components and optimizing the infrastructure for cost and performance.
Day-to-day responsibilities
- Ensure the operational integrity of the global infrastructure
- Design repeatable continuous integration and delivery systems
- Test and measure new methods, applications and frameworks
- Analyze and leverage various AWS-native functionality
- Support and build out an on-premise data center footprint
- Provide support and diagnose issues to other teams related to our infrastructure
- Participate in 24/7 on-call rotation (If Required)
- Expert-level administrator of Linux-based systems
- Experience managing distributed data platforms (Kafka, Spark, Cassandra, etc) Aerospike experience is a plus.
- Experience with production deployments of Kubernetes Cluster
- Experience in automating provisioning and managing Hybrid-Cloud infrastructure (AWS, GCP and On-Prem) at scale.
- Knowledge of monitoring platform (Prometheus, Grafana, Graphite).
- Experience in Distributed storage systems such as Ceph or GlusterFS.
- Experience in virtualisation with KVM, Ovirt and OpenStack.
- Hands-on experience with configuration management systems such as Terraform and Ansible
- Bash and Python Scripting Expertise
- Network troubleshooting experience (TCP, DNS, IPv6 and tcpdump)
- Experience with continuous delivery systems (Jenkins, Gitlab, BitBucket, Docker)
- Experience managing hundreds to thousands of servers globally
- Enjoy automating tasks, rather than repeating them
- Capable of estimating costs of various approaches, and finding simple and inexpensive solutions to complex problems
- Strong verbal and written communication skills
- Ability to adapt to a rapidly changing environment
- Comfortable collaborating and supporting a diverse team of engineers
- Ability to troubleshoot problems in complex systems
- Flexible working hours and ability to participate in 24/7 on call support with other team members whenever required.
About A listed product development organization
Similar jobs
CoinFantasy is looking for a tech enthusiast working primarily on blockchain technology to be part of the core blockchain team at CoinFantasy. You would be a part of the Roadmap team that is working on the architecture, design, development, and deployment of our decentralised platform.
Your primary responsibilities would be analysing requirements, designing blockchain technology around a certain business model, and writing smart contracts.
Job Responsibilities
- Administer our blockchain, database, and DevOps infrastructure.
- Cross team collaboration to coordinate safe, efficient releases.
- Build complex pipelines for
- Databases, Messaging, Storage, Compute in AWS.
- Build deployment pipeline with Github CI (Actions).
- Build tools to reduce occurrences of errors and improve our protocols.
- Develop software to integrate with internal back-end systems.
- Perform root cause analysis for production errors.
- Investigate and resolve technical issues.
- Design procedures for system troubleshooting and maintenance.
Requirements
- 8+ years of Experience working with DevOps, Infrastructure, Site Reliability or Cloud Engineering
- Understanding the entire tech stack of Blockchain Dapps
- Strong experience working with any configuration management tools
- Languages: Any modern programming language
- Experience working with some of the major public clouds. e.g. AWS, Azure
- Competent with the “basics”: E.g. Computer Networking
- Self-motivated individual with enthusiasm for learning and building things
- Collaborative, communicative, and confident in their abilities to work well with all team members at all seniority and skill levels
- Hands-on experience with Rust/Substrate and Contribution to open-source blockchain projects is an added advantage
About Us
CoinFantasy is a Play to Invest platform that brings the world of investment to users through engaging games. With multiple categories of games, it aims to make investing fun, intuitive, and enjoyable for users.
It features a sandbox environment in which users are exposed to the end-to-end investment journey without risking financial losses.
Website: https://www.coinfantasy.io/
Benefits
- Competitive Salary
- An opportunity to be part of the Core team in a fast-growing company
- A fulfilling, challenging and flexible work experience
- Practically unlimited professional and career growth opportunities
With a core belief that advertising technology can measurably improve the lives of patients, DeepIntent is leading the healthcare advertising industry into the future. Built purposefully for the healthcare industry, the DeepIntent Healthcare Advertising Platform is proven to drive higher audience quality and script performance with patented technology and the industry’s most comprehensive health data. DeepIntent is trusted by 600+ pharmaceutical brands and all the leading healthcare agencies to reach the most relevant healthcare provider and patient audiences across all channels and devices. For more information, visit DeepIntent.com or find us on LinkedIn.
We are seeking a skilled and experienced Site Reliability Engineer (SRE) to join our dynamic team. The ideal candidate will have a minimum of 3 years of hands-on experience in managing and maintaining production systems, with a focus on reliability, scalability, and performance. As an SRE at Deepintent, you will play a crucial role in ensuring the stability and efficiency of our infrastructure, as well as contributing to the development of automation and monitoring tools.
Responsibilities:
- Deploy, configure, and maintain Kubernetes clusters for our microservices architecture.
- Utilize Git and Helm for version control and deployment management.
- Implement and manage monitoring solutions using Prometheus and Grafana.
- Work on continuous integration and continuous deployment (CI/CD) pipelines.
- Containerize applications using Docker and manage orchestration.
- Manage and optimize AWS services, including but not limited to EC2, S3, RDS, and AWS CDN.
- Maintain and optimize MySQL databases, Airflow, and Redis instances.
- Write automation scripts in Bash or Python for system administration tasks.
- Perform Linux administration tasks and troubleshoot system issues.
- Utilize Ansible and Terraform for configuration management and infrastructure as code.
- Demonstrate knowledge of networking and load-balancing principles.
- Collaborate with development teams to ensure applications meet reliability and performance standards.
Additional Skills (Good to Know):
- Familiarity with ClickHouse and Druid for data storage and analytics.
- Experience with Jenkins for continuous integration.
- Basic understanding of Google Cloud Platform (GCP) and data center operations.
Qualifications:
- Minimum 3 years of experience in a Site Reliability Engineer role or similar.
- Proven experience with Kubernetes, Git, Helm, Prometheus, Grafana, CI/CD, Docker, and microservices architecture.
- Strong knowledge of AWS services, MySQL, Airflow, Redis, AWS CDN.
- Proficient in scripting languages such as Bash or Python.
- Hands-on experience with Linux administration.
- Familiarity with Ansible and Terraform for infrastructure management.
- Understanding of networking principles and load balancing.
Education:
Bachelor's degree in Computer Science, Information Technology, or a related field.
DeepIntent is committed to bringing together individuals from different backgrounds and perspectives. We strive to create an inclusive environment where everyone can thrive, feel a sense of belonging, and do great work together.
DeepIntent is an Equal Opportunity Employer, providing equal employment and advancement opportunities to all individuals. We recruit, hire and promote into all job levels the most qualified applicants without regard to race, color, creed, national origin, religion, sex (including pregnancy, childbirth and related medical conditions), parental status, age, disability, genetic information, citizenship status, veteran status, gender identity or expression, transgender status, sexual orientation, marital, family or partnership status, political affiliation or activities, military service, immigration status, or any other status protected under applicable federal, state and local laws. If you have a disability or special need that requires accommodation, please let us know in advance.
DeepIntent’s commitment to providing equal employment opportunities extends to all aspects of employment, including job assignment, compensation, discipline and access to benefits and training.
Candidate MUST HAVE product-based company experience and a minimum of 3years of experience in DevOps.
What you will do (or learn) :
1. Build our application stack on AWS. Infrastructure as code (read Terraform)
2. Build state-of-the-art CI/CD pipelines.
3. Manage data warehouses and data pipelines.
4. Work on infrastructure and data security.
5. State-of-the-art log management system and tooling around them.
6. Monitoring and alerting system.
What do we expect from you?
1. 3 to 10 years of experience with DevOps or SRE principles.
2. Good fundamentals of database management and other distributed systems management.
3. Experience in infrastructure as code or other configuration management systems.
4. Experience in scripting languages (like bash, python, go lang etc.)
5. Good understanding of Linux systems
6. Strong debugging and troubleshooting skills
7. Experience in tooling around monitoring, CI/CD, log management systems.
Nvizion Solutions is looking for the position of Site Reliability Engineer.
If interested, kindly share your resume along with contact details.
Title: Site Reliability Engineer
No. of job openings: 2
Location:Gurgaon/ Hyderabad/ Bengaluru/ Mumbai/Chennai ( Remote location)
Remuneration:Best in the Industry
· Experience required: 2 to 4 yrs in the industry
· Ensuring overall System's reliability
· Add automation and alerting in the system
· Providing Troubleshooting support
· Cross team communications. Working closely with Product team and Customer success team.
· Proactive support - to ensures the system is back to the healthy state
· R&D for new tools/technologies to support product and support team
· Good verbal/written communication to connect with the client.
· Good team player with a zeal to learn new technologies.
· The candidate will be part of the team responsible for 24X7 monitoring of distributed global platform.
- Linux Scripting
- CI/CD knowledge (Jenkins/ BitBucket Pipelie /GitOps)
- Version Control
- Cloud platform knowledge (GCP/AWS/Azure/Digital Ocean)
- Docker, Kubernetes
Senior Cloud Engineer / Jr. Cloud Solutions Architect
Roles and Responsibilities
-
Define, implement, deploy and maintain development, QA & production environments for cloud-based Azure architecture.
-
Create a strategy for establishing a secure and well-managed enterprise environment in Azure
-
Define and implement security architecture for production, ensure data security at all levels.
-
Provision Infrastructure as code using Azure CLI Powershell ARM templates and or Terraform with Ansible or other tools.
-
Develop scripts to automate the deployment of resource stacks and associated configurations
-
Extend MLP standard systems management processes into the cloud including change, incident, and problem management
-
Establish and implement monitoring and management infrastructure for both availability and performance management
-
Implement observability patterns using Azure Monitor Azure Application Insights and Log Analytics Workspace.
-
Provide internal training to the team.
Primary Skills/Requirements
-
5+ years of experience in IT and infrastructure
-
3+ years of experience in Azure design, support and management for a large-scale organization
-
Experience in design and implementation of high availability architecture.
-
Strong experience in Azure CLI Powershell and ARM Templates Terraform.
-
Strong understanding of IT Security and related audits
-
Experience with deploying applications on Linux - Ubuntu
-
Should know Azure offerings (Storage, OS instances, Availability zones, DR, Load balancers, VPN tunnel, Application Gateway, etc.)Cloud monitoring Experience with Azure Log Analytics Azure Monitor.
-
Experience with log collection tools and analysis, as well as infrastructure performance monitoring tools and optimization practices
-
Microsoft Azure Certification MCSE: Cloud Platform and Infrastructure or equivalent certification would be an added advantage
-
Experience with Postgres SQL Database
Behavioural
-
Positive work ethics
-
Ability to adapt to dynamic environment
-
Time Management
-
Team Player
-
Communication skills
-
Ability to work independently
Experience automating systems engineering tasks.
Experience in fast-paced and dynamic SRE or Production Support engineering teams
A proven track record of managing successful complex internet-based product platforms/architectures.
Experience building metrics and monitoring platforms and defining alerting strategies.
Strong analytical ability with a focus on making data driven decisions.
Capable of technical deep-dives, yet verbally and cognitively agile enough to hold their own in a strategy discussion with senior technical or executive leadership
Experience working in a managed services environment.
Good communication skills, both written and oral.
Solid understanding of Engineering, DevOps and cloud computing fundamentals.
Good understanding of cloud services including AWS.
Strong automation and CI / CD experience.
Solid experience with containerized applications/orchestration and serverless functions.
GitHub, CD/CI tools experience.
If I asked your previous team members about you, they would say you were a great leader and they would very much welcome an opportunity to work for you once again.
Experience in high SLA environments.
Computer Science, Engineering or Sciences degree required or equivalent work experience.
SRE - Tech Lead (DevOps):
Location: Permanent Work From Home Option
Notice: Candidates with a notice period of 30 days and less and preferred
SRE-DevOps- Tech Lead - JD:
Srijan is hiring for Site Reliability Engineering (SRE), We are looking for SRE/DevOps- Tech Lead or Sr. Tech Lead with strong automation skills and a good understanding of how to build & run secure & reliable platforms for cloud-native applications. Please find below the detailed job description and kindly go through the same for reference:-
Minimum Experience: 6+ years in DevOps/SRE
Permanent WFH option
Job Description:-
The focus of this role is to build scalable, resilient, secure infrastructure for cloud-native applications whilst automating every mundane task you could think of and build observability dashboards, set up alerts, etc to provide optics to relevant stakeholders. In a nutshell: “You are keepers of Production environments”. You must be a problem solver with the ability to multitask and come with strong collaboration and communication skills.
Key Responsibilities:-
-
Proactively monitor and review application performance
-
Handle on-call and emergency support
-
Ensure software has good logging and diagnostics
-
Create and maintain operational runbooks
-
Contribute in Solution Designing and evaluating Technical Debt
-
Set right practices for Well-Defined Architecture & to minimize toil.
-
Own SLI, SLO configuration as per Error Budget
-
Maintain production services through measuring and monitoring availability, latency, and overall system health.
-
Practice sustainable incident response and blameless postmortems.
-
Not be afraid to contribute changes back to the Software engineering team to improve the systems.
-
Managing the delivery pipeline into production.
-
Able to mentor junior members on regular basis
-
Troubleshooting issues with web applications
-
Understanding of security principles and best practices
-
Ensuring that critical data is backed up
-
Configuration of monitoring systems including infrastructure monitoring and Application Performance Monitoring systems such as New Relic.
-
Ensuring that web application infrastructure is built
-
Ability to act as Customer Technical Advocate and negotiate well with peers on technical fronts.
-
Flexible enough to work in different Shifts for hyper business requirement
-
Ability to handle multiple global clients on tech front and generate desired reports to represent health of SRE Delivery.
Skills/Experience:-
-
A key skill of a SRE Tech Lead is that they have a deep knowledge of the application, the code, and how it runs, is configured, and scales. That knowledge is what makes them so valuable at also monitoring and supporting it as site reliability engineers.
-
System administration, security, and networking
-
The SRE Tech Lead expected to have a good understanding of system administration (Linux or Windows) and networking.
-
Essential commands
-
User and Group Management
-
Knowledge of networking concepts (DNS, TCP/IP, and Firewalls)
-
Service Configuration
-
Storage Management
-
Good grasp of fundamental security concepts
-
Good understanding of infrastructure as code principles.
-
Knowledge of a scripting language such as Bash
-
Ability to configure infrastructure using a Configuration Management technology such as Puppet, Chef, or Ansible.
-
Familiarity with Jenkins or any other CI/CD tool
-
Proficiency in a high-level programming language such as Python or Go.
-
Understanding of container technologies such as Docker, Kubernetes
-
2 yrs+ hands on experience with container orchestration technologies such as ECS, EKS, AKS or Kubernetes would be beneficial.
-
Use Terraform and other IaC to deploy cloud infrastructure.
Cloud technologies:-
-
Experience designing available, cost-efficient, fault-tolerant, and scalable distributed systems on AWS/Azure
-
Hands-on experience using compute, networking, storage, and database AWS/Azure services
-
Hands-on experience of 4 yrs+ with AWS/Azure deployment and management services
-
Ability to identify and define technical requirements for an AWS/AZURE-based application
-
Ability to identify which AWS/AZURE services meet a given technical requirement
-
Knowledge of recommended best practices for building secure and reliable applications on the AWS/AZURE platform
-
An understanding of the AWS/AZURE global infrastructure
-
An understanding of network technologies as they relate to AWS/AZURE
-
An understanding of security features and tools that AWS/AZURE provides and how they relate to traditional services
About the Role
Dremio’s SREs ensure that our internal and externally visible services have reliability and uptime appropriate to users' needs and a fast rate of improvement. You will be joining a newly formed team that will spearhead our efforts to launch a cloud service. This is an opportunity to join a very fast growth startup and help build a cloud service from the ground up.
Responsibilities and Ownership
- Ability to debug and optimize code and automate routine tasks.
- Evangelize and advocate for reliability practices across our organization.
- Collaborate with other Engineering teams to support services before they go live through activities such as system design consulting, developing software platforms and frameworks, monitoring/alerting, capacity planning and launch reviews.
- Analyze and optimize our core product by developing and implementing reliability and performance practices.
- Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity.
- Be on-call for services that the SRE team owns.
- Practice sustainable incident response and blameless postmortems.
Qualifications
- 6+ years of relevant experience in the following areas: SRE, DevOps, Cloud Operations, Systems Engineering, or Software Engineering.
- Excellent command of cloud services on AWS/GCP/Azure, Kubernetes and CI/CD pipelines.
- Have moderate-advanced experience in Java, C, C++, Python, Go or other object-oriented programming languages.
- You are Interested in designing, analyzing and troubleshooting large-scale distributed systems.
- You have a systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
- You have a great ability to debug and optimize code and automate routine tasks.
- You have a solid background in software development and architecting resilient and reliable applications.
- 5+ years of software development or site reliability engineering or equivalent experience
- Skilled at problem solving, algorithms, and data structures
- Building tools and scripting frameworks from scratch
- Working with Cloud Automation tools like CloudFormation, Terraform, CDK, aws-cli
- Scripting languages like Python, Groovy, PowerShell, Bash, Perl etc.
- Configuration automation using Ansible or equivalent tools
- Exposure to Windows, Linux administration skills
- Project management tools like Jira, Trello
- Prior experience in dealing with Datastore technologies like Postgres, MySQL, SQL, DynamoDB is desirable
- Familiarity with basic networking, security and cloud engineering concepts
- Team player who is eager to help others to succeed through mentoring and leading by example
- Highly collaborative with effective written and verbal communication skills