12+ Reliability engineering Jobs in India
Apply to 12+ Reliability engineering Jobs on CutShort.io. Find your next job, effortlessly. Browse Reliability engineering Jobs and apply today!
Position: Windows SRE
Responsibilities:
- Windows Site Reliability Engineer with experience in managing large websites where Millions of customers hit
- Manage and monitor all installed systems and infrastructure
- Install, configure, test and maintain operating systems, application software and system management tools
- Among your responsibilities will be the installation and configuration of storage, servers, Microsoft servers (Cluster Services, File Services, Active Directory Services, Certificate Authority services), Virtual Infrastructure (Hyper-V), IIS, MS SQL and backup system
- Proactively ensure the highest levels of systems and infrastructure availability
- Monitor and test application performance for potential bottlenecks, identify possible solutions, and work with developers to implement those fixes
- Maintain security, backup, and redundancy strategies
- Write and maintain custom scripts to increase system efficiency and lower the human intervention time on any tasks
- Participate in the design of information and operational support systems
- Provide 2nd and 3rd level support
- Liaise and collaborate with vendors and Zacks personnel for problem resolution, decision making, knowledge sharing
Requirements:
- Minimum 5+ years of Windows support experience, 7, 8, 10, and Microsoft Server (all)
- Windows server expertise
- Familiar with WAN/LAN technologies
- Understanding of the OSI model
- Virtualization - MS Hyper-V, VMware, vSan
- Strong understanding of Internet protocols including HTTP(S), SSL, TCP, IP
- MS IIS administration and configuration
- MS Active Directory
- MS Storage Space
- DNS and DHCP
- SSL certificates and PKI
- Familiar with the ITIL framework
- Strong PowerShell experience
- Information Security experience a plus Other Qualifications
- Excellent attention to detail
Experience: 8-10 years
Job Summary:
We are seeking a Senior DevOps & SRE Engineer to join our team and help us build, deploy, and maintain our infrastructure and applications. The ideal candidate will have experience working in a fast-paced environment and a strong background in DevOps and Site Reliability Engineering (SRE). You will be responsible for ensuring the reliability, scalability, and security of our applications and infrastructure.
Responsibilities:
- Build and maintain our CI/CD pipeline and deployment automation tools
- Design and implement monitoring and alerting systems to ensure the health of our applications and infrastructure
- Work closely with development teams to ensure that code is deployed in a reliable and scalable manner
- Participate in on-call rotations to provide 24/7 support for our production systems
- Develop and maintain disaster recovery plans and processes
- Continuously improve our infrastructure and processes to ensure scalability, reliability, and security
- Mentor and provide technical leadership to junior team members
- Keep up-to-date with industry best practices and emerging technologies in DevOps and SRE
Requirements:
- Bachelor’s degree in Computer Science, Engineering, or a related field
- 5+ years of experience in DevOps or SRE
- Strong programming skills in at least one of the following languages: Python, Go, Ruby, or Java
- Experience with infrastructure as code tools such as Terraform or CloudFormation
- Experience with containerization technologies such as Docker and Kubernetes
- Strong understanding of networking concepts such as TCP/IP, DNS, and load balancing
- Experience with monitoring and logging tools such as Prometheus, Grafana, and ELK stack
- Excellent problem-solving skills and the ability to troubleshoot complex issues in a fast-paced environment
- Strong communication and collaboration skills with both technical and non-technical stakeholders
Preferred Qualifications:
- Experience with cloud providers such as AWS or Azure
- Experience with building and maintaining large-scale distributed systems
- Experience with database technologies such as MySQL, PostgreSQL, or MongoDB
- Experience with automation tools such as Ansible or Chef
- Experience with Agile development methodologies such as Scrum or Kanban
If you are passionate about DevOps and SRE and have the skills and experience we are looking for, we encourage you to apply for this exciting opportunity.
Company Description
Smarsh is the leader in communications compliance, archiving, and analytics. We provide compliance across the broadest set of communications channels with insights on what’s being captured. Smarsh customers manage over 500 million daily conversations across 80 channels and growing. Customers include the top 10 U.S., top 8 European, top 5 Canadian, and top 3 Asian banks. The Smarsh advantage is customers stay ahead of compliance and uncover patterns and relationships hidden within their data.
At Smarsh , we’ve been helping our customers manage new forms of communication since 1998. We work closely with regulators including the SEC, FINRA, IIROC, and the PRA and FCA, and with our customers, to ensure that they understand the capabilities of today’s technology and that our platform meets their most stringent requirements. Our products include Connected Capture, Connected Archive, Web Archive & Business Solutions.
About the team
Are you an SRE with excellent Observability, Containerization and Orchestration skills? As a Site Reliability Engineer (SRE) in the Smarsh SaaS Operations team, you'll be part of a team who measures and improves production performance reliability through sustainable engineering practices for our suite of applications. Toil will be your number one enemy, observability your closest friend and your mission will be to drive operational burden as close to zero as you can.
Responsibilities
- Responsible for technical direction at the platform solutions level. Is able to weigh the pros and cons of various solutions and credibly argue for the best path
- Work closely with Product Management and the rest of the engineering team to define features and their implementations with careful attention to quality, scalability, and maintainability
- Can break down complex technical solutions into abstractions that the rest of the team and understand
- Can investigate and solve complex bugs, performance, and scalability issues
- Collaborates with multiple agile teams to ensure their solutions integrate effectively
- Track work in ticketing system (JIRA)
- Participate in Pull Request reviews. Provide and receive feedback to continuously improve.
- Other duties as assigned.
Desired skills & experience
- A minimum 10+ years industry experience
- Masters in CS or equivalent
- Must have experience in Azure or AWS, either running some large-scale app there or migrating to Azure/AWS.
- Experience operating Cloud Foundry in production environments
- Experience managing CI/CD systems (Concourse, Jenkins, TravisCI etc.)
- Experience deploying and/or operating ELK stack
- Experience with container technologies and orchestration platforms (Docker, Kubernetes, Cloud Foundry)
- Experience working with monitoring and observability tools (We use Datadog and New Relic)
- Familiarity with working with PostgreSQL and MongoDB
- Background working in a multi-platform environment (Linux, Windows)
- Experience with running on a cloud platform, AWS preferred (S3, RDS, SQS)
- Familiarity with Agile/Scrum/Kanban methodologies
- Familiarity with programming/scripting languages (ie. Python, Bash, PowerShell, Go, etc.)
Additional Skills
- Expert programming skills in relevant languages
- Exceptional analytical and problem-solving skills
- Strong communication and collaboration skills
- Deep understanding of modern software architecture
- Deep domain knowledge of the industry, platform, and existing processes
- Fault-tolerant design & maintenance
- Knowledge and understanding of modern software programming/engineering.
- Product delivery lifecycle - requirement refinement through ops
Why Smarsh?
Ready to join a thriving tech company that’s redefining digital archiving and business intelligence?
Smarsh is the leading comprehensive archiving platform. Recognized as one of today’s fastest growing companies in the U.S., Smarsh delivers innovative cloud-based solutions that help organizations manage and enforce flexible and secure records retention and compliance strategies for electronic communications, including social media and enterprise social networks (Yammer, Chatter, Facebook, LinkedIn and more).
Our motto is ‘People First. Inspire Confidence. Embrace the Impossible.’ We hire lifelong learners who have a passion for their discipline and a track record of excellence. To learn more about us, visit www.smarsh.com/careers
bangalore based startup
Experience automating systems engineering tasks.
Experience in fast-paced and dynamic SRE or Production Support engineering teams
A proven track record of managing successful complex internet-based product platforms/architectures.
Experience building metrics and monitoring platforms and defining alerting strategies.
Strong analytical ability with a focus on making data driven decisions.
Capable of technical deep-dives, yet verbally and cognitively agile enough to hold their own in a strategy discussion with senior technical or executive leadership
Experience working in a managed services environment.
Good communication skills, both written and oral.
Solid understanding of Engineering, DevOps and cloud computing fundamentals.
Good understanding of cloud services including AWS.
Strong automation and CI / CD experience.
Solid experience with containerized applications/orchestration and serverless functions.
GitHub, CD/CI tools experience.
If I asked your previous team members about you, they would say you were a great leader and they would very much welcome an opportunity to work for you once again.
Experience in high SLA environments.
Computer Science, Engineering or Sciences degree required or equivalent work experience.
- We are looking for a Senior SRE with a proven track record of success leading complex cloud-hybrid environments. You will have:
- Strong sense of Being an Owner, Wearing the Customer Shoes, with the ability to Empower Others demonstrated through clear
- communication and collaboration.
- Skills to work independently with multiple global teams, developing, configuring, deploying, and operating our global infrastructure on AWS and on-prem.
- Operational experience in complex distributed and real-time systems, including experience with SLO/SLAs towards high availability,reliability and DR goals.
- DevOps experience in building tools and frameworks, with an understanding of continuous deployment processes.
- Ability to think at scale, bringing a focus on continuous delivery methodologies from design through deployment and operations.
- Experience building and managing systems with tools including Kubernetes, Chef/Ansible/Puppet, Kafka, Docker, and Terraform.
- 5+ years experience in a Software and/or Site Reliability Engineering role
- Experience writing automation code in GoLang, Python or Java
- Experience developing and operating large scale distributed systems with Kubernetes and Docker
- Experience in running real time and low latency high available applications (Kafka, gRPC, RTP)
- Experience running public cloud environments on AWS
- Experience running hybrid clouds and on-prem infrastructures on Red Hat Enterprise Linux / CentOS
- Bachelor degree in Engineering, Computer Science or equivalent experience
- The ability to lead, partner, and collaborate cross functionally across an engineering organization
Global on-demand content marketplace platform
• Run the production environment by monitoring availability and taking a holistic view of
system health
• Build software and systems to manage platform infrastructure and applications
• Improve reliability, quality, and time-to-market of our suite of software solutions
• Measure and optimize system performance, with an eye toward pushing our capabilities
forward, getting ahead of customer needs, and innovating to continually improve
• Provide primary operational support and engineering for multiple large distributed
software applications
• Drive cross-team alignment across development teams around reliability initiatives
The ideal candidate must -
• Bachelor’s degree in computer science or other highly technical, scientific discipline
• Ability to program (structured and OO) with one or more high level languages, such as
Python, Java, C/C++, Ruby, and JavaScript
• Good experience with microservices architecture and serverless technologies
• Exposure to event driven architecture and state machines
• A proactive approach to spotting problems, areas for improvement, and performance
bottlenecks
Building decentralized financial protocols and applications
Managing cloud-based serverless infrastructure on AWS, GCP(firebase) with IaC
(Terraform, CloudFormation etc.,)
Deploying and maintaining products, services, and network components with a focus
on security, reliability, and zero downtime
Automating and streamlining existing processes to aid the development team
Working with the development team to create ephemeral environments, simplifying
the development lifecycle
Driving forward our blockchain infrastructure by creating and managing validators for
a wide variety of new and existing blockchains
Requirements:
1-3+ years in a SRE / DevOps / DevSecOps or Infrastructure Engineering role
Strong working knowledge of Amazon Web Services (AWS) or GCP or similar cloud
ecosystem
Experience working with declarative Infrastructure-as-Code frameworks(Terraform,
CloudFormation)
Experience with containerization technologies and tools (Docker, Kubernetes), CI/CD
pipelines and Linux/Unix administration
Bonus points - if you know more about crypto, staking, defi, proof-of-stake,
validators, delegations
Benefits:
Competitive CTC on par with market along with ESOPs/Tokens
My client is a leader in the Ed Tech space
Job Title: Sr. Reliability Engineer/ Engineering Manager
Location: Powai (Mumbai)/ Bangaluru/ Delhi NCR
- System maintenance and administration of Applications and Software
- Integration with different systems and apps
- Must have knowledge on Microservices
- Should be proficient in scripting language Python, Good to have knowledge on Java
- Excellent over Root Cause Analysis
- Should be a technical as well as process-oriented candidate
- Excellent Debugging and Monitoring skills
- Responsible for daily trouble ticket resolution, client interaction and customer support via email and video calls
- Responsible for the setup for the Continuous Integration build and deployment for DEV and UAT environments
- Responsible for operational support and problem resolution for application users and internal operations team
- Identifying, tracking, managing and resolving project issues effectively and efficiently
- Supporting projects during the normalization phase after Go-Live and ensuring the smooth transitions to operations
- Review and implement the processes related to IT Support and Operations before handing over to Support Services
- Strong ability to work independently on complex issues
- Collaborate efficiently with internal experts to resolve customer issues quickly
- Both Proactive & reactive in work approach
- Should be willing to work in 24/7 work environment
Ideal candidate has:
- 6-10 years
- Product Background
● Research, propose and evaluate with a 5-year vision, the architecture, design, technologies,
processes and profiles related to Telco Cloud.
● Participate in the creation of a realistic technical-strategic roadmap of the network to transform
it to Telco Cloud and be prepared for 5G.
● Using your deep technical expertise, you will provide detailed feedback to Product Management
and Engineering, as well as contribute directly to the platform code base to enhance both the
Customer experience of the service, as well as the SRE quality of life.
● The individual must be aware of trends in network infrastructure as well as within the network
engineering and OSS community. What technologies are being developed or launched?
● The individual should stay current with infrastructure trends in the telco network cloud domain.
● Be responsible for the Engineering of Lab and Production Telco Cloud environments, including
patches, upgrades, and reliability and performance improvements.
Required Minimum Qualifications: (Education and Technical Skills/Knowledge)
● Software Engineering degree, MS in Computer Science or equivalent experience
● Years of experiences as an SRE, DevOps, Development and/or Support related role
● 0-5 years of professional experience for a junior position
● At least 8 years of professional experience for a senior position
● Unix server administration and tuning : Linux / RedHat / CentOS / Ubuntu
● You have deep knowledge in Networking Layers 1-4
● Cloud / Virtualization (at least two): Helm, Docker, Kubernetes, AWS, Azure, Google Cloud,
OpenStack, OpenShift, VMware vSphere / Tanzu
● You have in-depth knowledge of cloud storage solutions on top of AWS, GCP, Azure and/or
on-prem private cloud, such as Ceph, CephFS, GlusterFS
● DevOps: Jenkins, Git, Azure DevOps, Ansible, Terraform
● Backend Knowledge Bash, Python, Go (other knowledge of Scripting Language is a plus).
● PaaS Level solutions such as Keycloak for IAM, Prometheus, Grafana, ELK, DBaaS (such as MySQL,
Cassandra)
About the Organisation:
The team at Coredge.io is a combination of experienced and young professionals alike having
many years of experience in working with Edge computing, Telecom application development
and Kubernetes. The company has continuously collaborated with the open source community,
universities and major industry players in furthering its goal of providing the industry with an
indispensable tool to offer improved services to its customers. Coredge.io has a global market
presence with its offices in US and New Delhi, India.
A startup company providing AI based software platforms
Who You Are
- Creative thinker and strong problem solver with meticulous attention to detail
- Highly organized, creative, motivated, and passionate about achieving results
- Able to balance multiple tasks and projects effectively and quickly adapt to new situations and technologies
- Able to work both independently and as part of a team
- Systematic problem-solver, coupled with a strong sense of ownership and drive
What you need
- 3-7 years of experience as a Site Reliability Engineer or a mix of a software engineer and DevOps.
- Strong hands-on knowledge of Linux fundamentals, System administration scripting, performance tuning/scalability, troubleshooting.
- Write great quality code using SOLID principles including unit and integration tests.
- Hands-on development experience in an object-orientated programming language like Python.
- Hands-on experience developing task automations
- Experience using tools to create and manage CI (continuous integration) and CD (continuous delivery) pipelines.
- Familiarity with software development tools: source code management (SCM systems), code review systems, issue tracking tools, build tools, test frameworks, code quality tools.
- Experience implementing open-source observability and alerting tools, like Prometheus, Grafana, Cortex, Thanos, Alertmanager etc
- Have decent knowledge on networking (VPC, VNet, DNS etc) and of the TCP/IP stack, internet routing and load balancing.
- Worked with log and configuration management tool
- Prior experience of working with AWS, Azure, GCP is a plus
- Prior experience of working with Kubernetes, Docker and containers is plus
- Strong interpersonal communication skills (including listening, speaking, and writing) and ability to work well in a diverse, team-focused environment with other SREs, Engineers, Product Managers, etc.
- Documenting your work should be in your DNA
What you get
- A chance to develop and build something (probably from scratch) which you can be proud of
- Build and Implement modern systems observability solutions including monitoring, alerting, metrics, logging, and APM & distributed tracing.
- Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity.
- Maintain business continuity by identifying and driving opportunities to make systems highly resilient and human-free.
- Closely work with the software engineering team to ensure accurate monitoring and metrics are being built into applications before going to production.
- Develop and maintain software modules for use and re-use in cloud and on-premise systems automation.
- Identify process gaps and implement process improvements to increase operational reliability
- Drive standardization efforts across the services, infrastructure, systems, and practices
- Develop Systems & Tools to help with Development team to uphold the Reliability principles
Technical Leader reporting to the CTO/CEO. Your responsibilities include the following, but are not limited to:
- Architecting, Designing and Developing Software Programmes based on requirements provided
- Designing and Developing with a high quality of code that is modular, scalable and re-usable at all times
- Promote SRE (Site Reliability Engineering) to ensure all of the services are Highly-Available and Fault Tolerant at all times
- Communicate effectively the system requirements to other software development teams
- Involve pro-actively with client and their requirements
- Evaluate and select appropriate software or hardware and suggest integration methods
- Oversee assigned programs (e.g. conduct code review) and provide guidance to team members
- Assist with solving technical problems when they arise
- Ensure the implementation of agreed architecture and infrastructure
- Address technical concerns, ideas and suggestions
- Monitor systems to ensure they meet both user needs and business goals
Requirements
Technical Skills
- Ability to solution & deliver all of Operations/SRE services & processes including managing L2 Environment Support
- 5-12 years of overall environment support experience with 5+ years of experience as support / SRE engineer
- Experience in implementing Monitoring solutions using APM tools( Example: AppDynamics, Graylog, Dynatrace, Datadog etc.) set up and test proactive monitoring alerts
- Have a broad knowledge profile and really excel in some areas, such as HTTP/TLS, DNS, networking or containerization
- Comfortable with large scale production systems and technologies, for example load balancing, monitoring, distributed systems, microservices, and configuration management.
Process Skills
- Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive
- Interest in designing, analyzing and troubleshooting large-scale distributed systems.
Behavioral Skills
- Practice sustainable incident response and blameless postmortems.
- Proven ability in developing relationships with stakeholders, communicating project/program status, and understanding detailed business requirements across multiple project initiatives
- This role requires candidates to work in rotational shifts. 24*7 support
Benefits
LOCATION: Mumbai
COMPENSATION: Competitive
WHY ZYCUS? :
- Be a part of one of the fastest growing product Company in India
- Come join a young, dynamic & enterprising team
- Work on the latest technologies
- Flexible working hours (As per business requirement).
Zycus Global Leader Procurement: https://www.zycus.com/newsroom/press-releases.html" target="_blank">https://www.zycus.com/newsroom/press-releases.html