JOB Description
- Monitoring entire infrastructure of Olam using various monitoring tools like
- SCOM, SolarWinds, Telegraph, OEM.
- Monitoring various types of alerts like
- CPU Utilization
- Memory Utilization
- Database related alerts
- DR Replication issues
- Backup Failure Alerts
- Exchange Mail Queue Threshold Alerts
- Service Mailbox quota breach alert
- Adobe Experience Manager / Site 24/7 Alerts
- Application URL Alerting
- Scheduling Maintenance Mode for planned Activity.
- Daily repeat CI analysis of events/alerts/incident and raising proactive problem tickets which helps in reduction of major incident.
- Handling Major Incidents, Driving the major incident bridge, sending communication about major incident to stake holders.
- CMDB Inventory Management – Onboarding and Offboarding of Device's are commissioned/decommissioned.
- Coordinating with Service Provider for MPLS related outage
- Daily follow ups with Regional and internal teams to ensure all the node are up and running fine.
Similar jobs
SRE - Tech Lead (DevOps):
Location: Permanent Work From Home Option
Notice: Candidates with a notice period of 30 days and less and preferred
SRE-DevOps- Tech Lead - JD:
Srijan is hiring for Site Reliability Engineering (SRE), We are looking for SRE/DevOps- Tech Lead or Sr. Tech Lead with strong automation skills and a good understanding of how to build & run secure & reliable platforms for cloud-native applications. Please find below the detailed job description and kindly go through the same for reference:-
Minimum Experience: 6+ years in DevOps/SRE
Permanent WFH option
Job Description:-
The focus of this role is to build scalable, resilient, secure infrastructure for cloud-native applications whilst automating every mundane task you could think of and build observability dashboards, set up alerts, etc to provide optics to relevant stakeholders. In a nutshell: “You are keepers of Production environments”. You must be a problem solver with the ability to multitask and come with strong collaboration and communication skills.
Key Responsibilities:-
-
Proactively monitor and review application performance
-
Handle on-call and emergency support
-
Ensure software has good logging and diagnostics
-
Create and maintain operational runbooks
-
Contribute in Solution Designing and evaluating Technical Debt
-
Set right practices for Well-Defined Architecture & to minimize toil.
-
Own SLI, SLO configuration as per Error Budget
-
Maintain production services through measuring and monitoring availability, latency, and overall system health.
-
Practice sustainable incident response and blameless postmortems.
-
Not be afraid to contribute changes back to the Software engineering team to improve the systems.
-
Managing the delivery pipeline into production.
-
Able to mentor junior members on regular basis
-
Troubleshooting issues with web applications
-
Understanding of security principles and best practices
-
Ensuring that critical data is backed up
-
Configuration of monitoring systems including infrastructure monitoring and Application Performance Monitoring systems such as New Relic.
-
Ensuring that web application infrastructure is built
-
Ability to act as Customer Technical Advocate and negotiate well with peers on technical fronts.
-
Flexible enough to work in different Shifts for hyper business requirement
-
Ability to handle multiple global clients on tech front and generate desired reports to represent health of SRE Delivery.
Skills/Experience:-
-
A key skill of a SRE Tech Lead is that they have a deep knowledge of the application, the code, and how it runs, is configured, and scales. That knowledge is what makes them so valuable at also monitoring and supporting it as site reliability engineers.
-
System administration, security, and networking
-
The SRE Tech Lead expected to have a good understanding of system administration (Linux or Windows) and networking.
-
Essential commands
-
User and Group Management
-
Knowledge of networking concepts (DNS, TCP/IP, and Firewalls)
-
Service Configuration
-
Storage Management
-
Good grasp of fundamental security concepts
-
Good understanding of infrastructure as code principles.
-
Knowledge of a scripting language such as Bash
-
Ability to configure infrastructure using a Configuration Management technology such as Puppet, Chef, or Ansible.
-
Familiarity with Jenkins or any other CI/CD tool
-
Proficiency in a high-level programming language such as Python or Go.
-
Understanding of container technologies such as Docker, Kubernetes
-
2 yrs+ hands on experience with container orchestration technologies such as ECS, EKS, AKS or Kubernetes would be beneficial.
-
Use Terraform and other IaC to deploy cloud infrastructure.
Cloud technologies:-
-
Experience designing available, cost-efficient, fault-tolerant, and scalable distributed systems on AWS/Azure
-
Hands-on experience using compute, networking, storage, and database AWS/Azure services
-
Hands-on experience of 4 yrs+ with AWS/Azure deployment and management services
-
Ability to identify and define technical requirements for an AWS/AZURE-based application
-
Ability to identify which AWS/AZURE services meet a given technical requirement
-
Knowledge of recommended best practices for building secure and reliable applications on the AWS/AZURE platform
-
An understanding of the AWS/AZURE global infrastructure
-
An understanding of network technologies as they relate to AWS/AZURE
-
An understanding of security features and tools that AWS/AZURE provides and how they relate to traditional services
Requirements
Technical Skills
- Ability to solution & deliver all of Operations/SRE services & processes including managing L2 Environment Support
- 5-12 years of overall environment support experience with 5+ years of experience as support / SRE engineer
- Experience in implementing Monitoring solutions using APM tools( Example: AppDynamics, Graylog, Dynatrace, Datadog etc.) set up and test proactive monitoring alerts
- Have a broad knowledge profile and really excel in some areas, such as HTTP/TLS, DNS, networking or containerization
- Comfortable with large scale production systems and technologies, for example load balancing, monitoring, distributed systems, microservices, and configuration management.
Process Skills
- Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive
- Interest in designing, analyzing and troubleshooting large-scale distributed systems.
Behavioral Skills
- Practice sustainable incident response and blameless postmortems.
- Proven ability in developing relationships with stakeholders, communicating project/program status, and understanding detailed business requirements across multiple project initiatives
- This role requires candidates to work in rotational shifts. 24*7 support
Benefits
LOCATION: Mumbai
COMPENSATION: Competitive
WHY ZYCUS? :
- Be a part of one of the fastest growing product Company in India
- Come join a young, dynamic & enterprising team
- Work on the latest technologies
- Flexible working hours (As per business requirement).
Zycus Global Leader Procurement: https://www.zycus.com/newsroom/press-releases.html" target="_blank">https://www.zycus.com/newsroom/press-releases.html