We are looking for a highly skilled site reliability engineer to manage and scale our on-premise payments infrastructure. You will work on onsite environment spanning virtual machines and containerized workloads on bare metal, ensuring high availability, security, and performance for mission-critical systems.
Key Responsibilities
● Operate and optimize virtualized environments (VMs) and containerized workloads (Docker on bare metal)
● Manage and scale middleware systems like:
o Nginx (traffic routing, reverse proxy, load balancing)
o Redis (caching, HA setup)
o Kafka (streaming, partitioning, fault tolerance)
● Build and maintain CI/CD pipelines using Jenkins
● Manage infrastructure and application configurations using Git-based version control
● Ensure high availability, resilience, and performance tuning across systems
● Work on Linux system administration (RHEL/CentOS/Ubuntu)
● Implement and maintain automation frameworks using:
o Ansible
o Shell scripting
● Manage and troubleshoot networking components:
o TCP/IP, DNS, Load balancing
o Firewalls, WAF policies
o Akamai
● Handle security and compliance requirements
● Maintain accurate inventory and asset management systems
● Participate in incident response, RCA, and system reliability improvements
● Collaborate with application, security, and DevOps teams
Required Skills & Qualifications Core Infrastructure
● Strong hands-on experience with Linux system administration
● Experience managing on-prem data center environments
● Solid understanding of:
o Virtualization (VMware / KVM or similar)
o Bare metal provisioning
Containers & Middleware
● Experience running Docker in production (non-Kubernetes setups preferred)
● Strong operational knowledge of:
o Nginx
o Redis
o Kafka
o RDBMS
o Java
Observability, Alerting & Reliability
● Design and manage observability platforms:
o Elastic Stack (ELK)
o Grafana / Prometheus stack
● Build and maintain:
o Metrics, logs, and tracing pipelines
o Dashboards for system health and business KPIs
● Develop intelligent alerting strategies:
o Reduce noise (alert fatigue)
o Improve signal quality
● Build correlation mechanisms / alert aggregation systems to:
o Reduce MTTD (Mean Time to Detect)
o Reduce MTTR (Mean Time to Recover)
● Drive proactive monitoring and anomaly detection
● Lead incident response, debugging, and RCA with data-driven insights
CI/CD & Version Control
● Hands-on experience with:
o Git (branching strategies, code reviews, infra-as-code workflows)
o Jenkins (pipeline creation, build automation, deployment orchestration)
Networking & Security
● Good understanding of:
o Networking fundamentals (L3/L4 concepts)
o Firewalls and WAF (rule tuning, debugging)
● Experience handling secure production environments
Automation
● Hands-on experience with:
o Ansible and Shell scripting (bash)
Operations
● Experience with:
o Monitoring, alerting, and logging systems
o Incident management & RCA
o Capacity planning
Preferred Qualifications (Good to Have)
● Experience in UPI / Payments domain
● Understanding of:
o High TPS systems
o Low latency architecture
● Exposure to:
o Ceph / SAN / storage systems
o HA/DR design patterns
● Knowledge of observability stacks (Prometheus, ELK, etc.)
Experience working in regulated environments (PCI-DSS, RBI guidelines