About the team Site Reliability Engineering team in Media.Net is responsible for managing scaling, performance, monitoring, security, availability of the production environment. The focus is to architect, develop, automate and deploy products and infrastructure based on Linux and Linux application stacks. Our environment consists of our own BareMetal and private cloud across co-located datacenter facility and the AWS public cloud. Our engineering teams follow DevOps practices and we rely heavily on open source tools like Jenkins, Selenium, Git, Puppet, Docker, Kubernetes, Open stack, Nagios/Icinga, Kafka, Graphite, Hadoop, Graphite, ELK, Vault etc. We use Python and Go majorly in SRE teams. What is the job like? Engage with product and engineering team to design, build and maintain the system / software for high availability proactively and drive operation best practices Identify and drive opportunities in making resilient systems that help maintain business continuity Proactively perform troubleshooting, RCA and implement permanent resolution of issues across the stacks – hardware, software, database, network and so on Implementation of proactive monitoring, alerting, trend analysis and self-healing systems Develop continuous delivery for multiple platforms in production and staging environments Find areas of existing manual intervention, and replace with automation wherever possible Demonstrate ability to design, implement and manage highly available, scalable and reliable systems Infrastructure and platform security Effectively use and maintain Infrastructure and config management tools like puppet, chef, ansible, terraform to deploy and manage infrastructure Demonstrate technical mentoring and coaching to team members Adaptable to work in a fast-paced environment and alter priorities as per business needs
Job Description:● Develop and deliver automation software required for building & improving the functionality, reliability, availability, and manageability of applications and cloud platforms● Champion and drive the adoption of Infrastructure as Code (IaC) practices and mindset● Design, architect, and build self-service, self-healing, synthetic monitoring and alerting platform and tools● Automate the development and test automation processes through CI/CD pipeline (Git, Jenkins, SonarQube, Artifactory, Docker containers)● Build container hosting-platform using Kubernetes● Introduce new cloud technologies, tools & processes to keep innovating in commerce area to drive greater business value.Skills Required:● Excellent written and verbal communication skills and a good listener.● Proficiency in deploying and maintaining Cloud based infrastructure services (AWS, GCP, Azure – good hands-on experience in at least one of them)● Well versed with service-oriented architecture, cloud-based web services architecture, design patterns and frameworks.● Good knowledge of cloud related services like compute, storage, network, messaging (Eg SNS, SQS) and automation (Eg. CFT/Terraform).● Experience with relational SQL and NoSQL databases, including Postgres andCassandra.● Experience in systems management/automation tools (Puppet/Chef/Ansible, Terraform)● Strong Linux System Admin Experience with excellent troubleshooting and problem solving skills● Hands-on experience with languages (Bash/Python/Core Java/Scala)● Experience with CI/CD pipeline (Jenkins, Git, Maven etc)● Experience integrating solutions in a multi-region environment● Self-motivate, learn quickly and deliver results with minimal supervision● Experience with Agile/Scrum/DevOps software development methodologies.Nice to Have:● Experience in setting-up Elastic Logstash Kibana (ELK) stack.● Having worked with large scale data.● Experience with Monitoring tools such as Splunk, Nagios, Grafana, DataDog etc.● Previously experience on working with distributed architectures like Hadoop, Mapreduce etc.
Important to have : 1. Linux experience 2. Nginx or any webtier 3. Any scripting like terraform or ansible 4. Jenkins pipeline 5. Ci/CD 6. Containers and dockers