Responsibilities
- Building and maintenance of resilient and scalable production infrastructure
Improvement of monitoring systems
- Improvement of monitoring systems
- Creation and support of development automation processes (CI / CD)
- Participation in infrastructure development
- Detection of problems in architecture and proposing of solutions for solving them
- Creation of tasks for system improvements for system scalability, performance and monitoring
- Analysis of product requirements in the aspect of devops
- Incident analysis and fixing
Skills and Experience
- Understanding of the distributed systems principles
- Understanding of principles for building a resistant network infrastructure
- Experience of Ubuntu Linux administration (Debian-like will be a plus)
- Strong knowledge of Bash
- Experience of working with LXC-containers
- Understanding and experience with infrastructure as a code approach
- Experience of development idempotent Ansible roles
- Experience of working with git
Preferred experience
Experience with relational databases (PostgeSQL), ability to create simple SQL queries
- Experience with monitoring and metric collect systems (Prometheus, Grafana, Zabbix)
- Understanding of dynamic routing (OSPF)
- Knowledge and experience of working with network equipment Cisco
- Experience of working with Cisco NX-OS
- Experience of working with IPsec, VXLAN, Open vSwitch
- Knowledge of principles of multicast protocols IGMP, PIM
- Experience of setting multicast on Cisco equipment
- Experience administering Atlassian products

Similar jobs
Must Have -
a. Background working with Startups
b. Good knowledge of Kubernetes & Docker
c. Background working in Azure
What you’ll be doing
- Ensure that our applications and environments are stable, scalable, secure and performing as expected.
- Proactively engage and work in alignment with cross-functional colleagues to understand their requirements, contributing to and providing suitable supporting solutions.
- Develop and introduce systems to aid and facilitate rapid growth including implementation of deployment policies, designing and implementing new procedures, configuration management and planning of patches and for capacity upgrades
- Observability: ensure suitable levels of monitoring and alerting are in place to keep engineers aware of issues.
- Establish runbooks and procedures to keep outages to a minimum. Jump in before users notice that things are off track, then automate it for the future.
- Automate everything so that nothing is ever done manually in production.
- Identify and mitigate reliability and security risks. Make sure we are prepared for peak times,
- DDoS attacks and fat fingers.
- Troubleshoot issues across the whole stack - software, applications and network.
- Manage individual project priorities, deadlines, and deliverables as part of a self-organizing team.
- Learn and unlearn every day by exchanging knowledge and new insights, conducting constructive code reviews, and participating in retrospectives.
Requirements
- 2+ years extensive experience of Linux server administration include patching, packaging (rpm), performance tuning, networking, user management, and security.
- 2+ years of implementing systems that are highly available, secure, scalable, and self-healingon Azure cloud platform
- Strong understanding of networking, especially in cloud environments along with a good understanding of CICD.
- Prior experience implementing industry standard security best practices, including those recommended by Azure
- Proficiency with Bash, and any high-level scripting language.
- Basic working knowledge of observability stacks like ELK, prometheus, grafana, Signoz etc
- Proficiency with Infrastructure as Code and Infrastructure Testing, preferably using Pulumi/Terraform.
- Hands-on experience in building and administering VMs and Containers using tools such as Docker/Kubernetes.
- Excellent communication skills, spoken as well as written, with a demonstrated ability to articulate technical problems and projects to all stakeholders.
Key Skills Required:
· You will be part of the DevOps engineering team, configuring project environments, troubleshooting integration issues in different systems also be involved in building new features for next generation of cloud recovery services and managed services.
· You will directly guide the technical strategy for our clients and build out a new capability within the company for DevOps to improve our business relevance for customers.
· You will be coordinating with Cloud and Data team for their requirements and verify the configurations required for each production server and come with Scalable solutions.
· You will be responsible to review infrastructure and configuration of micro services and packaging and deployment of application
To be the right fit, you'll need:
· Expert in Cloud Services like AWS.
· Experience in Terraform Scripting.
· Experience in container technology like Docker and orchestration like Kubernetes.
· Good knowledge of frameworks such as Jenkins, CI/CD pipeline, Bamboo Etc.
· Experience with various version control system like GIT, build tools (Mavan, ANT, Gradle ) and cloud automation tools (Chef, Puppet, Ansible)
DESIRED SKILLS AND EXPERIENCE
Strong analytical and problem-solving skills
Ability to work independently, learn quickly and be proactive
3-5 years overall and at least 1-2 years of hands-on experience in designing and managing DevOps Cloud infrastructure
Experience must include a combination of:
o Experience working with configuration management tools – Ansible, Chef, Puppet, SaltStack (expertise in at least one tool is a must)
o Ability to write and maintain code in at least one scripting language (Python preferred)
o Practical knowledge of shell scripting
o Cloud knowledge – AWS, VMware vSphere o Good understanding and familiarity with Linux
o Networking knowledge – Firewalls, VPNs, Load Balancers
o Web/Application servers, Nginx, JVM environments
o Virtualization and containers - Xen, KVM, Qemu, Docker, Kubernetes, etc.
o Familiarity with logging systems - Logstash, Elasticsearch, Kibana
o Git, Jenkins, Jira
The Key Responsibilities Include But Not Limited to:
Help identify and drive Speed, Performance, Scalability, and Reliability related optimization based on experience and learnings from the production incidents.
Work in an agile DevSecOps environment in creating, maintaining, monitoring, and automation of the overall solution-deployment.
Understand and explain the effect of product architecture decisions on systems.
Identify issues and/or opportunities for improvements that are common across multiple services/teams.
This role will require weekend deployments
Skills and Qualifications:
1. 3+ years of experience in a DevOps end-to-end development process with heavy focus on service monitoring and site reliability engineering work.
2. Advanced knowledge of programming/scripting languages (Bash, PERL, Python, Node.js).
3. Experience in Agile/SCRUM enterprise-scale software development including working with GiT, JIRA, Confluence, etc.
4. Advance experience with core microservice technology (RESTFul development).
5. Working knowledge of using Advance AI/ML tools are pluses.
6. Working knowledge in the one or more of the Cloud Services: Amazon AWS, Microsoft Azure
7. Bachelors or Master’s degree in Computer Science or equivalent related field experience
Key Behaviours / Attitudes:
Professional curiosity and a desire to a develop deep understanding of services and technologies.
Experience building & running systems to drive high availability, performance and operational improvements
Excellent written & oral communication skills; to ask pertinent questions, and to assess/aggregate/report the responses.
Ability to quickly grasp and analyze complex and rapidly changing systemsSoft skills
1. Self-motivated and self-managing.
2. Excellent communication / follow-up / time management skills.
3. Ability to fulfill role/duties independently within defined policies and procedures.
4. Ability to balance multi-task and multiple priorities while maintaining a high level of customer satisfaction is key.
5. Be able to work in an interrupt-driven environment.Work with Dori Ai world class technology to develop, implement, and support Dori's global infrastructure.
As a member of the IT organization, assist with the analyze of existing complex programs and formulate logic for new complex internal systems. Prepare flowcharting, perform coding, and test/debug programs. Develop conversion and system implementation plans. Recommend changes to development, maintenance, and system standards.
Leading contributor individually and as a team member, providing direction and mentoring to others. Work is non-routine and very complex, involving the application of advanced technical/business skills in a specialized area. BS or equivalent experience in programming on enterprise or department servers or systems.
Now, more than ever, the Toast team is committed to our customers. We’re taking steps to help restaurants navigate these unprecedented times with technology, resources, and community. Our focus is on building a restaurant platform that helps restaurants adapt, take control, and get back to what they do best: building the businesses they love. And because our technology is purpose-built for restaurants by restaurant people, restaurants can trust that we’ll deliver on their needs for today while investing in experiences that will power their restaurant of the future.
At Toast, our Site Reliability Engineers (SREs) are responsible for keeping all customer-facing services and other Toast production systems running smoothly. SREs are a blend of pragmatic operators and software craftspeople who apply sound software engineering principles, operational discipline, and mature automation to our environments and our codebase. Our decisions are based on instrumentation and continuous observability, as well as predictions and capacity planning.
About this roll* (Responsibilities)
- Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding
- Partner with development teams to improve services through rigorous testing and release procedures
- Participate in system design consulting, platform management, and capacity planning
- Create sustainable systems and services through automation and uplift
- Balance feature development speed and reliability with well-defined service level objectives
Troubleshooting and Supporting Escalations:
- Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding
- Diagnose performance bottlenecks and implement optimizations across infrastructure, databases, web, and mobile applications
- Implement strategies to increase system reliability and performance through on-call rotation and process optimization
- Perform and run blameless RCAs on incidents and outages aggressively, looking for answers that will prevent the incident from ever happening again
Do you have the right ingredients? (Requirements)
- Extensive industry experience with at least 7+ years in SRE and/or DevOps roles
- Polyglot technologist/generalist with a thirst for learning
- Deep understanding of cloud and microservice architecture and the JVM
- Experience with tools such as APM, Terraform, Ansible, GitHub, Jenkins, and Docker
- Experience developing software or software projects in at least four languages, ideally including two of Go, Python, and Java
- Experience with cloud computing technologies ( AWS cloud provider preferred)
Bread puns are encouraged but not required
- Preferred experience in development associated with Kafka or big data technologies understand essential Kafka components like Zookeeper, Brokers, and optimization of Kafka clients applications (Producers & Consumers). -
Experience with Automation of Infrastructure, Testing , DB Deployment Automation, Logging/Monitoring/alerting
- AWS services experience on CloudFormation, ECS, Elastic Container Registry, Pipelines, Cloudwatch, Glue, and other related services.
- AWS Elastic Kubernetes Services (EKS) - Kubernetes and containers managing and auto-scaling -
Good knowledge and hands-on experiences with various AWS services like EC2, RDS, EKS, S3, Lambda, API, Cloudwatch, etc.
- Good and quick with log analysis to perform Root Cause Analysis (RCA) on production deployments and container errors on cloud watch.
Working on ways to automate and improve deployment and release processes.
- High understanding of the Serverless architecture concept. - Good with Deployment automation tools and Investigating to resolve technical issues.
technical issues. - Sound knowledge of APIs, databases, and container-based ETL jobs.
- Planning out projects and being involved in project management decisions. Soft Skills
- Adaptability
- Collaboration with different teams
- Good communication skills
- Team player attitude
Hands on Experience with Linux administration
Experience using Python or Shell scripting (for Automation)
Hands-on experience with Implementation of CI/CD Processes
Experience working with one cloud platforms (AWS or Azure or Google)
Experience working with configuration management tools such as Ansible & Chef
Experience working with Containerization tool Docker.
Experience working with Container Orchestration tool Kubernetes.
Experience in source Control Management including SVN and/or Bitbucket
& GitHub
Experience with setup & management of monitoring tools like Nagios, Sensu & Prometheus or any other popular tools
Hands-on experience in Linux, Scripting Language & AWS is mandatory
Troubleshoot and Triage development, Production issues
We are hiring candidates who are looking to work in a cloud environment and ready to learn and adapt to the evolving technologies.
Linux Administrator Roles & Responsibilities:
- 5+ or more years of professional experience with strong working expertise in Agile environments
- Deep knowledge in managing Linux servers.
- Managing Windows servers(Not Mandatory).
- Manage Web servers (Apache, Nginx).
- Manage Application servers.
- Strong background & experience in any one scripting language (Bash, Python)
- Manage firewall rules.
- Perform root cause analysis for production errors.
- Basic administration of MySQL, MSSQL.
- Ready to learn and adapt to business requirements.
- Manage information security controls with best practises and processes.
- Support business requirements beyond working hours.
- Ensuring highest uptimes of the services.
- Monitoring resource usages.
Skills/Requirements
- Bachelor’s Degree or Diploma in Computer Science, Engineering, Software Engineering or a relevant field.
- Experience with Linux-based infrastructures, Linux/Unix administration.
- Knowledge in managing databases such as My SQL, MS SQL.
- Knowledge of scripting languages such as Python, Bash.
- Knowledge in open-source technologies and cloud services like AWS, Azure is a plus. Candidates willing to learn will be preferred.
- Experience in managing web applications.
- Problem-solving attitude.
- 5+ years experience in the IT industry.
- Should have strong experience in working Configuration Management area, with tools preferably, TFSVC, TFS vNext OR GIT, SVN, Jenkins
- Strong working experience in tools related to Application Lifecycle Management, like Microsoft TFS
- Should have hands-on working experience in CICD (Continuous Integration Continuous Deployment) practices
- Strong be expertise in handling software builds & release mgmt. activities in Dotnet based environment / Java environment
- Strong skills in Perl or PowerShell or in any other scripting/automation/programming language
- Should have exposure to various build environments like dotNet, Java
- Should have experience with writing build scripts, automation of daily/nightly builds & deployment
- Good knowledge in Merging /Branching concepts.
- Good understanding of product life cycle management"
- Shall be very good technically; possess systems mindset and good problem-solving abilities
- Working with multisite teams, Quality conscious and Process & customer Oriented
- Self-starter and quick learner and ability to work with minimal supervision
- Can play a key role in the team
- Strong team player with a “can-do” attitude
- Ability to handle conflicts
- Ability to stay focused on the target in an ambiguous situation
- Good communication and documentation skills"








