Site Reliability Engineer
at A startup company providing AI based software platforms
Who You Are
- Creative thinker and strong problem solver with meticulous attention to detail
- Highly organized, creative, motivated, and passionate about achieving results
- Able to balance multiple tasks and projects effectively and quickly adapt to new situations and technologies
- Able to work both independently and as part of a team
- Systematic problem-solver, coupled with a strong sense of ownership and drive
What you need
- 3-7 years of experience as a Site Reliability Engineer or a mix of a software engineer and DevOps.
- Strong hands-on knowledge of Linux fundamentals, System administration scripting, performance tuning/scalability, troubleshooting.
- Write great quality code using SOLID principles including unit and integration tests.
- Hands-on development experience in an object-orientated programming language like Python.
- Hands-on experience developing task automations
- Experience using tools to create and manage CI (continuous integration) and CD (continuous delivery) pipelines.
- Familiarity with software development tools: source code management (SCM systems), code review systems, issue tracking tools, build tools, test frameworks, code quality tools.
- Experience implementing open-source observability and alerting tools, like Prometheus, Grafana, Cortex, Thanos, Alertmanager etc
- Have decent knowledge on networking (VPC, VNet, DNS etc) and of the TCP/IP stack, internet routing and load balancing.
- Worked with log and configuration management tool
- Prior experience of working with AWS, Azure, GCP is a plus
- Prior experience of working with Kubernetes, Docker and containers is plus
- Strong interpersonal communication skills (including listening, speaking, and writing) and ability to work well in a diverse, team-focused environment with other SREs, Engineers, Product Managers, etc.
- Documenting your work should be in your DNA
What you get
- A chance to develop and build something (probably from scratch) which you can be proud of
- Build and Implement modern systems observability solutions including monitoring, alerting, metrics, logging, and APM & distributed tracing.
- Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity.
- Maintain business continuity by identifying and driving opportunities to make systems highly resilient and human-free.
- Closely work with the software engineering team to ensure accurate monitoring and metrics are being built into applications before going to production.
- Develop and maintain software modules for use and re-use in cloud and on-premise systems automation.
- Identify process gaps and implement process improvements to increase operational reliability
- Drive standardization efforts across the services, infrastructure, systems, and practices
- Develop Systems & Tools to help with Development team to uphold the Reliability principles
Similar jobs
(deployment, troubleshooting, maintenance,
Helm charts) and Deployment and administration
of one or more of: ELK stack, Kafka, Prometheus
or Grafana with Working knowledge of at least
one cloud platform (GCP, AWS or Azure) & some
configuration management system (such as Salt
or Ansible).Good understanding of networking
concepts (architecture, components, protocols)
& Solid understanding of OS concepts and
internals of Linux is a must.
Senior Site Reliability Engineer
at One of the largest Equity broking House in India
● Be on a PagerDuty rotation to respond to availability incidents and provide support
for service engineers.
● Run the production environment by monitoring availability and taking a holistic view
of system health
● Building and implementing services to make IT and support better at their jobs.
● Improve reliability, quality, and time-to-market of our suite of software solutions
● Measure and optimize system performance, with an eye toward pushing our
capabilities forward, getting ahead of customer needs, and innovating to continually
improve
● Gather and analyze metrics from both operating systems and applications to assist in
performance tuning and fault finding
● Experience from an agile working development environment
● Participate in system design consulting, platform management, and capacity planning
● Balance feature development speed and reliability with well-defined service level
objectives
Required Skills and Qualifications:
● 3+ years of experience working within DevOps or SRE teams.
● 3+ years experience with AWS Cloud
● Ability to program (structured and OO) with one or more high level languages, such
as Python, Go, Java, and JavaScript
● Must have experience with Ansible, Helm, Terraform and Kubernetes.
● Document every action so your findings turn into repeatable actions–and then into
automation.
● Hands-on experience with Distributed Version Control System such as GIT, AWS
CodeCommit or equivalent
● Know your way around Linux and the Unix Shell.
● Experience or familiarity with ELK stack
● Ability to use Azure DevOps
● Experience with distributed storage technologies like NFS, Ceph, S3 as well as
dynamic resource management frameworks (Mesos, Kubernetes)
● A proactive approach to spotting problems, areas for improvement, and performance
bottlenecks
Candidate MUST HAVE product-based company experience and a minimum of 3years of experience in DevOps.
What you will do (or learn) :
1. Build our application stack on AWS. Infrastructure as code (read Terraform)
2. Build state-of-the-art CI/CD pipelines.
3. Manage data warehouses and data pipelines.
4. Work on infrastructure and data security.
5. State-of-the-art log management system and tooling around them.
6. Monitoring and alerting system.
What do we expect from you?
1. 3 to 10 years of experience with DevOps or SRE principles.
2. Good fundamentals of database management and other distributed systems management.
3. Experience in infrastructure as code or other configuration management systems.
4. Experience in scripting languages (like bash, python, go lang etc.)
5. Good understanding of Linux systems
6. Strong debugging and troubleshooting skills
7. Experience in tooling around monitoring, CI/CD, log management systems.
Founded by a passionate team of serial entrepreneurs and alumni of IIT Delhi, U.C Berkeley, and well-known tech companies such as Uber and Zomato.
Sourcewiz is on a mission to increase India’s export GDP. This is a unique opportunity to
join a funded early-stage startup and have a massive impact on our product, culture, and
direction. It's a lot of work and a roller coaster ride. But, if you are up for it, you can join us
in replacing the tiresome and slow sales process for importers and exporters and have a
significant impact on our customers. We are not a company that believes engineers should be hidden away from decisions, churning out code for features decided from upon high. Instead, our Engineers form strong bonds with cross-functional peers in Product Management, Product Design and others to become experts in their product domain.
We’re looking for people with a strong interest in building successful products or systems;
are comfortable in dealing with lots of moving pieces; have exquisite attention to detail, and
comfortable learning new technologies and systems.
As a Site Reliability Engineer at Sourcewiz, you will...
• Own and improve the scalability and reliability of our products
• Working directly with product engineering team
• Work with RDBMS, Search, Caching and queuing
• Contribute expertise towards architectural planning and ensure the company builds
sustainable services that meet our customer expectations while leveraging appropriate
tools and frameworks.
• Ongoing participation in the review and testing
Senior Engineer - Cloud Reliability
at Searce Inc
● 4-8 years experience in Cloud Infrastructure and Operations domains
● Experience with Linux systems and/OR Windows servers
● Specialize in one or two cloud deployment platforms: AWS, GCP, Azure
● Hands on experience with AWS services (EKS, ECS, EC2, VPC, RDS, Lambda, GKE, Compute Engine)
● Experience with one or more programming languages (Python, JavaScript, Ruby, Java,
.Net)
● Good understanding of Apache Web Server, Nginx, MySQL, MongoDB, Nagios
● Logging and Monitoring tools (ELK, Stackdriver, CloudWatch)
● DevOps Technologies
● Knowledge on Configuration Management tools such as Ansible, Terraform, Puppet,
Chef
● Experience working with deployment and orchestration technologies (such as Docker,
Kubernetes, Mesos)
Roles and Responsibilities
- Managing Availability, Performance, Capacity of infrastructure and applications.
- Building and implementing observability for applications health/performance/capacity.
- Optimizing On-call rotations and processes.
- Documenting “tribal” knowledge.
- Managing Infra-platforms like Mesos/Kubernetes,CICD,Observability (Prometheus/New Relic/ELK),Cloud Platforms (AWS/ Azure),Databases,Data Platforms Infrastructure
- Providing help in onboarding new services with production readiness review process.
- Providing reports on services SLO/Error Budgets/Alerts and Operational Overhead.
- Working with Dev and Product teams to define SLO/Error Budgets/Alerts.
- Working with Dev team to have in depth understanding of the application architecture
and its bottlenecks.
- Identifying observability gaps in product services, infrastructure and working with stake
owners to fix it.
- Managing Outages and doing detailed RCA with developers and identifying ways to
avoid that situation.
- Managing/Automating upgrades of the infrastructure services.
- Automate toil work.
Experience & Skills
- 6+ years of total experience
- Experience as an SRE/DevOps/Infrastructure Engineer on large scale microservices and infrastructure.
- A collaborative spirit with the ability to work across disciplines to influence, learn, and
deliver.
- A deep understanding of computer science, software development, and networking principles.
- Demonstrated experience with languages, such as Python, Java, Golang etc.
- Extensive experience with Linux administration and good understanding the various
linux kernel subsystems (memory, storage, network etc).
- Extensive experience in DNS, TCP/IP, UDP, GRPC, Routing and Load Balancing.
- Expertise in GitOps, Infrastructure as a Code tools such as Terraform etc.. and
- Configuration Management Tools such as Chef, Puppet, Saltstack, Ansible.
- Expertise of Amazon Web Services (AWS) and/or other relevant Cloud Infrastructure
solutions like Microsoft Azure or Google Cloud.
- Experience in building CI/CD solutions with tools such as Jenkins, GitLab, Spinnaker,
Argo etc.
- Experience in managing and deploying containerized environments using Docker,
Mesos/Kubernetes is a plus.
The Role
The role Data Lead is responsible for handling the data journey in a product, handling aspects related to data security, data acquisition/retrieval, data massaging etc.
How You Will Make an Impact:
Reasonable accommodations may be made to enable individuals with disabilities to perform the essential functions.
Ensuring the Innovapptive products to be data enrich & data-efficient.
What You Bring to the Team:
A seasoned data engineer with a solid understanding of how data-rich SAAS products retrieve and consume data.
To be successful in this role, we believe that you need to possess the following attributes.
- Bachelor's Degree in IT or Computers Engineering or equivalent degree in Computer Science
- 7-12 years of relevant experience
- This position addresses cloud data operations and classical database developer needs.
- Cloud Data Operations: Hands-on experience with Cloud Data Services on AWS (AWS RDS (MySQL, SQL Server) knowledge of latest cloud database service like Aurora server less DB etc.
- Hands-on experience in: Design stable, reliable and effective databases
- Provisioning cloud (AWS) DB services.
- Installing DB servers on AWS (IAAS model).
- Blob storage (S3, EBS EFS etc.)
- Optimizing DB services.
- Performance tuning, DB service optimization.
- Building fault-tolerant cloud data services.
- Experience with NoSQL technologies (documentDB, NoSQL), creating maintaining and consuming on cloud (AWS)
- Cloud Data security
- Hands-on experience with handling large data sets/transactions and operations.
- Exposure to data analytics and associated tools (Athena)
- Experience in handling Data Strategies, data life cycles in SAAS products.
- Exposure to cloud (AWS) networking.
- Query planning and optimization.
- SQL
- Knowledge of GDPR, physical/logical/conceptual data segregation in multi-tenant applications.
- Data Modeling
- Enforcing the appropriate security compliance in Customer environments as agreed with the client’s Information Security Council
- Excellent verbal and written communication skills
Site Reliability Engineer
- We are looking for a Senior SRE with a proven track record of success leading complex cloud-hybrid environments. You will have:
- Strong sense of Being an Owner, Wearing the Customer Shoes, with the ability to Empower Others demonstrated through clear
- communication and collaboration.
- Skills to work independently with multiple global teams, developing, configuring, deploying, and operating our global infrastructure on AWS and on-prem.
- Operational experience in complex distributed and real-time systems, including experience with SLO/SLAs towards high availability,reliability and DR goals.
- DevOps experience in building tools and frameworks, with an understanding of continuous deployment processes.
- Ability to think at scale, bringing a focus on continuous delivery methodologies from design through deployment and operations.
- Experience building and managing systems with tools including Kubernetes, Chef/Ansible/Puppet, Kafka, Docker, and Terraform.
- 5+ years experience in a Software and/or Site Reliability Engineering role
- Experience writing automation code in GoLang, Python or Java
- Experience developing and operating large scale distributed systems with Kubernetes and Docker
- Experience in running real time and low latency high available applications (Kafka, gRPC, RTP)
- Experience running public cloud environments on AWS
- Experience running hybrid clouds and on-prem infrastructures on Red Hat Enterprise Linux / CentOS
- Bachelor degree in Engineering, Computer Science or equivalent experience
- The ability to lead, partner, and collaborate cross functionally across an engineering organization
● Research, propose and evaluate with a 5-year vision, the architecture, design, technologies,
processes and profiles related to Telco Cloud.
● Participate in the creation of a realistic technical-strategic roadmap of the network to transform
it to Telco Cloud and be prepared for 5G.
● Using your deep technical expertise, you will provide detailed feedback to Product Management
and Engineering, as well as contribute directly to the platform code base to enhance both the
Customer experience of the service, as well as the SRE quality of life.
● The individual must be aware of trends in network infrastructure as well as within the network
engineering and OSS community. What technologies are being developed or launched?
● The individual should stay current with infrastructure trends in the telco network cloud domain.
● Be responsible for the Engineering of Lab and Production Telco Cloud environments, including
patches, upgrades, and reliability and performance improvements.
Required Minimum Qualifications: (Education and Technical Skills/Knowledge)
● Software Engineering degree, MS in Computer Science or equivalent experience
● Years of experiences as an SRE, DevOps, Development and/or Support related role
● 0-5 years of professional experience for a junior position
● At least 8 years of professional experience for a senior position
● Unix server administration and tuning : Linux / RedHat / CentOS / Ubuntu
● You have deep knowledge in Networking Layers 1-4
● Cloud / Virtualization (at least two): Helm, Docker, Kubernetes, AWS, Azure, Google Cloud,
OpenStack, OpenShift, VMware vSphere / Tanzu
● You have in-depth knowledge of cloud storage solutions on top of AWS, GCP, Azure and/or
on-prem private cloud, such as Ceph, CephFS, GlusterFS
● DevOps: Jenkins, Git, Azure DevOps, Ansible, Terraform
● Backend Knowledge Bash, Python, Go (other knowledge of Scripting Language is a plus).
● PaaS Level solutions such as Keycloak for IAM, Prometheus, Grafana, ELK, DBaaS (such as MySQL,
Cassandra)
About the Organisation:
The team at Coredge.io is a combination of experienced and young professionals alike having
many years of experience in working with Edge computing, Telecom application development
and Kubernetes. The company has continuously collaborated with the open source community,
universities and major industry players in furthering its goal of providing the industry with an
indispensable tool to offer improved services to its customers. Coredge.io has a global market
presence with its offices in US and New Delhi, India.
Site Reliability Engineer
at SteelEye is a fast growing FinTech company based in London
• Develop and Maintain IAC using Terraform and Ansible
• Draft design documents that translate requirements into code.
• Deal with challenges associated with scale.
• Assume responsibilities from technical design through technical client support.
• Manage expectations with internal stakeholders and context-switch in a fast paced environment.
• Thrive in an environment that uses Elasticsearch extensively.
• Keep abreast of technology and contribute to the engineering strategy.
• Champion best development practices and provide mentorship.
What we’re looking for
• An AWS Certified Engineer with strong skills in
o Terraform
o Ansible
o *nix and shell scripting
• Preferably with experience in:
o Elasticsearch
o Circle CI
o CloudFormation
o Python
o Packer
o Docker
o Prometheus and Grafana
o Challenges of scale
o Production support
• Sharp analytical and problem-solving skills.
• Strong sense of ownership.
• Demonstrable desire to learn and grow.
• Excellent written and oral communication skills.
• Mature collaboration and mentoring abilities.