![OJAS's logo](https://cdn.cutshort.io/public/images/default_company_picture.jpg)
![skill icon](https://cdn.cutshort.io/public/images/skill_icons/python.png)
- 5+ years of software development or site reliability engineering or equivalent experience
- Skilled at problem solving, algorithms, and data structures
- Building tools and scripting frameworks from scratch
- Working with Cloud Automation tools like CloudFormation, Terraform, CDK, aws-cli
- Scripting languages like Python, Groovy, PowerShell, Bash, Perl etc.
- Configuration automation using Ansible or equivalent tools
- Exposure to Windows, Linux administration skills
- Project management tools like Jira, Trello
- Prior experience in dealing with Datastore technologies like Postgres, MySQL, SQL, DynamoDB is desirable
- Familiarity with basic networking, security and cloud engineering concepts
- Team player who is eager to help others to succeed through mentoring and leading by example
- Highly collaborative with effective written and verbal communication skills
![companies logos](https://cdn.cutshort.io/public/images/hiring_companies_logos-v2.webp)
Similar jobs
Job Description
As a Site Reliability Engineer, you will be involved in exciting technical challenges by analyzing, troubleshooting, and designing vital services, platforms, and infrastructure while always thinking about reliability, scalability, resilience, security, and performance.
Requirements
- Strong experience in Deployment of AWS cloud infrastructure 1+ years.
- 1-5 years of working experience in infrastructure support and CICD platform, leveraging DevOps, SRE & Agile methodologies.
- Hands-on experience in provisioning Infrastructure as Code (IaC) using Terraform Enterprise or community edition.
- Experience in cloud environments AWS/GCP/Azure and container technology, Docker and Kubernetes, Docker Swarm, Helm DevOps (Git + CI/CD pipelines), and Jenkins.
- AWS (Solutions Architect Professional), (Valid certification)
- Experience in programming/scripting in Python/Ruby/Bash for at least 1+ years.
- Experience in monitoring and analyzing infrastructure performance using standard performance monitoring tools - Grafana/Prometheus, DataDog, Nagios, New Relic.
Responsibilities
- Owning Infra architecture and non-functional requirements, ensuring they fit into a cohesive vision aligned with the rest of the Technology roadmap of the platform for the launch
- Propagate DevOps culture across the organization by sharing industry best practices, standards, approaches, documentation, and code with other engineering teams
- Design, test and troubleshoot the CICD pipeline for containerized applications from build until deployment
- Apply automation and software to any manual and mechanical tasks or parts of the system that would benefit from it or are performed manually.
- Able to troubleshoot complicated, cross-platform issues handling OS, Networking, and databases in a cloud-based SaaS environment and handle live production incidents, debug/troubleshoot application and infrastructure issues, and follow and implement SRE best practices.
Benefits
- Work-Life Balance
- Learning & Development
- Sabbatical Leave
- Parental Leaves
- Office Perks (Free Meal, Snacks)
With a core belief that advertising technology can measurably improve the lives of patients, DeepIntent is leading the healthcare advertising industry into the future. Built purposefully for the healthcare industry, the DeepIntent Healthcare Advertising Platform is proven to drive higher audience quality and script performance with patented technology and the industry’s most comprehensive health data. DeepIntent is trusted by 600+ pharmaceutical brands and all the leading healthcare agencies to reach the most relevant healthcare provider and patient audiences across all channels and devices. For more information, visit DeepIntent.com or find us on LinkedIn.
We are seeking a skilled and experienced Site Reliability Engineer (SRE) to join our dynamic team. The ideal candidate will have a minimum of 3 years of hands-on experience in managing and maintaining production systems, with a focus on reliability, scalability, and performance. As an SRE at Deepintent, you will play a crucial role in ensuring the stability and efficiency of our infrastructure, as well as contributing to the development of automation and monitoring tools.
Responsibilities:
- Deploy, configure, and maintain Kubernetes clusters for our microservices architecture.
- Utilize Git and Helm for version control and deployment management.
- Implement and manage monitoring solutions using Prometheus and Grafana.
- Work on continuous integration and continuous deployment (CI/CD) pipelines.
- Containerize applications using Docker and manage orchestration.
- Manage and optimize AWS services, including but not limited to EC2, S3, RDS, and AWS CDN.
- Maintain and optimize MySQL databases, Airflow, and Redis instances.
- Write automation scripts in Bash or Python for system administration tasks.
- Perform Linux administration tasks and troubleshoot system issues.
- Utilize Ansible and Terraform for configuration management and infrastructure as code.
- Demonstrate knowledge of networking and load-balancing principles.
- Collaborate with development teams to ensure applications meet reliability and performance standards.
Additional Skills (Good to Know):
- Familiarity with ClickHouse and Druid for data storage and analytics.
- Experience with Jenkins for continuous integration.
- Basic understanding of Google Cloud Platform (GCP) and data center operations.
Qualifications:
- Minimum 3 years of experience in a Site Reliability Engineer role or similar.
- Proven experience with Kubernetes, Git, Helm, Prometheus, Grafana, CI/CD, Docker, and microservices architecture.
- Strong knowledge of AWS services, MySQL, Airflow, Redis, AWS CDN.
- Proficient in scripting languages such as Bash or Python.
- Hands-on experience with Linux administration.
- Familiarity with Ansible and Terraform for infrastructure management.
- Understanding of networking principles and load balancing.
Education:
Bachelor's degree in Computer Science, Information Technology, or a related field.
DeepIntent is committed to bringing together individuals from different backgrounds and perspectives. We strive to create an inclusive environment where everyone can thrive, feel a sense of belonging, and do great work together.
DeepIntent is an Equal Opportunity Employer, providing equal employment and advancement opportunities to all individuals. We recruit, hire and promote into all job levels the most qualified applicants without regard to race, color, creed, national origin, religion, sex (including pregnancy, childbirth and related medical conditions), parental status, age, disability, genetic information, citizenship status, veteran status, gender identity or expression, transgender status, sexual orientation, marital, family or partnership status, political affiliation or activities, military service, immigration status, or any other status protected under applicable federal, state and local laws. If you have a disability or special need that requires accommodation, please let us know in advance.
DeepIntent’s commitment to providing equal employment opportunities extends to all aspects of employment, including job assignment, compensation, discipline and access to benefits and training.
![skill icon](https://cdn.cutshort.io/public/images/skill_icons/python.png)
Candidate MUST HAVE product-based company experience and a minimum of 3years of experience in DevOps.
What you will do (or learn) :
1. Build our application stack on AWS. Infrastructure as code (read Terraform)
2. Build state-of-the-art CI/CD pipelines.
3. Manage data warehouses and data pipelines.
4. Work on infrastructure and data security.
5. State-of-the-art log management system and tooling around them.
6. Monitoring and alerting system.
What do we expect from you?
1. 3 to 10 years of experience with DevOps or SRE principles.
2. Good fundamentals of database management and other distributed systems management.
3. Experience in infrastructure as code or other configuration management systems.
4. Experience in scripting languages (like bash, python, go lang etc.)
5. Good understanding of Linux systems
6. Strong debugging and troubleshooting skills
7. Experience in tooling around monitoring, CI/CD, log management systems.
SRE - Tech Lead (DevOps):
Location: Permanent Work From Home Option
Notice: Candidates with a notice period of 30 days and less and preferred
SRE-DevOps- Tech Lead - JD:
Srijan is hiring for Site Reliability Engineering (SRE), We are looking for SRE/DevOps- Tech Lead or Sr. Tech Lead with strong automation skills and a good understanding of how to build & run secure & reliable platforms for cloud-native applications. Please find below the detailed job description and kindly go through the same for reference:-
Minimum Experience: 6+ years in DevOps/SRE
Permanent WFH option
Job Description:-
The focus of this role is to build scalable, resilient, secure infrastructure for cloud-native applications whilst automating every mundane task you could think of and build observability dashboards, set up alerts, etc to provide optics to relevant stakeholders. In a nutshell: “You are keepers of Production environments”. You must be a problem solver with the ability to multitask and come with strong collaboration and communication skills.
Key Responsibilities:-
-
Proactively monitor and review application performance
-
Handle on-call and emergency support
-
Ensure software has good logging and diagnostics
-
Create and maintain operational runbooks
-
Contribute in Solution Designing and evaluating Technical Debt
-
Set right practices for Well-Defined Architecture & to minimize toil.
-
Own SLI, SLO configuration as per Error Budget
-
Maintain production services through measuring and monitoring availability, latency, and overall system health.
-
Practice sustainable incident response and blameless postmortems.
-
Not be afraid to contribute changes back to the Software engineering team to improve the systems.
-
Managing the delivery pipeline into production.
-
Able to mentor junior members on regular basis
-
Troubleshooting issues with web applications
-
Understanding of security principles and best practices
-
Ensuring that critical data is backed up
-
Configuration of monitoring systems including infrastructure monitoring and Application Performance Monitoring systems such as New Relic.
-
Ensuring that web application infrastructure is built
-
Ability to act as Customer Technical Advocate and negotiate well with peers on technical fronts.
-
Flexible enough to work in different Shifts for hyper business requirement
-
Ability to handle multiple global clients on tech front and generate desired reports to represent health of SRE Delivery.
Skills/Experience:-
-
A key skill of a SRE Tech Lead is that they have a deep knowledge of the application, the code, and how it runs, is configured, and scales. That knowledge is what makes them so valuable at also monitoring and supporting it as site reliability engineers.
-
System administration, security, and networking
-
The SRE Tech Lead expected to have a good understanding of system administration (Linux or Windows) and networking.
-
Essential commands
-
User and Group Management
-
Knowledge of networking concepts (DNS, TCP/IP, and Firewalls)
-
Service Configuration
-
Storage Management
-
Good grasp of fundamental security concepts
-
Good understanding of infrastructure as code principles.
-
Knowledge of a scripting language such as Bash
-
Ability to configure infrastructure using a Configuration Management technology such as Puppet, Chef, or Ansible.
-
Familiarity with Jenkins or any other CI/CD tool
-
Proficiency in a high-level programming language such as Python or Go.
-
Understanding of container technologies such as Docker, Kubernetes
-
2 yrs+ hands on experience with container orchestration technologies such as ECS, EKS, AKS or Kubernetes would be beneficial.
-
Use Terraform and other IaC to deploy cloud infrastructure.
Cloud technologies:-
-
Experience designing available, cost-efficient, fault-tolerant, and scalable distributed systems on AWS/Azure
-
Hands-on experience using compute, networking, storage, and database AWS/Azure services
-
Hands-on experience of 4 yrs+ with AWS/Azure deployment and management services
-
Ability to identify and define technical requirements for an AWS/AZURE-based application
-
Ability to identify which AWS/AZURE services meet a given technical requirement
-
Knowledge of recommended best practices for building secure and reliable applications on the AWS/AZURE platform
-
An understanding of the AWS/AZURE global infrastructure
-
An understanding of network technologies as they relate to AWS/AZURE
-
An understanding of security features and tools that AWS/AZURE provides and how they relate to traditional services
![skill icon](https://cdn.cutshort.io/public/images/skill_icons/python.png)
A network of the world's best developers - full-time, long-term remote software jobs with better compensation and career growth. We enable our clients to accelerate their Cloud Offering and Capitalize on Cloud. We have our own IoT/AI platform and we provide professional services on that platform to build custom clouds for their IoT devices. We also build mobile apps, run 24x7 DevOps/site reliability engineering for our clients.
We are looking for a friendly, very hands-on technical, and dependable professional with plenty of experience as a backend & cloud engineer to provide site reliability services to our internal teams and end customers. We expect you to deliver with TOP quality & high speed. You must have experience developing and designing amazing UI screens.
This person MUST have:
- BE Computer Science or equivalent
- Cloud app development experience.
- Strong Troubleshooting and debugging skills
- A strong passion for writing simple, clean, and efficient code.
- 3 years of experience with the Django framework and other backend technologies.
- Knowledge of NodeJS
- Experience with building, modifying, and extending API endpoints (REST or GraphQL) for data retrieval and persistence.
- Understand how to use a database like Postgres (preferred choice), SQLite, MongoDB, MySQL.
- Experience creating high-performance applications.
- Experience with messaging and broker tools - Rabbitmq, MQTT
- Experience with SQL and NoSQL databases
- Experience with the full software development life cycle, including requirements collection, design, implementation, testing, and operational support.
- Knowledge of web services
- Proficient understanding of code versioning tools Git.
- Hands-on experience deploying and managing infrastructure with CloudFormation/Terraform
- Experience managing AWS infrastructure.
- Hands-on experience in Linux environment.
- Basic understanding of Kubernetes/Docker orchestration.
- Manges existing infrastructure/Pipelines/Engineering tools (On-Prem or AWS) for the engineering team (Build servers/Jenkins nodes etc.)
- Experience with scrum or other agile software development methodology.
- Excellent verbal and written communication, teamwork, decision making and influencing skills.
- Handle customer calls/emails regarding technical issues for end-users.
- Strong communication skills
- Attention to detail.
Experience:
- Min 3 year experience
Location:
- Ahmedabad Office Or,
- Work from home
Timings:
- 40 hours a week with a rotational shift every month.
Position:
- Full time/Direct
- We have great benefits such as PF, medical insurance, 12 annual company holidays, 12 PTO leaves per year, annual increments, Diwali bonus, spot bonuses and other incentives, etc.
- We don't believe in locking in people with large notice periods. You will stay here because you love the company. We have only a 30 days notice period
Roles and Responsibilities
- Managing Availability, Performance, Capacity of infrastructure and applications.
- Building and implementing observability for applications health/performance/capacity.
- Optimizing On-call rotations and processes.
- Documenting “tribal” knowledge.
- Managing Infra-platforms like Mesos/Kubernetes,CICD,Observability (Prometheus/New Relic/ELK),Cloud Platforms (AWS/ Azure),Databases,Data Platforms Infrastructure
- Providing help in onboarding new services with production readiness review process.
- Providing reports on services SLO/Error Budgets/Alerts and Operational Overhead.
- Working with Dev and Product teams to define SLO/Error Budgets/Alerts.
- Working with Dev team to have in depth understanding of the application architecture
and its bottlenecks.
- Identifying observability gaps in product services, infrastructure and working with stake
owners to fix it.
- Managing Outages and doing detailed RCA with developers and identifying ways to
avoid that situation.
- Managing/Automating upgrades of the infrastructure services.
- Automate toil work.
Experience & Skills
- 6+ years of total experience
- Experience as an SRE/DevOps/Infrastructure Engineer on large scale microservices and infrastructure.
- A collaborative spirit with the ability to work across disciplines to influence, learn, and
deliver.
- A deep understanding of computer science, software development, and networking principles.
- Demonstrated experience with languages, such as Python, Java, Golang etc.
- Extensive experience with Linux administration and good understanding the various
linux kernel subsystems (memory, storage, network etc).
- Extensive experience in DNS, TCP/IP, UDP, GRPC, Routing and Load Balancing.
- Expertise in GitOps, Infrastructure as a Code tools such as Terraform etc.. and
- Configuration Management Tools such as Chef, Puppet, Saltstack, Ansible.
- Expertise of Amazon Web Services (AWS) and/or other relevant Cloud Infrastructure
solutions like Microsoft Azure or Google Cloud.
- Experience in building CI/CD solutions with tools such as Jenkins, GitLab, Spinnaker,
Argo etc.
- Experience in managing and deploying containerized environments using Docker,
Mesos/Kubernetes is a plus.
![skill icon](https://cdn.cutshort.io/public/images/skill_icons/python.png)
Who You Are
- Creative thinker and strong problem solver with meticulous attention to detail
- Highly organized, creative, motivated, and passionate about achieving results
- Able to balance multiple tasks and projects effectively and quickly adapt to new situations and technologies
- Able to work both independently and as part of a team
- Systematic problem-solver, coupled with a strong sense of ownership and drive
What you need
- 3-7 years of experience as a Site Reliability Engineer or a mix of a software engineer and DevOps.
- Strong hands-on knowledge of Linux fundamentals, System administration scripting, performance tuning/scalability, troubleshooting.
- Write great quality code using SOLID principles including unit and integration tests.
- Hands-on development experience in an object-orientated programming language like Python.
- Hands-on experience developing task automations
- Experience using tools to create and manage CI (continuous integration) and CD (continuous delivery) pipelines.
- Familiarity with software development tools: source code management (SCM systems), code review systems, issue tracking tools, build tools, test frameworks, code quality tools.
- Experience implementing open-source observability and alerting tools, like Prometheus, Grafana, Cortex, Thanos, Alertmanager etc
- Have decent knowledge on networking (VPC, VNet, DNS etc) and of the TCP/IP stack, internet routing and load balancing.
- Worked with log and configuration management tool
- Prior experience of working with AWS, Azure, GCP is a plus
- Prior experience of working with Kubernetes, Docker and containers is plus
- Strong interpersonal communication skills (including listening, speaking, and writing) and ability to work well in a diverse, team-focused environment with other SREs, Engineers, Product Managers, etc.
- Documenting your work should be in your DNA
What you get
- A chance to develop and build something (probably from scratch) which you can be proud of
- Build and Implement modern systems observability solutions including monitoring, alerting, metrics, logging, and APM & distributed tracing.
- Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity.
- Maintain business continuity by identifying and driving opportunities to make systems highly resilient and human-free.
- Closely work with the software engineering team to ensure accurate monitoring and metrics are being built into applications before going to production.
- Develop and maintain software modules for use and re-use in cloud and on-premise systems automation.
- Identify process gaps and implement process improvements to increase operational reliability
- Drive standardization efforts across the services, infrastructure, systems, and practices
- Develop Systems & Tools to help with Development team to uphold the Reliability principles
![skill icon](https://cdn.cutshort.io/public/images/skill_icons/python.png)
About the Role
Dremio’s SREs ensure that our internal and externally visible services have reliability and uptime appropriate to users' needs and a fast rate of improvement. You will be joining a newly formed team that will spearhead our efforts to launch a cloud service. This is an opportunity to join a very fast growth startup and help build a cloud service from the ground up.
Responsibilities and Ownership
- Ability to debug and optimize code and automate routine tasks.
- Evangelize and advocate for reliability practices across our organization.
- Collaborate with other Engineering teams to support services before they go live through activities such as system design consulting, developing software platforms and frameworks, monitoring/alerting, capacity planning and launch reviews.
- Analyze and optimize our core product by developing and implementing reliability and performance practices.
- Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity.
- Be on-call for services that the SRE team owns.
- Practice sustainable incident response and blameless postmortems.
Qualifications
- 6+ years of relevant experience in the following areas: SRE, DevOps, Cloud Operations, Systems Engineering, or Software Engineering.
- Excellent command of cloud services on AWS/GCP/Azure, Kubernetes and CI/CD pipelines.
- Have moderate-advanced experience in Java, C, C++, Python, Go or other object-oriented programming languages.
- You are Interested in designing, analyzing and troubleshooting large-scale distributed systems.
- You have a systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
- You have a great ability to debug and optimize code and automate routine tasks.
- You have a solid background in software development and architecting resilient and reliable applications.
![skill icon](https://cdn.cutshort.io/public/images/skill_icons/python.png)
• Develop and Maintain IAC using Terraform and Ansible
• Draft design documents that translate requirements into code.
• Deal with challenges associated with scale.
• Assume responsibilities from technical design through technical client support.
• Manage expectations with internal stakeholders and context-switch in a fast paced environment.
• Thrive in an environment that uses Elasticsearch extensively.
• Keep abreast of technology and contribute to the engineering strategy.
• Champion best development practices and provide mentorship.
What we’re looking for
• An AWS Certified Engineer with strong skills in
o Terraform
o Ansible
o *nix and shell scripting
• Preferably with experience in:
o Elasticsearch
o Circle CI
o CloudFormation
o Python
o Packer
o Docker
o Prometheus and Grafana
o Challenges of scale
o Production support
• Sharp analytical and problem-solving skills.
• Strong sense of ownership.
• Demonstrable desire to learn and grow.
• Excellent written and oral communication skills.
• Mature collaboration and mentoring abilities.
![icon](https://cdn.cutshort.io/public/images/search.png)
![companies logos](https://cdn.cutshort.io/public/images/hiring_companies_logos-v2.webp)