Requirement- At least 3 years of experience with relative experience in managing development operations- Hands-on experience with AWS- Thorough knowledge on setting up release pipeline, managing multiple environments like Beta, Staging, UAT, and Production- Thorough knowledge about best cloud practices and architecture- Hands-on with benchmarking and performance monitoring- Identifying various bottlenecks and taking pre-emptive measures to avoid downtime- Hands-on knowledge with at least one toolset Chef/Puppet/Ansible- Hands-on with CloudFormation / Terraform or other Infrastructure as code is a plus. - Thorough experience with Shell Scripting and should not know to shy away from learning new technologies or programming languages- Experience with other cloud providers like Azure and GCP is a plus- Should be open to R&D for creative ways to improve performance while keeping costs low What we want the person to do? - Manage, Monitor and Provision Infrastructure - Majorly on AWS- Will be responsible for maintaining 100% uptime on production servers (Site Reliability)- Setting up a release pipeline for current releases. Automating releases for Beta, Staging & Production- Maintaining near-production replica environments on Beta and Staging- Automating Releases and Versioning of Static Assets (Experience with Chef/Puppet/Ansible)- Should have hands-on experience with Build Tools like Jenkins, GitHub Actions, AWS CodeBuild etc- Identify performance gaps and ways to fix.- Weekly meetings with Engineering Team to discuss the changes/upgrades. Can be related to code issue/architecture bottlenecks.- Creative Ways of Reducing Costs of Cloud Computing- Convert Infrastructure Deployment / Provision to Infrastructure as Code for reusability and scaling.
Responsibilities Our Site reliability engineers work on improving the availability, scalability, performance, and reliability of enterprise production services for our products as well as our customer’s data lake environments. You will use your expertise to improve the reliability and performance of Hadoop Data lake clusters and data management services. Just as our products, our SRE are expected to be platform and vendor-agnostic when it comes to implementing, stabilizing, and tuning Hadoop ecosystems. You’d be required to provide implementation guidance, best practices framework, and technical thought leadership to our customers for their Hadoop Data lake implementation and migration initiatives. You need to be 100% hand-on and as a required test, monitor, administer, and operate multiple Data lake clusters across data centers. Troubleshoot issues across the entire stack - hardware, software, application, and network. Dive into problems with an eye to both immediate remediations as well as the follow-through changes and automation that will prevent future occurrences. Must demonstrate exceptional troubleshooting and strong architectural skills and clearly and effectively describe this in both a verbal and written format. Requirements Customer-focused, Self-driven, and Motivated with a strong work ethic and a passion for problem-solving. 4+ years of designing, implementing, tuning, and managing services in a distributed, enterprise-scale on-premise and public/private cloud environment. Familiarity with infrastructure management and operations lifecycle concepts and ecosystem. Hadoop cluster design, Implementation, management and performance tuning experience with HDFS, YARN, HIVE/IMPALA, SPARK, Kerberos and related Hadoop technologies are a must. Must have strong SQL/HQL query troubleshooting and tuning skills on Hive/HBase. Must have a strong capacity planning experience for Hadoop ecosystems/data lakes. Good to have hands-on experience with – KAFKA, RANGER/SENTRY, NiFi, Ambari, Cloudera Manager, and HBASE. Good to have data modeling, data engineering, and data security experience within the Hadoop ecosystem.Good to have deep JVM/Java debugging and tuning skills
Who You Are Creative thinker and strong problem solver with meticulous attention to detail Highly organized, creative, motivated, and passionate about achieving results Able to balance multiple tasks and projects effectively and quickly adapt to new situations and technologies Able to work both independently and as part of a team Systematic problem-solver, coupled with a strong sense of ownership and drive What you need 3-7 years of experience as a Site Reliability Engineer or a mix of a software engineer and DevOps. Strong hands-on knowledge of Linux fundamentals, System administration scripting, performance tuning/scalability, troubleshooting. Write great quality code using SOLID principles including unit and integration tests. Hands-on development experience in an object-orientated programming language like Python. Hands-on experience developing task automations Experience using tools to create and manage CI (continuous integration) and CD (continuous delivery) pipelines. Familiarity with software development tools: source code management (SCM systems), code review systems, issue tracking tools, build tools, test frameworks, code quality tools. Experience implementing open-source observability and alerting tools, like Prometheus, Grafana, Cortex, Thanos, Alertmanager etc Have decent knowledge on networking (VPC, VNet, DNS etc) and of the TCP/IP stack, internet routing and load balancing. Worked with log and configuration management tool Prior experience of working with AWS, Azure, GCP is a plus Prior experience of working with Kubernetes, Docker and containers is plus Strong interpersonal communication skills (including listening, speaking, and writing) and ability to work well in a diverse, team-focused environment with other SREs, Engineers, Product Managers, etc. Documenting your work should be in your DNA What you get A chance to develop and build something (probably from scratch) which you can be proud of Build and Implement modern systems observability solutions including monitoring, alerting, metrics, logging, and APM & distributed tracing. Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity. Maintain business continuity by identifying and driving opportunities to make systems highly resilient and human-free. Closely work with the software engineering team to ensure accurate monitoring and metrics are being built into applications before going to production. Develop and maintain software modules for use and re-use in cloud and on-premise systems automation. Identify process gaps and implement process improvements to increase operational reliability Drive standardization efforts across the services, infrastructure, systems, and practices Develop Systems & Tools to help with Development team to uphold the Reliability principles
Roles and Responsibilities Managing Availability, Performance, Capacity of infrastructure and applications. Building and implementing observability for applications health/performance/capacity. Optimizing On-call rotations and processes. Documenting “tribal” knowledge. Managing Infra-platforms like Mesos/Kubernetes,CICD,Observability (Prometheus/New Relic/ELK),Cloud Platforms (AWS/ Azure),Databases,Data Platforms Infrastructure Providing help in onboarding new services with production readiness review process. Providing reports on services SLO/Error Budgets/Alerts and Operational Overhead. Working with Dev and Product teams to define SLO/Error Budgets/Alerts. Working with Dev team to have in depth understanding of the application architecture and its bottlenecks. Identifying observability gaps in product services, infrastructure and working with stake owners to fix it. Managing Outages and doing detailed RCA with developers and identifying ways to avoid that situation. Managing/Automating upgrades of the infrastructure services. Automate toil work. Experience & Skills 6+ years of total experience Experience as an SRE/DevOps/Infrastructure Engineer on large scale microservices and infrastructure. A collaborative spirit with the ability to work across disciplines to influence, learn, and deliver. A deep understanding of computer science, software development, and networking principles. Demonstrated experience with languages, such as Python, Java, Golang etc. Extensive experience with Linux administration and good understanding the various linux kernel subsystems (memory, storage, network etc). Extensive experience in DNS, TCP/IP, UDP, GRPC, Routing and Load Balancing. Expertise in GitOps, Infrastructure as a Code tools such as Terraform etc.. and Configuration Management Tools such as Chef, Puppet, Saltstack, Ansible. Expertise of Amazon Web Services (AWS) and/or other relevant Cloud Infrastructure solutions like Microsoft Azure or Google Cloud. Experience in building CI/CD solutions with tools such as Jenkins, GitLab, Spinnaker, Argo etc. Experience in managing and deploying containerized environments using Docker, Mesos/Kubernetes is a plus.
What are we looking for:● Research, propose and evaluate with a 5-year vision, the architecture, design, technologies,processes and profiles related to Telco Cloud.● Participate in the creation of a realistic technical-strategic roadmap of the network to transformit to Telco Cloud and be prepared for 5G.● Using your deep technical expertise, you will provide detailed feedback to Product Managementand Engineering, as well as contribute directly to the platform code base to enhance both theCustomer experience of the service, as well as the SRE quality of life.● The individual must be aware of trends in network infrastructure as well as within the networkengineering and OSS community. What technologies are being developed or launched?● The individual should stay current with infrastructure trends in the telco network cloud domain.● Be responsible for the Engineering of Lab and Production Telco Cloud environments, includingpatches, upgrades, and reliability and performance improvements.Required Minimum Qualifications: (Education and Technical Skills/Knowledge)● Software Engineering degree, MS in Computer Science or equivalent experience● Years of experiences as an SRE, DevOps, Development and/or Support related role● 0-5 years of professional experience for a junior position● At least 8 years of professional experience for a senior position● Unix server administration and tuning : Linux / RedHat / CentOS / Ubuntu● You have deep knowledge in Networking Layers 1-4● Cloud / Virtualization (at least two): Helm, Docker, Kubernetes, AWS, Azure, Google Cloud,OpenStack, OpenShift, VMware vSphere / Tanzu● You have in-depth knowledge of cloud storage solutions on top of AWS, GCP, Azure and/oron-prem private cloud, such as Ceph, CephFS, GlusterFS● DevOps: Jenkins, Git, Azure DevOps, Ansible, Terraform● Backend Knowledge Bash, Python, Go (other knowledge of Scripting Language is a plus).● PaaS Level solutions such as Keycloak for IAM, Prometheus, Grafana, ELK, DBaaS (such as MySQL,Cassandra)About the Organisation:The team at Coredge.io is a combination of experienced and young professionals alike havingmany years of experience in working with Edge computing, Telecom application developmentand Kubernetes. The company has continuously collaborated with the open source community,universities and major industry players in furthering its goal of providing the industry with anindispensable tool to offer improved services to its customers. Coredge.io has a global marketpresence with its offices in US and New Delhi, India.
Ability to clearly articulate and demonstrate the value proposition. Provides technical support, configuration, and administration of Citrix environment Develops processes, procedures, and technical documentation for management of Citrix environment System and application health monitoring and reporting Consulting with application owners to install and tune their applications Provides technical support, configuration, and administration of Windows Server as required Assists with architectural design to improve reliability, performance, and efficiencies Responds to situations where standard procedures have failed in isolating or fixing problem. Provide support for a 24 x 7 operation, when required Establishes strong working relationships with key staff members in departments. Minimum 5+ years’ experience as a Wintel Engineer/Architect with an outstanding track record of progressively increasing responsibilities Strong ability to communicate effectively and multi-task with minimal supervision Current certifications with Cisco, Citrix & Microsoft required Strong understanding of TCP/IP and OSI model Experience in Citrix XenApp, XenDesktop and XenServer technologies. 5+ years of Network, Storage and Infrastructure (Data Center, etc.) experience. A strong understanding of leading manufactures routing and switching architecture; and experience with storage and virtualization products from leading manufactures.
Requirements Technical Skills Ability to solution & deliver all of Operations/SRE services & processes including managing L2 Environment Support 5-12 years of overall environment support experience with 5+ years of experience as support / SRE engineer Experience in implementing Monitoring solutions using APM tools( Example: AppDynamics, Graylog, Dynatrace, Datadog etc.) set up and test proactive monitoring alerts Have a broad knowledge profile and really excel in some areas, such as HTTP/TLS, DNS, networking or containerization Comfortable with large scale production systems and technologies, for example load balancing, monitoring, distributed systems, microservices, and configuration management. Process Skills Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive Interest in designing, analyzing and troubleshooting large-scale distributed systems. Behavioral Skills Practice sustainable incident response and blameless postmortems. Proven ability in developing relationships with stakeholders, communicating project/program status, and understanding detailed business requirements across multiple project initiatives This role requires candidates to work in rotational shifts. 24*7 support Benefits LOCATION: Mumbai COMPENSATION: Competitive WHY ZYCUS? : Be a part of one of the fastest growing product Company in India Come join a young, dynamic & enterprising team Work on the latest technologies Flexible working hours (As per business requirement). Zycus Global Leader Procurement: https://www.zycus.com/newsroom/press-releases.html