- Our Site reliability engineers work on improving the availability, scalability, performance, and reliability of enterprise production services for our products as well as our customer’s data lake environments.
- You will use your expertise to improve the reliability and performance of Hadoop Data lake clusters and data management services. Just as our products, our SRE are expected to be platform and vendor-agnostic when it comes to implementing, stabilizing, and tuning Hadoop ecosystems.
- You’d be required to provide implementation guidance, best practices framework, and technical thought leadership to our customers for their Hadoop Data lake implementation and migration initiatives.
- You need to be 100% hand-on and as a required test, monitor, administer, and operate multiple Data lake clusters across data centers.
- Troubleshoot issues across the entire stack - hardware, software, application, and network.
- Dive into problems with an eye to both immediate remediations as well as the follow-through changes and automation that will prevent future occurrences.
- Must demonstrate exceptional troubleshooting and strong architectural skills and clearly and effectively describe this in both a verbal and written format.
- Customer-focused, Self-driven, and Motivated with a strong work ethic and a passion for problem-solving.
- 4+ years of designing, implementing, tuning, and managing services in a distributed, enterprise-scale on-premise and public/private cloud environment.
- Familiarity with infrastructure management and operations lifecycle concepts and ecosystem.
- Hadoop cluster design, Implementation, management and performance tuning experience with HDFS, YARN,
- HIVE/IMPALA, SPARK, Kerberos and related Hadoop technologies are a must.
- Must have strong SQL/HQL query troubleshooting and tuning skills on Hive/HBase.
- Must have a strong capacity planning experience for Hadoop ecosystems/data lakes.
- Good to have hands-on experience with – KAFKA, RANGER/SENTRY, NiFi, Ambari, Cloudera Manager, and HBASE.
- Good to have data modeling, data engineering, and data security experience within the Hadoop ecosystem.Good to have deep JVM/Java debugging and tuning skills.
Senior SRE - Acceldata (IC3 Level)
About the Job
You will join a team of highly skilled engineers who are responsible for delivering Acceldata’s support services. Our Site Reliability Engineers are trained to be active listeners and demonstrate empathy when customers encounter product issues. In our fun and collaborative environment Site Reliability Engineers develop strong business, interpersonal and technical skills to deliver high-quality service to our valued customers.
When you arrive for your first day, we’ll want you to have:
- Solid skills in troubleshooting to repair failed products or processes on a machine or a system using a logical, systematic search for the source of a problem in order to solve it, and make the product or process operational again
- A strong ability to understand the feelings of our customers as we empathize with them on the issue at hand
- A strong desire to increase your product and technology skillset; increase- your confidence supporting our products so you can help our customers succeed
In this position you will…
- Provide Support Services to our Gold & Enterprise customers using our flagship Acceldata Pulse,Flow & Torch Product suits. This may include assistance provided during the engineering and operations of distributed systems as well as responses for mission-critical systems and production customers.
- Demonstrate the ability to actively listen to customers and show empathy to the customer’s business impact when they experience issues with our products
- Participate in the queue management and coordination process by owning customer escalations, managing the unassigned queue.
- Be involved with and work on other support related activities - Performing POC & assisting Onboarding deployments of Acceldata & Hadoop distribution products.
- Triage, diagnose and escalate customer inquiries when applicable during their engineering and operations efforts.
- Collaborate and share solutions with both customers and the Internal team.
- Investigate product related issues both for particular customers and for common trends that may arise
- Study and understand critical system components and large cluster operations
- Differentiate between issues that arise in operations, user code, or product
- Coordinate enhancement and feature requests with product management and Acceldata engineering team.
- Flexible in working in Shifts.
- Participate in a Rotational weekend on-call roster for critical support needs.
- Participate as a designated or dedicated engineer for specific customers. Aspects of this engagement translates to building long term successful relationships with customers, leading weekly status calls, and occasional visits to customer sites
In this position, you should have…
- A strong desire and aptitude to become a well-rounded support professional. Acceldata Support considers the service we deliver as our core product.
- A positive attitude towards feedback and continual improvement
- A willingness to give direct feedback to and partner with management to improve team operations
- A tenacity to bring calm and order to the often stressful situations of customer cases
- A mental capability to multi-task across many customer situations simultaneously
- Bachelor degree in Computer Science or Engineering or equivalent experience. Master’s degree is a plus
- At least 2+ years of experience with at least one of the following cloud platforms: Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP), experience with managing and supporting a cloud infrastructure on any of the 3 platforms. Also knowledge on Kubernetes, Docker is a must.
- Strong troubleshooting skills (in example, TCP/IP, DNS, File system, Load balancing, database, Java)
- Excellent communication skills in English (written and verbal)
- Prior enterprise support experience in a technical environment strongly preferred
Strong Hands-on Experience Working With Or Supporting The Following
- 8-12 years of Experience with a highly-scalable, distributed, multi-node environment (50+ nodes)
- Hadoop operation including Zookeeper, HDFS, YARN, Hive, and related components like the Hive metastore, Cloudera Manager/Ambari, etc
- Authentication and security configuration and tuning (KNOX, LDAP, Kerberos, SSL/TLS, second priority: SSO/OAuth/OIDC, Ranger/Sentry)
- Java troubleshooting, e.g., collection and evaluation of jstacks, heap dumps
You might also have…
- Linux, NFS, Windows, including application installation, scripting, basic command line
- Docker and Kubernetes configuration and troubleshooting, including Helm charts, storage options, logging, and basic kubectl CLI
- Experience working with scripting languages (Bash, PowerShell, Python)
- Working knowledge of application, server, and network security management concepts
- Familiarity with virtual machine technologies
- Knowledge of databases like MySQL and PostgreSQL,
- Certification on any of the leading Cloud providers (AWS, Azure, GCP ) and/or Kubernetes is a big plus
The right person in this role has an opportunity to make a huge impact at Acceldata and add value to our future decisions. If this position has piqued your interest and you have what we described - we invite you to apply! An adventure in data awaits.
Learn more at https://www.acceldata.io/about-us