Apache Oozie Jobs in Pune
Data Engineers develop modern data architecture approaches to meet key business objectives and provide end-to-end data solutions. You might spend a few weeks with a new client on a deep technical review or a complete organizational review, helping them to understand the potential that data brings to solve their most pressing problems. On other projects, you might be acting as the architect, leading the design of technical solutions, or perhaps overseeing a program inception to build a new product. It could also be a software delivery project where you're equally happy coding and tech-leading the team to implement the solution.
You’ll spend time on the following:
- You will partner with teammates to create complex data processing pipelines in order to solve our clients’ most ambitious challenges
- You will collaborate with Data Scientists in order to design scalable implementations of their models
- You will pair to write clean and iterative code based on TDD
- Leverage various continuous delivery practices to deploy data pipelines
- Advise and educate clients on how to use different distributed storage and computing technologies from the plethora of options available
- Develop modern data architecture approaches to meet key business objectives and provide end-to-end data solutions
- Create data models and speak to the tradeoffs of different modeling approaches
Here’s what we’re looking for:
- You have a good understanding of data modelling and experience with data engineering tools and platforms such as Kafka, Spark, and Hadoop
- You have built large-scale data pipelines and data-centric applications using any of the distributed storage platforms such as HDFS, S3, NoSQL databases (Hbase, Cassandra, etc.) and any of the distributed processing platforms like Hadoop, Spark, Hive, Oozie, and Airflow in a production setting
- Hands on experience in MapR, Cloudera, Hortonworks and/or cloud (AWS EMR, Azure HDInsights, Qubole etc.) based Hadoop distributions
- You are comfortable taking data-driven approaches and applying data security strategy to solve business problems
- Working with data excites you: you can build and operate data pipelines, and maintain data storage, all within distributed systems
- Strong communication and client-facing skills with the ability to work in a consulting environment
consulting & implementation services in the area of Oil & Gas, Mining and Manufacturing Industry
- Data Engineer
Required skill set: AWS GLUE, AWS LAMBDA, AWS SNS/SQS, AWS ATHENA, SPARK, SNOWFLAKE, PYTHON
- Experience in AWS Glue
- Experience in Apache Parquet
- Proficient in AWS S3 and data lake
- Knowledge of Snowflake
- Understanding of file-based ingestion best practices.
- Scripting language - Python & pyspark
- Create and manage cloud resources in AWS
- Data ingestion from different data sources which exposes data using different technologies, such as: RDBMS, REST HTTP API, flat files, Streams, and Time series data based on various proprietary systems. Implement data ingestion and processing with the help of Big Data technologies
- Data processing/transformation using various technologies such as Spark and Cloud Services. You will need to understand your part of business logic and implement it using the language supported by the base data platform
- Develop automated data quality check to make sure right data enters the platform and verifying the results of the calculations
- Develop an infrastructure to collect, transform, combine and publish/distribute customer data.
- Define process improvement opportunities to optimize data collection, insights and displays.
- Ensure data and results are accessible, scalable, efficient, accurate, complete and flexible
- Identify and interpret trends and patterns from complex data sets
- Construct a framework utilizing data visualization tools and techniques to present consolidated analytical and actionable results to relevant stakeholders.
- Key participant in regular Scrum ceremonies with the agile teams
- Proficient at developing queries, writing reports and presenting findings
- Mentor junior members and bring best industry practices
- 5-7+ years’ experience as data engineer in consumer finance or equivalent industry (consumer loans, collections, servicing, optional product, and insurance sales)
- Strong background in math, statistics, computer science, data science or related discipline
- Advanced knowledge one of language: Java, Scala, Python, C#
- Production experience with: HDFS, YARN, Hive, Spark, Kafka, Oozie / Airflow, Amazon Web Services (AWS), Docker / Kubernetes, Snowflake
- Proficient with
- Data mining/programming tools (e.g. SAS, SQL, R, Python)
- Database technologies (e.g. PostgreSQL, Redshift, Snowflake. and Greenplum)
- Data visualization (e.g. Tableau, Looker, MicroStrategy)
- Comfortable learning about and deploying new technologies and tools.
- Organizational skills and the ability to handle multiple projects and priorities simultaneously and meet established deadlines.
- Good written and oral communication skills and ability to present results to non-technical audiences
- Knowledge of business intelligence and analytical tools, technologies and techniques.
Familiarity and experience in the following is a plus:
- AWS certification
- Spark Streaming
- Kafka Streaming / Kafka Connect
- ELK Stack
- Cassandra / MongoDB
- CI/CD: Jenkins, GitLab, Jira, Confluence other related tools
DeepIntent is a marketing technology company that helps healthcare brands strengthen communication with patients and healthcare professionals by enabling highly effective and performant digital advertising campaigns. Our healthcare technology platform, MarketMatch™, connects advertisers, data providers, and publishers to operate the first unified, programmatic marketplace for healthcare marketers. The platform’s built-in identity solution matches digital IDs with clinical, behavioral, and contextual data in real-time so marketers can qualify 1.6M+ verified HCPs and 225M+ patients to find their most clinically-relevant audiences, and message them on a one-to-one basis in a privacy compliant way. Healthcare marketers use MarketMatch to plan, activate, and measure digital campaigns in ways that best suit their business, from managed service engagements to technical integration or self-service solutions. DeepIntent was founded by Memorial Sloan Kettering alumni in 2016 and acquired by Propel Media, Inc. in 2017. We proudly serve major pharmaceutical and Fortune 500 companies out of our offices in New York, Bosnia and India.
Roles and Responsibilities
- Establish formal data practice for the organisation.
- Build & operate scalable and robust data architectures.
- Create pipelines for the self-service introduction and usage of new data
- Implement DataOps practices
- Design, Develop, operate Data Pipelines which support Data scientists and machine learning Engineers.
- Build simple, highly reliable Data storage, ingestion, transformation solutions which are easy to deploy and manage.
- Collaborate with various business stakeholders, software engineers, machine learning engineers, analysts.
- Experience in designing, developing and operating configurable Data pipelines serving high volume and velocity data.
- Experience working with public clouds like GCP/AWS.
- Good understanding of software engineering, DataOps, and data architecture, Agile and DevOps methodologies.
- Experience building Data architectures that optimize performance and cost, whether the components are prepackaged or homegrown
- Proficient with SQL,Python or JVM based language, Bash.
- Experience with any of Apache open source projects such as Spark, Druid, Beam, Airflow etc.and big data databases like BigQuery, Clickhouse, etc
- Good communication skills with ability to collaborate with both technical and non technical people.
- Ability to Think Big, take bets and innovate, Dive Deep, Bias for Action, Hire and Develop the Best, Learn and be Curious.
5+ years of professional experience in experiment design and applied machine learning predicting outcomes in large-scale, complex datasets.
Proficiency in Python, Azure ML, or other statistics/ML tools.
Proficiency in Deep Neural Network, Python based frameworks.
Proficiency in Azure DataBricks, Hive, Spark.
Proficiency in deploying models into production (Azure stack).
Moderate coding skills. SQL or similar required. C# or other languages strongly preferred.
Outstanding communication and collaboration skills. You can learn from and teach others.
Strong drive for results. You have a proven record of shepherding experiments to create successful shipping products/services.
Experience with prediction in adversarial (energy) environments highly desirable.
Understanding of the model development ecosystem across platforms, including development, distribution, and best practices, highly desirable.
A Masters or Ph.D degree with coursework in Statistics, Data Science, Experimentation Design, and Machine Learning highly desirable
In-person Interview- 24th Sept, Saturday- Pune Office
- Partners with business stakeholders to translate business objectives into clearly defined analytical projects.
- Identify opportunities for text analytics and NLP to enhance the core product platform, select the best machine learning techniques for the specific business problem and then build the models that solve the problem.
- Own the end-end process, from recognizing the problem to implementing the solution.
- Define the variables and their inter-relationships and extract the data from our data repositories, leveraging infrastructure including Cloud computing solutions and relational database environments.
- Build predictive models that are accurate and robust and that help our customers to utilize the core platform to the maximum extent.
Skills and Qualification
- 12 to 15 yrs of experience.
- An advanced degree in predictive analytics, machine learning, artificial intelligence; or a degree in programming and significant experience with text analytics/NLP. He shall have a strong background in machine learning (unsupervised and supervised techniques). In particular, excellent understanding of machine learning techniques and algorithms, such as k-NN, Naive Bayes, SVM, Decision Forests, logistic regression, MLPs, RNNs, etc.
- Experience with text mining, parsing, and classification using state-of-the-art techniques.
- Experience with information retrieval, Natural Language Processing, Natural Language
- Understanding and Neural Language Modeling.
- Ability to evaluate the quality of ML models and to define the right performance metrics for models in accordance with the requirements of the core platform.
- Experience in the Python data science ecosystem: Pandas, NumPy, SciPy, sci-kit-learn, NLTK, Gensim, etc.
- Excellent verbal and written communication skills, particularly possessing the ability to share technical results and recommendations to both technical and non-technical audiences.
- Ability to perform high-level work both independently and collaboratively as a project member or leader on multiple projects.
We’re hiring a talented Data Engineer and Big Data enthusiast to work in our platform to help ensure that our data quality is flawless. As a company, we have millions of new data points every day that come into our system. You will be working with a passionate team of engineers to solve challenging problems and ensure that we can deliver the best data to our customers, on-time. You will be using the latest cloud data warehouse technology to build robust and reliable data pipelines.
Exceptional candidates will have:
Data Engineer – SQL, RDBMS, pySpark/Scala, Python, Hive, Hadoop, Unix
Data engineering services required:
- Build data products and processes alongside the core engineering and technology team;
- Collaborate with senior data scientists to curate, wrangle, and prepare datafor use in their advanced analytical models;
- Integrate datafrom a variety of sources, assuring that they adhere to data quality and accessibility standards;
- Modify and improve data engineering processes to handle ever larger, more complex, and more types of data sources and pipelines;
- Use Hadoop architecture and HDFS commands to design and optimize data queries at scale;
- Evaluate and experiment with novel data engineering tools and advises information technology leads and partners about new capabilities to determine optimal solutions for particular technical problems or designated use cases.
Big data engineering skills:
- Demonstrated ability to perform the engineering necessary to acquire, ingest, cleanse, integrate, and structure massive volumes of data from multiple sources and systems into enterprise analytics platforms;
- Proven ability to design and optimize queries to build scalable, modular, efficient data pipelines;
- Ability to work across structured, semi-structured, and unstructured data, extracting information and identifying linkages across disparate data sets;
- Proven experience delivering production-ready data engineering solutions, including requirements definition, architecture selection, prototype development, debugging, unit-testing, deployment, support, and maintenance;
- Ability to operate with a variety of data engineering tools and technologies
● Frame ML / AI use cases that can improve the company’s product
● Implement and develop ML / AI / Data driven rule based algorithms as software items
● For example, building a chatbot that replies an answer from relevant FAQ, and
reinforcing the system with a feedback loop so that the bot improves
Must have skills:
● Data extraction and ETL
● Python (numpy, pandas, comfortable with OOP)
● Knowledge of basic Machine Learning / Deep Learning / AI algorithms and ability to
● Good understanding of SDLC
● Deployed ML / AI model in a mobile / web product
● Soft skills : Strong communication skills & Critical thinking ability
Good to have:
● Full stack development experience
B.Tech. / B.E. degree in Computer Science or equivalent software engineering
- Minimum 1 years of relevant experience, in PySpark (mandatory)
- Hands on experience in development, test, deploy, maintain and improving data integration pipeline in AWS cloud environment is added plus
- Ability to play lead role and independently manage 3-5 member of Pyspark development team
- EMR ,Python and PYspark mandate.
- Knowledge and awareness working with AWS Cloud technologies like Apache Spark, , Glue, Kafka, Kinesis, and Lambda in S3, Redshift, RDS
- Design, create, test, and maintain data pipeline architecture in collaboration with the Data Architect.
- Build the infrastructure required for extraction, transformation, and loading of data from a wide variety of data sources using Java, SQL, and Big Data technologies.
- Support the translation of data needs into technical system requirements. Support in building complex queries required by the product teams.
- Build data pipelines that clean, transform, and aggregate data from disparate sources
- Develop, maintain and optimize ETLs to increase data accuracy, data stability, data availability, and pipeline performance.
- Engage with Product Management and Business to deploy and monitor products/services on cloud platforms.
- Stay up-to-date with advances in data persistence and big data technologies and run pilots to design the data architecture to scale with the increased data sets of consumer experience.
- Handle data integration, consolidation, and reconciliation activities for digital consumer / medical products.
- Bachelor’s or master's degree in Computer Science, Information management, Statistics or related field
- 5+ years of experience in the Consumer or Healthcare industry in an analytical role with a focus on building on data pipelines, querying data, analyzing, and clearly presenting analyses to members of the data science team.
- Technical expertise with data models, data mining.
- Hands-on Knowledge of programming languages in Java, Python, R, and Scala.
- Strong knowledge in Big data tools like the snowflake, AWS Redshift, Hadoop, map-reduce, etc.
- Having knowledge in tools like AWS Glue, S3, AWS EMR, Streaming data pipelines, Kafka/Kinesis is desirable.
- Hands-on knowledge in SQL and No-SQL database design.
- Having knowledge in CI/CD for the building and hosting of the solutions.
- Having AWS certification is an added advantage.
- Having Strong knowledge in visualization tools like Tableau, QlikView is an added advantage
- A team player capable of working and integrating across cross-functional teams for implementing project requirements. Experience in technical requirements gathering and documentation.
- Ability to work effectively and independently in a fast-paced agile environment with tight deadlines
- A flexible, pragmatic, and collaborative team player with the innate ability to engage with data architects, analysts, and scientists