- Big data developer with 8+ years of professional IT experience with expertise in Hadoop ecosystem components in ingestion, Data modeling, querying, processing, storage, analysis, Data Integration and Implementing enterprise level systems spanning Big Data.
- A skilled developer with strong problem solving, debugging and analytical capabilities, who actively engages in understanding customer requirements.
- Expertise in Apache Hadoop ecosystem components like Spark, Hadoop Distributed File Systems(HDFS), HiveMapReduce, Hive, Sqoop, HBase, Zookeeper, YARN, Flume, Pig, Nifi, Scala and Oozie.
- Hands on experience in creating real - time data streaming solutions using Apache Spark core, Spark SQL & DataFrames, Kafka, Spark streaming and Apache Storm.
- Excellent knowledge of Hadoop architecture and daemons of Hadoop clusters, which include Name node,Data node, Resource manager, Node Manager and Job history server.
- Worked on both Cloudera and Horton works in Hadoop Distributions. Experience in managing Hadoop clustersusing Cloudera Manager tool.
- Well versed in installation, Configuration, Managing of Big Data and underlying infrastructure of Hadoop Cluster.
- Hands on experience in coding MapReduce/Yarn Programs using Java, Scala and Python for analyzing Big Data.
- Exposure to Cloudera development environment and management using Cloudera Manager.
- Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL/Oracle .
- Implemented Spark using PYTHON and utilizing Data frames and Spark SQL API for faster processing of data and handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, MapReduce and then loading data into HDFS.
- Used Spark Data Frames API over Cloudera platform to perform analytics on Hive data.
- Hands on experience in MLlib from Spark which are used for predictive intelligence, customer segmentation and for smooth maintenance in Spark streaming.
- Experience in using Flume to load log files into HDFS and Oozie for workflow design and scheduling.
- Experience in optimizing MapReduce jobs to use HDFS efficiently by using various compression mechanisms.
- Working on creating data pipeline for different events of ingestion, aggregation, and load consumer response data into Hive external tables in HDFS location to serve as feed for tableau dashboards.
- Hands on experience in using Sqoop to import data into HDFS from RDBMS and vice-versa.
- In-depth Understanding of Oozie to schedule all Hive/Sqoop/HBase jobs.
- Hands on expertise in real time analytics with Apache Spark.
- Experience in converting Hive/SQL queries into RDD transformations using Apache Spark, Scala and Python.
- Extensive experience in working with different ETL tool environments like SSIS, Informatica and reporting tool environments like SQL Server Reporting Services (SSRS).
- Experience in Microsoft cloud and setting cluster in Amazon EC2 & S3 including the automation of setting & extending the clusters in AWS Amazon cloud.
- Extensively worked on Spark using Python on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL.
- Strong experience and knowledge of real time data analytics using Spark Streaming, Kafka and Flume.
- Knowledge in installation, configuration, supporting and managing Hadoop Clusters using Apache, Cloudera (CDH3, CDH4) distributions and on Amazon web services (AWS).
- Experienced in writing Ad Hoc queries using Cloudera Impala, also used Impala analytical functions.
- Experience in creating Data frames using PySpark and performing operation on the Data frames using Python.
- In depth understanding/knowledge of Hadoop Architecture and various components such as HDFS and MapReduce Programming Paradigm, High Availability and YARN architecture.
- Establishing multiple connections to different Redshift clusters (Bank Prod, Card Prod, SBBDA Cluster) and provide the access for pulling the information we need for analysis.
- Generated various kinds of knowledge reports using Power BI based on Business specification.
- Developed interactive Tableau dashboards to provide a clear understanding of industry specific KPIs using quick filters and parameters to handle them more efficiently.
- Well Experience in projects using JIRA, Testing, Maven and Jenkins build tools.
- Experienced in designing, built, and deploying and utilizing almost all the AWS stack (Including EC2, S3,), focusing on high-availability, fault tolerance, and auto-scaling.
- Good experience with use-case development, with Software methodologies like Agile and Waterfall.
- Working knowledge of Amazon's Elastic Cloud Compute( EC2 ) infrastructure for computational tasks and Simple Storage Service ( S3 ) as Storage mechanism.
- Good working experience in importing data using Sqoop, SFTP from various sources like RDMS, Teradata, Mainframes, Oracle, Netezza to HDFS and performed transformations on it using Hive, Pig and Spark .
- Extensive experience in Text Analytics, developing different Statistical Machine Learning solutions to various business problems and generating data visualizations using Python and R.
- Proficient in NoSQL databases including HBase, Cassandra, MongoDB and its integration with Hadoop cluster.
- Hands on experience in Hadoop Big data technology working on MapReduce, Pig, Hive as Analysis tool, Sqoop and Flume data import/export tools.
About Molecular Connections
Similar jobs
RESPONSIBILITIES:
Requirement understanding and elicitation, analyze, data/workflows, contribute to product
project and Proof of concept (POC)
Contribute to prepare design documents and effort estimations.
Develop AI/ML Models using best in-class ML models.
Building, testing, and deploying AI/ML solutions.
Work with Business Analysts and Product Managers to assist with defining functional user
stories.
Ensure deliverables across teams are of high quality and clearly documented.
Recommend best ML practices/Industry standards for any ML use case.
Proactively take up R and D and recommend solution options for any ML use case.
REQUIREMENTS:
Required Skills
Overall experience of 4 to 7 Years working on AI/ML framework development
Good programming knowledge in Python is must.
Good Knowledge of R and SAS is desired.
Good hands on and working knowledge SQL, Data Model, CRISP-DM.
Proficiency with Uni/multivariate statistics, algorithm design, and predictive AI/ML modelling.
Strong knowledge of machine learning algorithms, linear regression, logistic regression, KNN,
Random Forest, Support Vector Machines and Natural Language Processing.
Experience with NLP and deep neural networks using synthetic and artificial data.
Involved in different phases of SDLC and have good working exposure on different SLDC’s like
Agile Methodologies.
Data Scientist – Delivery & New Frontiers Manager
Job Description:
We are seeking highly skilled and motivated data scientist to join our Data Science team. The successful candidate will play a pivotal role in our data-driven initiatives and be responsible for designing, developing, and deploying data science solutions that drives business values for stakeholders. This role involves mapping business problems to a formal data science solution, working with wide range of structured and unstructured data, architecture design, creating sophisticated models, setting up operations for the data science product with the support from MLOps team and facilitating business workshops. In a nutshell, this person will represent data science and provide expertise in the full project cycle. Expectation of the successful candidate will be above that of a typical data scientist. Beyond technical expertise, problem solving in complex set-up will be key to the success for this role.
Responsibilities:
- Collaborate with cross-functional teams, including software engineers, product managers, and business stakeholders, to understand business needs and identify data science opportunities.
- Map complex business problems to data science problem, design data science solution using GCP/Azure Databricks platform.
- Collect, clean, and preprocess large datasets from various internal and external sources.
- Streamlining data science process working with Data Engineering, and Technology teams.
- Managing multiple analytics projects within a Function to deliver end-to-end data science solutions, creation of insights and identify patterns.
- Develop and maintain data pipelines and infrastructure to support the data science projects
- Communicate findings and recommendations to stakeholders through data visualizations and presentations.
- Stay up to date with the latest data science trends and technologies, specifically for GCP companies
Education / Certifications:
Bachelor’s or Master’s in Computer Science, Engineering, Computational Statistics, Mathematics.
Job specific requirements:
- Brings 5+ years of deep data science experience
∙ Strong knowledge of machine learning and statistical modeling techniques in a in a clouds-based environment such as GCP, Azure, Amazon
- Experience with programming languages such as Python, R, Spark
- Experience with data visualization tools such as Tableau, Power BI, and D3.js
- Strong understanding of data structures, algorithms, and software design principles
- Experience with GCP platforms and services such as Big Query, Cloud ML Engine, and Cloud Storage
- Experience in configuring and setting up the version control on Code, Data, and Machine Learning Models using GitHub.
- Self-driven, be able to work with cross-functional teams in a fast-paced environment, adaptability to the changing business needs.
- Strong analytical and problem-solving skills
- Excellent verbal and written communication skills
- Working knowledge with application architecture, data security and compliance team.
- Bring in industry best practices around creating and maintaining robust data pipelines for complex data projects with/without AI component
- programmatically ingesting data from several static and real-time sources (incl. web scraping)
- rendering results through dynamic interfaces incl. web / mobile / dashboard with the ability to log usage and granular user feedbacks
- performance tuning and optimal implementation of complex Python scripts (using SPARK), SQL (using stored procedures, HIVE), and NoSQL queries in a production environment
- Industrialize ML / DL solutions and deploy and manage production services; proactively handle data issues arising on live apps
- Perform ETL on large and complex datasets for AI applications - work closely with data scientists on performance optimization of large-scale ML/DL model training
- Build data tools to facilitate fast data cleaning and statistical analysis
- Ensure data architecture is secure and compliant
- Resolve issues escalated from Business and Functional areas on data quality, accuracy, and availability
- Work closely with APAC CDO and coordinate with a fully decentralized team across different locations in APAC and global HQ (Paris).
You should be
- Expert in structured and unstructured data in traditional and Big data environments – Oracle / SQLserver, MongoDB, Hive / Pig, BigQuery, and Spark
- Have excellent knowledge of Python programming both in traditional and distributed models (PySpark)
- Expert in shell scripting and writing schedulers
- Hands-on experience with Cloud - deploying complex data solutions in hybrid cloud / on-premise environment both for data extraction/storage and computation
- Hands-on experience in deploying production apps using large volumes of data with state-of-the-art technologies like Dockers, Kubernetes, and Kafka
- Strong knowledge of data security best practices
- 5+ years experience in a data engineering role
- Science / Engineering graduate from a Tier-1 university in the country
- And most importantly, you must be a passionate coder who really cares about building apps that can help people do things better, smarter, and faster even when they sleep
Requirements:
● Understanding our data sets and how to bring them together.
● Working with our engineering team to support custom solutions offered to the product development.
● Filling the gap between development, engineering and data ops.
● Creating, maintaining and documenting scripts to support ongoing custom solutions.
● Excellent organizational skills, including attention to precise details
● Strong multitasking skills and ability to work in a fast-paced environment
● 5+ years experience with Python to develop scripts.
● Know your way around RESTFUL APIs.[Able to integrate not necessary to publish]
● You are familiar with pulling and pushing files from SFTP and AWS S3.
● Experience with any Cloud solutions including GCP / AWS / OCI / Azure.
● Familiarity with SQL programming to query and transform data from relational Databases.
● Familiarity to work with Linux (and Linux work environment).
● Excellent written and verbal communication skills
● Extracting, transforming, and loading data into internal databases and Hadoop
● Optimizing our new and existing data pipelines for speed and reliability
● Deploying product build and product improvements
● Documenting and managing multiple repositories of code
● Experience with SQL and NoSQL databases (Casendra, MySQL)
● Hands-on experience in data pipelining and ETL. (Any of these frameworks/tools: Hadoop, BigQuery,
RedShift, Athena)
● Hands-on experience in AirFlow
● Understanding of best practices, common coding patterns and good practices around
● storing, partitioning, warehousing and indexing of data
● Experience in reading the data from Kafka topic (both live stream and offline)
● Experience in PySpark and Data frames
Responsibilities:
You’ll
● Collaborating across an agile team to continuously design, iterate, and develop big data systems.
● Extracting, transforming, and loading data into internal databases.
● Optimizing our new and existing data pipelines for speed and reliability.
● Deploying new products and product improvements.
● Documenting and managing multiple repositories of code.
- Mandatory - Hands on experience in Python and PySpark.
- Build pySpark applications using Spark Dataframes in Python using Jupyter notebook and PyCharm(IDE).
- Worked on optimizing spark jobs that processes huge volumes of data.
- Hands on experience in version control tools like Git.
- Worked on Amazon’s Analytics services like Amazon EMR, Lambda function etc
- Worked on Amazon’s Compute services like Amazon Lambda, Amazon EC2 and Amazon’s Storage service like S3 and few other services like SNS.
- Experience/knowledge of bash/shell scripting will be a plus.
- Experience in working with fixed width, delimited , multi record file formats etc.
- Hands on experience in tools like Jenkins to build, test and deploy the applications
- Awareness of Devops concepts and be able to work in an automated release pipeline environment.
- Excellent debugging skills.
You will be responsible for designing, building, and maintaining data pipelines that handle Real-world data at Compile. You will be handling both inbound and outbound data deliveries at Compile for datasets including Claims, Remittances, EHR, SDOH, etc.
You will
- Work on building and maintaining data pipelines (specifically RWD).
- Build, enhance and maintain existing pipelines in pyspark, python and help build analytical insights and datasets.
- Scheduling and maintaining pipeline jobs for RWD.
- Develop, test, and implement data solutions based on the design.
- Design and implement quality checks on existing and new data pipelines.
- Ensure adherence to security and compliance that is required for the products.
- Maintain relationships with various data vendors and track changes and issues across vendors and deliveries.
You have
- Hands-on experience with ETL process (min of 5 years).
- Excellent communication skills and ability to work with multiple vendors.
- High proficiency with Spark, SQL.
- Proficiency in Data modeling, validation, quality check, and data engineering concepts.
- Experience in working with big-data processing technologies using - databricks, dbt, S3, Delta lake, Deequ, Griffin, Snowflake, BigQuery.
- Familiarity with version control technologies, and CI/CD systems.
- Understanding of scheduling tools like Airflow/Prefect.
- Min of 3 years of experience managing data warehouses.
- Familiarity with healthcare datasets is a plus.
Compile embraces diversity and equal opportunity in a serious way. We are committed to building a team of people from many backgrounds, perspectives, and skills. We know the more inclusive we are, the better our work will be.
• Help build a Data Science team which will be engaged in researching, designing,
implementing, and deploying full-stack scalable data analytics vision and machine learning
solutions to challenge various business issues.
• Modelling complex algorithms, discovering insights and identifying business
opportunities through the use of algorithmic, statistical, visualization, and mining techniques
• Translates business requirements into quick prototypes and enable the
development of big data capabilities driving business outcomes
• Responsible for data governance and defining data collection and collation
guidelines.
• Must be able to advice, guide and train other junior data engineers in their job.
Must Have:
• 4+ experience in a leadership role as a Data Scientist
• Preferably from retail, Manufacturing, Healthcare industry(not mandatory)
• Willing to work from scratch and build up a team of Data Scientists
• Open for taking up the challenges with end to end ownership
• Confident with excellent communication skills along with a good decision maker
Skills- Informatica with Big Data Management
1.Minimum 6 to 8 years of experience in informatica BDM development
2.Experience working on Spark/SQL
3.Develops informtica mapping/Sql
The Data Engineer would be responsible for selecting and integrating Big Data tools and frameworks required. Would implement Data Ingestion & ETL/ELT processes
Required Experience, Skills and Qualifications:
- Hands on experience on Big Data tools/technologies like Spark, Databricks, Map Reduce, Hive, HDFS.
- Expertise and excellent understanding of big data toolset such as Sqoop, Spark-streaming, Kafka, NiFi
- Proficiency in any of the programming language: Python/ Scala/ Java with 4+ years’ experience
- Experience in Cloud infrastructures like MS Azure, Data lake etc
- Good working knowledge in NoSQL DB (Mongo, HBase, Casandra)
The candidate,
1. Must have a very good hands-on technical experience of 3+ years with JAVA or Python
2. Working experience and good understanding of AWS Cloud; Advanced experience with IAM policy and role management
3. Infrastructure Operations: 5+ years supporting systems infrastructure operations, upgrades, deployments using Terraform, and monitoring
4. Hadoop: Experience with Hadoop (Hive, Spark, Sqoop) and / or AWS EMR
5. Knowledge on PostgreSQL/MySQL/Dynamo DB backend operations
6. DevOps: Experience with DevOps automation - Orchestration/Configuration Management and CI/CD tools (Jenkins)
7. Version Control: Working experience with one or more version control platforms like GitHub or GitLab
8. Knowledge on AWS Quick sight reporting
9. Monitoring: Hands on experience with monitoring tools such as AWS CloudWatch, AWS CloudTrail, Datadog and Elastic Search
10. Networking: Working knowledge of TCP/IP networking, SMTP, HTTP, load-balancers (ELB) and high availability architecture
11. Security: Experience implementing role-based security, including AD integration, security policies, and auditing in a Linux/Hadoop/AWS environment. Familiar with penetration testing and scan tools for remediation of security vulnerabilities.
12. Demonstrated successful experience learning new technologies quickly
WHAT WILL BE THE ROLES AND RESPONSIBILITIES?
1. Create procedures/run books for operational and security aspects of AWS platform
2. Improve AWS infrastructure by developing and enhancing automation methods
3. Provide advanced business and engineering support services to end users
4. Lead other admins and platform engineers through design and implementation decisions to achieve balance between strategic design and tactical needs
5. Research and deploy new tools and frameworks to build a sustainable big data platform
6. Assist with creating programs for training and onboarding for new end users
7. Lead Agile/Kanban workflows and team process work
8. Troubleshoot issues to resolve problems
9. Provide status updates to Operations product owner and stakeholders
10. Track all details in the issue tracking system (JIRA)
11. Provide issue review and triage problems for new service/support requests
12. Use DevOps automation tools, including Jenkins build jobs
13. Fulfil any ad-hoc data or report request queries from different functional groups