Good with Website Content extraction- Act as a Python Programmer/Developer- Extract information from structured/unstructured website- Deal with complex website in extracting data from any form of source- Expertise in web scraping with Python(libraries - Beautifulsoup, Scrapy)- Should have worked on REST API (libraries - requests,urllib etc)- Spidering/pagination- Quick learner and adapt to our existing web crawling infrastructure.Having experience with MongoDB, OCR(pdf2text,tesseract,tabula) is an added advantage
Responsibilities: Build real-time and batch analytics platform for analytics & machine-learning. Design, propose and develop solutions keeping the growing scale & business requirements in mind. As an integral part of the Data Engineering team, be involved in the entire development lifecycle from conceptualisation to architecture to coding to unit testing. Help us design the Data Model for our data warehouse and other data engineering solutions. Requirements: Deep understanding of real-time as well as batch processing big data solutions (Spark, Storm, Kafka, KSql, Flink, MapReduce, Yarn, Hive, HDFS, Pig etc). Extensive experience developing applications that work with NoSQL stores (e.g.,Elastic Search, HBase, Cassandra, MongoDB). Understands Data very well and has fair Data Modelling experience. Proven programming experience in Java or Scala. Experience in gathering and processing raw data at scale including writing scripts, web scraping, calling APIs, writing SQL queries, etc. Experience in cloud based data stores like Redshift and Big Query is an advantage. Previous experience in a high-growth tech startup would be an advantage.