Job Description
Internship Description:
Sayari is looking for a Data Engineer Intern specializing in web crawling to join its Data Engineering team! Sayari has developed a robust web crawling project that collects hundreds of millions of documents every year from a diverse set of sources around the world. These documents serve as source records for Sayari’s flagship graph product, which is a global network of corporate and trade entities and relationships. As a member of Sayari's data team your primary objective will be to work on maintaining and improving Sayari’s web crawling framework, with an emphasis on scalability and reliability. You will work with our Product and Software Engineering teams to ensure our crawling deployment meets product requirements and integrates efficiently with our ETL pipelines.
This is a remote paid internship with work expectations being between 20-30 hours a week.
Job Responsibilities:
Investigate and implement web crawlers for new sourcesMaintain and improve existing crawling infrastructureImprove metrics and reporting for web crawlingHelp improve and maintain ETL processesContribute to development and design of Sayari’s data productRequired Skills & Experience:
Experience with PythonExperience managing web crawling at scale, any framework, Scrapy is a plusExperience working with KubernetesExperience working collaboratively with gitExperience working with selectors such as: XPath, CSS, JMESPathExperience with WebDev tools (Chrome/Firefox)Desired Skills & Experience:
Experience with Apache projects such as Spark, Avro, Nifi, and AirflowExperience with datastores Postgres and/or RocksDBExperience working on a cloud platform like GCP, AWS, or AzureWorking knowledge of API frameworks, primarily RESTUnderstanding of or interest in knowledge graphsExperience with *nix environmentsExperience with reverse engineeringProficient in bypassing anti-crawling techniquesExperience with Javascript