Job Description
Internship Description:
Sayari is looking for an intern to join its Data Engineering team! Sayari’s flagship product, Sayari Graph, provides instant access to structured business information from billions of corporate, legal, and trade records. As a member of Sayari's data team you will work with our Product and Software Engineering teams to collect data from around the globe, maintain existing ETL pipelines, and develop new pipelines that power Sayari Graph.
Our application tier is built primarily in TypeScript, running in Kubernetes, and backed by Postgres, Cassandra, Elasticsearch, and Memgraph. Our data ingest tier runs on Spark, processing terabytes of data collected from hundreds of data sources. The platform allows users to explore a large knowledge graph sourced from hundreds of millions of structured and unstructured records from over 200 countries and 30 languages. As part of this team, you'll have the chance to contribute to our growing library of open-source work, including our WebGL-powered network visualization library Trellis.
This is a remote paid internship with work expectations being between 20-30 hours a week.
Job Responsibilities:
Write and deploy crawling scripts to collect source data from the webWrite and run data transformers in Scala Spark to standardize bulk data setsWrite and run modules in Python to parse entity references and relationships from source dataDiagnose and fix bugs reported by internal and external usersAnalyze and report on internal datasets to answer questions and inform feature workWork collaboratively on and across a team of engineers using basic agile principlesGive and receive feedback through code reviewsRequired Skills & Experience:
Experience with Python and/or a JVM language (e.g., Scala)Experience working collaboratively with gitDesired Skills & Experience:
Experience with Apache Spark and Apache AirflowExperience working on a cloud platform like GCP, AWS, or AzureUnderstanding of or interest in knowledge graphs