Hybrid Founding Data Engineer

Posted last month

Apply now

About the role

  • Architect and implement robust, scalable systems to handle data ingestion while maintaining high performance and quality
  • Build and optimize academic research paper pipeline: efficiently deduplicate hundreds of millions of research papers and calculate embeddings
  • Make Elicit the most complete and up-to-date database of scholarly sources
  • Expand the datasets Elicit works over (court documents, SEC filings, spreadsheets, presentations, audio, video, etc.) and ingest less-structured documents
  • Define and build secure, reliable, fast, and auditable private data connectors for customers
  • Preprocess and prepare data to make it useful to models; work with ML engineers and evaluation experts to find, gather, version, and apply datasets for training
  • Lead data pipeline optimization and enhancement projects and contribute to CI/CD, monitoring, and documentation
  • Collaborate with cross-functional teams and spend regular in-person time with teammates (approx. 1 week every 6)

Requirements

  • 5+ years of experience as a data engineer: owning make-or-break decisions about how to ingest, manage, and use data
  • Strong proficiency in Python (5+ years experience)
  • You have created and owned a data platform at rapidly-growing startups—gathering needs from colleagues, planning an architecture, deploying the infrastructure, and implementing the tooling
  • Experience with architecting and optimizing large data pipelines, ideally with particular experience with Spark
  • Strong SQL skills, including understanding of aggregation functions, window functions, UDFs, self-joins, partitioning, and clustering approaches
  • Experience with columnar data storage formats like Parquet
  • Strong opinions, weakly-held about approaches to data quality management
  • Creative and user-centric problem-solving
  • You should be excited to play a key role in shipping new features to users—not just building out a data platform!
  • Nice to have: experience developing deduplication processes for large datasets
  • Nice to have: hands-on experience with full-text extraction and processing from various document formats (PDF, HTML, XML, etc.)
  • Nice to have: familiarity with machine learning concepts and their application in search technologies
  • Nice to have: experience with distributed computing frameworks beyond Spark (e.g., Dask, Ray)
  • Nice to have: experience in science and academia: familiarity with academic publications
  • Nice to have: hands-on experience with Airflow, DBT, or Hadoop
  • Nice to have: experience with data lake, data warehouse, or lakehouse paradigms

Benefits

  • Flexible work environment: work from our office in Oakland or remotely with time zone overlap (between GMT and GMT-8), as long as you can travel for in-person retreats and coworking events
  • Fully covered health, dental, vision, and life insurance for you, generous coverage for the rest of your family
  • Flexible vacation policy, with a minimum recommendation of 20 days/year + company holidays
  • 401K with a 6% employer match
  • A new Mac + $1,000 budget to set up your workstation or home office in your first year, then $500 every year thereafter
  • $1,000 quarterly AI Experimentation & Learning budget
  • A team administrative assistant who can help you with personal and work tasks
  • Above-market equity and employee-friendly equity terms (10-year exercise period)

Job title

Founding Data Engineer

Job type

Experience level

Mid levelSenior

Salary

$185,000 - $305,000 per year

Degree requirement

No Education Requirement

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job