Architect and implement robust, scalable systems to handle data ingestion while maintaining high performance and quality
Build and optimize academic research paper pipeline: efficiently deduplicate hundreds of millions of research papers and calculate embeddings
Make Elicit the most complete and up-to-date database of scholarly sources
Expand the datasets Elicit works over (court documents, SEC filings, spreadsheets, presentations, audio, video, etc.) and ingest less-structured documents
Define and build secure, reliable, fast, and auditable private data connectors for customers
Preprocess and prepare data to make it useful to models; work with ML engineers and evaluation experts to find, gather, version, and apply datasets for training
Lead data pipeline optimization and enhancement projects and contribute to CI/CD, monitoring, and documentation
Collaborate with cross-functional teams and spend regular in-person time with teammates (approx. 1 week every 6)
Requirements
5+ years of experience as a data engineer: owning make-or-break decisions about how to ingest, manage, and use data
Strong proficiency in Python (5+ years experience)
You have created and owned a data platform at rapidly-growing startups—gathering needs from colleagues, planning an architecture, deploying the infrastructure, and implementing the tooling
Experience with architecting and optimizing large data pipelines, ideally with particular experience with Spark
Strong SQL skills, including understanding of aggregation functions, window functions, UDFs, self-joins, partitioning, and clustering approaches
Experience with columnar data storage formats like Parquet
Strong opinions, weakly-held about approaches to data quality management
Creative and user-centric problem-solving
You should be excited to play a key role in shipping new features to users—not just building out a data platform!
Nice to have: experience developing deduplication processes for large datasets
Nice to have: hands-on experience with full-text extraction and processing from various document formats (PDF, HTML, XML, etc.)
Nice to have: familiarity with machine learning concepts and their application in search technologies
Nice to have: experience with distributed computing frameworks beyond Spark (e.g., Dask, Ray)
Nice to have: experience in science and academia: familiarity with academic publications
Nice to have: hands-on experience with Airflow, DBT, or Hadoop
Nice to have: experience with data lake, data warehouse, or lakehouse paradigms
Benefits
Flexible work environment: work from our office in Oakland or remotely with time zone overlap (between GMT and GMT-8), as long as you can travel for in-person retreats and coworking events
Fully covered health, dental, vision, and life insurance for you, generous coverage for the rest of your family
Flexible vacation policy, with a minimum recommendation of 20 days/year + company holidays
401K with a 6% employer match
A new Mac + $1,000 budget to set up your workstation or home office in your first year, then $500 every year thereafter
$1,000 quarterly AI Experimentation & Learning budget
A team administrative assistant who can help you with personal and work tasks
Above-market equity and employee-friendly equity terms (10-year exercise period)
Data Engineer developing architecture and pipelines for data analytics at NinjaTrader. Empowering analysts and improving business workflows through data - driven solutions.
Data Engineer joining Alterric to collaborate on data platform projects and analytics solutions. Working with Azure Cloud technologies to ensure data quality and integrity for informed decision - making.
Data Engineer at Kyndryl transforming raw data into actionable insights using ELK Stack. Responsible for developing, implementing, and maintaining data pipelines and processing workflows.
Senior Data Engineer at Clorox developing cloud - based data solutions. Leading data engineering projects and collaborating with business stakeholders to optimize data flows.
Data Engineer building solutions on AWS for high - performance data processing. Leading initiatives in data architecture and analytics for operational support.
Senior Data Engineer overseeing Databricks platform integrity, optimizing data practices for efficient usage. Leading teams on compliance while mentoring a junior Data Engineer.
Associate Data Engineer contributing to software applications development and maintenance using Python. Collaborating with teams for clean coding and debugging practices in Pune, India.
Data Engineer focusing on development and optimization of data pipelines in an insurance context. Ensuring data integrity and supporting data - driven decision - making processes.
Lead Data Engineer responsible for delivering scalable cloud - based data solutions and managing cross - functional teams. Collaborating with global stakeholders and ensuring high - quality project execution in a fast - paced environment.