Web Crawling and Data Indexing Engineer at Mistral AI creating web data extraction tools. Join a team committed to AI innovation and excellence in engineering.
Responsibilities
Develop and maintain web crawlers using Python libraries such as Beautiful Soup to extract data from target websites.
Utilize headless browsing techniques, such as Chrome DevTools, to automate and optimize data collection processes.
Collaborate with cross-functional teams to identify, scrape, and integrate data from APIs to support business objectives.
Create and implement efficient parsing patterns using regular expressions, XPaths, and CSS selectors to ensure accurate data extraction.
Design and manage distributed job queues using technologies such as Redis, Kubernetes, and Postgres to handle large-scale data processing tasks.
Develop strategies to monitor and ensure data quality, accuracy, and integrity throughout the crawling and indexing process.
Continuously improve and optimize existing web crawling infrastructure to maximize efficiency and adapt to new challenges.
Requirements
Proficiency in Python, Java, or C++
Strong understanding of HTTP/HTTPS protocols and web communication.
Knowledge of HTML, CSS, and JavaScript for parsing and navigating web content.
Mastery of queues, stacks, hash maps, and other data structures for efficient data handling.
Ability to design and optimize algorithms for large-scale web crawling.
Hands-on experience with web scraping libraries/frameworks (e.g., Scrapy, BeautifulSoup, Selenium, Playwright).
Understanding of how search engines work and best practices for web crawling optimization.
Experience with SQL and/or NoSQL databases (e.g., PostgreSQL, MongoDB) for storing and managing crawled data.
Familiarity with data warehousing and scalable storage solutions.
Knowledge of distributed systems (e.g., Hadoop, Spark) for processing large datasets.
Proficiency in Pandas, NumPy, and Matplotlib for analyzing and visualizing scraped data.
Experience applying Machine Learning to improve crawling efficiency or accuracy.
Familiarity with cloud platforms (AWS, GCP) and containerization (Docker) for deployment.
Associate Software Engineer at L3Harris developing software for advanced tactical radio systems. Collaborating with cross - functional teams throughout all phases of software development life cycle.
Principal Software Engineer integrating partner accelerator hardware and Red Hat's open - source software stack. Collaborating across teams to optimize AI workloads and enhance system integration.
Senior Software Engineer at Itaú developing cloud applications using .NET, Angular, and AWS. Collaborating on innovative technology solutions in a diverse and inclusive team environment.
Full - Stack Software Engineer joining Blueground to develop and manage satellite applications complementing core systems. Collaborating in an Agile team with diverse technologies and frameworks.
Senior Software Engineer developing complex backend and cloud solutions for one of Austria's largest private companies. Engaging in digitalization of key business processes and creating scalable systems.
Senior Software Engineer developing and configuring integrations on the Dell Boomi platform. Collaborating with teams to implement APIs and monitor system performance.
Lead Engineer responsible for designing and optimizing manufacturing processes at Celestica. Develop processes for quality cost delivery goals while ensuring efficient operations in manufacturing sector.
Lead Engineer responsible for optimizing test equipment and processes in manufacturing at Celestica. Collaborating with engineering teams to deliver high - quality solutions.
Mission Software Engineer integrating and deploying autonomy software on large maritime platforms at HavocAI. Help shape the future of maritime autonomy while collaborating closely with customers and operators.