AI Engineer
Descrizione dell'offerta
We are building AI-powered systems that enhance multilingual communication, improve interpreter workflows, and support next-generation AI applications across text, speech, and multimodal experiences.
Propio is hiring an AI Data Strategy Engineer / Applied Scientist, LLM Data to own the data strategy, curation pipelines, annotation workflows, and evaluation datasets that power our multilingual AI systems. This is a hands‑on technical role for someone who understands how to manage the full AI data lifecycle, from acquisition, curation, annotation, and quality control to evaluation datasets and post‑training data, to directly improve model performance. The ideal candidate can build scalable data pipelines, design high‑quality annotation and QA processes, identify model failure modes, and close performance gaps through targeted data acquisition, curation, and synthetic data generation.
Requirements
- Define the end‑to‑end data roadmap for multilingual and multimodal AI systems, including text, speech, translation, interpretation, low‑resource languages, and agentic AI workflows.
- Design and build dataset curation pipelines for training, post‑training, and evaluation, including cleaning, deduplication, filtering, PII redaction, quality scoring, sampling, balancing, and versioning.
- Create annotation schemas, labeling guidelines, QA rubrics, golden datasets, and reviewer workflows for multilingual, speech, translation, and agentic AI data.
- Build evaluation datasets and benchmarks, analyze model failure modes, and translate performance gaps into targeted data improvements.
- Support post‑training data workflows such as SFT, instruction tuning, preference data, RLHF/DPO‑style data, reward model data, and synthetic data generation.
- Use modern annotation tools and AWS‑based data infrastructure to scale secure, traceable, and compliant AI data workflows.
Qualifications
- Bachelor’s degree in Computer Science, Machine Learning, Data Science, Computational Linguistics, Linguistics, Statistics, or a related field, or equivalent practical experience.
- 4+ years of experience in AI data, ML data operations, NLP data engineering, applied ML, speech/translation data, or LLM data workflows.
- Strong hands‑on experience with Python, SQL, and dataset curation pipelines.
- Experience with annotation workflows, QA rubrics, evaluation datasets, or human‑in‑the‑loop data processes.
- Familiarity with multilingual NLP, speech data, translation data, low‑resource languages, conversational AI, or agentic AI datasets.
- Working knowledge of AWS data and ML tools such as S3, Glue, SageMaker, Bedrock, Lambda, Step Functions, EKS/ECS, IAM, or KMS.
- Strong communication skills and ability to work with ML engineers, applied scientists, product teams, linguists, data teams, and vendors.
Preferred Qualifications
- Master’s or PhD in Computer Science, Machine Learning, NLP, Computational Linguistics, Data Science, Statistics, or a related field.
- Experience with LLM post‑training workflows such as SFT, instruction tuning, preference data, RLHF, DPO, reward modeling, or evaluation data generation.
- Experience with synthetic data generation, active learning, weak supervision, LLM‑as‑judge workflows, or automated data quality scoring.
- Experience with modern annotation and data platforms such as Labelbox, Scale AI, Prodigy, Argilla, Snorkel, Humanloop, or custom internal tooling.