Data Curation Intern

80k – 120k/yr Bengaluru, IN on-site internship intern 13d ago

Skills

About this role

About Karya: Why was Karya on the cover of the Time Magazine , highlighted by Satya Nadella , and invited to present its work to Sundar Pichai one on one? In part, because Karya is on a mission to provide AI enabled earning and learning opportunities to communities with high talent, but low access to opportunities. Karya achieves this while also delivering high quality, timely, and price competitive data to its clients. Karya builds high quality datasets for large companies like Google and Microsoft, while providing ethical work opportunities and fair wages to its workforce. Karya’s workers make nearly 20 times the Indian minimum wage and through our one-of-a-kind digital work platform, we have delivered over 40 million digital tasks and have positively impacted over 100 thousand workers. In the coming years, our goal is to rapidly scale our impact by bringing economic opportunities to millions of underserved users in India. With a rapidly growing global presence, we are also looking to expand our client base in the Indian market by partnering with leading Indian enterprises. About the Role We are looking for a detail-oriented and curious Data Curation Intern to help build high-quality datasets for training AI/ML models with a specific focus on Indian language and multilingual data. You will work with large open-source datasets (e.g., Sangraha by AI4Bharat) that require significant cleaning, structuring, and enrichment before they can be used effectively in model training pipelines. This is a hands-on, high-impact role at the intersection of data engineering, linguistics, and AI. You will start with text data pipelines and progressively move toward preparing data for read-speech and voice model training. What You'll Do Phase 1: Text Data Curation Audit and profile open-source datasets (Sangraha, Common Crawl, IndicCorp, etc.) to assess quality, coverage, and noise levels Design and implement data cleaning pipelines: deduplication, script normalisation, encoding fixes, noise removal, sentence boundary detection Create and apply metadata tagging schemas labelling text by domain (news, legal, literature, health, etc.), subdomain, language, register, and quality tier Build validation checklists and quality scorecards to benchmark dataset readiness for model training Document data provenance, licensing, and processing steps for reproducibility Phase 2: Speech & Voice Data Preparation Curate high-quality, phonetically diverse text passages suitable for read-speech recording Ensure text selection covers domain, prosodic, and phonemic variety required for TTS/ASR model training Assist in defining metadata standards for audio datasets (speaker demographics, recording conditions, transcription format) Support the pipeline transition from text corpus to aligned speech dataset What We're Looking For Must Have Strong attention to detail — you notice inconsistencies others miss Comfort with Python for data processing (pandas, regex, basic NLP libraries like spaCy or NLTK) Familiarity with text data formats: CSV, JSONL, Parquet, plain text corpora Curiosity about AI/ML, language technology, or computational linguistics Ability to work independently, document work clearly, and communicate blockers early Good to Have Prior exposure to NLP datasets or open-source language resources (IndicNLP, AI4Bharat, Hugging Face datasets) Knowledge of one or more Indian languages beyond English Experience with data versioning tools (DVC, Git-LFS) or dataset platforms (Hugging Face Hub) Basic understanding of how language models or speech models are trained Why This Role Work directly on real data pipelines that feed AI model training — not toy projects Gain hands-on experience with large-scale multilingual and Indic language datasets Build skills that are in high demand across AI labs, speech companies, and NLP startups Clear progression path: text → read speech → voice data, with increasing responsibility Mentorship from people who have built data and AI systems at scale Karya celebrates diversity and is an equal opportunity employer. All applicants will be considered without regard to race, religion, gender identity, sexual orientation, disability, or any other protected status. Offices: Bengaluru, Karnataka, India (Bangalore Office);