Data Science Institute looking for students for project using AI models to better understand new languages

Languages are constantly evolving and adapting to the world around them. Most approaches to learning new languages are rigid and don’t account for colloquialisms and outside influence, especially in the case of low-resource languages such as Hindi. The formal Hindi vocabulary taught by most textbooks is hardly going to help someone trying to discuss a complex topic such as climate change or open-heart surgery.

The Data Science Institute at Vanderbilt University is working with Dr. Elliott McCarter, a Senior Lecturer in Asian Studies at Vanderbilt, on a new research project this semester to bridge that language acquisition gap through deep learning. They’re looking for students to join the project and help harness use the latest developments in data driven research to make a better way to learn new languages.

The project is headed by Dr. McCarter and Data Science Institute data scientist, Umang Chaudhry. Using transformer models, the project team hopes to develop a better vocabulary for discussing any topic — no matter how complex. Instead of the traditional language textbook, the project will pull from data sources like Hindi newspaper articles, movie and television scripts, as well as other less formal sources. It can even identify different regional variations in the language. The model would allow a surgeon with only a basic understanding of Hindi to gain the vocabulary to discuss a complex health procedure with a patient in Jaipur; or it could give a physicist the right vocabulary to discuss quantum mechanics in Kolkata to a level no textbook can teach. 

“I hope that this project will develop a product that can serve the entire Hindi language pedagogy community and other researchers in Hindi linguistics,” Dr. McCarter said. “Additionally, the project will serve as a model for less commonly taught languages to provide data-supported instructional materials.”

Students who want to join this project and help unlock a better way to learn Hindi and other languages are invited to join this innovative research project. Prospective students should have an understanding of the Python programming language (CS 2204 or above preferred), transformers/deep learning, Huggingface Datasets or related. A familiarity with Hindi or other low-resource languages. Click here to fill out the the Google Document form. For questions about how the project will run, students can email Umang Chaudhry at For more general questions about the project itself, contact Dr. Elliott McCarter at