Building Indian Parliamentary Datasets

A Princeton University Studio Lab research initiative to create the largest unified dataset for Indian political linguistic analysis. Engineered a resumable Selenium crawler to index the Parliament Digital Library (handling dynamic pagination and session state), processed 40+ GB of election speech audio (Modi & Gandhi, 2014/2019 campaigns) through an AWS Transcribe/Translate pipeline, and digitized 40+ years (1981–2024) of Lok Sabha and Rajya Sabha debates. Designed a PyMuPDF and FuzzyWuzzy text extraction engine to structure thousands of raw PDF statements, mapping OCR output to standardized Ministry entities at scale.

Building Indian Parliamentary Datasets

Technologies Used