Session

Building a Data Telescope to Study the Vast Expanse of Social Media

Overview

ExperienceIn Person
TrackData Engineering & Streaming
IndustryCommunications, Media & Entertainment, Public Sector
TechnologiesAI/BI, Databricks SQL, Unity Catalog
Skill LevelIntermediate

Researchers studying extremism and misinformation need massive social media datasets but face infrastructure and budget constraints. Traditional solutions offer dashboard access to pre-aggregated data, limiting flexibility. Princeton’s Accelerator gives researchers direct Databricks workspace access with pre-built Typical Activity Datasets, enabling custom queries and analysis. We built a multi-tenant data and AI-powered platform serving R1 institutions.Our medallion architecture ingests data via Lakeflow Spark Declarative Pipelines: Bronze captures raw data, Silver applies filtering/text enrichments, and researchers can query via semantic search powered by MLOps. Leveraging Databricks Asset Bundles, REST APIs, and Unity Catalog, we built solutions for cross-workspace monitoring. Users who spent months building scrapers start work in days, querying 100+ languages and discovering content patterns while learning valuable skills. This talk shares our architecture for democratizing data infrastructure with engineering rigor.

Session Speakers

Kai Pak

/Director of Engineering
Princeton University

Lilly Amirjavadi

/Data Scientist
Princeton University