Session

Building a Data Telescope to Study the Vast Expanse of Social Media

Overview

Experience	In Person
Track	Data Engineering & Streaming
Industry	Communications, Media & Entertainment, Public Sector
Technologies	AI/BI, Databricks SQL, Unity Catalog
Skill Level	Intermediate

Researchers studying extremism and misinformation need massive social media datasets but face infrastructure and budget constraints. Traditional solutions offer dashboard access to pre-aggregated data, limiting flexibility. Princeton’s Accelerator gives researchers direct Databricks workspace access with pre-built Typical Activity Datasets, enabling custom queries and analysis. We built a multi-tenant data and AI-powered platform serving R1 institutions.

Our medallion architecture ingests data via Lakeflow Spark Declarative Pipelines: Bronze captures raw data, Silver applies filtering/text enrichments, and researchers can query via semantic search powered by MLOps. Leveraging Databricks Asset Bundles, REST APIs, and Unity Catalog, we built solutions for cross-workspace monitoring. Users who spent months building scrapers start work in days, querying 100+ languages and discovering content patterns while learning valuable skills. This talk shares our architecture for democratizing data infrastructure with engineering rigor.

Building a Data Telescope to Study the Vast Expanse of Social Media

Overview

Session Speakers

Kai Pak

Lilly Amirjavadi