Building a Data Telescope to Study the Vast Expanse of Social Media
Overview
| Experience | In Person |
|---|---|
| Track | Data Engineering & Streaming |
| Industry | Communications, Media & Entertainment, Public Sector |
| Technologies | AI/BI, Databricks SQL, Unity Catalog |
| Skill Level | Intermediate |
Researchers studying extremism and misinformation need massive social media datasets but face infrastructure and budget constraints. Traditional solutions offer dashboard access to pre-aggregated data, limiting flexibility. Princeton’s Accelerator gives researchers direct Databricks workspace access with pre-built Typical Activity Datasets, enabling custom queries and analysis. We built a multi-tenant data and AI-powered platform serving R1 institutions.Our medallion architecture ingests data via Lakeflow Spark Declarative Pipelines: Bronze captures raw data, Silver applies filtering/text enrichments, and researchers can query via semantic search powered by MLOps. Leveraging Databricks Asset Bundles, REST APIs, and Unity Catalog, we built solutions for cross-workspace monitoring. Users who spent months building scrapers start work in days, querying 100+ languages and discovering content patterns while learning valuable skills. This talk shares our architecture for democratizing data infrastructure with engineering rigor.
Session Speakers
Kai Pak
/Director of Engineering
Princeton University
Lilly Amirjavadi
/Data Scientist
Princeton University