Dan Corbiani is a Data Scientist and Solutions Architect who designs, develops, and deploys analytic solutions for research programs. His primary thrust area is the intersection of large-scale geospatial processing and Spark. This is within vector-based datasets such as critical infrastructure assets or entity paths. Dan has been working to implement distributed geospatial algorithms for pattern of life analysis and disaster response. He has implemented common geospatial algorithms such as DBSCAN and Getis Ord Gi* within the Spark framework. Dan has a long history with software development with a few tangents into Materials Science and Systems Engineering. This has allowed him to understand the requirements of the researchers as well as the implementation options in the cloud.
May 26, 2021 12:05 PM PT
Geospatial pipelines in Apache Spark are difficult because of the diversity of datasets and the challenge of harmonizing on a single dataframe. We have worked over the past year to review different pipeline tools that allow us to quickly combine steps to create new workflows or operate on new datasets. We have reviewed Dagster, Apache Spark MLflow pipelines, Prefect, and our own custom solutions. The talk will go over the pros and cons of each of these solutions and will show an actionable workflow implementation that any geospatial analyst can leverage. We will show how we can leverage a pipeline to run a traditional geospatial hotspot analysis. Interactive mapping within the Databricks platform will be demonstrated.
June 24, 2020 05:00 PM PT
Geospatial data appears to be simple right up until the part when it becomes intractable. There are many gotcha moments with geospatial data in spark and we will break those down in our talk. Users who are new to geospatial analysis in spark will find this portion useful as projections, geometry types, indices, and geometry storage can cause issues. We will begin by discussing the basics of geospatial data and why it can be so challenging. This will be brief and will be in the context of how geospatial data can cause scaling problems in spark. Critically, we will show how we have approached these issues to limit errors and reduce cost. There are many geospatial packages available within Spark. We have tried many of them and will discuss the pros and cons of each using common examples across libraries. New users will benefit from this discussion as each library has advantages in specific scenarios. Lastly, we will discuss how we migrate geospatial data. This will include our best practices for ingesting geospatial data as well as how we store it for long term use. Users may be specifically interested in our evaluation of spatial indexing for rapid retrieval of records.