Slater Victoroff is the Founder and CTO of indico data solutions, an Enterprise AI solution for unstructured content with an emphasis on text and NLP. He has been building machine learning solutions for startups, governments, and Fortune 100 companies for the past 5 years and is a frequent speaker at AI conferences.
April 23, 2019 05:00 PM PT
There is a growing feeling that privacy concerns dampen innovation in machine learning and AI applied to personal and/or sensitive data. After all, ML and AI are hungry for rich, detailed data and sanitizing data to improve privacy typically involves redacting or fuzzing inputs, which multiple studies have shown can seriously affect model quality and predictive power. While this is technically true for some privacy-safe modeling techniques, it's not true in general. The root cause of the problem is two-fold. First, most data scientists have never learned how to produce great models with great privacy. Second, most companies lack the systems to make privacy-preserving machine learning & AI easy.
This talk will challenge the implicit assumption that more privacy means worse predictions. Using practical examples from production environments involving personal and sensitive data, the speakers will introduce a wide range of techniques-from simple hashing to advanced embeddings-for high-accuracy, privacy-safe model development. Key topics include pseudonymous ID generation, semantic scrubbing, structure-preserving data fuzzing, task-specific vs. task-independent sanitization and ensuring downstream privacy in multi-party collaborations. In addition, we will dig into embeddings as a unique deep learning-based approach for privacy-preserving modeling over unstructured data. Special attention will be given to Spark-based production environments.
April 23, 2019 05:00 PM PT
There is a growing feeling that privacy concerns dampen innovation in machine learning and AI applied to personal and/or sensitive data. After all, ML and AI are hungry for rich, detailed data and sanitizing data to improve privacy typically involves redacting or fuzzing inputs, which multiple studies have shown can seriously affect model quality and predictive power. While this is technically true for some privacy-safe modeling techniques, it's not true in general.
The root cause of the problem is two-fold. First, most data scientists have never learned how to produce great models with great privacy. Second, most companies lack the systems to make privacy-preserving machine learning & AI easy. This talk will challenge the implicit assumption that more privacy means worse predictions. Using practical examples from production environments involving personal and sensitive data, the speakers will introduce a wide range of techniques-from simple hashing to advanced embeddings-for high-accuracy, privacy-safe model development.
Key topics include pseudonymous ID generation, semantic scrubbing, structure-preserving data fuzzing, task-specific vs. task-independent sanitization and ensuring downstream privacy in multi-party collaborations. In addition, we will dig into embeddings as a unique deep learning-based approach for privacy-preserving modeling over unstructured data. Special attention will be given to Spark-based production environments.
March 31, 2023 08:36 AM PT
The General Data Protection Regulation (GDPR), which came into effect on May 25, 2018, establishes strict guidelines for managing personal and sensitive data, backed by stiff penalties. GDPR's requirements have forced some companies to shut down services and others to flee the EU market altogether. GDPR's goal to give consumers control over their data and, thus, increase consumer trust in the digital ecosystem is laudable.
However, there is a growing feeling that GDPR has dampened innovation in machine learning & AI applied to personal and/or sensitive data. After all, ML & AI are hungry for rich, detailed data and sanitizing data to improve privacy typically involves redacting or fuzzing inputs, which multiple studies have shown can seriously affect model quality and predictive power. While this is technically true for some privacy-safe modeling techniques, it's not true in general.
The root cause of the problem is two-fold. First, most data scientists have never learned how to produce great models with great privacy. Second, most companies lack the systems to make privacy-safe machine learning & AI easy. This talk will challenge the implicit assumption that more privacy means worse predictions. Using practical examples from production environments involving personal and sensitive data, the speakers will introduce a wide range of techniques--from simple hashing to advanced embeddings--for high-accuracy, privacy-safe model development.
Key topics include pseudonymous ID generation, semantic scrubbing, structure-preserving data fuzzing, task-specific vs. task-independent sanitization and ensuring downstream privacy in multi-party collaborations. Special attention will be given to Spark-based production environments.
Session hashtag: #SAISDD13