Vini Jaiswal is a Senior Developer Advocate at Databricks, where she helps data practitioners to be successful in building on Databricks and open source technologies like Apache Spark, Delta, and MLflow. She has extensive experience working with Unicorns, Digital Natives and some of the Fortune 500 companies helping with the successful implementation of Data and AI use cases in production at scale for on-premise and cloud deployments. Vini also worked as the Data Science Engineering Lead under Citi’s Enterprise Operations & Technology group and interned as a Data Analyst at Southwest Airlines. She holds an MS in Information Technology and Management from the University of Texas at Dallas.
May 27, 2021 11:00 AM PT
This talk focuses on the importance of data access and how crucial it is, to have the granular level of data availability in the open-source space as it helps researchers and data teams to fuel their work.
We present to you the research conducted by the DS4C (Data Science for Covid-19) team who made a huge and detailed level of South Korea Covid-19 data available to a wider community. The DS4C dataset was one of the most impactful datasets on Kaggle with over fifty thousand cumulative downloads and 300 unique contributors. What makes the DS4C dataset so potent is the sheer amount of data collected for each patient. The Korean government has been collecting and releasing patient information with unprecedented levels of detail. The data released includes infected people’s travel routes, the public transport they took, and the medical institutions that are treating them. This extremely fine-grained detail is what makes the DS4C dataset valuable as it makes it easier for researchers and data scientists to identify trends and more evidence to support hypotheses to track down the cause and gain additional insights. We will cover the data challenges, impact that it had on the community by making this data available on a public forum and conclude it with an insightful visual representation.
May 26, 2021 11:30 AM PT
Data is the new oil and to transform it into new products, you need a high performing oil refinery. Every organization is realizing the value of creating a data driven culture to accelerate innovation, increase revenue and improve their product.
While most organizations have standardized on Apache Spark for data processing, Delta lake allows for bringing performance, transactionality and reliability to your data. They can unlock significant value with the next generation Lakehouse architecture to run their downstream applications of Machine Learning, AI and Analytics. Through this session, we will explain how you can leverage the Lakehouse platform to make data a part of each business function, the leadership, Sales, Customer Success, Marketing, Product, HR teams so they can produce actionable insights to accelerate innovation further and drive revenue for the business.
In a nutshell, we will cover:
November 18, 2020 04:00 PM PT
Games earn more money than movies and music combined. That means a lot of data is generated as well. One of the development considerations for ML Pipeline is that it must be easy to use, maintain, and integrate. However, it doesn't necessarily have to be developed from scratch. By using well-known libraries/frameworks and choice of efficient tools whenever possible, we can avoid "reinventing the wheel", making it flexible and extensible.
Moreover, a fully automated ML pipeline must be reproducible at any point in time for any model which allows for faster development and easy ways to debug/test each step of the model. This session walks through how to develop a fully automated and scalable Machine Learning pipeline by the example from an innovative gaming company whose games are played by millions of people every day, meaning data growth within terabytes that can be used to produce great products and generate insights on improving the product.
Wildlife leverages data to drive product development lifecycle and deploys data science to drive core product decisions and features, which helps the company by keeping ahead of the market. We will also cover one of the use cases which is improving user acquisition through improved LTV models and the use of Apache Spark. Spark's distributed computing enabled Data Scientists to run more models in parallel and they can innovate faster by onboarding more Machine Learning use cases. For example, using Spark allowed the company to have around 30 models for different kinds of tasks in production.
Speakers: Vini Jaiswal and Arthur Gola
November 18, 2020 04:00 PM PT
Who knew time travel could be possible!
While you can use the features of Delta Lake, what is actually happening underneath the covers? We will walk you through the concepts of ACID transactions, Delta time machine, Transaction protocol and how Delta brings reliability to data lakes. Organizations can finally standardize on a clean, centralized, versioned big data repository in their own cloud storage for analytics
What can attendees learn from the session?
Speakers: Vini Jaiswal and Burak Yavuz
June 25, 2020 05:00 PM PT
This talk will focus on Journey of technical challenges, trade offs and ground-breaking achievements for building performant and scalable pipelines from the experience working with our customers. The problems encountered are shared by many organizations and so the lessons learned and best practices are widely applicable.
Attendees will come out of the session with Best Practices and Strategies that can be applied to their Big Data architecture, such as:
Audience: The attendees should have some knowledge of setting up the Big Data Pipelines and Apache Spark.