If you’re subscribed to [email protected], or work in a large company, you may see some common Spark error messages. Even attending Spark Summit over the past few years you have seen talks like the “Top K Mistakes in Spark.” While cool non-machine learning based tools do exist to examine Spark’s logs — they don’t use machine learning and therefore are not as cool but also limited in by the amount of effort humans can put into writing rules for them. This talk will look what happens when we train “regular” clustering models on stack traces, and explore DL models for classifying user message to the Spark list. Come for the reassurance that the robots are not yet able to fix themselves, and stay to learn how to work better with the help of our robot friends. The tl;dr of this talk is Spark ML on Spark output, plus a little bit of Tensorflow is fun for the whole family, but probably shouldn’t automatically respond to user list posts just yet.
Session hashtag: #SAISML10
Holden is a transgender Canadian open source developer with a focus on Apache Spark, Airflow, Kubeflow, and related "big data" tools. She is the co-author of Learning Spark, High Performance Spark, and Kubeflow for Machine Learning. She is a committer and PMC on Apache Spark. She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal.
Gris Cuevas is an Open Source Program Manager at Google Cloud and an aspiring Data Scientist. She recently graduated with a Masters in Operations Research and Data Science at UC Berkeley. Gris has worked on developing online communities for the past 7 years and is now collaborating on the design of an algorithm to predict author quality in online forums at Google. Gris is interested in Natural Language Processing, Information Retrieval, and Open Source technologies. She loves The Beatles, juggling and Mexican food of course.