Maziyar Panahi is a Senior Data Scientist and Spark NLP Lead at John Snow Labs with over a decade long experience in public research. He is a senior Big Data engineer and a Cloud architect with extensive experience in computer networks and software engineering. He has been developing software and planning networks for the last 15 years. In the past, he also worked as a network engineer in high-level places after he completed his Microsoft and Cisco training (MCSE, MCSA, and CCNA).
He has been designing and implementing large-scale databases and real-time Web services in public and private Clouds such as AWS, Azure, and OpenStack for the past decade. He is one of the early adopters and main maintainers of the Spark NLP library. He is currently employed by The French National Centre for Scientific Research (CNRS) as a Big Data engineer and System/Network Administrator working at the Institute of Complex Systems of Paris (ISCPIF).
November 17, 2020 04:00 PM PT
Natural language processing is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, summarization, sentiment analysis, natural language BI, language modeling, and disambiguation. Building such systems usually requires combining three types of software libraries: NLP annotation frameworks, machine learning frameworks, and deep learning frameworks. This talk introduces the NLP library for Apache Spark. Spark NLP natively extends the Spark ML pipeline API's which enabling zero-copy, distributed, combined NLP & ML pipelines, which leverage all of Spark's built-in optimizations. Benchmarks and design best practices for building NLP, ML and DL pipelines on Spark will be shared.
The library implements core NLP algorithms including lemmatization, part of speech tagging, dependency parsing, named entity recognition, spell checking and sentiment detection. Spark NLP was also the first library to deliver production-grade, trainable, and scalable implementations of named entity recognition using BERT embeddings. Using these capabilities will be covered in this talk, as well as support for "post-BERT" embeddings, multi-lingual, and multi-domain natural language understanding challenges. Recent accuracy benchmarks against state-of-the-art results will be shared.
The talk will demonstrate using these algorithms to build commonly used pipelines, using PySpark on notebooks that will be made publicly available after the talk.
Speakers: David Talby and Maziyar Panahi