Spark NLP: State of the Art Natural Language Processing at Scale

Download Slides

Natural language processing is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, summarization, sentiment analysis, natural language BI, language modeling, and disambiguation. Building such systems usually requires combining three types of software libraries: NLP annotation frameworks, machine learning frameworks, and deep learning frameworks. This talk introduces the NLP library for Apache Spark. Spark NLP natively extends the Spark ML pipeline API’s which enabling zero-copy, distributed, combined NLP & ML pipelines, which leverage all of Spark’s built-in optimizations. Benchmarks and design best practices for building NLP, ML and DL pipelines on Spark will be shared.

The library implements core NLP algorithms including lemmatization, part of speech tagging, dependency parsing, named entity recognition, spell checking and sentiment detection. Spark NLP was also the first library to deliver production-grade, trainable, and scalable implementations of named entity recognition using BERT embeddings. Using these capabilities will be covered in this talk, as well as support for “post-BERT” embeddings, multi-lingual, and multi-domain natural language understanding challenges. Recent accuracy benchmarks against state-of-the-art results will be shared.

The talk will demonstrate using these algorithms to build commonly used pipelines, using PySpark on notebooks that will be made publicly available after the talk.

Speakers: David Talby and Maziyar Panahi


– Hello everyone. My name is David and I’m here with my friend, Maziyar Panahi. And our goal today is to tell you a bit about Spark NLP in open-source library that will hopefully help you in your next natural language processing projects. I’m going to introduce the library, it’s background, what are the goals, with its design and functionality and then I’ll hand off to Maziyar, who will talk in more detail about its accuracy, performance, speed and we give some examples of actual code, that shows how you can use the library in your own projects. Spark NLP is an Apache in 2.0 license software library for Python, Java, Scala and also some are bindings. This means first of all that everything we showed you today is completely free and open source for both personal and commercial use. The library has been around for three years and we’ve had I believe almost 80 releases. So it’s heavily used in production, in stable, and you’ll see some of the functionality it comes with at this point. Spark NLP is based on Apache Spark. And actually every Spark NLP pipeline is a Spark ML pipeline as well. You can use it on, you can run it locally, you don’t have to run into the cluster, but if you do choose to scale it, it’s the only open source library that natively scales to a cluster and you can only turn any Apache Spark cluster here on premise, on databricks, on the cloud or in any other environment. You can see on the slide, some of the environments that we support in an official manner, meaning we test in every release it does work. We also have specialized bills, both full delays this nVIDIA chips, as well as for the latest Intel chips to make sure that from a deep planning perspective we make the most modern computer platforms. Spark NLP is the most widely used NLP library by practitioners in the enterprise. It since February, 2019 on the O’Reilly AI adoption in the enterprise, like both at the time. At least the data is by far the most widely used and the NLP library in their survey with two other surveys in 2019, 2020 that confirmed this with different based of people. And what you’re seeing here is the most recent NLP industry survey that came out just last month in September of 2020, that specifically asked people which NLP library they use and most targeted specifically at work petitioners. So Spark NLP is a natural language processing library and in terms of what it does, it’s very complete. One of the things we wanted from the get-go was to make sure that you can achieve your entire NLP pipeline with one library. First of all, for because of just implicitly and elegance of the programming, you know minimization of code. But another piece of that was the ability to scale the pipeline and natively scale to a cluster. This just doesn’t work if you’re doing half of your work in say you know, spaCY and LTK or and then another part of the work in another library. We wanted to make sure that for your entire NLP library we avoid unnecessary copying of data, we avoid moving data between libraries or even the operating system processes and all the optimizations can be done basically together. So Spark NLP functionality starts with really basic things like how do you split the document into sentences, how you tokenize or split into words, grammatical features like stemming, lemmatization, part of speech, dependence, deparsing and then the more advanced or deep learning based algorithms like spell tracking, named entity recognition, sentiment analysis, emotion analysis or more generally document classification algorithms. Spark NLP also comes embedded with the latest embedding both word embeddings, chunk embeddings, sentence embeddings. So you do not need a separate transformer library, Spark NLP handles loading, caching and distributing to cluster any resources you need including you know built, bio BERT and other built like embedding if that’s what your pipeline needs. All of that is done for you behind the scenes. Spark NLP is somewhat different than other natural language processing libraries in the sense that it was geared from day one to not just production great use cases but large enterprise gate production use cases so it’s it’s built for scale, it’s built for compliance and of course it’s been built for you know real production great scenarios. As a result as you can see and some of the largest companies in the world use bulk NLP regularly and in production. Spark NLP is also heavily used in the health care and life science space in other high compliance industries like finance and insurance. We often see cases where teams that have in the past use you know spaCY or a Hugging Face for some prototyping or to get things working then move to Spark NLP when they want to take things into production or see a bigger scale. So Spark NLP has three basic design goals; and that is to give to the open source community three things; State of doubt accuracy, state-of-the-art performance or speed and state of doubt scalability. By state of the art we don’t mean the marketing term we mean the actual academic term. So state-of-the-art accuracy it means the best accuracy that’s been achieved on academically accepted benchmark on peer review papers and the beautiful thing about the NLP space is that it’s been progressing beautifully fast and clearly it’s really been amazing in the past few years in terms of how high the achievable accuracy in common use cases are or some use cases nowadays. And our goal is to provide you that functionality and that level of accuracy out of the box in a production grade trainable and scalable manner. Other than accuracy which is really the first and most important thing we care about the next thing to think about is speed and nowadays speed doesn’t just mean the optimization of the code itself the pipeline and memory use it also means can you make the most of you know GPUs or the latest intel chips. It means can you make use of the latest lighter BERT embeddings and transformers which takes far less memory and they operate much faster and give you almost the same level of accuracy. And we also care about scalability because really if you want to move beyond one machine and you want to work on 10, 50, 500 machines it needs to work first of all and the other thing it needs to actually give you the benefits of scale. So those were the three top design goals when we started three years ago and as you’ve seen it’s taken hold of the community which we are very appreciative of, we understand the responsibility, we keep releasing new software every two weeks as we have done for the last three years and as we plan to do going forward. And now I’ll hand over to Maziyar, he will walk us to some of the details for each of those three points and then show some code examples of the library in action.

– Thank you David. Hi my name is Maziyar Panahi and I am Spark NLP lead at John Snow Labs. State of the art means the best peer-reviewed academic results. For instance the highest F1 final score on CoNLL-2003 for naming recognition tasks. We have implemented a bi-directional LSTM character level CNN plus CRF with empowering word embeddings for transfer learning. Based on the scores provided by other open source libraries the best finance score on CoNLL 2003 NER for a system in production was achieved by using a Spark NLP. For this specific benchmark we used BERT Large case sensitive obviously to train our NER model that has achieved 93.3 F1 on test data set and 95.9 on dev data sets. However we also did our own comparisons, we took an open source library in NLP which is well known and we try to compare our own NER accuracy with theirs. The point of this comparison was that everything should be working right out of the box there is no extra steps to make something working. All parameters are default and as you can see the spaces it could reach to almost 92% on a dev data set in any task while Spark NLP has a very good accuracy from the smaller glove embeddings all the way to the BERT large model. The reason we chose to have everything right out of the box is that most open source libraries don’t come with many word embeddings for transfer learning. For instance the spaCY only has three pipelines in a spaCY2 release that can provide vectors and four models for new and advanced state-of-the-art transformers. On the other hand Spark NLP comes with 50 word embeddings models. Right out of the box some of the most recent and state of the art language models such as BERT, ALBERT, ELECTRA, XLNet, ELMO and other fine-tuned and domain specific transformers such as BioBERT, clinicalBERT and CovidBERT. You can choose any of these word embeddings models to train your own name entity mission. Spark NLP is one of the very few open source libraries that comes with a built-in multi-class and multi-labeled text classification. You can use multi-class sex classification for problems such as detecting emotions, cyberbullying, detecting fake news, spams and so on. On the other hand multi-label text classification is used to detect toxic comments or movie genres or any other real world problem that each document requires to have more than one label or class. Our classifiers in Spark NLP can use over 90 pre-trained word and sentence embeddings right out of the box including a state of the art sentence and embeddings such as BERT or sentence universal sentence encoder which they both actually scored very high in SDS benchmarks. We have designed and implemented these models in tensorflow by using the latest approaches such as RGIU, followed by convolutional neural nets to achieve high accuracy in multi-class and multi-label document classifications. We also managed to accept over 90 word and sentence embeddings with dimensions between 100 all the way to 1024 with up to 100 classes without compromising the accuracy of our classifiers. Spark NLP is one of the very few open source NLP libraries to not only offer language detection or identification but also achieve such high accuracy. The models are designed and trained by using tensorFlow Keras. There is no need for any pre-processing nor any word embeddings. This makes the final model as as small as three megabytes to five megabyte depending on how many languages they can detect. The models were trained over 8 million Wikipedia pages. It reached the accuracy of 97% for 32 languages and 99% for 22 languages. Almost none of the top 10 most used NLP libraries comes with spell checking feature. Spark NLP has one of the best spell checking model in NLP libraries among the NLP libraries. It was initially created to handle bad inputs from OCR and it can leverage the context of the sentence and document. It preserves and correct custom patterns and it will give you the flexibility to incorporate your very own custom patterns as well. Context exposure carrying a Spark NLP has already broken almost every publicly available benchmark. I would like to talk about performance now. The second mission in Spark NLP we are constantly monitoring and profiling Spark NLP tasks and functionality to improve not only the accuracy but also the speed of that specific task during prediction or interfering. In early 2020 we improved the speed of training in a negative evolution by over 60% and improved the accuracy by 4% in the final F1 when we moved to tensorflow 1.15 to support windows, as same as Mac OS and Linux operating systems. BERT embeddings is a member of transformers family a new language modeling that generates contextualized embeddings by considering the context of the text in the simplest definition obviously. Transformers are slow by nature. They are slow because of so many encoders and layers involved in extracting features for predictions. They require very good GPU high-end GPU to to accelerate the speed and the length of the text being fed into their layers actually highly impact the the final performance. That being said in our latest release we managed to improve the memory by 30% and also improve the performance by more than 70% with implementing dynamic shapes into our BERT embeddings. If that wasn’t enough we also added 24 new and smaller models. Models such as Tiny BERT, Mini BERT, Small BERT and medium BERTS. BERT-Base has 12 encoders or 12 layers and it has only also compared to the base model. The Tiny BERT has only two layers and it has only 128 dimensions. BERT-Tiny is 24 times a smaller and 28 times faster than BERT-Base. We care about the size in megabyte because the smaller the model is the faster it can be downloaded and at the same time that the sooner and the faster it can be loaded into tensorflow. So when it comes to BERT-Tiny, the download and the loading on tensorflow is almost seamless. We have done intensive benchmarks on different hardware with different accelerators. Using GPU in a Spark NLP is as easy as setting one parameter by using GPU any task that uses tensorflow as a backend can perform between 40 to 50% faster in both training and prediction. However by using only the newest intel processors Cascade Lake on AWS the performance is closed to GPU and if we compile our own custom tensorflow, optimize for intel, which is called actually MKL. The final result in training for instance here French NER was 19% faster than GPU which already is 40 faster than generic intel xeon and up to 46% actually cheaper. No matter how fast something is you either want that to be faster or you have so much data that either it doesn’t fit in a single machine anymore or it will take much longer with that amount of data. Spark NLP needs zero code change to a scale it’s by pipeline on a Spark cluster. It is the only natively distributed open source NLP library, it takes advantage of aperture Spark execution planning, caching, serialization or data shuffling. The the catch is that scaling is not linear it depends on your task, 10 more machines won’t necessarily mean 10 times faster. The spark configurations do matter and tuning your cluster is highly advised meaning knowing your cluster and knowing your data set will help you to take full advantage of distribution and parallelism. So I have done a few benchmarks, the first one is actually for a pipeline we have called Recognize Entity DL. So I’ve done this benchmark over Amazon full review. It has 50 million sentences and overall 255 million tokens. I use single node machine with 32 Gigabytes of memory, 32 cores and on the other hand I use 10 workers with 32 Gigabyte memory and 16 cores on data breaks on AWS for this comparison. Now sometimes you see document per second for a benchmark, sometimes you see sentence per second to me how many words you can process per second is the actual benchmark because sentences could be short, could be long they all affect the final benchmark. Now here you can see on a single machine you can do tokenization up to 340 000. I rounded down all these numbers way down to be fair. If you go to 10 nodes you can perform 7 million tokenization word per second on 10 node machine. At the same time for something much more complicated involving deep learning, you can do almost 38,000 words per second for extracting entities from those sentences. But if you move the cluster in a 10 note set up you can increase that to 136. As I told you before it’s not a linear scaling on a tokenization it wasn’t 10 times more it was way more than 10 times. On the NER it wasn’t 10 times more so it was just almost five times. Perhaps with a little bit of tuning, we could have done a better work but this is just right out of the box we wanted to know what would happen. So another benchmark I’ve done for distribution is for BERT embeddings and as I told you before the BERTs are very slow. So the same data sets we had before and I chose 128 max sequence length for our BERTs and as you can see you could achieve 19 000 word per second on a single machine and you can increase that up to 76 000 word per second on a 10 note setup on a data breaks on AWS. But that’s just BERT based. If you move to BERT-Tiny model which we discussed before that 19 000 on a single machine becomes 170 000 word per second. And if you have a setup of 10 machines that could be increased almost to half a million words per second which is pretty impressive for such a very complicated transformer such as BERTs. So now we want to see that the library that we establish is the state of the art it’s very accurate, its performance is fast and at the same time it can scale easily but how easy it is to use? If you want to use a Spark NLP you need to know three things. It can support Python or Scala and Java right out of the box you don’t have to do anything, the codes and the APIs are exactly the same thing. In using the Spark NLP you have pre-trained pipelines, you have pre-trained models and you have training your own models. Let’s look into each of one of them and see how it works. So pre-trained pipelines are pretty easy. It’s just one line you just load the pipeline, you have your text and you just transform it. We already have over 90 pre-trained pipelines right out of the box. They support up to 13 languages, they’re simple, they’re easy, they work both online and offline. The caveat is that they’re not flexible meaning if there is a task in the pipeline you don’t need you have to pay for it by computation and time. If there is a task in the pipeline that the pipeline doesn’t have it but you need it again it’s not flexible so the pipelines are calm as they come. There is no guarantee what you need would fit the purpose but they are perfect for starting to check everything out and see what you need. On the other hand we have pre-trained models. We offer over 250 pre-trained models. They support up to 46 languages, they work both online and offline, they’re flexible and customized pipelines. So it means that you can easily construct your own pipeline and just use the task that you need and just don’t you know you don’t need part of a speech so you don’t use part of a speech but some of our pipelines may come with a part of a speech. And here you may need a different model, different BERTs different NER and the caveat here is that some of the models depend on each other for instance here you have to use the NER model that was trained on a specific BERT model so you can’t use a BERT that has 256 dimension on any yard that was trained by 128. So that’s the caveat of pre-trained models. On the other hand you could train your own models. Now see how hard it is for instance some very simple tasks such as part of a speech or post tagging you can see how easy it is to train your own part of a switch. We have classes that you can just convert your data set into the format that is accessible and compatible such as this POS class that we have you read it there. Now for that it’s pretty much the simple thing we just use perceptron approach which comes from a perceptron average algorithm. It’s fast it’s language agnostic, it means that it supports all the languages and in pretty much in five or six iteration you get a very high accurate part of a speech, so you can just go to websites such as universal dependencies they support over 100 languages. You can download it in data set and you can just train your own part of a speech. Now how easy it is to train your own NER. NER is very complicated task. As you can see here again we have classes like training classes that they help you to transform your data set into something that is compatible with the NER, NER DL annotator we have here. And that said you just call the BERT embedding that we have or any board embeddings that you can choose and then you have your setup hyper parameters, you set them and you start training them. The acceptable format here is CoNLL 2003. We have over 50 word embeddings models already available to you, you can train it on CPU and on GPU doesn’t matter which. And you get out of the box extended metrics and evaluation with a built-in validation split also with metrics to help you through this training more accurately. So I would like to show you an example I did this thing on Google Colab. I used BERT with two layers 768 dimensions it took over 16 minutes to train the whole thing and achieve 91% micro F1 on dev data sets and 90% with the conll_eval which is just evaluating the pair of the entities not just each entity separately. Because we did improvement to our BERT embeddings and we were able to use the full CoNLL 2003 data set before we were just able to do it partially. But right now the whole data set can fit into such a small shared virtual machine such as Google Colab and we use GPU which is free on Google Colab and it took only 16 minutes to add shape Another thing that you can do is to train your own multi-class classifier this actually supports 100 classes up to 100 classes, accepts over 90 word and sentence embedding models you can do it with CPU and GPU and it also give you extended evaluation and metrics. And I would like to say the field of NLP is one of the most dynamic and rapidly growing fields in the AR and computer science. We have had over 73 releases in the past three years which means one release every two weeks that David mentioned before. To keep adapting to the new trends in NLP/NLU we have an active community on Slack and GitHub which I would like to thank you all if you are listening right now. Spark NLP has over 330 pre-trained models and pipelines supporting over 46 languages, many features which are only available in Spark NLP, are the wide range of out of the box functionalities in spark NLP, makes it a unified solution for all of your NLP needs. Thank you very much and don’t forget to check the website and join our slack channel. I hope you enjoyed the talk.

Watch more Data + AI sessions here
Try Databricks for free
« back
About David Talby

John Snow Labs

David Talby is a chief technology officer at John Snow Labs, helping healthcare & life science companies put AI to good use. David is the creator of Spark NLP – the world’s most widely used natural language processing library in the enterprise. He has extensive experience building and running web-scale software platforms and teams – in startups, for Microsoft’s Bing in the US and Europe, and to scale Amazon’s financial systems in Seattle and the UK. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

About Maziyar Panahi

John Snow Labs

Maziyar Panahi is a Senior Data Scientist and Spark NLP Lead at John Snow Labs with over a decade long experience in public research. He is a senior Big Data engineer and a Cloud architect with extensive experience in computer networks and software engineering. He has been developing software and planning networks for the last 15 years. In the past, he also worked as a network engineer in high-level places after he completed his Microsoft and Cisco training (MCSE, MCSA, and CCNA).

He has been designing and implementing large-scale databases and real-time Web services in public and private Clouds such as AWS, Azure, and OpenStack for the past decade. He is one of the early adopters and main maintainers of the Spark NLP library. He is currently employed by The French National Centre for Scientific Research (CNRS) as a Big Data engineer and System/Network Administrator working at the Institute of Complex Systems of Paris (ISCPIF).