Building a Pipeline for State-of-the-Art Natural Language Processing Using Hugging Face Tools

Download Slides

The natural language processing (NLP) landscape has radically changed with the arrival of transformer networks in 2017. From BERT to XLNet, ALBERT and ELECTRA, huge neural networks now manage to obtain unprecedented scores on benchmarks for tasks like sequence classification, question answering and named entity recognition. The pipeline from text to prediction remains complex, but tools like huggingface/transformers and huggingface/tokenizers take most of the burden off of the user, offering a simple API. This talk will focus on the entire NLP pipeline, from text to tokens with huggingface/tokenizers and from tokens to predictions with huggingface/transformers.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hi everyone today we’ll be talking about the pipeline for state of the art MMP, my name is Anthony. I’m an engineer at Hugging Face, main maintainer of tokenizes, and with my colleague by Lysandre which is also an engineer and maintainer of Hugging Face transformers, we’ll be talking about the pipeline in NLP and how we can use tools from Hugging Face to help you with that, your feedback is important to us, don’t forget to rate and review the sessions, so, before we start, I’d like to quickly introduce Hugging Face for those of you that never heard about us, and ever since started when we were trying to build your best friend to conversational AI, an AI capable of listening to you and talking with you about any subject, if you’ve seen the movie “Her” you know what I’m talking about, let’s say it’s quite an ambitious goal and yet a lot of filming while trying to achieve it, we started to open source tools we built along the way and we’ve been very surprised by the way it’s been received by the community.

Hugging Face

Soon enough, we’re spending all our time on these tools making them better day after day and here we are today. Transformers is our most successful projects and is the most popular open source library for NLP today, we have close to 30,000 stars in GitHub, more than 1000 research paper mentioning it, and many companies are using it in production every day, we are now pursuing a new goal, which is to help improve the NLP and make it accessible to everybody, and we recently raised a new round of funding to pursue this goal.

We now have multiple open source projects that can help you work in many different steps of the NLP pipeline, and we are going to show you two of them, so Today’s Menu will talk about transfer learning in NLP and how it applies to transformer networks, then we’ll dive into the Tokenization followed by Transformers models, and now let’s get started with my colleague, my colleague Lysander, we’ll start with the transfer learning.

Subjects we’ll dive in today

– All right, thank you, Anthony, and so we’ll we’ll take a look at transfer learning, especially transfer learning NLP, and especially apply to fulfillment, so in a few words, NLP took a gigantic turn in 2018 with the arrival of the transformer architecture, so it arrived in 2018 with the vas one transformer and immediately after the GPT and Bert Transformers arrived and obtained state of the art results on many different NLP tasks, and so, first of all, before we dive in, how Transformers work and how they how transfer learning makes them so efficient, let’s take a quick time, a quick pause to understand what is exactly is transfer learning, and how it’s different to traditional machine learning, so traditional machine learning, you would have multiple tasks and you would have multiple learning systems, which we’ll all need to learn on that specific task, so if you had three different tasks, you would need three different learning systems, and there was not really multitask systems or not really, you weren’t focusing your efforts on the different tasks at once, and this works.

Transfer learning

All right, however, some tasks can actually leverage some shared knowledge across different tasks. For example, if you’re working with tech with text two different tasks that comes from text, well, the underlying principle is language so if you have one tasks that understands language and another task, you can surely use that knowledge that was acquired and use it for a specific task, which is exactly what transfer learning tries to do, so you have one or multiple source tasks on which you train your learning system, and then that’s not really your target task but your target test may can use the knowledge required using those two tasks using the training and those two tasks for threaded gets better results with less data, and so how does that apply to NLP, well with sequential transfer learning, which is very similar to what I explained, so there’s a it’s done in two steps, the first step is the pre training and the second step is the adaptation or also called fine tuning, the pre training is a very computationally intensive step where you have a lot of data you require a lot of compute, and you’re basically trying to cram as much knowledge as possible in different systems, and those systems can range from word embeddings, like which back or globey, or a very recent transformer networks, like GPT berts, for this server, and once you have this general purpose model, you’re going to try and adapt it to different tasks, and in order to do so you require less data since you already have a strong knowledge base right.

Sequential transfer learning

So now how does this apply to Transformer Networks?

So Transformer Networks and a few words are very, very large neural networks, they went from a few million parameters to billions of parameters, the biggest transformer network that has the most amount of parameters is GPT, three, which was released last week, and which has 175 billion trainable parameters, and so with such a big amount of parameters, you actually have a very big capacity, we can train those very big neural networks on very big data sets.

However, one flaw of this is that it requires a lot of compute to be trained, so that’s where transfer learning really kicks in, that’s where it’s really interesting, because starting from a base model in order to obtain a general purpose language model, so a language model that’s specific to a domain, let’s say a text or language, but completely text agnostic, well, you will need thousands and thousands and computes a very, very large corpus and days or even weeks of training, just to get to that stage however, once you have the pre trained language model, then fine tuning it to different tasks is actually very easy you only need small datasets, because most of the knowledge is already in the model, so you don’t need your base model to get an understanding of language just from this whole data set, you’re completely relying on the previous training and just fine tuning a bit on it for that specific tasks, so transfer learning is really useful in the case of very large models like transformer networks, and that’s where model sharing is especially important because we producing a pre training is completely impractical, it costs a lot, and since the general purpose, the it’s completely task agnostic, so it’s very important to share that pre train model so that other users may just use it and fine tune it on their downstream tests or on their own datasets, and this is why this is something we’re very proud of the Hugging Face where we allow the easy retrieval and distribution of models entirely for free, so that users may share compute, right so now let’s take a deeper look at the inner mechanisms of transfer learning of the pipeline, for natural language processing, pre training and fine tuning of transformer architectures, so the transfer learning pipeline in NLP is composed of two big steps, the first part is the tokenization aspects, which Anthony will present in a bit and the second part is all the prediction aspect.

Model Sharing Reduced compute, cost, energy footprint

So the tokenizer’s goal, no is tokenizer’s goal is to get

Transfer Learning pipeline in NLP

some input like a sequence or a sentence a dump of text and convert it into inputs understandable by the model.

Once the sequences have been converted to understandable inputs, then these can be fed to the model which can then do a prediction on top of so, how does pre training a language model works, there’s different ways of doing it but the most frequent and the best way to do it is by language model, learning to predict text given other texts, one huge advantage of this method is that it doesn’t require human annotation, and since pre training requires a lot of data, not requiring human annotation is a very big deal because you don’t want to have to annotate a lot millions of examples when you can just leverage your, in this case, self supervised aspects of language model.


That’s also very interesting in the case of languages that have that only have a few data, low resources, language, no resource languages, because these languages in these languages, it’s very hard to acquire datasets that have sufficiently annotated data, to perform a pre training however just obtaining a dump of text is easy enough and can be used to obtain a sufficiently good pre trained model, so let’s take a quick look at how Language Modeling operates in NLP with when we look at two objectives right now, the first one being mask language modeling or MLM, which is also known as the close task, and is a very old language modeling method, what we’re trying to do here is mask some of the inputs and ask the model to predict what was instead of the mask, we’re going to ask it to fill the mask, so in a traditional pipeline in a traditional NLP pipeline, this would look like the following.

Language Modeling

So you have a sentence, the pipeline for state of the art natural language processing, that you first tokenize so you convert it into tokens, and then you must have those tokens. So for example, here the token natural was masked, it was replaced by a massive token, and you then ask the model to replace that mask token so for example, here we asked the model to replace it and give its five most probable answers, which are natural, artificial machine processing, and speech, which all makes sense in that context so that’s already pre trained model that did this completion. But at first, of course, the model wouldn’t do such correct predictions, and by making a lot out of it and training it, that’s how you would obtain a speech and language model. The second objective is the CLM language model that is CLM objective, so we’re here instead of masking some of the inputs, you want the model to predict what’s the next token with the token following the sequence. And so this is interesting, because it actually trains the model to do text generation, since it’s trying to always generate the next token, you can change that and it will, at the end, generate a sequence that has been introduced. Generate a sequence just from the context that you gave it up first, so this is interesting, but it only attends to the left hand side context, compared to the MLM objective that we saw previously, which since you’re masking a token that can be in the middle of the sentence, you can have access to both right context and left contexts. Whereas here, you only have access to left context. So usually CLM has, doesn’t get as good results on downstream tasks as the MLM between those. However, it allows for text generation, which is a very useful feature, all right, so now we’ll let Anthony do the present the Tokenization, which is one major aspect of the natural language processing there. – Yeah, let’s dive a little bit more on the tokenization, now, I want to start by quickly talking about the role of the tokenization. So as we saw in NLP, our inputs, the data that we generally processes, basically some raw text, like in this example Jim Henson was a fifth year but our models obviously only work with numbers. So we need to find meaningful way to transform this world text these strings, numbers, that’s what the tokenizers do and there are a lot of different ways to do this, but our goal is generally to find the most meaningful representation, the one that makes the most sense for models and possibly the smallest one, so let’s see some examples of tokenization algorithm and questions we get to ask ourselves with this.

The first kind of tokenization that comes to mind is simply based on words, and it’s really easy to set up a news is generally just a few words and use good reasons, in this case, we want to separate the world text into words and find a numerical representation for each of them, this usually requires splitting the text somehow, and we have to choose how to do it, do we want to keep the punctuation the words or maybe separated in their home tokens or even another role, with this kind of tokenizer we generally end up with some fairly large vocabularies. Words like dog different words like dogs, and they end up with a different representation, and the same applies for run versus running, for example, and all of these different words, get a message at the ID generally static, starting from zero and going up to the size of the vocabulary that we’ll use to identify each word, so the more words we have, the more IDs we need to see and the bigger the vocabularies, this also means that we generally need some token to represent any word that is not in our vocabulary, this is what we call an out of vocabulary token, also known as an unknown token, now, we don’t like this token, because it means we are not able to be present some of the words we might see might get as input and we are losing information on the way, so that’s something we want to avoid as much as possible, and so, one way to reduce the risk of having out of vocabulary tokens is go one level deeper I’m talking about character based tokenizers in this case we now split our texting characters, and one of the advantages is that we ended up with smaller vocabularies or so we have a lot less out of vocabulary tokens as the words can be built from these characters yeah we get to ask ourselves, also, do we want to keep the spaces and punctuation or not, one thing about this is that intuitively, we’re seeing that kind of lack in terms of information since each character doesn’t mean a lot separately, this is not true for some languages like Chinese, for example, where each character carries more information that’s in Latin languages, but in English, for example, each character doesn’t mean by itself. And another thing to consider is that by using such techniques, we end up with large amounts of tokens to be processed by our models, and this can have an impact on the size of the context carry around, for example.

No, we’ve seen some good results with this kind of tokenizers, so it really is interesting to consider some cases and we can also think about another technique which is built from the previous one we saw about words.

And then having as a fallback, the character based tokenization when the word is not in the vocabulary, but actually let’s see something that is even better than that, and I’m talking about subword tokenization, you might have heard about BPE or byte pair encoding before this algorithm was initially used for compression, as introduced in 1994 by Philip Gage, before it got applied to NLP in 2016 by Rico Sennrich and these quarters in the paper, Neural Machine Translation of Rare Words with Subwords Units, and this algorithm brought some really interesting improvements.

Byte Pair Encoding

The idea with BPE is to start building an alphabet composed of unicode characters that will serve as the base vocabulary, then we start building new tokens for most for the most frequent pairs, we find the original corpus, or for example, in English, the letter T and H are often seen together so they end up being merged in a new token, TH and later on, we might see that TH is actually often seen next to the letter E, so emoji being the token D, and then keep building up more and more tokens up to some target look at the results, since we start from the characters and build up to words generally when a word is not part of the vocabulary, we are able to use multiple tokens to represent it, in this example, you see that tokenization can be presented by token and zation.

The same would apply for standardization for example, which could be split in standard and zation and so these sub words ended up having a lot of meaning and also space efficient, we can have very good coverage is relatively small vocabularies and also less known tokens, there are obviously also techniques out there like the Byte level BPE which is very interesting it was introduced with GPT-2 by open AI and used bytes as base alphabets instead of unicode characters, this means that the initial vocabulary fits in 256 different characters which is the number of values by can have and yeah instead of the 100,000 more than 100,000 different building characters and so there is also word peace which is a form of sub word tokenization two actually very similar to the way BPE works used by Google in models like berts, or so the more recent unigram are generally implemented in sentence is which brings some improvement, have a BPE by improving the way we merge tokens, so now, knowing all of this, I can now tell you why we decided to build a tokenized library and how it works, so the first reason is simply for performance, our models usually run with frameworks like PyTorch, tensor flow, or even onyx. And all of these provide great performance day. The tokenization is from happening into pricing and it ends up being really slow, sometimes bottleneck and in the NFT pipeline, so having to pre process your entire data sets before actually training the model shouldn’t be a requirement because this means that whenever you want to change something, you have to start over and you see it can really cumbersome, so we definitely wanted to improve this experience and provide real time optimization that we didn’t even notice, it is also a great occasion to have all the various tokenization, and the reason under one roof is a shared API, this makes experimenting way easier in switching tokenizing subways, we want it to be easy to share your work and access the work of others and finally, we also want to make it really easy to train and tokenizer for example, on the new language, or new data sets, or whatever.

So now, let’s see how we actually do this Tokenizer is actually a pipeline, the input text goes through this pipeline, and in the end, we get something ready to be fed to the model so the first step is the normalization, that is where we transform our input, generally that is where we’re going to treat white spaces, for example, all lowercase everything for language, yeah, sorry, can we start over the first one?

The tokenization pipeline

Okay, the first step is the normalization, that is where we

transform our input, generally, that is where we’re going to treat white spaces, for example, all lowercase, everything, maybe apply some unique amounts of normalization, then we have the pre-tokenization.

In this step we take care of pre segmenting the input as many, in most cases means simply explaining on white spaces, for example, languages that use white spaces, but we could be anything that makes sense for our specific use case, once all of this is done, we are ready to apply the actual tokenization algorithm, this is where BPE unigram word level or any other tokenization algorithm does its magic. And finally, the last step is the post processing.

Here we are the special tokens like for example, the CLS and set in birth, we take care of truncating the input so that it fits the model bad necessary et cetera.

So now let’s see some code and how to build custom tokenizer, this first example shows how to build a byte level BPE as you can see, we start by building and tokenizes tokenizer based on an empty BPE model, we attach a normalizer in this case, this is a classic unicode and if case normalization we also attach our prayer tokenizer here this is a vital event which will take care of speeding the world text to words, if possible, and transform the input into the right representation, so that we can process the text as byte level, this also means that we need a decoder now, this one will take care of transforming the tokens back to the readable unicode characters when we want to decode from IDs back to text, and that’s it, our tokenizer is ready and we can now train it for the trainer, for the training, we specify the target vocabulary size, as well as the spatial tokens we plan on using, we give it to be a bunch of piles and just let it work, the training step actually trains the model so here when the training is done, it just means that our BPE model is not empty anymore now has a vocabulary, and I want also to show you another example of tokenizer, in this case, a word is like the one used in bytes, so just like before, we initialize the tokenizer this time is an empty word piece, this time we use the sequence as the main normalizer, which is just a utility helper that we can use to combine multiple normalizes, in this case, we want to strip each screen we receive as input and also lowercase everything, we use a simple whitespace pre-tokenizer that will split on white spaces and keep the punctuation as separate tokens, and we set up our decoder, in this case, we need it to decode the IDs while treating the double aspects involved, which is basically to work this and once again, we’re ready for and there is a final step here though, which is about setting up the post processor and if you remember from before, and we describe the pipeline.


This is the part that actually adds the spatial tokens. And now we’re actually defining this here after the training because we need to initialize it with a token and a token ID for both the set and CLS tokens, and in order to be able to use the token to ID method and the tokenizer we can do is only after it’s been stained so now after these examples our tokenizes are ready to be used, he can also specify maximum if you wants by enabling truncation, this would ensure that the inputs to a modal always be of the right size you can also enable the padding for example, so that all the sequences we encode as batches have the same length in this case, we specify that what are the token and by the ID that we want to use.

And we can say these tokenize everything we defined before that we get before that will get saved in this file, the normalizer, the tokenizer, the padding options, everything, this means that whenever we want to use this tokenizer in the future, all we need is a single line of code means utilize it from the file, ready to use it. If you want to try another tokenizer your code doesn’t change, you just change the file from each new load tokenizer.

Now, of course, each tokenizer can be used to encode some input text and here we actually supports multiple types of inputs, we can encode single sequences, but also pairs of sequences, for example, if you want to do question answering, you would encode the context and the question as a pair of sequence you can also encode batches of sequences, both finger and pair of sequences, and even mix these that’s relevant, and we even bought pre tokenized inputs, for example, if you want to do some entity extraction, and that’s your data set is close data.

And in all of these cases, when you encode and you get an encoding back, this encoding contains all the scenes that you may need like the IDs, obviously, but also the generated tokens if you want to check what it looks like, this example shows you what some byte level tokenization would look like. ‘Cause its strange characters from the feature token, this character actually present a whitespace white level, and we also provide the offsets for each token so that you can extract text from your original input if needed, that’s something really helpful in the case of question answering, for example, where you want to extract the answer from the original text, and maybe highlight it, you can also find special tokens masking attention, mask them and a lot of other things also that it lets you discover, so yeah, try it, it’s as easy as using pip install tokenizes or NPM, install tokenizing node and we’ll keep on adding bindings for new languages in the future, now I let Lysander talk to you about transformers.

– All right thank you Anthony, so Transformers is another big part of the natural language processing pipeline, because it takes care of the prediction right after the tokenization.

So, since 2018, and the arrival of the first transformer architecture, there really has been an explosion of the number of Transformers that have and just to name a few words GPT-3 was released last week and before that there was a Mina from Google and several others the past few months and each of those transformer architecture it is a bit different to the previous. For example, Bert uses word piece tokenization was trained with mask language modeling, and next sentence prediction. Well GPT-2 uses byte level BPE tokenization and was trained with their causal language model, and it’s the case for every single transformer model, even though they’re very similar, they have some inner quirks and some slightly different API’s, what we try to do in Transformers is offer all of those transformer architectures to be very simply usable under the exact same API. While doing so we try to make it accessible to as most users as possible, given that we first started with only supplying our models in PyTorch, now they’re all available in both PyTorch and TensorFlow since last September, and since last month, we have our first two models Bert and Roberta in the Jack’s framework from Google, all of those models train and run on CPU, GPU and TPU with each framework specific optimizations, their XLA for TensorFlow TorchScript for PyTorch, and others, like Half-precision and Half-precision and others, so to understand how an inference script would look like using transformers, it leverages the two abstract classes which are pre trained tokenizer, which is completely based off of Hugging Face tokenizers that Anthony just presented you and pre trained model which is an abstract class and compressing the models. Both Praetorian tokenizer and pre trained model offer the exact same API, whichever model you use whichever model tokenizer pair you choose to use, so for example, if you would want to use bytes and write a script with it, an inference script with it it will be as simple as simple as just changing words to GPT-2 to just completely change the transformer, so with Transformers what we try to do is to, we publicly host pre train tokenizer vocabularies and model weights on a model hub, which allows easy sharing and use of pre trained models and tokenizers, so right now, we have more than 1600 model tokenizer pairs that you can use very simply as shown in the decode sample here, using just tokenizer from pre trained and model from pre trained as well.


Another abstraction that is available into in Transformers is the pipeline abstraction, which handles both the tokenization and prediction, it uses reasonable defaults, which means that you can get state of the art results state of the art predictions without tuning in settings, however, they’re still very customizable, which means that if for example, you want to use a pipeline in a different language than English, you can totally just input which pre train model and tokenizer pair you would like to use, and it will automatically download it from the model hub and use that pair that you mentioned. So let’s check a few use cases, that’s where it gets really interesting, so here we have sentiment analysis, which is a kind of sequence classification will be for the next three example, we’ll be using the pipeline abstraction, for the last example we’ll be using, we’ll be looking at how it looks without using the pipeline obstruction, so just using the variables, tokenizer, and model to see what really happens inside the pipe. What’s your, for example, with the pipeline, it’s very easy to do sentiment classification, you would start by first importing the pipeline, and then initializing it with the task that you’re trying that you want to complete. So here’s sentiment analysis, then you would just call this object NLP, with the sequence that you want to classify according to a sentiment positive or negative? So the results of NLP I hate you results in a negative label with a score of 99.9% and the result of NLP I love you results in a positive label with a score of 99.9%. Now for our next task the question answering which I find absolutely incredible the question answering task in natural language processing is given a context given a context and a question, it extracts an answer from the context to give an answer to the question, so here you would do like the previous slide you will initialize the pipeline with the task question answering and then define a context which in this case is I will read it out loud, so extractive question answering is the task of extracting an answer from the text given a question? An example of a question answering data set is discarded datasets, which is entirely based on that task. If you would like to fine tune a model on on a squat task, you may leverage the human squad, QI script, so then you would call the NLP variable with a question and the context that I just read, so for example, NLP what is extracted question answering given the context, and it returns a result with the score the start and end location of the answer in this case, the task of extracting an answer from a textbook question, which is entirely correct, and for the second question, what is a good example of a question answering data set? The answer is what data set which is correct as well. So for causal language modeling or text generation, which is the same thing as we’ve seen previously, and you will first initialize the pipeline as we’ve done previously with the text generation string, and then create a sequence that you would like to be completed so for example here the spark will say AI summit. We want the model to complete the sentence following that initial context, and so we do NLP sequence which results in the following generated text the spark will say AI summit was a special event held in January 2016 to celebrate the emergence of a major and enduring innovation, the spark plus the event open with a keynote by Dr. Stephen Hawking, creator of AI, although it still has many of et cetera. So you can see here that the text is syntactically coherent, even though it’s factually incorrect, right now is a an

example using the tokenizer and models without relying on the pipeline obstruction, so here, we’re going to do sequence classification, again, similar to the sentiment analysis, but we’re going to do a different task here. We’re going as you can see in the in the code sample, we’re loading the tokenizer and model checkpoints bert-base case to fine tune the MRPC bert base case means that is the original bert base with the case vocabulaty that was released by Google when it was bert. And finally MRPC means that we find in these checkpoint to the MRPC test, MRPC is the Microsoft Research paraphrase corpus, which is a data set that tries to identify when two sequences or two sentences are paraphrases of each other, so here we initialize the tokenizer and model from checkpoint, next when you find

the classes, so not paraphrase and is paraphrase, and we define three sequences, the first one is the company Hugging Face is based in New York City, the second one is apples are especially bad for your health and the last one is Hugging Face’s headquarters are situated in Manhattan. The first 10 sequences are completely unrelated. So definitely not paraphrases of each other, however, the first and last sequences are practically mean the same thing, so maybe they’re not actually paraphrases, but they’re very, very similar to each other, so we want our model to predict that the first and last sequences are paraphrases, so we define that the we now use the tokenizer to encode the sequence pairs, so that it

basically combines the two pair the two sequences into a pair using the model special tokens and builds tensors that can then be used by the model to output the prediction, that’s why we have returned tensors equals T which means that we want tensors to be returned so paraphrase and not paraphrase here are two dictionaries containing PyTorch tensors that the model will use to really do a correct prediction, so now we spread

the paraphrase and not paraphrase dictionaries in the model and get the first results the first return of the model. Since the model will return triple outputs which contains a lot of different data on how the prediction was computed, we only want the first results which is only the classification levels, so the paraphrase classification nuggets and the not paraphrase phrase specification nuggets.

Now, we just pass these results through a softmax to get to make sure that our results are between zero and one and we can do a probabilities or a percentage of those and we print the results, and we can see the results should be paraphrased is the first and last sequence, so the result is not paraphrased 10% and is very phrase 90% So, it did identify that there two sequences were very similar, and the second case should not be paraphrased. It’s not a paraphrase that 94% and it is paraphrase at 6%. So here to identify which paraphrases it identified that the first and second sequences were not paraphrases of each other. So, using the model and tokenizer is still very easy and seamless since we offer these high level methods that allow converting inputs to better converting sequences to model inputs very simple, but Transformers isn’t only

limited to inference, it can also train models, and we do offer quite a lot of examples scripts in our library in both TensorFlow and PyTorch for a few tasks that are listed here named entity recognition, sequence classification, question answering language modeling, both fine tuning and from scratch if he wants to do a pre trained language model and multiple choice all of those trained on both TPU CPU and GPU, as well as this, since our models are simply bare bones, models and both PyTorch and TensorFlow, you can use it with the respective training frameworks, so for example, in PyTorch, you can just use it with a simple training group or with PyTorch lightning, and then TensorFlow, you could use it with a karass fit method or with the TensorFlow estimators, we also offer a trainer class that can be simply overridden in python, just to build a very simple training specific to NLP, so with Transformers with this presentation we just raised the surface of transformers, which offers many many different possibilities, to just name a few that were added in the past few weeks, we’ve added the ELECTRA model and the ELECTRA pre training method, we’ve added the Reformer for very high efficiency, the Longformer for very, very large sequences, the Encoder-decoder architectures for Translation and Summarization, so now we’ve seen both the tokenization and prediction aspects of the full NLP pipeline, but is that really everything, but is that really everything that covers the NLP pipeline, well, not really, because there’s still the data parts and the metrics part that needs to be covered, since just obtaining data and feeding data to the tokenizer in a memory friendly way, and computing metrics afterwards are two different tasks, well, we’ve created the Hugging Face NLP library, just so that it’s so that it offers in memory mapping

and computing the metrics automatically, and the NLP

library is will be soon built on top of Spark, and Kersey offers more than 100 different data sets that can be used very, very simply by the tokenizer and transformers nugget. So that’s it for the full pipeline of the current natural language processing using Hugging Face and tools, thank you very much for listening to our talk, your feedback is very important to us. So please don’t forget to rate and review the sessions.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Lysandre Debut

Hugging Face

Lysandre Debut is a Machine Learning Engineer at Hugging Face, the leading NLP startup, based in NYC and Paris, that raised more than $20M from prominent investors. The company created Transformers, the fastest growing open-source library enabling thousands of companies to leverage natural language processing, of which Lysandre is a maintainer and core contributor.

About Anthony Moi

Hugging Face

Anthony Moi is the tech lead at Hugging Face, the leading NLP startup, based in NYC and Paris, that raised more than $20M from prominent investors. The company creates tools for deep learning applied to natural language processing, namely huggingface/transformers and huggingface/tokenizers. Ahthony is the lead maintainer and contributor to huggingface/tokenizers