MPT-30B: Raising the bar for open-source foundation models

Introducing MPT-30B, a new, more powerful member of our Foundation Series of open-source models, trained with an 8k context length on NVIDIA H100 Tensor Core GPUs.

Since the launch of MPT-7B in May, the ML community has eagerly embraced open-source MosaicML Foundation Series models. The MPT-7B base, -Instruct, -Chat, and -StoryWriter models have collectively been downloaded over 3M times!

We've been overwhelmed by what the community has built with MPT-7B. To highlight a few: LLaVA-MPT adds vision understanding to MPT, GGML optimizes MPT on Apple Silicon and CPUs, and GPT4All lets you run a GPT4-like chatbot on your laptop using MPT as a backend model.

Today, we are excited to expand the MosaicML Foundation Series with MPT-30B, a new, open-source model licensed for commercial use that is significantly more powerful than MPT-7B and outperforms the original GPT-3. In addition, we are releasing two fine-tuned variants, MPT-30B-Instruct and MPT-30B-Chat, that are built on top of MPT-30B and excel at single-turn instruction following and multi-turn conversations, respectively.

All MPT-30B models come with special features that differentiate them from other LLMs, including an 8k token context window at training time, support for even longer contexts via ALiBi, and efficient inference + training performance via FlashAttention. The MPT-30B family also has strong coding abilities thanks to its pre-training data mixture. This model was extended to an 8k context window on NVIDIA H100 GPUs, making it (to the best of our knowledge) the first LLM trained on H100 GPUs, which are now available to MosaicML customers!

The size of MPT-30B was also specifically chosen to make it easy to deploy on a single GPU—either 1x NVIDIA A100-80GB in 16-bit precision or 1x NVIDIA A100-40GB in 8-bit precision. Other comparable LLMs such as Falcon-40B have larger parameter counts and cannot be served on a single datacenter GPU (today); this necessitates 2+ GPUs, which increases the minimum inference system cost.

If you want to start using MPT-30B in production, there are several ways to customize and deploy it using the MosaicML Platform.

MosaicML Training. Customize MPT-30B using your private data via fine-tuning, domain-specific pre-training, or training from scratch. You always own the final model weights, and your data is never stored on our platform. Pricing is per GPU-minute.
MosaicML Inference. Talk to our hosted endpoints for MPT-30B-Instruct (and MPT-7B-Instruct) using our Python API, with standard pricing per-1K-tokens.

We are so excited to see what our community and customers build next with MPT-30B. To learn more about the models and how you can customize them using the MosaicML platform, read on!

MPT-30B Family

Mosaic Pretrained Transformer (MPT) models are GPT-style decoder-only transformers with several improvements including higher speed, greater stability, and longer context lengths. Thanks to these improvements, customers can train MPT models efficiently (40-60% MFU) without diverging from loss spikes and can serve MPT models with both standard HuggingFace pipelines and FasterTransformer.

MPT-30B (Base)

MPT-30B is a commercial Apache 2.0 licensed, open-source foundation model that exceeds the quality of GPT-3 (from the original paper) and is competitive with other open-source models such as LLaMa-30B and Falcon-40B.

Using our publicly available LLM Foundry codebase, we trained MPT-30B over the course of 2 months, transitioning between multiple different NVIDIA A100 clusters as hardware availability changed, with an average MFU of >46%. In mid-June, after we received our first batch of 256 NVIDIA H100 GPUs from CoreWeave, we seamlessly moved MPT-30B to the new cluster to resume training on H100s with an average MFU of >35%. To the best of our knowledge, MPT-30B is the first public model to be (partially) trained on H100 GPUs! We found that throughput increased by 2.44x per GPU and we expect this speedup to increase as software matures for the H100.

As mentioned earlier, MPT-30B was trained with a long context window of 8k tokens (vs. 2k for LLaMa and Falcon) and can handle arbitrarily long context windows via ALiBi or with fine-tuning. To build 8k support into MPT-30B efficiently, we first pre-trained on 1T tokens using sequences that were 2k tokens long, and continued training for an additional 50B tokens using sequences that were 8k tokens long.

The data mix used for MPT-30B pre-training is very similar to MPT-7B (see the MPT-7B blog post for details). For the 2k context window pre-training we used 1T tokens from the same 10 data subsets as the MPT-7B model (Table 1), but in slightly different proportions.

Table 1: Data mix for MPT-30B pre-training. We collected 1T tokens of pre-training data from ten different open-source text corpora. We tokenized the text using the EleutherAI GPT-NeoX-20B tokenizer and sampled according to the above ratios.

For the 8k context window fine-tuning, we created two data mixes from the same 10 subsets we used for the 2k context window pre-training (Figure 1). The first 8k fine-tuning mix is similar to the 2k pre-training mix, but we increased the relative proportion of code by 2.5x. To create the second 8k fine-tuning mix, which we refer to as the "long sequence" mix, we extracted all sequences of length ≥ 4096 tokens from the 10 pre-training data subsets. We then fine-tuned on a combination of these two data mixes. See the Appendix for more details on the 8k context window fine-tuning data.

Figure 1: Data subset distribution for 8k context window fine-tuning. For 8k context window fine-tuning, we took each data subset and extracted all the samples with ≥ 4096 tokens in order to create a new “long sequence” data mix. We then fine-tuned on a combination of both the long sequence and original data mixes.

In Figure 2, we measure these six core capabilities and find that MPT-30B significantly improves over MPT-7B in every respect. In Figure 3 we perform the same comparison between similarly-sized MPT, LLaMa, and Falcon models. Overall we find that the 7B models across the different families are quite similar. But LLaMa-30B and Falcon-40B are slightly higher in text capabilities than MPT-30B, which is consistent with their larger pre-training budgets:

MPT-30B FLOPs ~= 6 * 30e9 [params] * 1.05e12 [tokens] = 1.89e23 FLOPs
LLaMa-30B FLOPs ~= 6 * 32.5e9 [params] * 1.4e12 [tokens] = 2.73e23 FLOPs (1.44x more)
Falcon-40B FLOPs ~= 6 * 40e9 [params] * 1e12 [tokens] = 2.40e23 FLOps (1.27x more)

On the other hand, we find that MPT-30B is significantly better at programming, which we credit to its pre-training data mixture including a substantial amount of code. We dig into programming ability further in Table 2, where we compare the HumanEval scores of MPT-30B, MPT-30B-Instruct, and MPT-30B-Chat to existing open source models including those designed for code generation. We find that MPT-30B models are very strong at programming and MPT-30B-Chat outperforms all models except WizardCoder. We hope that this combination of text and programming capabilities will make MPT-30B models a popular choice for the community.

Finally in Table 3, we show how MPT-30B outperforms GPT-3 on the smaller set of eval metrics that are available from the original GPT-3 paper. Just about 3 years after the original publication, we are proud to surpass this famous baseline with a smaller model (17% of GPT-3 parameters) and significantly less training compute (60% of GPT-3 FLOPs).

For more detailed evaluation data, or if you want to reproduce our results, you can see the raw data and scripts we used in our LLM Foundry eval harness here. Note that we are still polishing our HumanEval methodology and will release it soon via Composer and LLM-Foundry.

Figure 2 -MPT-7B vs MPT-30B. Our new MPT-30B model significantly improves over our previous MPT-7B model

Figure 3 - MPT vs. LLaMa vs. Falcon models. Left: Comparing models with 7 billion parameters. Right: Comparing models with 30 to 40 billion parameters.

Table 2: Zero-shot accuracy (pass @ 1) of MPT-30B models vs. general purpose and GPT-distilled code generation models on HumanEval, a corpus of Python coding problems.

We find that MPT-30B models outperform LLaMa-30B and Falcon-40B by a wide margin, and even outperform many purpose-built coding models such as StarCoder. See Appendix about disclaimer about Falcon-40B-Instruct and Falcon-40B. External sources: [1], [2], [3], [4], [5]

Table 3: Zero-shot accuracy of MPT-30B vs. GPT-3 on nine in-context-learning (ICL) tasks. We find that MPT-30B outperforms GPT-3 in six out of the nine metrics. GPT-3 numbers are copied from the original paper.

MPT-30B-Instruct

LLM pre-training teaches the model to continue generating text based on the input it was provided. But in practice, users expect LLMs to treat the input as instructions to follow. Instruction fine-tuning is the process of training LLMs to perform instruction-following. By reducing the reliance on clever prompt engineering, instruction fine-tuning makes LLMs more accessible, intuitive, and immediately usable. The progress of instruction fine-tuning has been driven by open-source datasets like FLAN, P3, Alpaca, and Dolly-15k.

We created a commercially-usable, instruction-following variant of our model called MPT-30B-Instruct. We liked the commercial license of Dolly, but we wanted to add more training data, so we augmented Dolly with a subset of Anthropic's Helpful & Harmless dataset, doubling the dataset size while maintaining a commercial CC-By-SA-3.0 license. Then, to take advantage of MPT-30B's 8,192 token context length, we further augmented the data with some open source datasets: CompetitionMath, GradeSchoolMath, DialogSum, DuoRC, QASPER, QuALITY, SummScreen, and Spider.

This new instruction-following dataset is an improvement upon the dataset we used to train MPT-7B-Instruct, and we plan to release an updated MPT-7B-Instruct-v2 in the near future to bring it up to parity with MPT-30B-Instruct.

MPT-30B-Chat

We also created MPT-30B-Chat, a conversational version of MPT-30B. MPT-30B-Chat has been fine-tuned on a large collection of chat datasets, ensuring that it is ready for a wide array of conversational tasks and applications. The combined fine-tuning dataset is composed of 1.54 billion tokens and the model is trained for 6 epochs. The dataset uses the ChatML format, which provides a convenient and standardized way to pass system messages to the model and helps prevent malicious prompt injection.

MPT-30B-Chat is a research artifact and not meant for commercial use, and we have used a non-commercial CC-By-NC-SA-4.0 license accordingly. We have released it because it demonstrates the power of MPT-30B when combined with large, high-quality fine-tuning datasets.

Despite being trained as a general conversational model, MPT-30B-Chat is also surprisingly good at programming and scores 37.2% on HumanEval; this places it above almost all open source models other than WizardCoder. See Table 2 above for more details!

Deploy MPT-30B models with MosaicML Inference

With the launch of our MosaicML Inference service, we offer low-latency, high throughput hosting for open-source models like MPT and Llama. You can use our inference software stack to serve these models either on MosaicML hardware or on your own private hardware. With MosaicML Inference, you can send API requests to MosaicML-hosted endpoints for MPT-7B-Instruct, MPT-30B-Instruct, and other open-source text generation and embedding models. These endpoints are priced per-token and are significantly cheaper than comparable OpenAI APIs for the same quality (See Figure 6). MosaicML Inference is a great option for quickly prototyping AI-powered features. It can also be a suitable choice if sharing data with a third party API is acceptable.

Figure 6: Querying MPT-30B or MPT-7B models using MosaicML Inference is 4x cheaper than using OpenAI APIs and offers comparable quality.

Customize MPT-30B with MosaicML Training

MPT-30B comes with strong generation abilities out of the box. But for the best performance on your specific task, we recommend fine-tuning MPT-30B on your private data. This process can be done in hours for as little as a few hundred dollars.

For more advanced use cases, such as custom languages, custom domains (e.g. Replit's code-generation model), or custom tasks (long-document question answering), you can customize MPT-30B further by either adding domain-specific pre-training or training a custom model from scratch.

LLM Foundry

To make training custom language models as easy as possible, we've open sourced our production-ready training code as LLM Foundry. It's the exact same codebase our NLP team used to build MPT-7B and MPT-30B. This repository uses Composer, StreamingDataset, and FSDP to train custom models of any size on any number of GPUs. It can also stream data directly from your private object storage, with easy model export to HuggingFace, ONNX, or FasterTransformer. LLM Foundry has been battle-tested on cloud A100s and H100s, and we are rapidly adding support for more hardware options.

LLM Foundry also includes scripts for evaluating your model on both standard eval metrics and/or custom data. Thanks to our multi-GPU and FSDP support, evaluation is extremely fast—you can measure eval metrics offline in minutes, or even live during training, giving you instant feedback on how your model is performing.

Whether you want to do a small fine-tuning run or train a huge model from scratch, LLM Foundry handles all of these workloads efficiently. Check out our public training performance and inference performance tables!

As a customer of MosaicML, you also get access to up-to-date recipes that ensure your training runs will be stable (no loss spikes) as well as our MCLI orchestration software. The latter gracefully handles hardware failures and automatic resumption so that you don't waste compute or need to babysit your runs.

Training Times + Costs

How much time and money does it take to train custom MPT-30B models? Let's start with the base model. In Table 4, we show the times and costs to pre-train MPT-30B from scratch using either A100 or H100 GPUs. With MosaicML infrastructure, you can train your own custom MPT-30B from scratch on 1T tokens in under 2 weeks.

What about if you want to fine-tune an existing model? In Table 5 we break down the times and per-1B-token costs to fine-tune MPT-30B. With MosaicML infrastructure, you can perform full fine-tuning of MPT-30B models without worrying about system memory constraints—and it only costs a few hundred dollars!

Table 4 - Times and costs to pre-train MPT-30B from scratch on 1 trillion tokens. Times for H100 are extrapolated from a 256xH100 system. Costs are based on current MosaicML reserved cluster pricing of $2.50/A100-40GB/hour and $5.00/H100-80GB/hour as of June 22nd, 2023. Note that H100 AMP_FP8 training convergence is still being validated by MosaicML’s research team but will be coming soon!

Table 5 - Times and costs to fine-tune MPT-30B on 1 billion tokens. Costs are based on current MosaicML reserved cluster pricing of $2.50/A100-40GB/hour and $5.00/H100-80GB/hour as of June 22nd, 2023. Note that H100 AMP_FP8 training convergence is still being validated by MosaicML's research team but will be coming soon

What's next?

Ready to kick the tires of our new MPT-30B family? As a reminder, our Foundation Series models are fully supported by the MosaicML platform, giving you the tools and expertise to easily and efficiently build, customize, and deploy on your secure cloud of choice. Sign up for a demo here. And stay tuned for many more models to come in our Foundation Series!

Appendix

Acknowledgements

We gratefully acknowledge our friends at OCI, who host the NVIDIA A100 GPUs we used to complete the primary training phase for MPT-30B.

We also gratefully acknowledge our friends at CoreWeave, who host the NVIDIA H100 GPUs we used to complete the 8k-context training phase for MPT-30B and supported us as we got up to speed with a new GPU architecture.

We also gratefully acknowledge our friends at AI2, who shared immensely valuable technical expertise as we developed the MPT family of models.

Data

MPT-30B 8k Context Window Fine-tuning Data

For 8k context window fine-tuning, we took each data subset and extracted all the samples with at least 4096 tokens in order to create a new "long sequence" data mix. We then fine-tuned on a combination of both the long sequence and original data mixes.

Table listing various types of businesses.

MPT-30B-Instruct Fine-tuning Data

Table showing data source usage: 45% use online databases, 30% rely on surveys, and 25% gather data from interviews.

MPT-30B-Chat Fine-tuning Data

Chat fine-tuning data. Note that each token was seen 6 times. These token counts include both the prompts and their target responses, so not all 1.54B tokens are loss-generating.

Table showing ticket sales for each event.

Evaluation

A table displaying various product types.

MPT-30B vs. open-source models on our code evaluation suite. We test each model on the HumanEval dataset of code prompts, using zero-shot evaluation and benchmarking using the pass@1 metric, or the percent of test cases that the model passes when only allowed to generate one possible code continuation. We also provide cited external values to verify the replicability of our in-house code evaluation suite, which will be released as open source in a future release of Composer/LLM-Foundry.

External sources: [1], [2], [3], [4], [5], [6], [7], [8]

Falcon Code Eval Disclaimer

In our eval framework, Falcon-40B and Falcon-40B-Instruct appear to be outliers among similarly sized models. While the majority of our self-evaluated scores match external results, our Falcon-40B-Instruct pass rate is significantly lower than reported in WizardCoder. We use the same prompts and LLM Foundry eval harness for all models. If you have suggestions on how to better prompt / use Falcon-40B or more external HumanEval scores we can reference, please reach out and let us know!

May 5, 2023

Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs

July 18, 2023

Announcing MPT-7B-8K: 8K Context Length for Document Understanding

March 9, 2023