Skip to main content
Page 1
Generative AI

Fast, Secure and Reliable: Enterprise-grade LLM Inference

Introduction After a whirlwind year of developments in 2023, many enterprises are eager to adopt increasingly capable generative AI models to supercharge their...
Generative AI

Serving Quantized LLMs on NVIDIA H100 Tensor Core GPUs

Quantization is a technique for making machine learning models smaller and faster. We quantize Llama2-70B-Chat, producing an equivalent-quality model that generates 2.2x more...
Generative AI

LLM Training and Inference with Intel Gaudi 2 AI Accelerators

January 4, 2024 by Abhi Venigalla and Daya Khudia in Mosaic Research
At Databricks, we want to help our customers build and deploy generative AI applications on their own data without sacrificing data privacy or...
Generative AI

Integrating NVIDIA TensorRT-LLM with the Databricks Inference Stack

Over the past six months, we've been working with NVIDIA to get the most out of their new TensorRT-LLM library. TensorRT-LLM provides an easy-to-use Python interface to integrate with a web server for fast, efficient inference performance with LLMs. In this post, we're highlighting some key areas where our collaboration with NVIDIA has been particularly important.
Engineering blog

Introducing Mixtral 8x7B with Databricks Model Serving

Today, Databricks is excited to announce support for Mixtral 8x7B in Model Serving . Mixtral 8x7B is a sparse Mixture of Experts (MoE)...
Generative AI

LLM Inference Performance Engineering: Best Practices

In this blog post, the MosaicML engineering team shares best practices for how to capitalize on popular open source large language models (LLMs)...
Generative AI

Introducing Llama2-70B-Chat with MosaicML Inference

Llama2-70B-Chat is a leading AI model for text completion, comparable with ChatGPT in terms of quality. Today, organizations can leverage this state-of-the-art model...
Generative AI

Benchmarking Large Language Models on NVIDIA H100 GPUs with CoreWeave (Part 1)

April 27, 2023 by Daya Khudia and Vitaliy Chiley in Mosaic Research
Benchmarking Large Language Models on NVIDIA H100 GPUs with CoreWeave The research and engineering teams here at MosaicML collaborated with CoreWeave, one of...
Generative AI

MosaicML Delivers Leading NLP Performance in MLPerf v2.1

MosaicML leads the MLPerf NLP results, delivering a score of 7.9 minutes on 8x NVIDIA A100 GPUs in the Open Division, thanks to...
Generative AI

MosaicML Satisfies the Need for Speed with MLPerf Results

MosaicML’s Open Division submission to the MLPerf Image Classification benchmark delivers a score of 23.8 minutes (4.5x speed-up relative to our baseline) on...