2023년 4월 12일

무료 Dolly: 세계 최초의 진정한 오픈 инструкция-튜닝 LLM 소개

작성자: 마이크 코노버, Matt Hayes, 안킷 마투르, 젠웨이 셰, 준완, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia , Reynold Xin

Two weeks ago, we released Dolly, a large language model (LLM) trained for less than $30 to exhibit ChatGPT-like human interactivity (aka instruction-following). Today, we’re releasing Dolly 2.0, the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.

Dolly 2.0 is a 12B parameter language model based on the EleutherAI pythia model family and fine-tuned exclusively on a new, high-quality human generated instruction following dataset, crowdsourced among Databricks employees.

We are open-sourcing the entirety of Dolly 2.0, including the training code, the dataset, and the model weights, all suitable for commercial use. This means that any organization can create, own, and customize powerful LLMs that can talk to people, without paying for API access or sharing data with third parties.

databricks-dolly-15k dataset

databricks-dolly-15k contains 15,000 high-quality human-generated prompt / response pairs specifically designed for instruction tuning large language models. Under the licensing terms for databricks-dolly-15k (Creative Commons Attribution-ShareAlike 3.0 Unported License), anyone can use, modify, or extend this dataset for any purpose, including commercial applications.

To the best of our knowledge, this dataset is the first open source, human-generated instruction dataset specifically designed to make large language models exhibit the magical interactivity of ChatGPT. databricks-dolly-15k was authored by more than 5,000 Databricks employees during March and April of 2023. These training records are natural, expressive and designed to represent a wide range of the behaviors, from brainstorming and content generation to information extraction and summarization.

Why did we create a new dataset?

As soon as we released Dolly 1.0, we were inundated by requests from people who wanted to try it out. The number one question that we kept getting was “can I use this commercially?”

A critical step in the creation of Dolly 1.0, or any instruction following LLMs, is to train the model on a dataset of instruction and response pairs. Dolly 1.0 was trained for $30 using a dataset that the Stanford Alpaca team had created using the OpenAI API. That dataset contained output from ChatGPT, and as the Stanford team pointed out, the terms of service seek to prevent anyone from creating a model that competes with OpenAI. So, unfortunately, the answer to this common question was, “probably not!”

As far as we know, all the existing well-known instruction-following models (Alpaca, Koala, GPT4All, Vicuna) suffer from this limitation, prohibiting commercial use. To get around this conundrum, we started looking for ways to create a new dataset not “tainted” for commercial use.

How did we do it?

We knew from the OpenAI research paper that the original InstructGPT model was trained on a dataset consisting of 13,000 demonstrations of instruction following behavior. Inspired by this, we set out to see if we could achieve a similar result with Databricks employees leading the charge.

Turns out, generating 13k questions and answers is harder than it looks. Every answer has to be original. It can’t be copied from ChatGPT or anywhere on the web, or it would taint our dataset. It seemed daunting, but Databricks has over 5,000 employees who are very interested in LLMs. So we thought we could crowdsource among them to create an even higher quality dataset than the 40 labelers had created for OpenAI. But we knew they were all busy and had full-time jobs, so we needed to incentivize them to do this.

We set up a contest, where the top 20 labelers would get a big award. We also outlined 7 very specific tasks:

Open Q&A: For instance, “Why do people like comedy movies?” or “What is the capital of France?” In some cases, there’s not a correct answer, and in others, it requires drawing on knowledge of the world at large.
Closed Q&A: These are questions that can be answered using only the information contained in a passage of reference text. For instance, given a paragraph from Wikipedia on the atom, one might ask, “What is the ratio between protons and neutrons in the nucleus?”
Extract information from Wikipedia: Here an annotator would copy a paragraph from Wikipedia and extract entities or other factual information such as weights or measurements from the passage.
Summarize information from Wikipedia: For this, annotators provided a passage from Wikipedia and were asked to distill it to a short summary.
Brainstorming: This task asked for open-ended ideation and an associated list of possible options. For instance, “What are some fun activities I can do with my friends this weekend?”.
Classification: For this task, annotators were asked to make judgments about class membership (e.g. are the items in a list animals, minerals or vegetables) or to judge the properties of a short passage of text, such as the sentiment of a movie review.
Creative writing: This task would include things like writing a poem or a love letter.

Example of Open QA in databricks-dolly-15k

How do I build a campfire?

Safety should always come first when starting a campfire. Ensure you have water or another way to extinguish your fire readily available. Use a campfire ring, pit, or clear a large area around your campfire location. Also, make sure campfires are permitted in your area.

With safety out of the way, all a fire needs are heat, fuel, and oxygen. Any combustible material will start a fire, but good campfire fuel sources include tinder, kindling, and dry logs.

Use stuff like dry pine needles, dry leaves, grasses, and wood shavings for tinder. You could also use newspaper (or any paper), dryer lint, or cardboard. For kindling, use smaller pieces of wood, such as twigs and sticks no thicker than 1 inch in diameter (about the thickness of your thumb).

Dry or well-seasoned logs will ensure a good burn. Stack the wood in a way that guarantees oxygen can easily reach your flame. Many prefer a teepee or log cabin stacking design. Now, use a heat source to create a flame or spark and apply this to your tinder. Once the tinder combusts, it should light the kindling, and the kindling should eventually ignite your logs. If all goes well, you should now be enjoying your campfire.

Example of Brainstorming in databricks-dolly-15k

Instead of making a peanut butter and jelly sandwich, what else could I combine peanut butter with in a sandwich? Give five ideas.

Instead of jelly, try one of the following with peanut butter in a sandwich:

Honey
Raisins
Fruit preserves
Bananas
Sliced apples

We were initially skeptical whether we would get to 10,000 results. But with nightly leaderboard gamification, we managed to break 15,000 results within a week. Out of fear of eating into our productivity, we closed the contest.

Our journey to create a commercially viable model

We also wanted to produce an open source model that can be commercially used. Despite databricks-dolly-15k being substantially smaller than Alpaca, the dataset on which Dolly 1.0 was trained, the resulting Dolly 2.0 model, based on EleutherAI’s pythia-12b, exhibited high-quality instruction following behavior. In hindsight, this isn’t surprising. Many of the instruction tuning datasets released in recent months contain synthesized data, which often contains hallucinations and factual errors.

databricks-dolly-15k, on the other hand, is generated by professionals, is high quality, and contains long answers to most tasks.

You can see for yourself some examples of how Dolly 2.0 can be used for summarization and content generation. Based on our initial customer feedback it’s clear capabilities such as these will have wide-ranging applications across the enterprise.

Dolly 2.0 summarizes Databricks documentation

Instruction

Response

새로운 DBSQL 웨어하우스의 기본 구성은 무엇인가요?

입력:

Databricks SQL Serverless는 서버리스 컴퓨팅을 지원합니다. 관리자는 Databricks에서 관리하는 즉시 사용 가능한 컴퓨팅 기능을 제공하는 서버리스 SQL 웨어하우스(이전의 SQL 엔드포인트)를 생성할 수 있습니다. 서버리스 SQL 웨어하우스는 Databricks 계정의 컴퓨팅 클러스터를 사용합니다. 원래 고객 호스팅 SQL 웨어하우스(현재 클래식 SQL 웨어하우스라고 함)를 사용할 때와 마찬가지로 Databricks SQL 쿼리에 사용하면 됩니다. Databricks는 엔드포인트가 연결된 네트워크와 통신하는 원격 컴퓨팅 장치 또는 클라우드 서비스의 진입점을 의미하므로, 업계에서 엔드포인트라는 용어를 사용하기 때문에 SQL 엔드포인트에서 SQL 웨어하우스로 이름을 변경했습니다. 데이터 웨어하우스는 비즈니스 친화적인 방식으로 여러 소스의 현재 및 과거 데이터를 저장하여 더 쉬운 통찰력과 보고를 제공하는 데이터 관리 시스템입니다. SQL 웨어하우스는 이 컴퓨팅 리소스의 전체 기능을 정확하게 설명합니다. 계정에 서버리스 SQL 웨어하우스가 활성화된 경우 다음 사항에 유의하세요. UI에서 생성할 때 새 SQL 웨어하우스는 기본적으로 서버리스입니다. API를 사용하여 생성할 때 새 SQL 웨어하우스는 기본적으로 서버리스가 아니며, 명시적으로 서버리스를 지정해야 합니다. 두 방법 중 하나를 사용하여 새 프로 또는 클래식 SQL 웨어하우스를 생성할 수도 있습니다. 프로 또는 클래식 SQL 웨어하우스를 서버리스 SQL 웨어하우스로 업그레이드하거나 클래식 SQL 웨어하우스를 프로 SQL 웨어하우스로 업그레이드할 수 있습니다. 서버리스에서 프로 또는 클래식으로 다운그레이드할 수도 있습니다. 이 기능은 Databricks SQL에만 영향을 미칩니다. Data Science & Engineering 또는 Databricks Machine Learning 워크스페이스 환경의 노트북 및 작업과 함께 Databricks Runtime 클러스터가 작동하는 방식에는 영향을 미치지 않습니다. Databricks Runtime 클러스터는 항상 AWS 계정의 클래식 데이터 평면에서 실행됩니다. 서버리스 할당량을 참조하세요. 계정에 이용 약관 업데이트가 필요한 경우 워크스페이스 관리자에게 Databricks SQL UI에서 알림이 표시됩니다. 워크스페이스에 AWS 인스턴스 프로필이 있는 경우 생성 방법 및 시기에 따라 서버리스 컴퓨팅을 지원하도록 신뢰 관계를 업데이트해야 할 수 있습니다.

UI를 사용하여 새 DBSQL 웨어하우스를 생성할 때 기본적으로 서버리스 SQL 웨어하우스가 됩니다. 그러나 API를 사용하여 동일한 웨어하우스를 생성할 때는 프로 SQL 웨어하우스임을 명시적으로 지정해야 합니다.

Dolly 2.0, 트윗 콘텐츠 생성

Instruction

Response

새로운 LLM인 Dolly 2.0 출시를 알리는 트윗을 작성해 주세요.

Dolly 2.0이 출시되었습니다 🤩

LLM을 업그레이드하여 더 효율적이고 강력하며 더 많은 사용자가 액세스할 수 있도록 만들었습니다.

더 알아보기 ⬇️

https://t.co/KxzT3HXCPb

진정한 오픈 대규모 언어 모델

고객으로부터 자체 모델을 소유할 수 있어야 민감한 데이터를 제3자에게 넘기지 않고도 도메인별 애플리케이션을 위한 고품질 모델을 만들 수 있다는 의견을 반복해서 들었습니다.

또한 편향, 책임, AI 안전과 같은 중요한 문제는 소수의 대기업이 아닌 광범위한 다양한 이해관계자 커뮤니티에서 해결해야 한다고 믿습니다. 오픈 소스 데이터셋과 모델은 모든 사람이 인공 지능 기술의 발전을 활용할 수 있도록 하는 데 도움이 되는 논평, 연구 및 혁신을 장려합니다.

기술 및 연구 아티팩트로서 Dolly가 효과성 측면에서 최첨단이 될 것으로 기대하지는 않습니다. 그러나 Dolly와 오픈 소스 데이터셋이 수많은 후속 작업의 씨앗 역할을 하여 더 강력한 언어 모델을 부트스트랩하는 데 도움이 될 것으로 기대합니다.

오늘 시작하는 방법은 무엇인가요?

Databricks Hugging Face 페이지를 방문하여 Dolly 2.0 모델 가중치를 다운로드하고 databricks-labs의 Dolly 리포지토리를 방문하여 databricks-dolly-15k 데이터셋을 다운로드하세요. 또한 웨비나에 참여하여 조직에서 LLM을 활용하는 방법을 알아보세요.

리소스

(이 글은 AI의 도움을 받아 번역되었습니다. 원문이 궁금하시다면 여기를 클릭해 주세요)