SESSION

SEA-LION: Representing the Diverse Languages of Southeast Asia with LLMs

OVERVIEW

EXPERIENCEIn Person
TYPEBreakout
TRACKGenerative AI
INDUSTRYPublic Sector
TECHNOLOGIESAI/Machine Learning, Apache Spark, GenAI/LLMs
SKILL LEVELBeginner
DURATION40

Southeast Asia is one of the world's most culturally diverse regions, covering countries such as Singapore, Vietnam, Thailand, and Indonesia. People speak multiple languages and draw cultural influences from China, India and the West. To reflect these cultural contexts and linguistic influences, the Singapore government entity (AI Singapore) worked with Databricks MosaicML to build SEA-LION, an open-sourced large language model trained on local languages such as Thai, Indonesian and Tamil. This localized LLM is suitable for more than just low-resourced languages. It can also handle unique contexts, such as code-switching between multiple dialects in a sentence. This session goes over the design considerations of SEA-LION, from customizing a tokenizer to regional languages to creating a model cost-effective enough to appeal to resource-constrained organizations in the region. It will also cover potential applications of the model and its long-term vision.

SESSION SPEAKERS

Jeanne Choo

/APJ ML Practice Lead
Databricks