Application Spotlight: Trifacta

Published: October 9, 2014

This post is guest authored by our friends at Trifacta after having their data transformation platform “Certified on Spark.”

Today we announced v2 of the Trifacta Data Transformation Platform, a release that emphasizes the important role that Hadoop plays in the new big data enterprise architecture. With Trifacta v2 we now support transforming data of all shapes and sizes in Hadoop. This means supporting Hadoop-specific data formats as both inputs and outputs in Trifacta v2 - data formats such as Avro, ORC and Parquet. It also means intelligently executing data transformation scripts through not only MapReduce, which was available in Trifacta v1, but also Spark. Trifacta v2 has been officially Certified on Spark by Databricks.

Our partnership with Databricks brings the performance and flexibility of the Spark data processing engine to the world of data wrangling. It has been a pleasure to work with the original team that started the Spark research project at UC Berkeley that later became Apache Spark and to introduce a new category of applications to the Spark community. Our inspiration to integrate with Spark was in part the sheer power of the technology. But it was also prompted by the tremendous momentum of the open source Apache Spark community and project. We’ve seen a growing number of technology and Fortune 500 companies select Spark as a critical component of their investments in Hadoop implementations.

With support now from all of the major Hadoop distributions, including Cloudera, Hortonworks, MapR and Pivotal, Spark is certainly here to stay as a foundational component of the Hadoop ecosystem. And having tested Spark against many different data transformation use cases, we now know why. With Spark under the hood of Trifacta, we can now execute large-scale data transformations at interactive response rates. This capability complements the execution frameworks that we introduced in Trifacta v1, where we supported instant and batch execution. In Trifacta v1, we could either execute over small data instantaneously in the browser or we could operate over large volumes in batch mode by compiling transformation scripts to execute in MapReduce. Now in Trifacta v2, we can intelligently select between in-browser, Spark and MapReduce execution for the user, so that our customers can focus on analysis instead of technical details.

We leverage Spark’s flexible execution model to drive low-latency processing for a variety of data transformation workloads. For instance, Spark is suitable for interactive data structuring and cleaning transformations, iterative machine learning routines that power Trifacta’s Predictive InteractionTM technology and efficient analytic queries for Visual Data Profiling.

What made it relatively easy to plug native Spark execution into our Data Transformation Platform was Trifacta’s declarative transformation language, Wrangle. Wrangle is a data transformation language that is designed to translate visual interactions that users have in Trifacta into natively executable code that can run on variety of different processing frameworks including MapReduce and Spark.

With the immense technical talent of both organizations, I am looking forward to seeing what we’re able to build together and the impact it will have on big data. If you’re interested in learning more about Wrangle, Trifacta’s Domain-Specific Language or how our architecture makes it easy to plug into multiple data processing frameworks, I’ll be speaking next week with Joe Hellerstein at Strata New York on the topic or stay tuned to the product page on our website.

What's next?

December 11, 2024/4 min read

Innovators Unveiled: Announcing the Databricks Generative AI Startup Challenge Winners!

December 11, 2024/15 min read

Never miss a Databricks post

Sign up

What's next?

Innovators Unveiled: Announcing the Databricks Generative AI Startup Challenge Winners!

Introducing Databricks Generative AI Partner Accelerators and RAG Proof of Concepts