Open Variant Data Type in Delta Lake and Apache Spark
What you’ll learn
Variant is a new data type for storing semi-structured data; ingress and egress of hierarchical data through JSON will be supported. Without Variant, customers had to choose between flexibility and performance. To maintain flexibility, customers would store JSON in single columns as strings. To see better performance, customers would apply strict schematizing approaches with structs, which requires separate processes to maintain and update with schema changes. With Variant, customers can retain flexibility (there’s no need to define an explicit schema) and receive vastly improved performance compared with querying the JSON as a string.
Variant is particularly useful when the JSON sources have unknown, changing and frequently evolving schema. For example, customers have shared endpoint detection and response (EDR) use cases, with the need to read and combine logs containing different JSON schemas. Similarly, for uses involving ad-click and application telemetry, where the schema is unknown and changing all the time, Variant is well suited. In both cases, the Variant data type’s flexibility allows the data to be ingested and performant without requiring an explicit schema.