Sophia Sun

Big Data Software Engineer, Intel

Sophia sun is a big data software engineer at intel, focusing on spark workload performance analyzing and tuning. She has rich experience on big data benchmark(such as TPC-DS, TPCx-BB, HiBench etc.) analyzing and tuning on large-scale cluster.

Past sessions

Nowadays, Fieldâ€-Programmable Gate Array (FPGA) is widely used on data centers, and for a wide range of data center workloads, FPGA-enabled data centers have shown greate potential for providing dramatically speed performance and energy efficiency improvement.

So how to efficiently integrate FPGAs to accelerate popular frameworks for big data processing like Apache Spark is an interesting topic. In this talk, We are going to present the feasibility of incorporating FPGA acceleration into Spark based on the Intel recently-announced FPGA Programmable Acceleration Cards (PACs) for Xeon servers and using the TPCx-HS, the industry standard for benchmarking big data systems, to show that acceleration is possible.

With a step-by-step case study for the TPCx-HS, we demonstrate how a straightforward integration with FPGA can offer an efficient integration with 1.2x overall system speedup and more energy efficiency improvement.

Session hashtag: #SAISEco1

Nowadays, people are creating, sharing and storing data at a faster pace than ever before, effective data compression / decompression could significantly reduce the cost of data usage. Apache Spark is a general distributed computing engine for big data analytics, and it has large amount of data storing and shuffling across cluster in runtime, the data compression/decompression codecs can impact the end to end application performance in many ways.

However, there's a trade-off between the storage size and compression/decompression throughput (CPU computation). Balancing the data compress speed and ratio is a very interesting topic, particularly while both software algorithms and the CPU instruction set keep evolving. Apache Spark provides a very flexible compression codecs interface with default implementations like GZip, Snappy, LZ4, ZSTD etc. and Intel Big Data Technologies team also implemented more codecs based on latest Intel platform like ISA-L(igzip), LZ4-IPP, Zlib-IPP and ZSTD for Apache Spark; in this session, we'd like to compare the characteristics of those algorithms and implementations, by running different micro workloads as well as end to end workloads, based on different generations of Intel x86 platform and disk.

It's supposedly to be the best practice for big data software engineers to choose the proper compression/decompression codecs for their applications, and we also will present the methodologies of measuring and tuning the performance bottlenecks for typical Apache Spark workloads.

Session hashtag: #Exp1SAIS