Spark Performance Tuning refers to the process of adjusting settings to record for memory, cores, and instances used by the system. This process guarantees that the Spark has a flawless performance and also prevents bottlenecking of resources in Spark.
In order, to reduce memory usage you might have to store spark RDDs in serialized form. Data serialization also determines a good network performance. You will be able to obtain good results in Spark performance by:
Spark supports two serialization libraries, as follows:
While tuning memory usage, there are three aspects that stand out:
One option to reduce memory consumption is by staying away from java features that could overhead. Here are a few ways to do this:
In order to avoid the large “churn” related to the RDDs that have been previously stored by the program, java will dismiss old objects in order to create space for new ones. However, by using data structures that feature fewer objects the cost is greatly reduced. One such example would be the employment an array of Ints instead of a linked list. Alternatively, you could use objects in the serialized form, so you will only have a single object for each RDD partition.
An efficient memory use is essential to good performance. Spark uses memory mainly for storage and execution. Storage memory is used to cache data that will be reused later. On the other hand, execution memory is used for computation in shuffles, sorts, joins, and aggregations. Memory contention poses three challenges for Apache Spark:
Instead of avoiding statically reserving memory in advance, you could deal with memory contention when it arises by forcing members to spill.