Spark executor OOM: not enough memory to build hash map
Background
One of the pipelines I maintained failed last week. The first try failed but the second try was successful. The pipeline’s structure is quite simple, it contains 3 steps, like most of the ETL pipelines:
- Load data from database
- Boardcast join a 2GiB dataset
- Write to delta table
Error message is There is not enough memory to build the hash map. It happened in step 2.
Environment Details
Databricks Runtime: 11.3 LTS, Scala 2.12, Spark 3.3.0
Cluster: (AWS EC2 instances)
Driver: r5.xlarge · (32GiB memory, 4 vcores)
Workers: i3en.xlarge · 4 workers (32GiB memory, 4 vcores)
Spark config (selected):
spark.driver.memory 24g
spark.driver.maxResultSize 12g
spark.executor.cores 2
spark.executor.memory 12g
spark.executor.memoryOverhead 2048
Observation
Detailed error log and traceback:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 312 in stage 189.0 failed 4 times, most recent failure: Lost task 312.3 in stage 189.0 (TID 2483) (10.241.80.62 executor 11): org.apache.spark.sql.execution.OutOfMemorySparkException: There is not enough memory to build the hash map
Caused by: org.apache.spark.sql.execution.OutOfMemorySparkException: There is not enough memory to build the hash map
Okay, the first and most important question is:
Where did OOM happened? Driver or Executor?
If you’re not familiar with Spark’s architecture, it’ll be hard to answer. But it’s still possible for an engineer with basic JVM knowledge to answer this question (and even better in this way to provide more solid evidence!). We can check driver’s and executor’s stderr and stdout log respectively. If there’s a lot of GC records in stdout log, we can confirm that the it’s under memory pressure. Let’s see:
Driver side stderr log
ERROR: [2023-08-02 13:44:48,551][*******][:91]: pipeline failed
Traceback (most recent call last):
**********************************
py4j.protocol.Py4JJavaError: An error occurred while calling o15767.saveAsTable.
......
Caused by org.apache.spark.sql.execution.OutOfMemorySparkException: There is not enough memory to build the hash map
Driver side stdout log
Unfortunately, there’re no GC records before 13:44:48. The most recent GC record before error was raised is at 13:35:22. Besides, there’re a lot of memory free in ParOldGen.
Executor side stderr log
23/08/02 13:43:57 INFO MemoryStore: Block rdd_4467_312 stored as values in memory (estimated size 3.1 MiB, free 3.9 GiB)
23/08/02 13:44:04 INFO TaskMemoryManager: 2112293120 bytes of memory are used for execution and 2458893004 bytes of memory are used for storage
23/08/02 13:44:05 ERROR Executor: Exception in task 312.0 in stage 189.0 (TID 2459)
org.apache.spark.sql.execution.OutOfMemorySparkException: There is not enough memory to build the hash map
23/08/02 13:44:38 ERROR Executor: Exception in task 311.0 in stage 189.0 (TID 2458)
org.apache.spark.sql.execution.OutOfMemorySparkException: There is not enough memory to build the hash map
23/08/02 13:44:45 INFO TaskMemoryManager: 2110837364 bytes of memory are used for execution and 2461804515 bytes of memory are used for storage
23/08/02 13:44:45 ERROR Executor: Exception in task 312.3 in stage 189.0 (TID 2483)
org.apache.spark.sql.execution.OutOfMemorySparkException: There is not enough memory to build the hash map
Okay, We see there’re a lot of errors and same There is not enough memory to build the hash map error raised. And it was just a minute before pipeline fail!
Executor side stdout log
2023-08-02T13:43:12.466+0000: [GC (Allocation Failure) [PSYoungGen: 2350233K->285749K(2410496K)] 4691094K->2626610K(6597632K), 0.0487498 secs] [Times: user=0.18 sys=0.00, real=0.05 secs]
2023-08-02T13:44:01.866+0000: [GC (Allocation Failure) [PSYoungGen: 2099028K->410926K(2380288K)] 4439889K->4259124K(6567424K), 0.3241892 secs] [Times: user=0.79 sys=0.47, real=0.32 secs]
2023-08-02T13:44:02.190+0000: [Full GC (Ergonomics) [PSYoungGen: 410926K->0K(2380288K)] [ParOldGen: 3848197K->3860421K(5868544K)] 4259124K->3860421K(8248832K), [Metaspace: 99036K->96953K(1157120K)], 0.4974180 secs] [Times: user=1.55 sys=0.04, real=0.50 secs]
2023-08-02T13:46:39.501+0000: [GC (Allocation Failure) [PSYoungGen: 1969152K->85311K(2463232K)] 5829573K->3945740K(8331776K), 0.0275780 secs] [Times: user=0.09 sys=0.00, real=0.03 secs]
There’re some GC records in executor side, right before 13:44:48, not enough memory left in both PSYoungGen and ParOldGen. We can confirm that the executor is under memory pressure.
Root Cause
The root cause is that Executor Side Broadcast join (EBJ) is introduced in Databricks Runtime 11.3 TLS. For broadcast join, the memory pressure is moved from driver to executor. This post explains it very well: Job fails with “not enough memory to build the hash map” error
However, the post doesn’t explicitly suggest that we need to add more memory to executor, which may be confused for beginners. And for engineers who are familiar with Driver Side Broadcast Join before Databricks Runtime 11.3 TLS (eg, me, and my colleague), this post may be confusing. To confirm that it’s an executor memory issue, we’d better double check the stdout and GC log.
We fixed it by using i3en.3xlarge (96 GiB memory, 12 cores), seting spark.executor.cores to 3 and seting spark.executor.memory to 20g. It runs smoothly without issues after that.
Learned
- Use stdout/stderr to position where OOM happened