scala - Spark: Incremental collect() to a partition causes OutOfMemory in Heap -


i have following code. need print rdd console, collecting large rdd in smaller chunks collecting per partition. avoid collecting entire rdd @ once. when monitoring heap , gc log, seems nothing ever being gc'd. heap keeps on growing until hits outofmemory error. if understanding correct, in below once println statement executed collected rdds, won't needed safe gc, that's not see in gc log each call collect accumulates until oom. know why collected data not being gc'd?

 val writes = partitions.foreach { partition =>       val rddpartition = rdds.mappartitionswithindex ({          case (index, data) => if (index == partition.index) data else iterator[words]()       }, false).collect().toseq       val partialreport = report(rddpartition, reportid, datecreated)       println(partialreport.name)      } 

if dataset huge, master node cant handle , shutdown. may try writing them file (eg saveastextfile), read each file again


Comments

Popular posts from this blog

php - Wordpress website dashboard page or post editor content is not showing but front end data is showing properly -

How to get the ip address of VM and use it to configure SSH connection dynamically in Ansible -

javascript - Get parameter of GET request -