Hadoop is great for dealing with large datasets, however sometimes it struggles with really big datasets. If the number of reducers is quite small then seeing outOfMemoryErrors isn’t all that uncommon. Fixing this can be as simple as increasing the number of reduces available, however that isn’t always an option.
I’m currently working with a week’s worth of data from the Twitter gardenhose, which totals around 500GB of uncompressed JSON data. I’m using Hadoop to filter this down, removing obvious spam and non-English tweets. Doing this is quick and easy with Hadoop, however it’s not so easy keeping them all in order.
Although Hadoop sorts the output of each reducer, it does not sort the output globally. This means that if I have 16 reducers, then the output is 16 files, each one containing sorted tweets from across the entire dataset.
If I want all of the data in single file, and in order, then the most obvious solution would appear to be a single reducer. However, try that and you’ll probably end up with an OutOfMemoryError.
Map output copy failure: java.lang.OutOfMemoryError: Java heap space
What’s the problem and how do I fix it?
By default, Hadoop tries to buffer 70% of the data from a mapper before it starts sorting; however when you’re dealing with a massive dataset, 70% is massive too.
The solution seems to be reducing the input buffer percentage in Hadoop’s mapred-site.xml. You’ll probably have to add the property yourself, as I don’t believe the default mapred-site.xml includes it.
<property> <name>mapred.job.shuffle.input.buffer.percent</name> <value>0.20</value> </property>
20% seems to be a good place to start, although you may have to tweak the value depending on the size of your dataset and the amount of RAM available. A larger buffer means quicker reduces, so it’s worth spending some time tweaking the value for the best performance.