Method for finding memory leak in large Java heap dumps
This answer expands upon @Will-Hartung's. I applied to same process to diagnose one of my memory leaks and thought that sharing the details would save other people time.
The idea is to have postgres 'plot' time vs. memory usage of each class, draw a line that summarizes the growth and identify the objects that are growing the fastest:
^
|
s | Legend:
i | * - data point
z | -- - trend
e |
( |
b | *
y | --
t | --
e | * -- *
s | --
) | *-- *
| -- *
| -- *
--------------------------------------->
time
Convert your heap dumps (need multiple) into a format this is convenient for consumption by postgres from the heap dump format:
num #instances #bytes class name
----------------------------------------------
1: 4632416 392305928 [C
2: 6509258 208296256 java.util.HashMap$Node
3: 4615599 110774376 java.lang.String
5: 16856 68812488 [B
6: 278914 67329632 [Ljava.util.HashMap$Node;
7: 1297968 62302464
...
To a csv file with a the datetime of each heap dump:
2016.09.20 17:33:40,[C,4632416,392305928
2016.09.20 17:33:40,java.util.HashMap$Node,6509258,208296256
2016.09.20 17:33:40,java.lang.String,4615599,110774376
2016.09.20 17:33:40,[B,16856,68812488
...
Using this script:
# Example invocation: convert.heap.hist.to.csv.pl -f heap.2016.09.20.17.33.40.txt -dt "2016.09.20 17:33:40" >> heap.csv
my $file;
my $dt;
GetOptions (
"f=s" => \$file,
"dt=s" => \$dt
) or usage("Error in command line arguments");
open my $fh, '<', $file or die $!;
my $last=0;
my $lastRotation=0;
while(not eof($fh)) {
my $line = <$fh>;
$line =~ s/\R//g; #remove newlines
# 1: 4442084 369475664 [C
my ($instances,$size,$class) = ($line =~ /^\s*\d+:\s+(\d+)\s+(\d+)\s+([\$\[\w\.]+)\s*$/) ;
if($instances) {
print "$dt,$class,$instances,$size\n";
}
}
close($fh);
Create a table to put the data in
CREATE TABLE heap_histogram (
histwhen timestamp without time zone NOT NULL,
class character varying NOT NULL,
instances integer NOT NULL,
bytes integer NOT NULL
);
Copy the data into your new table
\COPY heap_histogram FROM 'heap.csv' WITH DELIMITER ',' CSV ;
Run the slop query against size (num of bytes) query:
SELECT class, REGR_SLOPE(bytes,extract(epoch from histwhen)) as slope
FROM public.heap_histogram
GROUP BY class
HAVING REGR_SLOPE(bytes,extract(epoch from histwhen)) > 0
ORDER BY slope DESC
;
Interpret the results:
class | slope
---------------------------+----------------------
java.util.ArrayList | 71.7993806279174
java.util.HashMap | 49.0324576155785
java.lang.String | 31.7770770326123
joe.schmoe.BusinessObject | 23.2036817108056
java.lang.ThreadLocal | 20.9013528767851
The slope is bytes added per second (since the unit of epoch is in seconds). If you use instances instead of size, then that's the number of instances added per second.
My one of the lines of code creating this joe.schmoe.BusinessObject was responsible for the memory leak. It was creating the object, appending it to an array without checking if it already existed. The other objects were also created along with the BusinessObject near the leaking code.
It's almost impossible without some understanding of the underlying code. If you understand the underlying code, then you can better sort the wheat from chaff of the zillion bits of information you are getting in your heap dumps.
Also, you can't know if something is a leak or not without know why the class is there in the first place.
I just spent the past couple of weeks doing exactly this, and I used an iterative process.
First, I found the heap profilers basically useless. They can't analyze the enormous heaps efficiently.
Rather, I relied almost solely on jmap histograms.
I imagine you're familiar with these, but for those not:
jmap -histo:live <pid> > histogram.out
creates a histogram of the live heap. In a nutshell, it tells you the class names, and how many instances of each class are in the heap.
I was dumping out heap regularly, every 5 minutes, 24hrs a day. That may well be too granular for you, but the gist is the same.
I ran several different analyses on this data.
I wrote a script to take two histograms, and dump out the difference between them. So, if java.lang.String was 10 in the first dump, and 15 in the second, my script would spit out "5 java.lang.String", telling me it went up by 5. If it had gone down, the number would be negative.
I would then take several of these differences, strip out all classes that went down from run to run, and take a union of the result. At the end, I'd have a list of classes that continually grew over a specific time span. Obviously, these are prime candidates for leaking classes.
However, some classes have some preserved while others are GC'd. These classes could easily go up and down in overall, yet still leak. So, they could fall out of the "always rising" category of classes.
To find these, I converted the data in to a time series and loaded it in a database, Postgres specifically. Postgres is handy because it offers statistical aggregate functions, so you can do simple linear regression analysis on the data, and find classes that trend up, even if they aren't always on top of the charts. I used the regr_slope function, looking for classes with a positive slope.
I found this process very successful, and really efficient. The histograms files aren't insanely large, and it was easy to download them from the hosts. They weren't super expensive to run on the production system (they do force a large GC, and may block the VM for a bit). I was running this on a system with a 2G Java heap.
Now, all this can do is identify potentially leaking classes.
This is where understanding how the classes are used, and whether they should or should not be their comes in to play.
For example, you may find that you have a lot of Map.Entry classes, or some other system class.
Unless you're simply caching String, the fact is these system classes, while perhaps the "offenders", are not the "problem". If you're caching some application class, THAT class is a better indicator of where your problem lies. If you don't cache com.app.yourbean, then you won't have the associated Map.Entry tied to it.
Once you have some classes, you can start crawling the code base looking for instances and references. Since you have your own ORM layer (for good or ill), you can at least readily look at the source code to it. If you ORM is caching stuff, it's likely caching ORM classes wrapping your application classes.
Finally, another thing you can do, is once you know the classes, you can start up a local instance of the server, with a much smaller heap and smaller dataset, and using one of the profilers against that.
In this case, you can do unit test that affects only 1 (or small number) of the things you think may be leaking. For example, you could start up the server, run a histogram, perform a single action, and run the histogram again. You leaking class should have increased by 1 (or whatever your unit of work is).
A profiler may be able to help you track the owners of that "now leaked" class.
But, in the end, you're going to have to have some understanding of your code base to better understand what's a leak, and what's not, and why an object exists in the heap at all, much less why it may be being retained as a leak in your heap.
Take a look at Eclipse Memory Analyzer. It's a great tool (and self contained, does not require Eclipse itself installed) which 1) can open up very large heaps very fast and 2) has some pretty good automatic detection tools. The latter isn't perfect, but EMA provides a lot of really nice ways to navigate through and query the objects in the dump to find any possible leaks.
I've used it in the past to help hunt down suspicious leaks.