Recently I spent some time on investigation what sort of monitoring tools can help in solving production issues. And here are some thoughts about it.
Issues I usually run into in production and that I'd like to solve
- Memory footprint/leaks (caches, collections).
- Threads/pools/executors. They should have limited, possibly configured number of threads.
- Database connections/throughput.
- Number of different requests coming into the system.
- The most CPU consuming tasks.
- Application availability (whether or not it is started and reachable)
So these are the issues that happen in production and which are usually hard to solve.
Monitoring tools
Here is a list of frameworks/tools/suites I found and looked at so far
- JConsole
- Visual VM
- Lambda probe
- GlassBox
- JAMon
- Spring AMS/Hyperic HQ
- JXInsight
Let me describe them and point some key features in context of issues listed above.
JConsole
This one provides the access to the started VM using different kind of Mbeans/MXBeans. Following information can be extracted from the running VM:
System information
- Threads (Peaks, current number and so on),
- Memory (Heap/NoneHeap)
- Loaded classes
- OS state (operation system vendor, system properties and paths and so on)
- Garbage collector info.
Mbeans:
- Custom application Mbeans
- VM Mbeans (runtime, threading, memory pools, garbage collector)
In most cases this information is quite enough to identify the problem in general. I.e. whether or not this issue is a memory leak or may be over-threaded application. Say, this is a basis that one can get fast and for free but it won't give you any details about application bottlenecks.
JConsole supports pluggable modules that are simple to write and integrate into it.
Going forward we could say that currently this tool gives enough data and together with some applications provided by Sun it could be the best one to monitor and find all kind of bottlenecks.
Visual VM
Heavy-weighted framework based on NetBeans API and thus requires it to start. At the same time it is very flexible in data representation aspect. Pluggable modules allow depiction of any monitored data the way you like the most (charts/histograms/textual view). But this approach gives you yet another representation layer for the same information that you can get with JConsole.
To my personal understanding VisualVM won't give much in comparison with JConsole except may be some CPU/memory profiling features integrated into the VisualVM and it won't help you to find out database bottlenecks.
JAMon
JAMon is not a tool but rather a monitoring framework that wraps your code with proxy objects and logs execution time. Basically it can wrap almost all calls and objects, even the database ones and thus provide comprehensive view on what happens inside the application. Also it has very simple user interface based on some servlets wrapped into the WAR file.
One have to either change code and wrap every monitoring place with JAMon classes or use aspect pointcuts to instrument code at runtime. Both ways have some pros and cons. But it would be great not to change code (avoid any dependency on JAMon) and at the same time not to loose the performance (and memory) with instrumenting code at runtime.
This framework can be used instead of profiler even in the production but in very exceptional cases when we know where exactly issue happens.
JAMon coding example:
import com.jamonapi.*;
...
Monitor mon=MonitorFactory.start("myMonitor");
...Code Being Timed...
mon.stop();
Lambda probe.
This one is much better (and even oriented) for the application servers. Lambda probe can be easily integrated into the web-container or application server and show some additional container specific data like database connections, running servlets, thread pools and even particular thread in one of them.
So it looks like it could help us with database issues but in practice it just give us number of active connections that application has. It does not show MBeans and should be used together with JConsole.
GlassBox
Simple and probably useful monitoring application, but lacks of documentation does not allow to dig into it.
It runs as a wrapper around the web-container/appServer and as I see instruments everything using AspectJ in runtime. The main idea behind it is to get access to all possible Java calls and then filter out those ones that are not really interesting from performance point of view. But it makes execution slower (mostly at startup but at runtime as well, when application gets access to the particular class the first time). Also it consumes a lot of memory to instrument all classes.
Tool developers make some performance assumptions based on some internal criteria and there is no way to configure them. This framework has a few of maintainers (3-5) and last commit to their svn was about a month ago. So I wouldn't recommend to use it.
Spring AMS
The most powerful suite (application management suite) based on Hyperic HQ - world-class leader monitoring framework.
Features:
- Joins together all application information under the same roof.
- Different applications can be grouped to provide useful views.
- Physical box availability with a lot of operating system specific parameters.
- Depicts comprehensive Spring based details (contexts, executor services, db connections – everything that can be declared in Spring configuration).
- Has integration with almost all application servers (Tomcat, WebLogic, WebSphere, GlassFish, Spring DM)
- Configurable alerts could inform in time about failures, lack of memory or overloaded CPUs
At the same time it is:
- Heavy-weighted (AMS requires server to be installed along with Progress database, agents set up on every monitored box)
- Proprietary
- Complicated, details overloaded web-based console.
Beside the basics provides by JConsole, Spring AMS gives some additional useful details about Spring based application like database commits/rollbacks (overall, average). The better integration you have with Spring the more information could be shown on the console.
One can even get all application Mbeans by changing MBeans domain to the “spring.application” (this is a requirement of Spring and the way they define what should be shown in console) and thus adding application specific metrics.
Another requirements is to use compile-time instrumented Spring files. Spring allows to download all libraries from their site and use instrumented version. Also they provide instrumented logging, hibernate, collections, ehcache. So it is oriented to the Spring applications, but IMHO we have a lot of them.
JXInsight
This product is mostly oriented to the development phase but can be used at production as well.
Pros:
- Integration with a lot of frameworks and products
- Support for distributed environment
- Probes to meter and and traces to get paths (stack traces). Common ways to detect CPU consumption and hotspots but very featured ones.
- True JDBC monitoring on transaction level with long-term statements detection.
- Allows off-line analyze by taking snapshots at runtime.
- Hight resolution clock.
Cons:
- Proprietary.
- JDBC monitoring is not recommended for the production.
- Adds overhead by own Java agent and instrumentation
- Does not support alerts and thus need to be monitored all the time.
So it won't give database activity monitoring at production. The only useful things are probes that meter resources consumption across different customizable groups (read packages/classes).
Profiling
Memory footprints/leaks
Why memory is so important? Simply because of lack of the system resources, but even if you have enough hardware resources the Java GC could take time and thus slow down your application.
Let's assume that we know the issue (whatever tool we used it gave us some information to make the decision) and it's a memory footprint/leak or application over-threading. Next step is to define where exactly this problem occur. What point in code or at least class causes it.
Starting from Java 5 Sun provides set of very convenient tools to identify it. First of all it's a memory dumper. Tool called “jmap” allows getting memory dump by process id (the only issue is that it fails up to JDK 5.0._14). It works very fast especially if traces “live” objects only and could get a 4G heap dump approximately in a couple of minutes. This dump is a memory snapshot with all objects and references between them and thus can be analyzed later to find out the outstanding number of threads or other objects.
Another case is a out of memory exception (OOME). In this case it's recommended to start production application with Java parameter (-XX:+HeapDumpOnOutOfMemoryErrors) that takes memory snapshot right after exception happened and thus we still have a dump file to analyze and find out leaks.
Cpu overloading
This is the most complicated issue because usually it takes a few seconds and it's hard to catch it.
But let's assume that application consumes 100% of CPU and we see it in our monitoring tool. In this case we can go through the list of active threads and find out those ones that are in charge of that. Thread name should give us the point in code that caused this problem.
In most cases this means that application uses all hardware resources and need to be either optimized or scaled. Talking about scalability we should remember two types of it. One can scale-in application by adding more power to the same box or scale-out code by moving some calculations outside the original box and thus making grids (both types computational and data ones).
Database monitoring
It happens very often that application does almost nothing and has little memory footprint and at the same time works very slow. The possible bottleneck could be a database that consumes a lot of resources on a remote box and cannot handle all applications requests.
Usually databases provide tools for their monitoring and optimization but on application side it could be worthwhile to trace database activity as well.
One of the ways is a J2EE data sources registered in application server and showing all SQL statements, the slowest ones, all connections and their activity.
IO stat
The last major performance issue that should be taken into consideration is a input-output throughput. This means both network and hardware activity that can be easily monitored with Linux “iostat” and “netstat” utilities.
The solution in this case is to fix it on hardware level or change application code to diminish data amount sent/saved if it's possible.
Conclusion
So as I see it all bottlenecks can be found with JConsole and some useful tools like jmem, SAP mat, operating system tool (iostat, netstat) and database specific applications. The only thing that should be solved is an application availability. But this could be resolved with Apache/shell scripts.