We recently ran into what initially appeared to be the resurrection of a long vanquished JDK bug, but turned out to be something new!
On one of our new clients, running ATG 11.1 on JBoss EAP 6.1, during pre-launch preparation, we discovered that the application was unusually slow. Not all the time, but randomly, pages would take tens or hundreds of seconds to return. Our APM, New Relic, pointed to extremely slow getConnection() calls on the JDBC pools. Most getConnection() calls were very quick, less than a millisecond, but some took 10 seconds or more. This obviously adds up when a page has dozens or hundreds of DB calls to make.
Slow performance around JDBC getConnection call
Delays on getConnection() calls immediately made us check two things: the JDBC pool size and health, and the Database performance.
We saw no errors about timeouts getting resources from the pool, and using the JBoss web admin, were able to confirm that the pools were not exhausted at any point. We focused on improving pool connection health checks, which was a good exercise, but didn’t have any impact on the issues we were seeing.
The database looked healthy and had plenty of open session and connection slots. I wrote a simple test class that would open up hundreds or thousands of JDBC connections, and time the creation. It showed no slowness or irregularities in establishing large volumes of connections to the database.
The Ghost of Christmas Past?
The symptoms rang a bell for me. The irregular, almost random, slowness with JDBC calls made me think of the old JDK /dev/random bug from JDK 1.4 and 1.5. For those of you who never ran into this, the short version is that a combination of JDK bugs meant that the JVM would try to get random bytes from /dev/random instead of /dev/urandom (even when configured to point to /dev/urandom in 1.5). /dev/random is a source of TRUE random bytes, and as such the pool is refreshed based on entropy sources such as mouse movements, and other more low level OS runtime actions. If you have a server, with no mouse/keyboard interaction, and a pretty steady state of processes/swap/etc… this unfortunately means that the entropy pool for /dev/random can become exhausted. Basically /dev/random would run out of random bytes. Then anything which needed random data, would be blocked and wait until enough events were captured to provide the entropy fuel to create some new random bytes. Oracle JDBC drivers use SSL to encrypt the connection to the DB, and rely on the random pool to setup those connections.
(See JDK-4705093, JDK-4820090, JDK-4952383, JDK-6202721, and many more for details)
So the symptom of this bug was that sometimes, JDBC connections would take a super long time to happen. It was semi-random, and presented more under load – just like what we were seeing here. But we solved this, using the /dev/./urandom workaround for years. So was this a new similar bug?
After confirming that the workaround was indeed in place, I modified my JDBC test class to use the same ojdbc.jar that JBoss was setup to use, and did some trace analysis to confirm which random source was being used, etc… Everything looked correct.
We were able to analyze a large number of stack traces, and found that the JBoss Periodic Recovery service (com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery) was holding a lock on the org.jboss.as.connector.subsystems.datasources.AbstractDataSourceService, which was blocking other threads for betting a connection back from the org.jboss.jca.adapters.jdbc.WrapperDataSource.getConnection call.
We dug into that, but other than disabling the Periodic Recovery service entirely (which is a long term bad idea), we couldn’t get a fix.
We opened a support case with RedHat/JBoss. They quickly came back confirming that this is a known bug in JBoss EAP 6.1!
It is fixed in EAP 6.2 and later, but unfortunately Oracle Commerce doesn’t yet support anything later than EAP 6.1, so luckily we were able to get a hotfix for EAP 6.1.0.
Once applied the hotfix solved the issue. Performance was good, getConnection calls returned quickly and consistently, etc….
If you see slow performance around JDBC getConnection calls and you’re running on JBoss EAP 6.1, you need a hotfix to solve a pretty critical ironjacamar bug.
That is the fix for the old issue I mentioned in the “The Ghost of Christmas Past” section.
The issue we found here more recently is not related to the crypo/random side of things, and requires the hot fix I mention above!
Should the issue be fixed with something like this in startup param?