uPortal2 Memory Leak Issues and Fixes.

Several schools reported some memory management issues with their uPortal2 implementations. As a result uPortal code base was investigated using "YourKit" (a memory profiling tool) to find the areas of code that causes memory leak.

Before I jump into memory leak issues I would like to define some terms

Memory Leak with Java....?

Memory leak refers to the loss of free memory in the heap. Yes memory leaks can happen with Java and it can be quite severe. Java's memory leak is not like c/c++ memory leak.
Memory leak in c/c++:
The Object has been allocated, but it is no longer reachable.

Memory leak in Java:
The Object is reachable but it is not live. Meaning the object has reached the end of its lifecycle and now GC should be able to reclaim this object but some erroneous references prevents it to get garbage collected. Most of the time, a single lingering reference can have massive memory impact.

Following is a list of Memory leak issues that have been found with uPortal code base.

Memory leak Issues List

1. PersonDirectory memory leak related to caching IPersons in a WeakHashMap

Description:
PersonDirectory.java has a memory leak resulting from the use of a WeakHashMap to cache IPerson Objects.
The expected (broken) behavior is as folllows; As users login an entry is added to the WeakHashMap that is never able to be garbage collected since the: Value (IPerson) has a hard references to the
Key (person.getAttribute(IPerson.USERNAME) . This would mean that for sites that have 30,000+ registered users this Map could grow to 30,000+ entries at max and would never go down (until the JVM is bounced).

Proposed Fixes:

  1. Remove the WeakHashMap altogether with the understanding that the only downside is the PersonDirectory.getRestrictedPerson(String uid) will have to query the data store each time it is invoked.
  2. Replace WeakHashMap with org.jasig.portal.utils.WeakValueMap from uP3 sandbox code. That implementation is tested and should do what was originally indented for the PersonDir caching.

Implemented Fix & Reasoning:
The first proposed fix was implemented for following reasons.

  1. Assumption is most of the users don't make heavy use of the cache, it is providing little value to them and consuming memory, which is a problem on hand.
  2. LDAP queries and primary key selects tend to be very fast.
  3. WeakReference behavior is not a substitute for a proper caching strategy.

We can go from that "known-good" state in figuring out whether the loss of caching introduces unacceptable PersonDirectory performance or whether it's more than good enough as a baseline to which caching-inclined uPortal adoptors can re-introduce caching in keeping with where each adoptor specifically wants to be on the performance/dynamicness/memory conservation trade-off curve.

Status:
The proposed fix (Fix number 1) has been implemented on uportal2 head and uportal-2-4-patches branch.
Important Notes:
Removing the IPerson cache in PersonDirectory should help with memory consumption, but it is sure to increase the workload of PersonDirectory because PAGS is heavily dependent on PersonDirectory.getRestrictedPerson() which will now instantiate an IPerson object and populate it with attributes each and every time it is called. That involves a trip to LDAP and to the database. The plan is to assume that the extra overhead will not have a significant impact on performance and leave the cache out until a more solid and proven cache strategy is known to be required and available.
Important Links:
UP-744@JIRA

2. ChannelManager has a memory leak, when it swaps out a channel for the CError channel, the end session events never progagate to the original channel.

&

3. CSecureInfo has a memory leak, when ChannelManager swaps out a channel for the CSecureInfo channel, the to end session events never propagate to the original channel

Description:
When the ChannelManager swaps out a channel for the CError channel, the end session events never propagates to the original channel. This was particularly bad with the CPortletChannel channel which keeps a static Map for ChannelState objects. Luckily all my portlets on the "Portlet Examples" tab weren't configurated and all failed..Which uncovered it.

Proposed Fixes:
We could add a method to CError to retrieve the "the_channel" attribute, then have ChannelManager propagate the SESSION_DONE event (or others??).
There is similar behavor with CSecureInfo. Both CSecureInfo and CError should
implement receiveEvent and pass events they receive along to the channels they
hold.
Status:
The proposed fix has been implemented on uportal2 head and uportal-2-4-patches branch.
Important Links:
UP-745@JIRA
UP-746@JIRA

4. Synchronize ChannelFactory's instantiateChannel method

Description:
During the analysis of memory snapshots we found multiple instances of some multithreaded channels. It was surprising to see multiple instances and it suggested that there must be some bug in the code. The problem was found in the instantiateChannel. This method should have been synchronized so that there isn't an unintended possibility of creating more than one instance of a particular multithreaded channel.
Status:
Fixed. UP-753@JIRA

5. Deploy Xalan to Tomcat's endorsed directory

Description:
uPortal ships with Xalan 2.6.0 which is represented by xalan.jar which currently gets deployed into the webapp lib directory. In this location, the JVM doesn't see it as it loads Xalan from another location. The JDK 1.4 has Xalan 2.4.1 and that is what gets loaded at the run time. Xalan_2.4.1 has known memory leak. It was caught after we profile the uPortal.
Fix:
To make sure Xalan 2.6.0 gets used, Portal's Xalan should get copied to the common/endorsed directory of Tomcat. Since our build.xml file is already Tomcat-specific, so it won't hurt to add this change.
Status:
Fixed. UP-759@JIRA

Note: xalan 2.6.0 has a bug. For details please check the Known Problems section on UCB Webmail page

6 UBC WebmailComposeMessage has a memory leak related to finalize method

Description:
ComposeMessage has a memory leak related to finalize method, this is causing
ChannelManager objects to leak as well.
Fix:
Remove finalizer method and provide explicit termination method such as cleanup().
Status:
Fixed. UBCWEBMAIL-4@JIRA

7 CPortletAdapter and PortletStateManager Memory leaks.

Description:
Under stress conditions, it looks like when the PortletStateManager is asked to clear the window state for a particular session, that session cannot be obtained due to the response already being committed. And since this exception is thrown, the subsequent cleaning of the static channelStateMap Map does not happen, which results in a memory leak.

Fix:
We tried changing PortletStateManager and CPortletAdapter to use the PortletWindow's ID (which gets initialized in CPortletAdapter to be session dependent – sessionID+subscribedId) to store window states and it seems to function correctly and does resolve the leak.
Status:
Fixed in head and portal_rel-2-5-patchesUP-1125@JIRA.
Fixed for portal_rel-2-4-patches UP-1119@JIRA

8. PortletStateManager.clearState(PortletWindow) can lead to ConcurrentModificationException and can cause memory leak.

Under stress, you can run into the following scenario which can cause ConcurrentModificationException

When two users end/begin the lifecycle of this channel at the same time (likely under stress??), since they both have access to the same Map object at the same time – classic concurrency stuff. And once the remove operation throws that exception, The ChannelState object get left behind.

Fix:
To fix this problem remove's and put's into the channelStateMap need to broken out into synchronized methods. The implemented fix consists of synchronizing the maps as well as caching HashMap's to current, prev window states. Modified the CPortletAdapter to use the ClearState(PortletWindow) rather than clearState(HttpServletRequest) method.
Satus:
Fixed for head.
Fixed for portal_rel-2-4-patches.
Fixed for portal_rel-2-5-patches.
UP-1123@JIRA
UP-1124@JIRA

9 ca.ubc.itservices.channels.webmail.MessageParts Finalize methods implicated in memory leaks

Description:
During the Analysis of memory snapshots we have found lots of loitered objects in the memory because of MessageParts finalizer. When I set the filter to get retained Objects because of MessageParts class, I found 74,500 Objects are retained in memory and retained memory size is 3,980,240K. All paths goes to MessageParts.finalize( ) method.

Fix:
Remove Finalizer from this class.

Status:
Fixed. UBCWEBMAIL-35@JIRA
Following are the links to the discussion that takes place regarding this issue.
[ https://list.unm.edu/cgi-bin/wa?A2=ind0503&L=jasig-dev&D=0&P=4621] and
https://list.unm.edu/cgi-bin/wa?A2=ind0503&L=jasig-dev&P=R18341

10 ca.ubc.itservices.channels.webmail.ActiveFolder finalizer and its Super class (MsgFolder) finalizer causes loitered Objects.

Observation:
When I set the filter to get retained Objects because of ActiveFolder class, I found 82,170 Objects are retained in memory and retained memory size is 6,889,496K. Most paths goes to ActiveFolder.finalize( ) method.
Fix:

  • I renamed ActiveFolder.finalize() and MsgFolder.finalize() methods to close()and fix the codebase to call this method explictly to cleanup the state of the object.
    The finalize method has been removed from these classes. Why? Because finalizers are highly unreliable and can cause memory leak. We try to use finalizer to cleanup non memory resources and we end up causing memory leak in our applications. If we can cleanup our mess (non memory resources) we should do that rather relying on something that is unpredictable. For this approach we have to make sure that the clients of these classes are explicitly calling termination mechanism appropriately to clean up the state. For me the only ligitimate use of finalizer is to cleanup the state of native peer object since gc will be unaware of those native objects.

This suggestion has been posted on JASIG DEV list
https://list.unm.edu/cgi-bin/wa?A2=ind0503&L=jasig-dev&P=R38229

Status:
Fixed. UBCWEBMAIL-35@JIRA

11. ThreadPooling problem:

Description:
There were numerous issues discovered with the home-grown Threadpooling library, including the potential for deadlock and infinite loops, as well as misplaced clean up calls. We had already seen in production through memory snapshots that some of these issues were forcing memory leaks.
Fix:
After discussion on the list, the threadpool was replaced with the backport concurrent library. A patch was made for the uPortal HEAD. Note this fix has not yet implemented on rel_2-4-patches.
Status:
Fixed both in the head and rel_2-4-patches head.
UP-862@JIRA
UP-1016@JIRA

12. Worker.java

Following calls to cleanup some resources is in the try block. If exception is thrown before these lines get executed means that the cleanup will not happen.

cleanState();
// Release this thread
pool.releaseThread(this);

Fix:
These should be moved to finally block.
Status:
Fixed for both head and portal_release-2-4-patches.
UP-1058@JIRA
This issue was discussed on the uPortal Dev list. Following is the link to this discussion thread.
https://list.unm.edu/cgi-bin/wa?A2=ind0503&L=jasig-dev&D=0&m=6774&P=5397

13. Memory leak caused by ChannelRenderer finalizer method

ChannelRenderer finalizer is causing lots of loitered objects in memory causing a serious memory leak. The finalizer in this class is basically added to take care of runaway threads. A badly written channel can cause runaway thread e.g. a channel that goes in infinite loop etc. We observed during memory snapshots analysis using YourKit that finalizer is getting blame of lot of loitered objects.

Fix:
Rename finalizer() as kill(). This method should be explictly called from ChannelManager to kill the runaway threads.
ChannelManager.finishedRenderingCycle() method is modified to explicitly loop thru ChannelRenderers held in ChannelManager.rendererTable and kill each thread in ChannelManager.finishedRenderingCycle() by calling explictly ChannelRenderer.kill() method.
Status:
Fixed for both head and portal_release-2-4-patches.
UP-931@JIRA

14. Finalizer in XRTreeFrag (xalan.jar) causes lots of loitered objects (memory leak).

Finalizer in XRTreeFrag (xalan.jar) causes lots of loitered objects. Since there is a clean up method destruct() available and all the clients should call this clean up method rather depends upon finalizer.
Because of finalizer in this class these objects gets put on finalizer queue and dont get cleaned quickly when they go out of scope. Finalizer is provided probably as a safty net but this is hurting the application by causing memory leak.
I took a snapshot from production system with memory size set to 128 MB.

Snapshot file size: 56,601 KB
XRTreeFrag Retained size: 44, 273 KB
XRTreeFrag number of retained Objects: 170, 321 objects

Fix:
The fix should be simple remove finalizer from this class. I have tested it and it did clean up the resources and fix the memory leak.
Status:
I patched the Xalan 2.6.0 with finalizer removed from XRTreeFrag and deployed in our production system. I took snapshot again with the memory size set to 128 MB.

Snapshot file size: 36, 823 KB
XRTreeFrag Retained size: 636 KB (retained objects are mostly because of stylesheetRootCache)
XRTreeFrag number of retained Objects: 2, 647 objects

Practically it has improved portal performance significantly. I have filed a bug report on xalan JIRA.
http://issues.apache.org/jira/browse/XALANJ-2128

15. MultipartDataSource.finalize( ) is not implemented properly

Two problems with this implementation

1- The method signature is wrong.
2- Adding a finalize method will delay cleaning up the memory. Snapshots taken from production system proves this fact.
Fix:
Get rid of finalize and add a new method as follows

 
  protected void dispose() {
    buff = null;
    if(tempfile!= null){
        tempfile.delete();
        tempfile = null;
    }
  }

Status:
Fixed for both head and portal_release-2-4-patches.
UP-992@JIRA

Some RUTGERS specific Problems & Solutions:

During this memory analysis and improvement round we also made some fixes that are more RUTGERS specific but you can also look into them to get some ideas. Following is the summarized list of these issues.

1. Disable persistent sessions

Description:
Disable persistent sessions if you do not really need them. You can disable them by adding the following lines to the application context definition in either server.xml or in the portal.xml that lives in the portal project:

<!-- Disables restart persistence of sessions -->
<Manager pathname=""/>

We noticed that persistent sessions were enabled by profiling the portal with YourKit and seeing that there were a bunch of lingering PersonImpl objects that seem to be held by Tomcat's PersistentSessionManager.

2. JavaServer Pages (JSP) needlessly created sessions

Description:
Several JavaServer Pages (JSP) needlessly created sessions that were left to time out. Monitoring tools that were reading some of our diagnostic JSPs added as much as one session per minute. The solution here was to add
<%@ page session="false"%> to the top of each of these JSPs that do not need to participate in the session.

3. AuthorizationImpl memory leak:

It was pointed out on the mailing list that the AuthorizationImpl has a serious memory leak in that the HashMap is never cleaned up. It also includes a performance reducer, in the form of a hash map clone on the addition of every new principal.
Fix:
AuthorizationImpl was rewritten to make use of a LRU cache and the removal of the cloning. This should reduce memory requirements and increase performance.