uPortal Load Testing Tips

This page pulls together sets of tips on how to perform load testing / performance testing on uPortal.  Generally load testing uPortal is the same as load testing any web application, but due to uPortal's nature of aggregating content and extensive caching there are a few differences compared to other web applications.

Tools

Any tool that supports load testing on a web application would be fine.  Many institutions use jmeter since it is free and it works well.

Traps to avoid

  • uPortal does extensive caching for the guest user, and also based on username.  Do not use a single (or a small number of) user accounts for your simulated users. How large is your user community? Use a number similar to that. In any case, use 10k or more.  Also if you have multiple types of users (students, staff, faculty), have a set of test accounts that include all user types ideally of equal proportion to your actual community.

  • Don't forget to include pauses (between actions) in the scripts. Hundreds of threads sending requests to the portal as fast as they possibly can is not a load test -- it's a Denial of Service Attack.  Use at least 5 seconds between actions.  Your users will typically be much higher than that; 5 is pretty aggressive.

  • If your using a load balancer that uses sticky connections (based on client IP address) you will need to use multiple injectors and configure your load balancer to route each injector to a different server. Consider by-passing the load balancer and configuring each injector to make requests directly to the real server (modifying the hosts file in the injector machines is a good way to avoid issues caused by redirects / certificates / cookies etc. e.g. CAS authentication service url). 

Tips preparing your plan

  • Typically the expensive thing is logging in. Most folks find that the concurrent user limit is constrained by the number of logins that can be performed.  A reasonable rate for a warmed up portal is 4 seconds between logins.  A well tuned and functioning portal on adequate hardware, including the dependent systems it accesses, can handle faster than that especially if you have more time between user actions.
    • Give the script a ramp-up time; e.g. the first pass through the script have it space the logins out a bit. In the scripts there is a ramp-up time (don't recall if that's what jmeter calls it) that you use to have it start script execution threads over a period of time rather than all at once. What I do is determine a login rate I want, say 1 login every 4 seconds, then calculate how many script execution threads I'd need based on the number of jmeter machines (you will likely need multiple) and the number of uPortal servers in the cluster.
    • Optionally configure a random variability in the time between user actions.  A common approach is to use 5 seconds plus 0-3 seconds.  This will cause a bit more randomness in the test but is more similar to the real world.
  • Using jMeter or similar tool you basically record the browser traffic to record the flow of steps you want to test. It is usually a good idea to clear browser cache before performing the recording to capture all requests if you are trying to simulate 'new user' vs. someone who has accessed the portal some time in the past (or to go for more of a 'worst case' vs. 'best cast' type of comparison). Whatever you do, record what your process is to get parity if you plan on comparing those numbers to previous results or some future results. jMeter has options for clearing cache between runs, but you have to start out with all the browser requests by clearing cache at script recording time if you are trying to get more of that 'worst case' scenario.
  • You have options in the jmeter configuration to have jmeter request external resources (images from institution web assets servers, etc.) or you can apply a filter to request only from a particular domain. Typically do the latter to minimize external dependencies that can add variability to results. It's not a bad idea to do one run with all external dependencies involved for comparison because real users will hit all external dependencies and it might indicate a downstream issue your real users might run into, but I wouldn't do it for all runs and its basically 'static' web assets that would be accessed which are generally not a big resource impact on the dependent systems.

  • Create a few alternate scenarios (such as switching tabs / using focused window state / searching / adding portlets to layout) and mix them up so that not all virtual users are preforming the same actions.
  • Set your thread count to 1 initially and enable Results Tree reporting when developing script. Increase to multiple threads on single machine.  When doing real load tests you probably want to disable Results Tree report (too extensive and slow).

Tips executing your load test

  • Try to insure as much consistency between tests as possible; e.g. eliminate external dependencies, make sure you aren't load testing when other activity might be occurring on the network, etc.
  • Restart Tomcat on all uPortal servers in the cluster prior to each load test to get a more consistent start state.  Otherwise the extensive caching can skew results.
  • It may also be worthwhile performing an initdb between test runs (Only if its safe to do so without losing data you care about).

Monitoring

  • Work with your team to have monitoring occur on dependent systems during the load test (database, LDAP/AD, RSS feed sources, etc.)
  • Monitor a variety of things, but minimally CPU%, IO, and insure you have garbage collection logging enabled (see JVM Configuration and Tuning for more information).  Also monitor your jmeter load generator machines.  It is possible that your load generators are not keeping up and are not able to send information to uPortal or the jmeter server adequately.
  • Using a monitoring tool such as NewRelic and be extremely beneficial for identifying both functional and non-functional issues. Also the visualisations provided by NewRelic can be useful when interpreting the results.
  • JVM profiling tools (e.g NewRelic / YourKit) can provide indepth insights when diagnosing code based performance issues.

Tips interpreting results

  • Interpreting the results can be challenging as there are many things happening. The portal often pulls in content or data from other systems and it is not uncommon for the portal to have to wait on the dependent systems. Looking at cpu% on the portal and on dependent systems can help determine where possible bottlenecks are.
  • It can be relatively difficult to define when the 'end' of the test is.  Due to the variability in user inputs and response handling, it is common that some of the test threads complete while others are running.  For the purpose of calculations, I usually look for a noticeable consistent change in the CPU usage of the uPortal servers to determine when I consider the test ended on a particular server.  For instance if you are hitting the systems fairly hard and the average CPU% used is in the upper 80s, when the average CPU% drops consistently below 70% I consider that the end of the test, at least on that uPortal node.
    • When I start a test I'll often start on a shell on each uPortal server a command such as the following to record the CPU% every 15 seconds for 240 entries (adjust as appropriate for your test duration).  This provides a way after the test to determine how long to consider the test 'active' when computing time-based test averages.

      sar -u 15 240 > cpuUsage.txt

  • Review the average CPU load on each uPortal server in the cluster during the load test.  They should be relatively equal.  If they are not, there is an inconsistency or error in the load balancer's load balancing algorithm/method or between the test script and load balancer.  For example if the load balancer is using IP-address-based routing, that's a problem especially for a load test, or if the load balancer sets a session cookie, the test script may not be resending that cookie value.  
  • We often find that the timeout value in the portlet definition files is fairly low and under heavy load such as with a load test you can get a lot of warnings in the portal.log file about portlets timing out. It's generally recommended to have higher values for the timeout in the portlet definition XML files to reduce this noise. The purpose of the portlet's timeout field is two fold: 1) don't hold up the page longer than that number of milliseconds because you are waiting on a dependent system so you can give the user a satisfactory response,  2) to provide a mechanism for the portal to detect and count poorly performing portlets/dependent-systems so at some point the portal won't even attempt to try the dependent system (i.e. a short-circuit design pattern if you are familiar with it), and 3) give the portal a mechanism to indicate a thread might be running out of control and request it terminate (though it is a request and the thread has to call back into the portal for information for the thread to actually abort). I wouldn't have any timeout less than 10000ms, and those that have dependencies upon other systems should probably have a higher value (20000ms or 30000ms). It certainly cuts down on the noise during load tests and on production systems when traffic spikes occur.
  • One thing to look carefully at is the database connections. We have often found that there are too few connections to the database under load and this causes unnecessarily delays. If the db connections during load test tend to be fully consumed but the DB has extra processing capacity, I'd increase the number of DB connections between the portal servers and the DB to see if you get more favorable results.

Example jmeter scripts

  File Modified

File csus_example_uPortal_jMeter_Script_base.jmx

May 11, 2016 by James Wennmacher

CSU - Sacramento

  • Demonstrates use of variables, script to distribute load across student/faculty/staff accounts, CAS authentication, looping through portal's 4 tabs multiple times, variable wait times (3 seconds + 0 to 1 seconds)