Poor platform user lookup performance

Description

This was originally reported in the field where looking up ~350 coaches via UPortalPersonAttributesService#getCoaches() was taking ~1.5m. In that case we demonstrated that only a negligible portion of that time could be attributed to the "core" LDAP search. Further, even searches with result sets smaller by an order of magnitude would take several seconds to return, which is rather unexpected for what should be heavily cached data.

The fix in the field was to dramatically increase the TTLs on the org.jasig.services.persondir.USER_INFO.merged cache, which did fix the perceived poor performance, but meant that changes in external attribute stores (LDAP) would be very slow to appear in SSP (Platform).

Replicating the problem locally was difficult until I added a third PersonAttributeDAO to the default two uPortal-internal DAOs, in which case I could get even small datasets, e.g. 2 coaches, to take up to a second to return and larger datasets, e.g. 40 coaches, to take up to 15 seconds to return. I think this happens because the query pattern triggered by PersonLookupHelperImpl#searchForPeople has at least two problems:

Before the "main" search runs, the attribute filter ends up querying each DAO for each possible attribute as if it were a username.
After the "main" search runs the results are filtered for user-scoped visibility (i.e. only return the users for which the current user has permissions to see), which results in each found user being queried for again, but by username

#1 seems like a flat out bug. I don't understand why that would ever make sense to do. Need to talk with the uPortal guys about that one.

#2 seems like we both need to recommend that all "external" DAOs need a caching wrapper and that we need that wrapper to always cache both query results and per-username results so that if the app turns around immediately and asks for the same users by username we're not going to hit the underlying data store. It's still going to be terribly inefficient, but would be an improvement (and, FWIW, not all that dissimilar from the Hibernate identity cache).

Lastly, to avoid regressions in this area and because we can't set per-search cache policies on the platform side, we should probably implement one or more caches on the SSP side... maybe some simple wrappers on the PersonAttributesService interface? (http://docs.spring.io/spring/docs/3.1.4.RELEASE/spring-framework-reference/htmlsingle/#cache)

Here is an example of problem #1 where I captured a stack trace just to demonstrate where the call was coming from. You should expect to see ~10 similar lookups, basically one for every possible user attribute:

And here is an example of #2. You should expect to see one of these for each person returned from the "main" coach search:

Environment

None

Attachments

28 Mar 2014, 11:02 PM
28 Mar 2014, 09:42 PM
27 Mar 2014, 08:29 PM

Activity

Daniel McCallum

April 1, 2014 at 5:16 PM

Rolled back the patch for 2.4. Wasn't quite as impactful as I'd hoped, and not going to have time to vet the security implications for 2.4. Set to Critical for 2.5, though, because I do think this is becoming an increasing problem in the field.

Daniel McCallum

March 28, 2014 at 11:02 PM

SSP-2050-NR-trace-without-person-attr-filtering.png shows that the patch does help query counts... there's still several N+1s but at least N is the user count, not the attribute count. Total response time is still slower than you'd like. But in a situation where each of those queries would be hopping the network, this would probably be a significant improvement.

Daniel McCallum

March 28, 2014 at 10:32 PM

This problem has become rather acute at several recent real-world deploys, so I just pushed an experimental commit to see what happens if we just comment out the problematic code block (we don't really need it, as noted in the comments above).

Daniel McCallum

March 28, 2014 at 9:42 PM

Attaching SSP-2050-local-attr-dao-queries.txt, which is the db query horror show for a coach lookup, as reported by New Relic in Linux CI today (https://rpm.newrelic.com/accounts/643066/applications/3420204/transactions#id=452998208&tab-detail_3011473517=sql_statements&app_trace_id=3011473517)

Daniel McCallum

March 27, 2014 at 8:29 PM

While working on unrelated issues and watching NewRelic in LinuxCI, noticed that our SSP->uPortal coach lookups exhibit a N+1 query pattern. Attaching the NR trace which shows roughly 450 db queries for a coach lookup result set that net ~45 coaches. That the performance here is even halfway reasonable is kind of shocking; could see it being much worse if hitting a db across a network, or just hitting a larger user store, period.

Might or might not be the same cause as what is described earlier in this ticket (could just be a JPA/Hibernate mapping issue). But it's at least another datapoint to check when/if working on improvements to person lookup performance.

Resize work item view side panel

Details

Assignee

Jason Elwood

Reporter

Daniel McCallum

Original estimate

Time tracking

No time logged4d remaining

Components

uPortal

Fix versions

SSP 2.X

SSP-Platform 2.X

Priority

Critical

Created January 29, 2014 at 12:15 AM

Updated June 20, 2014 at 6:34 PM