Poor platform user lookup performance
Description
Environment
Attachments
- 28 Mar 2014, 11:02 PM
- 28 Mar 2014, 09:42 PM
- 27 Mar 2014, 08:29 PM
Activity
Daniel McCallum
Rolled back the patch for 2.4. Wasn't quite as impactful as I'd hoped, and not going to have time to vet the security implications for 2.4. Set to Critical for 2.5, though, because I do think this is becoming an increasing problem in the field.
Daniel McCallum
SSP-2050-NR-trace-without-person-attr-filtering.png shows that the patch does help query counts... there's still several N+1s but at least N is the user count, not the attribute count. Total response time is still slower than you'd like. But in a situation where each of those queries would be hopping the network, this would probably be a significant improvement.
Daniel McCallum
This problem has become rather acute at several recent real-world deploys, so I just pushed an experimental commit to see what happens if we just comment out the problematic code block (we don't really need it, as noted in the comments above).
Daniel McCallum
Attaching SSP-2050-local-attr-dao-queries.txt, which is the db query horror show for a coach lookup, as reported by New Relic in Linux CI today (https://rpm.newrelic.com/accounts/643066/applications/3420204/transactions#id=452998208&tab-detail_3011473517=sql_statements&app_trace_id=3011473517)
Daniel McCallum
While working on unrelated issues and watching NewRelic in LinuxCI, noticed that our SSP->uPortal coach lookups exhibit a N+1 query pattern. Attaching the NR trace which shows roughly 450 db queries for a coach lookup result set that net ~45 coaches. That the performance here is even halfway reasonable is kind of shocking; could see it being much worse if hitting a db across a network, or just hitting a larger user store, period.
Might or might not be the same cause as what is described earlier in this ticket (could just be a JPA/Hibernate mapping issue). But it's at least another datapoint to check when/if working on improvements to person lookup performance.
This was originally reported in the field where looking up ~350 coaches via
UPortalPersonAttributesService#getCoaches()was taking ~1.5m. In that case we demonstrated that only a negligible portion of that time could be attributed to the "core" LDAP search. Further, even searches with result sets smaller by an order of magnitude would take several seconds to return, which is rather unexpected for what should be heavily cached data.The fix in the field was to dramatically increase the TTLs on the
org.jasig.services.persondir.USER_INFO.mergedcache, which did fix the perceived poor performance, but meant that changes in external attribute stores (LDAP) would be very slow to appear in SSP (Platform).Replicating the problem locally was difficult until I added a third PersonAttributeDAO to the default two uPortal-internal DAOs, in which case I could get even small datasets, e.g. 2 coaches, to take up to a second to return and larger datasets, e.g. 40 coaches, to take up to 15 seconds to return. I think this happens because the query pattern triggered by
PersonLookupHelperImpl#searchForPeoplehas at least two problems:Before the "main" search runs, the attribute filter ends up querying each DAO for each possible attribute as if it were a username.
After the "main" search runs the results are filtered for user-scoped visibility (i.e. only return the users for which the current user has permissions to see), which results in each found user being queried for again, but by username
#1 seems like a flat out bug. I don't understand why that would ever make sense to do. Need to talk with the uPortal guys about that one.
#2 seems like we both need to recommend that all "external" DAOs need a caching wrapper and that we need that wrapper to always cache both query results and per-username results so that if the app turns around immediately and asks for the same users by username we're not going to hit the underlying data store. It's still going to be terribly inefficient, but would be an improvement (and, FWIW, not all that dissimilar from the Hibernate identity cache).
Lastly, to avoid regressions in this area and because we can't set per-search cache policies on the platform side, we should probably implement one or more caches on the SSP side... maybe some simple wrappers on the
PersonAttributesServiceinterface? (http://docs.spring.io/spring/docs/3.1.4.RELEASE/spring-framework-reference/htmlsingle/#cache)Here is an example of problem #1 where I captured a stack trace just to demonstrate where the call was coming from. You should expect to see ~10 similar lookups, basically one for every possible user attribute:
And here is an example of #2. You should expect to see one of these for each person returned from the "main" coach search: