uPortal IRC Logs-2012-08-28

[11:56:10 CDT(-0500)] <sjungling> morning all… it looks our uPortal 4 cluster is hammering our LDAP servers with queries to the point where LDAP stops accepting connections

[11:56:17 CDT(-0500)] <sjungling> i have LDAP pooling enabled

[11:56:39 CDT(-0500)] <sjungling> are there other properties that can be configured for LDAP to fine-tune the pooling?

[11:58:12 CDT(-0500)] <EricDalquist> sjungling: where did we get with caching?

[12:01:31 CDT(-0500)] <sjungling> got it working

[12:01:47 CDT(-0500)] <sjungling> albiet the 3 servers use two different flavors of caching

[12:01:52 CDT(-0500)] <sjungling> one has caching only on merged attribuets

[12:02:05 CDT(-0500)] <sjungling> the other has caching on individual data sources AND on merged attributes

[12:03:21 CDT(-0500)] <sjungling> here's the latest with individual caching that's deployed to 1/3 servers https://gist.github.com/b70101a55e5bc66e6333

[12:04:03 CDT(-0500)] <sjungling> it could very well be that this is just our LDAP server, but as one of the largest consumers of our LDAP… uPortal is where we're starting to troubleshoot

[12:06:34 CDT(-0500)] <sjungling> even got Regex working to not search LDAP/Secondary Oracle for "guest" user… since they won't exist

[12:15:36 CDT(-0500)] <sjungling> thinking it might be worth going above and beyond with the LDAP pooling a la http://static.springsource.org/spring-ldap/site/reference/html/pooling.html

[12:17:15 CDT(-0500)] <sjungling> kinda like this https://gist.github.com/b70101a55e5bc66e6333#file_ldap_context.xml

[12:18:06 CDT(-0500)] <EricDalquist> well I wrote the ldap pooling in spring-ldap

[12:18:10 CDT(-0500)] <EricDalquist> and while it depends on your ldap server

[12:18:17 CDT(-0500)] <EricDalquist> we actually found pooling connections reduced performance

[12:18:24 CDT(-0500)] <EricDalquist> though it also reduced load on the ldap server

[12:18:33 CDT(-0500)] <EricDalquist> since with pooling you need connection validation

[12:18:45 CDT(-0500)] <EricDalquist> and dbcp uses a lot of synchronization

[12:18:57 CDT(-0500)] <EricDalquist> I'm grabbing a bite but I'll look at your configs in a few minutes

[12:19:04 CDT(-0500)] <sjungling> kk

[12:19:09 CDT(-0500)] <EricDalquist> in theory with caching you should only see 1 query per user login

[12:19:20 CDT(-0500)] <EricDalquist> any idea how many logins you see/minute?

[12:19:34 CDT(-0500)] <sjungling> not off hand

[12:19:37 CDT(-0500)] <sjungling> too much text scrolling

[12:20:02 CDT(-0500)] <EricDalquist> like 10/minute ... 100/minute ... etc

[12:20:29 CDT(-0500)] <sjungling> probably more toward the 100/min

[12:20:37 CDT(-0500)] <EricDalquist> ok

[12:20:38 CDT(-0500)] <sjungling> Google Analytics Real Time shows about 480 active users

[12:20:54 CDT(-0500)] <EricDalquist> do you have an idea of how many LDAP queries/minute your server is seeing?

[12:21:13 CDT(-0500)] <sjungling> sigh

[12:21:22 CDT(-0500)] <sjungling> it would be super helpful if I had that kind of info

[12:21:28 CDT(-0500)] <sjungling> at present, that cannot be supplied to me

[12:21:33 CDT(-0500)] <EricDalquist> :/

[12:21:46 CDT(-0500)] <sjungling> so… disabling pooling would be your recommendation for a first step?

[12:22:10 CDT(-0500)] <EricDalquist> well that would incrase ldap server load

[12:22:16 CDT(-0500)] <EricDalquist> but in our limited testing case

[12:22:21 CDT(-0500)] <EricDalquist> it performed a little better

[12:22:27 CDT(-0500)] <EricDalquist> but our ldap servers were not overloaded

[12:22:48 CDT(-0500)] <sjungling> our admin says we're experiencing load spikes which in turn starts refusing connections to everyone

[12:23:05 CDT(-0500)] <sjungling> that's why I was thinking maxTotal property from pooling might help curb the spikes

[12:23:21 CDT(-0500)] <EricDalquist> yeah

[12:23:26 CDT(-0500)] <EricDalquist> it would just shift the bottleneck

[12:23:27 CDT(-0500)] <sjungling> this is old skool uP 2.6, but this is how we had LDAP configured before

[12:23:28 CDT(-0500)] <sjungling> https://gist.github.com/b70101a55e5bc66e6333#file_ldap.xml

[12:23:46 CDT(-0500)] <EricDalquist> and depending on your config either throw errors or cause long wait times

[12:27:29 CDT(-0500)] <sjungling> long wait times i could live with… errors… less than optimal

[12:27:48 CDT(-0500)] <sjungling> but it beats the heck outta being the person that's responsible for the app bringing down all network authentication

[12:32:16 CDT(-0500)] <sjungling> so… additional pooling options… got to work locally… signed in. checked caching info for kicks:

[12:32:16 CDT(-0500)] <sjungling> LDAP: 25% (3 hits, 9 misses)

[12:32:34 CDT(-0500)] <sjungling> are misses == queries?

[12:32:49 CDT(-0500)] <EricDalquist> that would be 12 queries

[12:32:52 CDT(-0500)] <EricDalquist> 3 hit

[12:32:54 CDT(-0500)] <EricDalquist> 9 missed

[12:33:20 CDT(-0500)] <sjungling> k. so why does one login generate 12 queries against LDAP??

[12:33:34 CDT(-0500)] <sjungling> gah, didn't mean that extra '?'

[12:33:38 CDT(-0500)] <EricDalquist>

[12:33:53 CDT(-0500)] <EricDalquist> so uPortal is generally "naive" in data access

[12:34:01 CDT(-0500)] <EricDalquist> when it needs a piece of data it requests it from the source

[12:34:07 CDT(-0500)] <EricDalquist> the intent is that there is a caching layouer

[12:34:25 CDT(-0500)] <EricDalquist> which means that the data layer only gets asked once in a while

[12:35:29 CDT(-0500)] <EricDalquist> but one login by one user should be 1 miss and X hits

[12:36:08 CDT(-0500)] <sjungling> https://gist.github.com/b70101a55e5bc66e6333#file_gistfile1.txt

[12:36:17 CDT(-0500)] <sjungling> that's the cache stats from my local box after one login

[12:36:48 CDT(-0500)] <sjungling> maybe that looks a little more correct with 'merged' getting more hits and misses

[12:36:55 CDT(-0500)] <EricDalquist> yes

[12:36:58 CDT(-0500)] <EricDalquist> that does look correct

[12:37:00 CDT(-0500)] <EricDalquist> also

[12:37:10 CDT(-0500)] <EricDalquist> can you try login as a different local user

[12:37:14 CDT(-0500)] <sjungling> sure

[12:37:14 CDT(-0500)] <EricDalquist> right after the first login

[12:37:19 CDT(-0500)] <sjungling> one moment

[12:37:25 CDT(-0500)] <EricDalquist> I'm guessing that the higher numbers are queries for related user data

[12:37:29 CDT(-0500)] <EricDalquist> for things like the fragment owners

[12:37:41 CDT(-0500)] <sjungling> :-/

[12:37:48 CDT(-0500)] <sjungling> trying to query LDAP for fragment owners?

[12:37:53 CDT(-0500)] <EricDalquist> but those should only happen once as well

[12:37:58 CDT(-0500)] <sjungling> that would account for a lot of those misses

[12:38:01 CDT(-0500)] <EricDalquist> well for uPortal a user is a user

[12:40:03 CDT(-0500)] <sjungling> right;

[12:40:11 CDT(-0500)] <EricDalquist> but again those misses should get cached

[12:40:13 CDT(-0500)] <sjungling> but again, in theory, that only happens once

[12:41:02 CDT(-0500)] <sjungling> ok, had another user login and then i logged in again

[12:41:02 CDT(-0500)] <sjungling> USER_INFO.csuc_ldap 00% (1 / 500) 30% (8 hits, 19 misses)

[12:41:02 CDT(-0500)] <sjungling> USER_INFO.csuc_rds 00% (1 / 500) 33% (8 hits, 16 misses)

[12:41:02 CDT(-0500)] <sjungling> USER_INFO.merged 00% (1 / 1000) 67% (57 hits, 28 misses)

[12:41:02 CDT(-0500)] <sjungling> USER_INFO.up_user 00% (1 / 500) 33% (8 hits, 16 misses)

[12:42:55 CDT(-0500)] <EricDalquist> hrm

[12:43:12 CDT(-0500)] <EricDalquist> so one thing you could do is turn on debug logging for org.jasig.services.persondir.support.CachingPersonAttributeDaoImpl

[12:43:53 CDT(-0500)] <EricDalquist> that should log each hit/miss

[12:43:56 CDT(-0500)] <EricDalquist> and the associated key

[12:44:04 CDT(-0500)] <EricDalquist> which might give us some hints

[12:45:18 CDT(-0500)] <sjungling> like this?

[12:45:18 CDT(-0500)] <sjungling> log4j.logger.org.jasig.services.persondir.support.CachingPersonAttributeDaoImpl=DEBUG, R

[12:45:18 CDT(-0500)] <sjungling> log4j.additivity.org.jasig.services.persondir.support.CachingPersonAttributeDaoImpl.type=false

[12:45:37 CDT(-0500)] <EricDalquist> drop the .type off the second line

[12:45:42 CDT(-0500)] <EricDalquist> then it is right

[12:46:06 CDT(-0500)] <sjungling> kewl… restarting

[12:46:48 CDT(-0500)] <EricDalquist> warning ... it will likely be a lot of stuff to sift through in your logs

[12:46:49 CDT(-0500)] <sjungling> preppin' for the wall-o-log msgs

[12:47:59 CDT(-0500)] <sjungling> here's the welcome page: https://gist.github.com/b8a08ce3480f5e46c8b8

[12:48:54 CDT(-0500)] <sjungling> and here's after I logged in https://gist.github.com/b8a08ce3480f5e46c8b8#file_u_portal_user_login.log

[12:49:01 CDT(-0500)] <EricDalquist> uh ... hrm

[12:49:08 CDT(-0500)] <EricDalquist> athena: you there?

[12:49:12 CDT(-0500)] <athena> yep

[12:49:13 CDT(-0500)] <athena> what's up?

[12:49:23 CDT(-0500)] <EricDalquist> take a look at https://gist.github.com/b8a08ce3480f5e46c8b8

[12:49:27 CDT(-0500)] <EricDalquist> starting at line 124

[12:49:36 CDT(-0500)] <EricDalquist> why would uPortal be querying personDir for username=ALL

[12:49:47 CDT(-0500)] <EricDalquist> username=ALL_PORTLETS

[12:50:13 CDT(-0500)] <athena> hmmm . . . it seems like that has to be permission-related

[12:50:19 CDT(-0500)] <EricDalquist> yeah

[12:50:19 CDT(-0500)] <athena> i think those are permission targets, right?

[12:50:49 CDT(-0500)] <EricDalquist> thats what I thought

[12:50:50 CDT(-0500)] <athena> i think we'd had some notion of trying to have something that represented everything and we'd used those to replace some of the non-performant regex calls that had gotten added into the logic

[12:50:57 CDT(-0500)] <athena> that's what i'm remembering offhand, at least

[12:51:08 CDT(-0500)] <athena> to be honest, i'm not quite sure how much and where those targets are being used

[12:51:13 CDT(-0500)] <athena> we still don't have good UI support for them

[12:51:29 CDT(-0500)] <EricDalquist> yeah

[12:51:34 CDT(-0500)] <EricDalquist> I see them in the code ...

[12:52:01 CDT(-0500)] <athena> not sure why they're being called there though

[12:52:05 CDT(-0500)] <EricDalquist> yeah

[12:52:12 CDT(-0500)] <EricDalquist> so that log looks pretty good sjungling

[12:52:39 CDT(-0500)] <sjungling> k

[12:55:16 CDT(-0500)] <sjungling> so… LDAP pooling

[12:55:27 CDT(-0500)] <EricDalquist> I'm going to file a bug about the ALL/ALL_PORTLETS queries

[12:55:31 CDT(-0500)] <EricDalquist> since those shouldn't happen

[12:55:35 CDT(-0500)] <sjungling> kk

[12:55:43 CDT(-0500)] <EricDalquist> but I'm still wondering why you're seeing so much LDAP load

[12:56:03 CDT(-0500)] <EricDalquist> is there any hope of figuring out logins/minute vs ldap queries/minute?

[12:56:23 CDT(-0500)] <sjungling> we've hit a point where there's like 3 major services that consume our LDAP… and none of us are sure if we're the cause or victim of the other

[12:56:37 CDT(-0500)] <sjungling> lemme ping our OSA again

[12:56:37 CDT(-0500)] <EricDalquist> heh

[12:58:02 CDT(-0500)] <sjungling> but, if i'm reading the docs right… spring-ldap pooling maxTotal has a default of unlimited active connections?

[12:58:15 CDT(-0500)] <EricDalquist> I think so

[12:58:23 CDT(-0500)] <EricDalquist> it is just using commons-dbcp under the hood

[12:58:24 CDT(-0500)] <sjungling> that seems… a bit presumptive

[13:01:07 CDT(-0500)] <EricDalquist> hrm

[13:01:07 CDT(-0500)] <EricDalquist> http://static.springsource.org/spring-ldap/site/reference/html/pooling.html

[13:01:12 CDT(-0500)] <EricDalquist> that shows a default of 8

[13:01:33 CDT(-0500)] <EricDalquist> for max active

[13:01:40 CDT(-0500)] <EricDalquist> which essentially limits maxTotal

[13:01:56 CDT(-0500)] <sjungling> but maxTotal also has a property of -1, right?

[13:02:00 CDT(-0500)] <EricDalquist> right

[13:02:06 CDT(-0500)] <EricDalquist> but you'll never get beyond ~8

[13:02:10 CDT(-0500)] <EricDalquist> since once you have 8 active

[13:02:20 CDT(-0500)] <EricDalquist> a new connection request just waits for an active slot

[13:02:27 CDT(-0500)] <EricDalquist> or fails

[13:02:33 CDT(-0500)] <EricDalquist> depending on whenExhaustedAction

[13:02:34 CDT(-0500)] <sjungling> hrm

[13:03:56 CDT(-0500)] <sjungling> so, disabling pooling removes that maxActive limit, correct?

[13:05:29 CDT(-0500)] <EricDalquist> so looking at https://gist.github.com/b70101a55e5bc66e6333

[13:08:53 CDT(-0500)] <EricDalquist> so looking at https://gist.github.com/b12feca69bb37fcf44e5

[13:09:07 CDT(-0500)] <EricDalquist> it looks like you'd peak out at 50 ldap connections from each server

[13:09:47 CDT(-0500)] <sjungling> so maxActive would supersede maxTotal?

[13:09:58 CDT(-0500)] <EricDalquist> yes

[13:10:03 CDT(-0500)] <sjungling> and… that would be better than default 8 connections (if that's the bottleneck)

[13:10:23 CDT(-0500)] <EricDalquist> here is our config: https://gist.github.com/3501568

[13:11:46 CDT(-0500)] <sjungling> k, lemme give that a whirl

[13:16:39 CDT(-0500)] <sjungling> ldapContext.xml] is invalid; nested exception is org.xml.sax.SAXParseException: The prefix "util" for element "util:constant" is not bound

[13:17:08 CDT(-0500)] <sjungling> searchControl

[13:17:38 CDT(-0500)] <EricDalquist> updated the gist

[13:17:48 CDT(-0500)] <EricDalquist> you need to add the util namespace to the root element

[13:20:11 CDT(-0500)] <sjungling> got it

[13:20:13 CDT(-0500)] <sjungling> testing

[13:21:33 CDT(-0500)] <athena> anyone have any favorite tomcat/jvm/jmx/whatever monitoring utilities for uportal?

[13:22:14 CDT(-0500)] <EricDalquist> uhg no

[13:22:22 CDT(-0500)] <EricDalquist> I would love to find a decent JMX monitoring tool

[13:22:38 CDT(-0500)] <athena> anything else you guys use for monitoring JVMs in general?

[13:22:44 CDT(-0500)] <EricDalquist> watchdog

[13:22:49 CDT(-0500)] <EricDalquist> to watch the process

[13:22:50 CDT(-0500)] <EricDalquist> that is it

[13:22:56 CDT(-0500)] <EricDalquist> we really want jvm internals monitoring

[13:23:04 CDT(-0500)] <EricDalquist> but just haven't had the resources to really look into it

[13:23:14 CDT(-0500)] <EricDalquist> sjungling: so one of our 4 servers has 639 active users

[13:23:25 CDT(-0500)] <EricDalquist> and 10 LDAP connections

[13:24:36 CDT(-0500)] <sjungling> dang

[13:25:07 CDT(-0500)] <EricDalquist> we get a lot of login turn over

[13:25:22 CDT(-0500)] <sjungling> hows that? login/logout/login agains?

[13:25:23 CDT(-0500)] <EricDalquist> PD ldap cache is at 4746/22476 hits/misses

[13:25:38 CDT(-0500)] <EricDalquist> login view one/two pages and leave

[13:25:44 CDT(-0500)] <sjungling> ah, yeah

[13:25:46 CDT(-0500)] <sjungling> we get that

[13:25:52 CDT(-0500)] <sjungling> we're a springboard to "other stuff"

[13:26:01 CDT(-0500)] <EricDalquist> PD merged cache: 378388/36466 hits/misses

[13:26:13 CDT(-0500)] <sjungling> so your LDAP cache has a similar caching percentage…

[13:26:26 CDT(-0500)] <EricDalquist> so 91% hit rate for merged ... 16% hit rate for ldap

[13:26:29 CDT(-0500)] <EricDalquist> which is what we would expect

[13:26:33 CDT(-0500)] <sjungling> kk

[13:26:39 CDT(-0500)] <EricDalquist> the merging layer should catch most of the caching hits

[13:26:41 CDT(-0500)] <sjungling> so our 70-75% on merged isn't too far off

[13:26:49 CDT(-0500)] <EricDalquist> yeah

[13:26:56 CDT(-0500)] <sjungling> kk

[13:31:07 CDT(-0500)] <sjungling> ok… looks like i finally got a successful login using your ldap config

[13:39:11 CDT(-0500)] <EricDalquist> HAHA: http://www.youtube.com/watch?v=dYBjVTMUQY0

[13:58:52 CDT(-0500)] <athena> omg.

[13:58:52 CDT(-0500)] <athena> lol

[14:19:06 CDT(-0500)] <drewwills> lol

[14:32:06 CDT(-0500)] <sjungling> @ericdalquist seeing bunch of these https://raw.github.com/gist/f706e6c0069725189c01/97c01b481411cc812b9802b41dc90423a66f7395/uPortal.log

[14:32:44 CDT(-0500)] <sjungling> by your estimations do those look like a "it's not uP but LDAP" kinda thing?

[14:50:07 CDT(-0500)] <EricDalquist> huh

[14:50:08 CDT(-0500)] <EricDalquist> not sure

[14:51:18 CDT(-0500)] <EricDalquist> what does your target org.springframework.ldap.core.support.LdapContextSource bean look like again?

[14:51:59 CDT(-0500)] <EricDalquist> also do you have a snippet of your ehcache config handy for the caches used by PD?

[14:52:17 CDT(-0500)] <sjungling> yep, one moment

[14:53:30 CDT(-0500)] <sjungling> https://gist.github.com/70950a606b5fe4a0de49

[14:54:16 CDT(-0500)] <EricDalquist> so one thing I'd look at is cache sizes

[14:54:20 CDT(-0500)] <EricDalquist> either via the cache manager portlet

[14:54:21 CDT(-0500)] <EricDalquist> or jmx

[14:54:30 CDT(-0500)] <EricDalquist> see how full your persondir caches are

[14:54:39 CDT(-0500)] <EricDalquist> you might be running out of space with only 1000 elements

[14:54:43 CDT(-0500)] <EricDalquist> resulting in more queries

[14:54:55 CDT(-0500)] <EricDalquist> same with 500 for ldap queries

[14:55:22 CDT(-0500)] <sjungling> one of our boxes… (21 / 500) for merged

[14:55:25 CDT(-0500)] <EricDalquist> if you get more than 500 unique, active, users over 10 minutes you'll end up with extra queries being run

[14:56:29 CDT(-0500)] <sjungling> groups.IEntity, StylesheetUserPreferencesImpl.Query, and FragmentDefinition.Query are the only ones with full objects

[14:56:46 CDT(-0500)] <EricDalquist> hrm ok

[14:57:27 CDT(-0500)] <EricDalquist> and can you add your ldapContext.xml to that gist?

[15:03:54 CDT(-0500)] <sjungling> added

[15:04:46 CDT(-0500)] <EricDalquist> hrm

[15:04:55 CDT(-0500)] <EricDalquist> so I think that is the LDAP server returning an error

[15:05:34 CDT(-0500)] <EricDalquist> or ... well it could be the csuc-student-announcements portlet timing out

[15:05:44 CDT(-0500)] <EricDalquist> since when a portlet times out thread.interrupt is called on it

[15:06:15 CDT(-0500)] <EricDalquist> yeah that is the case

[15:06:23 CDT(-0500)] <EricDalquist> so the portal is trying to execute csuc-student-announcements

[15:06:33 CDT(-0500)] <EricDalquist> which takes longer than the configured timeout

[15:06:52 CDT(-0500)] <EricDalquist> and when the portlet times out the portal sticks that worker thread on a recovery queue

[15:07:00 CDT(-0500)] <EricDalquist> and periodically calls Thread.interrupt on it

[15:07:04 CDT(-0500)] <EricDalquist> until the worker returns

[15:07:12 CDT(-0500)] <EricDalquist> if you're seeing a lot of thse

[15:07:25 CDT(-0500)] <EricDalquist> I'd say it implies that the LDAP searches are taking a long time

[15:07:35 CDT(-0500)] <EricDalquist> looking at the bottom of the stack trace

[15:07:41 CDT(-0500)] <EricDalquist> the LDAP connection is established

[15:07:59 CDT(-0500)] <EricDalquist> and the code is waiting in readReply to read the response from the LDAP server

[15:14:11 CDT(-0500)] <EricDalquist> what is the timeout for that portlet?>

[15:14:48 CDT(-0500)] <sjungling> checking…

[15:14:52 CDT(-0500)] <sjungling> probably the default for webproxy