[11:56:10 CDT(-0500)] <sjungling> morning all… it looks our uPortal 4 cluster is hammering our LDAP servers with queries to the point where LDAP stops accepting connections
[11:56:17 CDT(-0500)] <sjungling> i have LDAP pooling enabled
[11:56:39 CDT(-0500)] <sjungling> are there other properties that can be configured for LDAP to fine-tune the pooling?
[11:58:12 CDT(-0500)] <EricDalquist> sjungling: where did we get with caching?
[12:01:31 CDT(-0500)] <sjungling> got it working
[12:01:47 CDT(-0500)] <sjungling> albiet the 3 servers use two different flavors of caching
[12:01:52 CDT(-0500)] <sjungling> one has caching only on merged attribuets
[12:02:05 CDT(-0500)] <sjungling> the other has caching on individual data sources AND on merged attributes
[12:03:21 CDT(-0500)] <sjungling> here's the latest with individual caching that's deployed to 1/3 servers https://gist.github.com/b70101a55e5bc66e6333
[12:04:03 CDT(-0500)] <sjungling> it could very well be that this is just our LDAP server, but as one of the largest consumers of our LDAP… uPortal is where we're starting to troubleshoot
[12:06:34 CDT(-0500)] <sjungling> even got Regex working to not search LDAP/Secondary Oracle for "guest" user… since they won't exist
[12:15:36 CDT(-0500)] <sjungling> thinking it might be worth going above and beyond with the LDAP pooling a la http://static.springsource.org/spring-ldap/site/reference/html/pooling.html
[12:17:15 CDT(-0500)] <sjungling> kinda like this https://gist.github.com/b70101a55e5bc66e6333#file_ldap_context.xml
[12:18:06 CDT(-0500)] <EricDalquist> well I wrote the ldap pooling in spring-ldap
[12:18:10 CDT(-0500)] <EricDalquist> and while it depends on your ldap server
[12:18:17 CDT(-0500)] <EricDalquist> we actually found pooling connections reduced performance
[12:18:24 CDT(-0500)] <EricDalquist> though it also reduced load on the ldap server
[12:18:33 CDT(-0500)] <EricDalquist> since with pooling you need connection validation
[12:18:45 CDT(-0500)] <EricDalquist> and dbcp uses a lot of synchronization
[12:18:57 CDT(-0500)] <EricDalquist> I'm grabbing a bite but I'll look at your configs in a few minutes
[12:19:04 CDT(-0500)] <sjungling> kk
[12:19:09 CDT(-0500)] <EricDalquist> in theory with caching you should only see 1 query per user login
[12:19:20 CDT(-0500)] <EricDalquist> any idea how many logins you see/minute?
[12:19:34 CDT(-0500)] <sjungling> not off hand
[12:19:37 CDT(-0500)] <sjungling> too much text scrolling
[12:20:02 CDT(-0500)] <EricDalquist> like 10/minute ... 100/minute ... etc
[12:20:29 CDT(-0500)] <sjungling> probably more toward the 100/min
[12:20:37 CDT(-0500)] <EricDalquist> ok
[12:20:38 CDT(-0500)] <sjungling> Google Analytics Real Time shows about 480 active users
[12:20:54 CDT(-0500)] <EricDalquist> do you have an idea of how many LDAP queries/minute your server is seeing?
[12:21:13 CDT(-0500)] <sjungling> sigh
[12:21:22 CDT(-0500)] <sjungling> it would be super helpful if I had that kind of info
[12:21:28 CDT(-0500)] <sjungling> at present, that cannot be supplied to me
[12:21:33 CDT(-0500)] <EricDalquist> :/
[12:21:46 CDT(-0500)] <sjungling> so… disabling pooling would be your recommendation for a first step?
[12:22:10 CDT(-0500)] <EricDalquist> well that would incrase ldap server load
[12:22:16 CDT(-0500)] <EricDalquist> but in our limited testing case
[12:22:21 CDT(-0500)] <EricDalquist> it performed a little better
[12:22:27 CDT(-0500)] <EricDalquist> but our ldap servers were not overloaded
[12:22:48 CDT(-0500)] <sjungling> our admin says we're experiencing load spikes which in turn starts refusing connections to everyone
[12:23:05 CDT(-0500)] <sjungling> that's why I was thinking maxTotal property from pooling might help curb the spikes
[12:23:21 CDT(-0500)] <EricDalquist> yeah
[12:23:26 CDT(-0500)] <EricDalquist> it would just shift the bottleneck
[12:23:27 CDT(-0500)] <sjungling> this is old skool uP 2.6, but this is how we had LDAP configured before
[12:23:28 CDT(-0500)] <sjungling> https://gist.github.com/b70101a55e5bc66e6333#file_ldap.xml
[12:23:46 CDT(-0500)] <EricDalquist> and depending on your config either throw errors or cause long wait times
[12:27:29 CDT(-0500)] <sjungling> long wait times i could live with… errors… less than optimal
[12:27:48 CDT(-0500)] <sjungling> but it beats the heck outta being the person that's responsible for the app bringing down all network authentication
[12:32:16 CDT(-0500)] <sjungling> so… additional pooling options… got to work locally… signed in. checked caching info for kicks:
[12:32:16 CDT(-0500)] <sjungling> LDAP: 25% (3 hits, 9 misses)
[12:32:34 CDT(-0500)] <sjungling> are misses == queries?
[12:32:49 CDT(-0500)] <EricDalquist> that would be 12 queries
[12:32:52 CDT(-0500)] <EricDalquist> 3 hit
[12:32:54 CDT(-0500)] <EricDalquist> 9 missed
[12:33:20 CDT(-0500)] <sjungling> k. so why does one login generate 12 queries against LDAP??
[12:33:34 CDT(-0500)] <sjungling> gah, didn't mean that extra '?'
[12:33:38 CDT(-0500)] <EricDalquist>
[12:33:53 CDT(-0500)] <EricDalquist> so uPortal is generally "naive" in data access
[12:34:01 CDT(-0500)] <EricDalquist> when it needs a piece of data it requests it from the source
[12:34:07 CDT(-0500)] <EricDalquist> the intent is that there is a caching layouer
[12:34:25 CDT(-0500)] <EricDalquist> which means that the data layer only gets asked once in a while
[12:35:29 CDT(-0500)] <EricDalquist> but one login by one user should be 1 miss and X hits
[12:36:08 CDT(-0500)] <sjungling> https://gist.github.com/b70101a55e5bc66e6333#file_gistfile1.txt
[12:36:17 CDT(-0500)] <sjungling> that's the cache stats from my local box after one login
[12:36:48 CDT(-0500)] <sjungling> maybe that looks a little more correct with 'merged' getting more hits and misses
[12:36:55 CDT(-0500)] <EricDalquist> yes
[12:36:58 CDT(-0500)] <EricDalquist> that does look correct
[12:37:00 CDT(-0500)] <EricDalquist> also
[12:37:10 CDT(-0500)] <EricDalquist> can you try login as a different local user
[12:37:14 CDT(-0500)] <sjungling> sure
[12:37:14 CDT(-0500)] <EricDalquist> right after the first login
[12:37:19 CDT(-0500)] <sjungling> one moment
[12:37:25 CDT(-0500)] <EricDalquist> I'm guessing that the higher numbers are queries for related user data
[12:37:29 CDT(-0500)] <EricDalquist> for things like the fragment owners
[12:37:41 CDT(-0500)] <sjungling> :-/
[12:37:48 CDT(-0500)] <sjungling> trying to query LDAP for fragment owners?
[12:37:53 CDT(-0500)] <EricDalquist> but those should only happen once as well
[12:37:58 CDT(-0500)] <sjungling> that would account for a lot of those misses
[12:38:01 CDT(-0500)] <EricDalquist> well for uPortal a user is a user
[12:40:03 CDT(-0500)] <sjungling> right;
[12:40:11 CDT(-0500)] <EricDalquist> but again those misses should get cached
[12:40:13 CDT(-0500)] <sjungling> but again, in theory, that only happens once
[12:41:02 CDT(-0500)] <sjungling> ok, had another user login and then i logged in again
[12:41:02 CDT(-0500)] <sjungling> USER_INFO.csuc_ldap 00% (1 / 500) 30% (8 hits, 19 misses)
[12:41:02 CDT(-0500)] <sjungling> USER_INFO.csuc_rds 00% (1 / 500) 33% (8 hits, 16 misses)
[12:41:02 CDT(-0500)] <sjungling> USER_INFO.merged 00% (1 / 1000) 67% (57 hits, 28 misses)
[12:41:02 CDT(-0500)] <sjungling> USER_INFO.up_user 00% (1 / 500) 33% (8 hits, 16 misses)
[12:42:55 CDT(-0500)] <EricDalquist> hrm
[12:43:12 CDT(-0500)] <EricDalquist> so one thing you could do is turn on debug logging for org.jasig.services.persondir.support.CachingPersonAttributeDaoImpl
[12:43:53 CDT(-0500)] <EricDalquist> that should log each hit/miss
[12:43:56 CDT(-0500)] <EricDalquist> and the associated key
[12:44:04 CDT(-0500)] <EricDalquist> which might give us some hints
[12:45:18 CDT(-0500)] <sjungling> like this?
[12:45:18 CDT(-0500)] <sjungling> log4j.logger.org.jasig.services.persondir.support.CachingPersonAttributeDaoImpl=DEBUG, R
[12:45:18 CDT(-0500)] <sjungling> log4j.additivity.org.jasig.services.persondir.support.CachingPersonAttributeDaoImpl.type=false
[12:45:37 CDT(-0500)] <EricDalquist> drop the .type off the second line
[12:45:42 CDT(-0500)] <EricDalquist> then it is right
[12:46:06 CDT(-0500)] <sjungling> kewl… restarting
[12:46:48 CDT(-0500)] <EricDalquist> warning ... it will likely be a lot of stuff to sift through in your logs
[12:46:49 CDT(-0500)] <sjungling> preppin' for the wall-o-log msgs
[12:47:59 CDT(-0500)] <sjungling> here's the welcome page: https://gist.github.com/b8a08ce3480f5e46c8b8
[12:48:54 CDT(-0500)] <sjungling> and here's after I logged in https://gist.github.com/b8a08ce3480f5e46c8b8#file_u_portal_user_login.log
[12:49:01 CDT(-0500)] <EricDalquist> uh ... hrm
[12:49:08 CDT(-0500)] <EricDalquist> athena: you there?
[12:49:12 CDT(-0500)] <athena> yep
[12:49:13 CDT(-0500)] <athena> what's up?
[12:49:23 CDT(-0500)] <EricDalquist> take a look at https://gist.github.com/b8a08ce3480f5e46c8b8
[12:49:27 CDT(-0500)] <EricDalquist> starting at line 124
[12:49:36 CDT(-0500)] <EricDalquist> why would uPortal be querying personDir for username=ALL
[12:49:47 CDT(-0500)] <EricDalquist> username=ALL_PORTLETS
[12:50:13 CDT(-0500)] <athena> hmmm . . . it seems like that has to be permission-related
[12:50:19 CDT(-0500)] <EricDalquist> yeah
[12:50:19 CDT(-0500)] <athena> i think those are permission targets, right?
[12:50:49 CDT(-0500)] <EricDalquist> thats what I thought
[12:50:50 CDT(-0500)] <athena> i think we'd had some notion of trying to have something that represented everything and we'd used those to replace some of the non-performant regex calls that had gotten added into the logic
[12:50:57 CDT(-0500)] <athena> that's what i'm remembering offhand, at least
[12:51:08 CDT(-0500)] <athena> to be honest, i'm not quite sure how much and where those targets are being used
[12:51:13 CDT(-0500)] <athena> we still don't have good UI support for them
[12:51:29 CDT(-0500)] <EricDalquist> yeah
[12:51:34 CDT(-0500)] <EricDalquist> I see them in the code ...
[12:52:01 CDT(-0500)] <athena> not sure why they're being called there though
[12:52:05 CDT(-0500)] <EricDalquist> yeah
[12:52:12 CDT(-0500)] <EricDalquist> so that log looks pretty good sjungling
[12:52:39 CDT(-0500)] <sjungling> k
[12:55:16 CDT(-0500)] <sjungling> so… LDAP pooling
[12:55:27 CDT(-0500)] <EricDalquist> I'm going to file a bug about the ALL/ALL_PORTLETS queries
[12:55:31 CDT(-0500)] <EricDalquist> since those shouldn't happen
[12:55:35 CDT(-0500)] <sjungling> kk
[12:55:43 CDT(-0500)] <EricDalquist> but I'm still wondering why you're seeing so much LDAP load
[12:56:03 CDT(-0500)] <EricDalquist> is there any hope of figuring out logins/minute vs ldap queries/minute?
[12:56:23 CDT(-0500)] <sjungling> we've hit a point where there's like 3 major services that consume our LDAP… and none of us are sure if we're the cause or victim of the other
[12:56:37 CDT(-0500)] <sjungling> lemme ping our OSA again
[12:56:37 CDT(-0500)] <EricDalquist> heh
[12:58:02 CDT(-0500)] <sjungling> but, if i'm reading the docs right… spring-ldap pooling maxTotal has a default of unlimited active connections?
[12:58:15 CDT(-0500)] <EricDalquist> I think so
[12:58:23 CDT(-0500)] <EricDalquist> it is just using commons-dbcp under the hood
[12:58:24 CDT(-0500)] <sjungling> that seems… a bit presumptive
[13:01:07 CDT(-0500)] <EricDalquist> hrm
[13:01:07 CDT(-0500)] <EricDalquist> http://static.springsource.org/spring-ldap/site/reference/html/pooling.html
[13:01:12 CDT(-0500)] <EricDalquist> that shows a default of 8
[13:01:33 CDT(-0500)] <EricDalquist> for max active
[13:01:40 CDT(-0500)] <EricDalquist> which essentially limits maxTotal
[13:01:56 CDT(-0500)] <sjungling> but maxTotal also has a property of -1, right?
[13:02:00 CDT(-0500)] <EricDalquist> right
[13:02:06 CDT(-0500)] <EricDalquist> but you'll never get beyond ~8
[13:02:10 CDT(-0500)] <EricDalquist> since once you have 8 active
[13:02:20 CDT(-0500)] <EricDalquist> a new connection request just waits for an active slot
[13:02:27 CDT(-0500)] <EricDalquist> or fails
[13:02:33 CDT(-0500)] <EricDalquist> depending on whenExhaustedAction
[13:02:34 CDT(-0500)] <sjungling> hrm
[13:03:56 CDT(-0500)] <sjungling> so, disabling pooling removes that maxActive limit, correct?
[13:05:29 CDT(-0500)] <EricDalquist> so looking at https://gist.github.com/b70101a55e5bc66e6333
[13:08:53 CDT(-0500)] <EricDalquist> so looking at https://gist.github.com/b12feca69bb37fcf44e5
[13:09:07 CDT(-0500)] <EricDalquist> it looks like you'd peak out at 50 ldap connections from each server
[13:09:47 CDT(-0500)] <sjungling> so maxActive would supersede maxTotal?
[13:09:58 CDT(-0500)] <EricDalquist> yes
[13:10:03 CDT(-0500)] <sjungling> and… that would be better than default 8 connections (if that's the bottleneck)
[13:10:23 CDT(-0500)] <EricDalquist> here is our config: https://gist.github.com/3501568
[13:11:46 CDT(-0500)] <sjungling> k, lemme give that a whirl
[13:16:39 CDT(-0500)] <sjungling> ldapContext.xml] is invalid; nested exception is org.xml.sax.SAXParseException: The prefix "util" for element "util:constant" is not bound
[13:17:08 CDT(-0500)] <sjungling> searchControl
[13:17:38 CDT(-0500)] <EricDalquist> updated the gist
[13:17:48 CDT(-0500)] <EricDalquist> you need to add the util namespace to the root element
[13:20:11 CDT(-0500)] <sjungling> got it
[13:20:13 CDT(-0500)] <sjungling> testing
[13:21:33 CDT(-0500)] <athena> anyone have any favorite tomcat/jvm/jmx/whatever monitoring utilities for uportal?
[13:22:14 CDT(-0500)] <EricDalquist> uhg no
[13:22:22 CDT(-0500)] <EricDalquist> I would love to find a decent JMX monitoring tool
[13:22:38 CDT(-0500)] <athena> anything else you guys use for monitoring JVMs in general?
[13:22:44 CDT(-0500)] <EricDalquist> watchdog
[13:22:49 CDT(-0500)] <EricDalquist> to watch the process
[13:22:50 CDT(-0500)] <EricDalquist> that is it
[13:22:56 CDT(-0500)] <EricDalquist> we really want jvm internals monitoring
[13:23:04 CDT(-0500)] <EricDalquist> but just haven't had the resources to really look into it
[13:23:14 CDT(-0500)] <EricDalquist> sjungling: so one of our 4 servers has 639 active users
[13:23:25 CDT(-0500)] <EricDalquist> and 10 LDAP connections
[13:24:36 CDT(-0500)] <sjungling> dang
[13:25:07 CDT(-0500)] <EricDalquist> we get a lot of login turn over
[13:25:22 CDT(-0500)] <sjungling> hows that? login/logout/login agains?
[13:25:23 CDT(-0500)] <EricDalquist> PD ldap cache is at 4746/22476 hits/misses
[13:25:38 CDT(-0500)] <EricDalquist> login view one/two pages and leave
[13:25:44 CDT(-0500)] <sjungling> ah, yeah
[13:25:46 CDT(-0500)] <sjungling> we get that
[13:25:52 CDT(-0500)] <sjungling> we're a springboard to "other stuff"
[13:26:01 CDT(-0500)] <EricDalquist> PD merged cache: 378388/36466 hits/misses
[13:26:13 CDT(-0500)] <sjungling> so your LDAP cache has a similar caching percentage…
[13:26:26 CDT(-0500)] <EricDalquist> so 91% hit rate for merged ... 16% hit rate for ldap
[13:26:29 CDT(-0500)] <EricDalquist> which is what we would expect
[13:26:33 CDT(-0500)] <sjungling> kk
[13:26:39 CDT(-0500)] <EricDalquist> the merging layer should catch most of the caching hits
[13:26:41 CDT(-0500)] <sjungling> so our 70-75% on merged isn't too far off
[13:26:49 CDT(-0500)] <EricDalquist> yeah
[13:26:56 CDT(-0500)] <sjungling> kk
[13:31:07 CDT(-0500)] <sjungling> ok… looks like i finally got a successful login using your ldap config
[13:39:11 CDT(-0500)] <EricDalquist> HAHA: http://www.youtube.com/watch?v=dYBjVTMUQY0
[13:58:52 CDT(-0500)] <athena> omg.
[13:58:52 CDT(-0500)] <athena> lol
[14:19:06 CDT(-0500)] <drewwills> lol
[14:32:06 CDT(-0500)] <sjungling> @ericdalquist seeing bunch of these https://raw.github.com/gist/f706e6c0069725189c01/97c01b481411cc812b9802b41dc90423a66f7395/uPortal.log
[14:32:44 CDT(-0500)] <sjungling> by your estimations do those look like a "it's not uP but LDAP" kinda thing?
[14:50:07 CDT(-0500)] <EricDalquist> huh
[14:50:08 CDT(-0500)] <EricDalquist> not sure
[14:51:18 CDT(-0500)] <EricDalquist> what does your target org.springframework.ldap.core.support.LdapContextSource bean look like again?
[14:51:59 CDT(-0500)] <EricDalquist> also do you have a snippet of your ehcache config handy for the caches used by PD?
[14:52:17 CDT(-0500)] <sjungling> yep, one moment
[14:53:30 CDT(-0500)] <sjungling> https://gist.github.com/70950a606b5fe4a0de49
[14:54:16 CDT(-0500)] <EricDalquist> so one thing I'd look at is cache sizes
[14:54:20 CDT(-0500)] <EricDalquist> either via the cache manager portlet
[14:54:21 CDT(-0500)] <EricDalquist> or jmx
[14:54:30 CDT(-0500)] <EricDalquist> see how full your persondir caches are
[14:54:39 CDT(-0500)] <EricDalquist> you might be running out of space with only 1000 elements
[14:54:43 CDT(-0500)] <EricDalquist> resulting in more queries
[14:54:55 CDT(-0500)] <EricDalquist> same with 500 for ldap queries
[14:55:22 CDT(-0500)] <sjungling> one of our boxes… (21 / 500) for merged
[14:55:25 CDT(-0500)] <EricDalquist> if you get more than 500 unique, active, users over 10 minutes you'll end up with extra queries being run
[14:56:29 CDT(-0500)] <sjungling> groups.IEntity, StylesheetUserPreferencesImpl.Query, and FragmentDefinition.Query are the only ones with full objects
[14:56:46 CDT(-0500)] <EricDalquist> hrm ok
[14:57:27 CDT(-0500)] <EricDalquist> and can you add your ldapContext.xml to that gist?
[15:03:54 CDT(-0500)] <sjungling> added
[15:04:46 CDT(-0500)] <EricDalquist> hrm
[15:04:55 CDT(-0500)] <EricDalquist> so I think that is the LDAP server returning an error
[15:05:34 CDT(-0500)] <EricDalquist> or ... well it could be the csuc-student-announcements portlet timing out
[15:05:44 CDT(-0500)] <EricDalquist> since when a portlet times out thread.interrupt is called on it
[15:06:15 CDT(-0500)] <EricDalquist> yeah that is the case
[15:06:23 CDT(-0500)] <EricDalquist> so the portal is trying to execute csuc-student-announcements
[15:06:33 CDT(-0500)] <EricDalquist> which takes longer than the configured timeout
[15:06:52 CDT(-0500)] <EricDalquist> and when the portlet times out the portal sticks that worker thread on a recovery queue
[15:07:00 CDT(-0500)] <EricDalquist> and periodically calls Thread.interrupt on it
[15:07:04 CDT(-0500)] <EricDalquist> until the worker returns
[15:07:12 CDT(-0500)] <EricDalquist> if you're seeing a lot of thse
[15:07:25 CDT(-0500)] <EricDalquist> I'd say it implies that the LDAP searches are taking a long time
[15:07:35 CDT(-0500)] <EricDalquist> looking at the bottom of the stack trace
[15:07:41 CDT(-0500)] <EricDalquist> the LDAP connection is established
[15:07:59 CDT(-0500)] <EricDalquist> and the code is waiting in readReply to read the response from the LDAP server