pgtInit returns null pgtIou due to pgtUrl readTimeout (less than 1% occurrence)

Description

This occurs about once out of perhaps every 500-600 logins (from what we've seen) and is probably the last bug that needs to be addressed in order to achieve 100% service availability. Please see the following ClearPass transaction (includes proxy ticket request, etc).

Service Ticket: ST-1234

pgtInit Request
https://cas.example.com/serviceValidate?ticket=ST-1234&pgtUrl=https%3A%2F%2Fwww%2Eexample%2Ecom%2FpgtUrl%2Easp%3FproxyResponse%3DTrue&service=https%3A%2F%2Fwww%2Eexample%2Ecom%2Fclearpass.asp

pgtInit Response
<cas:serviceResponse xmlns:cas='http://www.yale.edu/tp/cas'> <cas:authenticationSuccess> <cas:user>jdoe</cas:user> </cas:authenticationSuccess> </cas:serviceResponse>

pgtIou
[null]

pgtId
0 chars

Proxy Ticket Response
<cas:serviceResponse xmlns:cas='http://www.yale.edu/tp/cas'> <cas😛roxyFailure code='INVALID_REQUEST'> 'pgt' and 'targetService' parameters are both required </cas😛roxyFailure> </cas:serviceResponse>

Request:
https://cas.example.com/clearPass?ticket= (proxy ticket)

casUsername
jdoe

casPassword
0 chars

ClearPass Response
<cas:clearPassResponse xmlns:cas='http://www.yale.edu/tp/cas'> <cas:clearPassFailure>No authentication information provided.</cas:clearPassFailure> </cas:clearPassResponse>

This doesn't happen often, but it has happened to 7 people in the past 5 hours, one of which I've found to be associated with a SocketTimeoutException which led to additional exceptions resulting in error.authentication.credentials.bad. Others, however, show no signs of such exceptions.

Granted, it's a less than 1% failure rate, but fails nonetheless for some users under certain circumstances. The observed result in the pgtInit response is that the pgtInit response doesn't return a PGT-IOU, even though we are still getting a cas:authenticationSuccess XML response.

Could this be caused due to replication latency as we are currently storing tickets in a replicated memcached environment?

Environment

CAS 3.5.2 with ClearPass / Couchbase with replication

Activity

Show:

Matt BorjaDecember 4, 2013 at 6:11 PM

It appears we've gotten a couple null pgtIou responses, and I was able to reproduce the same behavior one time and noticed it took upwards of 5 seconds (around 8 seconds), causing another readTimeout. However, with our Couchbase implementation, I believe this would imply one of two things (or both):

  • Audit and update/replace the behavior of HttpClient to, as you pointed out before, utilize apache httpclient to support retries.

  • Handle pgtRetry during a PGT session within CAS-ification (Classic ASP example shown below)

Cookie-based pgtRety implementation

<% pgtIou = CASRequestCustom(pgtInitResponse, "cas:proxyGrantingTicket") 'Handle pgtRetry if readTimeout is thrown by HttpClient If pgtIou = False Then If Request.Cookies("pgtRetry") = "1" Then Response.Cookies("pgtRetry") = "" Response.Cookies("pgtRetry").Expires = dateadd("h",-1,now()) Response.Cookies("pgtRetry").Path = "/" Response.Cookies("pgtRetry").Domain = ".example.com" 'Failed after retry pgtIou = "" Else Response.Cookies("pgtRetry") = "1" Response.Cookies("pgtRetry").Expires = dateadd("s",30,now()) Response.Cookies("pgtRetry").Path = "/" Response.Cookies("pgtRetry").Domain = ".example.com" 'Retry one time Response.Redirect("https://cas.example.com/login?service=" & service_url End If End If %>

Matt BorjaDecember 4, 2013 at 5:00 PM
Edited

Just a quick update. We have deployed Couchbase in production to front-load our proxy callback URL and have begun monitoring for read timeouts.

In a separate test case, though, I can confirm that our proxy callback response time has dropped from 6-7 seconds down to 600ms. Here is some sample code written in Classic ASP with support for database fallback in the event Couchbase is unreachable.

<% Response.CacheControl = "no-cache" Const STORAGE_METHOD = "couchbase" 'Valid options: couchbase, database Function StorePgt(ByVal pgtIou, ByVal pgtId, ByVal storage_method) Select Case storage_method Case "couchbase" Dim mc: If couchbase_connect(mc, 14400) Then On Error Resume Next mc.Set CStr(Sanitize(pgtIou)), Sanitize(pgtId), 14400 If Err.Number <> 0 Then Call StorePgt(pgtIou, pgtId, "database") 'Fallback End If On Error Goto 0 Else Call StorePgt(pgtIou, pgtId, "database") 'Fallback End If Case "database" Dim conn: If db_connect(conn, hostname, username, password, database, library) Then conn.Execute("INSERT INTO pgtLookups (pgtIou, pgtId) VALUES('" & Sanitize(pgtIou) & "', '" & Sanitize(pgtId) & "')") Else 'Database is unavailable. Nothing else to fallback on. End If End Select End Function Function couchbase_connect(ByRef mc, ByVal default_expiry) On Error Resume Next Const max_retries = 5 Dim mcServers(1) mcServers(0) = "server1" mcServers(1) = "server2" Set mc = Server.CreateObject("memcachedCOM") For t = 1 To max_retries If Not mc_ready Then For i = LBound(mcServers) to UBound(mcServers) mc_server = mcServers(i) If Not mc_ready Then On Error Resume Next mc.Open(mc_server) mc.Set "check_server", "OK" mc.Get("check_server") If Err.Number <> 0 Then mc_ready = False Else mc_ready = True mc_use_server = mc_server result = "OK" End If mc.Close() On Error Goto 0 End If Next End If Next If mc_ready Then On Error Resume Next mc.tryOpen(mc_use_server) mc.Set "check_server", "OK", 1 mc.Delete "check_server" On Error Resume Next couchbase_connect = True Else couchbase_connect = False End If On Error Goto 0 End Function Function db_connect(ByRef conn, ByVal hostname, ByVal username, ByVal password, ByVal database, ByVal library) On Error Resume Next Set conn = server.createobject("ADODB.Connection") conn.open "PROVIDER=SQLOLEDB;DATA SOURCE=" & hostname & ";UID=" & username & ";PWD=" & password & ";DATABASE=" & database & ";Network Library=" & library If Err.Number = 0 Then db_connect = True Else db_connect = False End If On Error Goto 0 End Function 'Proxy callback Dim pgtIou, pgtId pgtIou = Request.QueryString("pgtIou") pgtId = Request.QueryString("pgtId") If Request.QueryString("proxyResponse") = "True" Then If pgtIou <> "" And pgtId <> "" Then Call StorePgt(pgtIou, pgtId, STORAGE_METHOD) End If 'Prevent duplication Response.Redirect( Request.ServerVariables("URL") ) End If %>

To lookup the pgtId, you could use something like this (slightly backwards in that this looks for the pgtIou in the database first before querying Couchbase):

<% Function pgtIdLookup(pgtIou) Dim dbconn Set dbconn = Server.CreateObject("ADODB.Connection") dbconn.open "PROVIDER=SQLOLEDB..." Dim rs: Set rs = dbconn.Execute("SELECT TOP 1 * FROM pgtLookups WHERE pgtIou = '" & Sanitize(pgtIou) & "'") If Not rs.EOF Then pgtIdLookup = Trim(rs("pgtId")) Else Dim mc: If couchbase_connect(mc, DEFAULT_EXPIRY) Then On Error Resume Next pgtIdLookup = mc.Get(CStr(Sanitize(pgtIou))) If Err.Number <> 0 Then pgtIdLookup = False End If On Error Goto 0 Else pgtIdLookup = False End If End If dbconn.Close() Set dbconn = Nothing End Function %>

Matt BorjaDecember 3, 2013 at 7:02 PM
Edited

So after doing some deeper tracing, it appears that the read timeout being thrown is as a result of the proxy callback URL as indicated in ServiceValidateController#getServiceCredentialsFromRequest

In our case specifically, our proxy callback utilizes a SQL Server database (under load from our website) and I was able to see the connection take anywhere from 6-7 seconds. Even though HttpClient is able to connect just fine (satisfying the connection timeout threshold), our page won't actually return a 302 response until database connection has been established.

Our initial approach was to increase the readTimeout in applicationContext.xml from the 5 second default to 10 seconds. However, it is obvious at this point that not all production and database environments are configured equally.

My recommendation at this point would be for a documentation update outlining the possibility of read timeouts caused by pgtUrl and suggest a front-loaded memcached implementation whereby pgtIou/pgtId pairs are stored to Couchbase with possibly a database as a fallback or secondary option.

For reference, the following is a snippet of our current implementation in Classic ASP that takes 6-7 seconds somewhat frequently:

Set conn = Server.CreateObject("ADODB.Connection") conn.Open "PROVIDER=SQLOLEDB..." If Request.QueryString("proxyResponse") = "True" Then If pgtIou <> "" And pgtId <> "" Then conn.Execute("INSERT INTO pgtLookups (pgtIou, pgtId) VALUES('" & Sanitize(pgtIou) & "', '" & Sanitize(pgtId) & "')") End If 'Prevent duplication Response.Redirect( CStr(Request.ServerVariables("URL")) ) End If conn.Close Set conn = Nothing

Misagh MoayyedDecember 3, 2013 at 5:21 PM

Great. Please post back the results when you do get the chance.

HttpClient is the place to review, specially if we are to implement some sort of re-try logic. It's worth noting that if this in fact turns out to be the issue, it might prove more useful if we revamp the implementation to use something like apache httpclient where it does have support for retrying connection attempts. But, first thing first, lets make sure this in fact is the underlying problem.

Appreciate you trying this out.

Matt BorjaDecember 3, 2013 at 4:54 PM
Edited

This might actually solve our problem. Looks like applicationContext.xml within spring-configuration has the closest match. I'll have our server admin increase this timeout and see if this fixes our problem.

<bean id="httpClient" class="org.jasig.cas.util.HttpClient" p:readTimeout="5000" p:connectionTimeout="5000"/>

Thanks!

Fixed

Details

Assignee

Reporter

Components

Fix versions

Affects versions

Priority

Created December 3, 2013 at 2:30 AM
Updated April 4, 2014 at 7:49 AM
Resolved April 4, 2014 at 7:49 AM

Flag notifications