pgtInit returns null pgtIou due to pgtUrl readTimeout (less than 1% occurrence)
Description
Environment
Activity
Matt BorjaDecember 4, 2013 at 6:11 PM
It appears we've gotten a couple null pgtIou responses, and I was able to reproduce the same behavior one time and noticed it took upwards of 5 seconds (around 8 seconds), causing another readTimeout. However, with our Couchbase implementation, I believe this would imply one of two things (or both):
Audit and update/replace the behavior of HttpClient to, as you pointed out before, utilize apache httpclient to support retries.
Handle pgtRetry during a PGT session within CAS-ification (Classic ASP example shown below)
Cookie-based pgtRety implementation
<%
pgtIou = CASRequestCustom(pgtInitResponse, "cas:proxyGrantingTicket")
'Handle pgtRetry if readTimeout is thrown by HttpClient
If pgtIou = False Then
If Request.Cookies("pgtRetry") = "1" Then
Response.Cookies("pgtRetry") = ""
Response.Cookies("pgtRetry").Expires = dateadd("h",-1,now())
Response.Cookies("pgtRetry").Path = "/"
Response.Cookies("pgtRetry").Domain = ".example.com"
'Failed after retry
pgtIou = ""
Else
Response.Cookies("pgtRetry") = "1"
Response.Cookies("pgtRetry").Expires = dateadd("s",30,now())
Response.Cookies("pgtRetry").Path = "/"
Response.Cookies("pgtRetry").Domain = ".example.com"
'Retry one time
Response.Redirect("https://cas.example.com/login?service=" & service_url
End If
End If
%>
Matt BorjaDecember 4, 2013 at 5:00 PMEdited
Just a quick update. We have deployed Couchbase in production to front-load our proxy callback URL and have begun monitoring for read timeouts.
In a separate test case, though, I can confirm that our proxy callback response time has dropped from 6-7 seconds down to 600ms. Here is some sample code written in Classic ASP with support for database fallback in the event Couchbase is unreachable.
<%
Response.CacheControl = "no-cache"
Const STORAGE_METHOD = "couchbase" 'Valid options: couchbase, database
Function StorePgt(ByVal pgtIou, ByVal pgtId, ByVal storage_method)
Select Case storage_method
Case "couchbase"
Dim mc: If couchbase_connect(mc, 14400) Then
On Error Resume Next
mc.Set CStr(Sanitize(pgtIou)), Sanitize(pgtId), 14400
If Err.Number <> 0 Then
Call StorePgt(pgtIou, pgtId, "database") 'Fallback
End If
On Error Goto 0
Else
Call StorePgt(pgtIou, pgtId, "database") 'Fallback
End If
Case "database"
Dim conn: If db_connect(conn, hostname, username, password, database, library) Then
conn.Execute("INSERT INTO pgtLookups (pgtIou, pgtId) VALUES('" & Sanitize(pgtIou) & "', '" & Sanitize(pgtId) & "')")
Else
'Database is unavailable. Nothing else to fallback on.
End If
End Select
End Function
Function couchbase_connect(ByRef mc, ByVal default_expiry)
On Error Resume Next
Const max_retries = 5
Dim mcServers(1)
mcServers(0) = "server1"
mcServers(1) = "server2"
Set mc = Server.CreateObject("memcachedCOM")
For t = 1 To max_retries
If Not mc_ready Then
For i = LBound(mcServers) to UBound(mcServers)
mc_server = mcServers(i)
If Not mc_ready Then
On Error Resume Next
mc.Open(mc_server)
mc.Set "check_server", "OK"
mc.Get("check_server")
If Err.Number <> 0 Then
mc_ready = False
Else
mc_ready = True
mc_use_server = mc_server
result = "OK"
End If
mc.Close()
On Error Goto 0
End If
Next
End If
Next
If mc_ready Then
On Error Resume Next
mc.tryOpen(mc_use_server)
mc.Set "check_server", "OK", 1
mc.Delete "check_server"
On Error Resume Next
couchbase_connect = True
Else
couchbase_connect = False
End If
On Error Goto 0
End Function
Function db_connect(ByRef conn, ByVal hostname, ByVal username, ByVal password, ByVal database, ByVal library)
On Error Resume Next
Set conn = server.createobject("ADODB.Connection")
conn.open "PROVIDER=SQLOLEDB;DATA SOURCE=" & hostname & ";UID=" & username & ";PWD=" & password & ";DATABASE=" & database & ";Network Library=" & library
If Err.Number = 0 Then
db_connect = True
Else
db_connect = False
End If
On Error Goto 0
End Function
'Proxy callback
Dim pgtIou, pgtId
pgtIou = Request.QueryString("pgtIou")
pgtId = Request.QueryString("pgtId")
If Request.QueryString("proxyResponse") = "True" Then
If pgtIou <> "" And pgtId <> "" Then
Call StorePgt(pgtIou, pgtId, STORAGE_METHOD)
End If
'Prevent duplication
Response.Redirect( Request.ServerVariables("URL") )
End If
%>
To lookup the pgtId, you could use something like this (slightly backwards in that this looks for the pgtIou in the database first before querying Couchbase):
<%
Function pgtIdLookup(pgtIou)
Dim dbconn
Set dbconn = Server.CreateObject("ADODB.Connection")
dbconn.open "PROVIDER=SQLOLEDB..."
Dim rs: Set rs = dbconn.Execute("SELECT TOP 1 * FROM pgtLookups WHERE pgtIou = '" & Sanitize(pgtIou) & "'")
If Not rs.EOF Then
pgtIdLookup = Trim(rs("pgtId"))
Else
Dim mc: If couchbase_connect(mc, DEFAULT_EXPIRY) Then
On Error Resume Next
pgtIdLookup = mc.Get(CStr(Sanitize(pgtIou)))
If Err.Number <> 0 Then
pgtIdLookup = False
End If
On Error Goto 0
Else
pgtIdLookup = False
End If
End If
dbconn.Close()
Set dbconn = Nothing
End Function
%>
Matt BorjaDecember 3, 2013 at 7:02 PMEdited
So after doing some deeper tracing, it appears that the read timeout being thrown is as a result of the proxy callback URL as indicated in ServiceValidateController#getServiceCredentialsFromRequest
In our case specifically, our proxy callback utilizes a SQL Server database (under load from our website) and I was able to see the connection take anywhere from 6-7 seconds. Even though HttpClient is able to connect just fine (satisfying the connection timeout threshold), our page won't actually return a 302 response until database connection has been established.
Our initial approach was to increase the readTimeout in applicationContext.xml from the 5 second default to 10 seconds. However, it is obvious at this point that not all production and database environments are configured equally.
My recommendation at this point would be for a documentation update outlining the possibility of read timeouts caused by pgtUrl and suggest a front-loaded memcached implementation whereby pgtIou/pgtId pairs are stored to Couchbase with possibly a database as a fallback or secondary option.
For reference, the following is a snippet of our current implementation in Classic ASP that takes 6-7 seconds somewhat frequently:
Set conn = Server.CreateObject("ADODB.Connection")
conn.Open "PROVIDER=SQLOLEDB..."
If Request.QueryString("proxyResponse") = "True" Then
If pgtIou <> "" And pgtId <> "" Then
conn.Execute("INSERT INTO pgtLookups (pgtIou, pgtId) VALUES('" & Sanitize(pgtIou) & "', '" & Sanitize(pgtId) & "')")
End If
'Prevent duplication
Response.Redirect( CStr(Request.ServerVariables("URL")) )
End If
conn.Close
Set conn = Nothing
Misagh MoayyedDecember 3, 2013 at 5:21 PM
Great. Please post back the results when you do get the chance.
HttpClient is the place to review, specially if we are to implement some sort of re-try logic. It's worth noting that if this in fact turns out to be the issue, it might prove more useful if we revamp the implementation to use something like apache httpclient where it does have support for retrying connection attempts. But, first thing first, lets make sure this in fact is the underlying problem.
Appreciate you trying this out.
Matt BorjaDecember 3, 2013 at 4:54 PMEdited
This might actually solve our problem. Looks like applicationContext.xml within spring-configuration has the closest match. I'll have our server admin increase this timeout and see if this fixes our problem.
<bean id="httpClient" class="org.jasig.cas.util.HttpClient"
p:readTimeout="5000"
p:connectionTimeout="5000"/>
Thanks!
This occurs about once out of perhaps every 500-600 logins (from what we've seen) and is probably the last bug that needs to be addressed in order to achieve 100% service availability. Please see the following ClearPass transaction (includes proxy ticket request, etc).
Service Ticket: ST-1234
pgtInit Request
https://cas.example.com/serviceValidate?ticket=ST-1234&pgtUrl=https%3A%2F%2Fwww%2Eexample%2Ecom%2FpgtUrl%2Easp%3FproxyResponse%3DTrue&service=https%3A%2F%2Fwww%2Eexample%2Ecom%2Fclearpass.asp
pgtInit Response
<cas:serviceResponse xmlns:cas='http://www.yale.edu/tp/cas'> <cas:authenticationSuccess> <cas:user>jdoe</cas:user> </cas:authenticationSuccess> </cas:serviceResponse>
pgtIou
[null]
pgtId
0 chars
Proxy Ticket Response
<cas:serviceResponse xmlns:cas='http://www.yale.edu/tp/cas'> <cas😛roxyFailure code='INVALID_REQUEST'> 'pgt' and 'targetService' parameters are both required </cas😛roxyFailure> </cas:serviceResponse>
Request:
https://cas.example.com/clearPass?ticket= (proxy ticket)
casUsername
jdoe
casPassword
0 chars
ClearPass Response
<cas:clearPassResponse xmlns:cas='http://www.yale.edu/tp/cas'> <cas:clearPassFailure>No authentication information provided.</cas:clearPassFailure> </cas:clearPassResponse>
This doesn't happen often, but it has happened to 7 people in the past 5 hours, one of which I've found to be associated with a SocketTimeoutException which led to additional exceptions resulting in error.authentication.credentials.bad. Others, however, show no signs of such exceptions.
Granted, it's a less than 1% failure rate, but fails nonetheless for some users under certain circumstances. The observed result in the pgtInit response is that the pgtInit response doesn't return a PGT-IOU, even though we are still getting a cas:authenticationSuccess XML response.
Could this be caused due to replication latency as we are currently storing tickets in a replicated memcached environment?