Clustering the CAS server

Overview

To run CAS servers in a load-balanced cluster, it is necessary to find a way to replicate the various ticket caches across the servers in the cluster.

Even if sticky sessions are used to cause an individual user to always be directed to a single CAS server, this alone would not solve the problem. When a CAS "target" website (i.e. the CAS client) attempts to validate the ticket, there is currently no way to ensure that the validation request gets load-balanced to the same server in the cluster which vended the ticket. By replicating the ticket caches, this problem is avoided, because the ticket will be available to all servers in the cluster.

By using JGroups, we were able to easily add replication to the hashtables that underly the ticket caches.

Important Note

This implementation assumes that tickets are generated by a robust GUID generator and that two servers in a cluster will never create duplicate tickets. I'm not sure if this is true... I haven't looked closely at CAS's current ticket generator.

The Ticket Must Be Serializable

Self-explanatory. The ticket superclass must implement java.io.Serializable.

Changing the Cache Constructors

The primary changes were made to the constructors of each ticket cache object. Here is an example from ServiceTicketCache:

The original constructor:

public ServiceTicketCache(Class ticketType, int tolerance)
{
    super(tolerance);
    if (!ServiceTicket.class.isAssignableFrom(ticketType))
        throw new IllegalArgumentException("...");
    this.ticketType = ticketType;
    this.ticketCache = Collections.synchronizedMap(new HashMap());
}

The new constructor:

public ServiceTicketCache(Class ticketType, int tolerance, String clusterProperties, String groupName)
{
    super(tolerance);
    if (!ServiceTicket.class.isAssignableFrom(ticketType))
        throw new IllegalArgumentException("...");
    this.ticketType = ticketType;
    
    // TODO this code should be refactored to a superclass
    if(clusterProperties==null || clusterProperties.trim().length()==0)
    {
        // not clustered - just used a synchronized map
        this.ticketCache = Collections.synchronizedMap(new HashMap());
    }
    else
    {
        try
        {
            // connect to a JGroups channel for clustered operation
            JChannel channel=new JChannel(clusterProperties);
            channel.setOpt(Channel.AUTO_RECONNECT, Boolean.TRUE);
            channel.connect(groupName);
            // TODO timeout should be configurable
            this.ticketCache = new DistributedHashtable(channel, 2000);
        }
        catch (ChannelException e)
        {
            // Clustering failed!  Using non-clustered mode.
            this.ticketCache = Collections.synchronizedMap(new HashMap());
        }
    }
}

You'll notice two new parameters to the constructor: groupName and clusterProperties.

clusterProperties is a string containing the JGroups "protol stack" for the cluster. See the JGroups documentation for more information.

groupName is a name that identifies the group that is communicating on the channel.

The Cluster Properties (Protocol Stack)

We chose to use the same clusterProperties but different groupNames for all ticket caches. The clusterProperties are configured with a range of ports, so that, when the second cache to be initialized discovers that the first port is in use by a differently-named group, it will move on to the next port in the range.

Here's an example of the type of protocol stack that has worked for us.

UDP(mcast_addr=228.8.8.8;mcast_port=45566;ip_ttl=32;mcast_send_buf_size=150000;mcast_recv_buf_size=80000):
PING(timeout=5000;num_initial_members=2):
MERGE2(min_interval=5000;max_interval=10000):
FD_SOCK:
VERIFY_SUSPECT(timeout=1500):
pbcast.NAKACK(gc_lag=50;retransmit_timeout=300,600,1200,2400,4800):
UNICAST(timeout=5000):
pbcast.STABLE(desired_avg_gossip=20000):
FRAG(frag_size=4096;down_thread=false;up_thread=false):
ENCRYPT(key_store_name=defaultStore.keystore;store_password=changeit;alias=myKey):
pbcast.GMS(join_timeout=5000;join_retry_timeout=5000;shun=false;print_local_addr=true):
pbcast.STATE_TRANSFER

Here's another example protocol stack, this time using TCP & TCPPING instead of UDP multicast.

TCP(start_port=7800):
TCPPING(initial_hosts=host1[CAS:7800],host2[CAS:7800];port_range=5;timeout=2000;num_initial_members=3;up_thread=true;down_thread=true):
MERGE2(min_interval=5000;max_interval=10000):
FD_SOCK:
VERIFY_SUSPECT(timeout=1500):
pbcast.NAKACK(gc_lag=50;retransmit_timeout=300,600,1200,2400,4800):
pbcast.STABLE(desired_avg_gossip=20000):
ENCRYPT(key_store_name=defaultStore.keystore;store_password=changeit;alias=myKey):
pbcast.GMS(join_timeout=800;join_retry_timeout=800;shun=false;print_local_addr=true):
pbcast.STATE_TRANSFER

It took a little bit of extra work to get the ENCRYPT protocol working, but I think it was worth the effort.

Initializing and Closing the Caches

The clusterProperties setting is configured in the "web.xml" file. CacheInit.java was altered to read this value from the app context, to use a regular expression to remove whitespace (which JGroups dislikes), and to pass it to the various constructors.

Here's the snipped of code that performs this initialization:

String clusterProperties = app.getInitParameter("edu.yale.its.tp.cas.clusterProperties");
// clean all whitespace from the properties
if(clusterProperties!=null) clusterProperties = clusterProperties.replaceAll("[CAS: \\n\\t]","");

// set up the caches...
GrantorCache tgcCache = new GrantorCache(TicketGrantingTicket.class, grantingTimeout, clusterProperties, "tgcCache");
GrantorCache pgtCache = new GrantorCache(ProxyGrantingTicket.class, grantingTimeout, clusterProperties, "pgtCache");
ServiceTicketCache stCache = new ServiceTicketCache(ServiceTicket.class, serviceTimeout, clusterProperties, "stCache");
ServiceTicketCache ptCache = new ServiceTicketCache(ProxyTicket.class, serviceTimeout, clusterProperties, "ptCache");
LoginTicketCache ltCache = new LoginTicketCache(loginTimeout, clusterProperties, "ltCache");

Also, CacheInit.java was altered to close the channels on all of the caches when the context was destroyed. This is necessary to free up the ports whenever the web application is restarted. Here's the corresponding code:

CacheInit.java:

public void contextDestroyed(ServletContextEvent sce)
{
    // close all of the channels (in case we're using clustering)
    ServletContext app = sce.getServletContext();
    ((GrantorCache)app.getAttribute("tgcCache")).closeChannel();
    ((GrantorCache)app.getAttribute("pgtCache")).closeChannel();
    ((ServiceTicketCache)app.getAttribute("stCache")).closeChannel();
    ((ServiceTicketCache)app.getAttribute("ptCache")).closeChannel();
    ((LoginTicketCache)app.getAttribute("ltCache")).closeChannel();
}

ServiceTicketCache.java:

public void closeChannel()
{
    if(ticketCache instanceof DistributedHashtable)
    {
        ((DistributedHashtable)ticketCache).getChannel().close();
    }
}

Note that we use JGroup's DistributedHashtable instead of ReplicatedHashtable, because the former is synchronous while the latter is asynchronous.

Clustering Tomcat

We have not yet configured Tomcat to be load-balanced. I guess I'll leave that as an excercise forthe reader. Note that Tomcat supports replicated sesion data, so any CAS enhancements that rely on the session can also be made to work in a clustered environment.

Alternatives to Clustering

If the ticket contained some sort of tag identifying which server vended the ticket, then this clustering approach would not be necessary. The CAS client running on the target (service) would simply directly contact the correct CAS server and bypass the load-balancing mechanism. This is not quite as robust, but less complex. Note that the SAML Browser/Artifact binding includes a mechanism to make this possible, since the artifact includes a 20-byte server-identifier.