Document Cacher service

The problems

Unavailable or slow backing services

At the September 2005 developers' meeting we discussed the issue of channel failure when a remote resource is unavailable or extremely slow. Channels often depend for content on resources which are retrieved over the network via http. Examples include all RSS channels, most web proxy channels and most generic XSLT channels. When a resource is slow, broken or unavailable the only strategy available is to timeout the channel. This makes for poor portal response based on behavior of external services.

Re-use of cached resources across channels

uPortal provides a riche caching API for channels, such that channels can generate scoped channel cache keys and the framework (really, dynamic channel wrappers) will used cache rendering content, even across users where applicable, for the same published channel in the same state. This can be a great efficiency for RSS reader channels, news channels, any other shared content in the portal.

However, sometimes channels with different state, which render differently, nonetheless rely upon the same backing resources. They provide different views on the resource. They cannot utilize uPortal IChacheable channel output cachinng since their output is different. By relying upon a Document Caching service, however, they can use the same cached copy of backing resources. If you give users the ability to configure their RSS reader channel, they will differently configure it with regard to how many items to display, whether to include article summaries, etc. Nonetheless your portal can realize the efficiency of only retrieving SlashDot's RSS content in one place, sharing the latest content across the differently configured channel instances.

This isn't just for RSS. Some of the menu channels in YaleInfo are different views on the same backing XML.

Yale's Solution

To reduce this dependency for YaleInfo we wrote a document caching service. You can take a look at in our wiki at http://tp.its.yale.edu/confluence/display/YIP/Document%2BCacher?showAttachments=true#attachments. We also modified CGenericXSLT to accept and pass caching parameters parameters and to retrieve resources through the caching service if requested. When a channel first calls it, the caching service starts a background thread. The thread uses the commons httpclient which has a timeout value for waiting on a response from an http request. It maintains a queue of resources to retrieve and fetches them in the background at the interval specified by the channel. The first request for a resource will attempt the retrieval on the requesting thread but will also put the document parameters onto the background queue. Subsequent requests will be fulfilled immediately from the most recent version in the cache. So channels never wait for external resources after the first try. We give the channels very long timeouts in hopes of getting the document cached on the first attempt. (Check to see whether we disabled the killing of threads.)

This service has worked very well for us - in fact I don't think our portal would behave acceptably without it. However, the implementation could be greatly improved.

Redesign for inclusion in uPortal Base

This page is intended to start a discussion of how to reimplement this function for the base portal code so it can be shared by all. One past redesign was based on using Quartz to schedule the retrieval of resources. The details of how the channel would request the document, avoid deadlocks with jobs run by Quartz were never finished. Andrew recently discovered an implementation of a read through cache using ehcache. The read through cache already contains a call out to a handler to refresh the cache when it expires. So the current plan is to use this to reimplement the function.

Here are some questions about the best design. Please comment!

What to cache

Q: We were caching DOM objects so they did not have to be reparsed by the channels. Is this broad enough or are there other objects that it should cache?

Andrew: I think there should be two parallel caches, one of DOM objects (thereby realizing the efficiency of caching pre-parsed content) and one of Strings or otherwise of raw character content (enabling caching content that isn't well-formed DOMs, such as Metar data in support of weather channels and other non-XML data formats. Also, there seems to be a tread towards markup that isn't well-formed XML – ICharacterChannel opts channels out of producing necessarily well-formed XHTML, JSR-168 provides a character stream output abstraction, and the new UW-Wisconsin WebProxy JSR-168 portlet allows proxying of webpages without running them through Tidy to make them well-formed XHTML.

Exposing the caching service via a URL

Q: Should the cached resource be available via a url (no change required to existing channels)? Or from a call to a service method? If the cached resources are in turn available via http then the service is essentially a proxy and we have all the problems of controling access to the proxy. And we can't return a DOM over http.

Andrew: I don't think the only or main way for clients of the Caching Service to access the caching service should be via a servlet mapped to answer on HTTP. Such a servlet would be a natural client of a caching service – and might be especially appropriate in the case where the service is configured to only handle URLs specified at service initialization.

Accessing the service over HTTP pays re-parsing costs in the case of XML and consumes servlet container threads and resources in all cases. Accessing the service via an injected service bean (more ideal) or via a static cover (perhaps less ideal) allows clients within the portal to use the service within the portal without exposing the service outside the portal.

How configured

Q: It seems cleaner to have the service start up at portal init time and start retrieving resources in the background. At least it seems overly complicated to retrieve sometimes on the channel rendering thread and other times on the background thread. It would be nice to avoid the issues raised when a rendering channel thread is killed (apart from whether this is a good idea at all.) So where does the list of resources to cache come from? Should it be built by examining channel parameters in advance? Should we cache every RSS or blog that a portal user requests? Should there be a configuration file listing the resources? It would be best if the resources were not entered twice - once as channel parameters and again in a config file.

Andrew: I think there should be an interface for a plugin that provides the initial cache entries. I'd expect that the most commonly desired implementation of this plugin would be to read a configuration file to find the resources the service is desired to cache. Having the plugin would make it possible to use an implementation that mines the channel parameters for URLs.

I think the cache behavior in the case of a cache miss (when a client requests a URL that isn't currently cached) should be configurable. It would be possible to cache dynamically configured URLs (user-specified RSSes) but it would also be possible not to cache these.

Making channels use the service

Q: If we change CGenericXSLT to use the service this still leaves other custom channels to adopt the change if desired. We will need to change the CPD files to include new parameters about whether to use the caching service for a particular resource. For example we would not want to cache resources retrieved using a priveleged connection context (such as proxy CAS).

Andrew: In a way, we're re-inventing LocalConnectionContext. LocalConnectionContext takes the approach of providing an API that munges URLS. By default it does no munging. Implementations exist to slap request parameters on, such as a proxy CAS ticket.

But what if the LocalConnectionContext API had been to actually retrieve the URL and return its content, rather than just munge the URL. Arguably, this has some advantages in preventing the channel from ever seeing what the context did to the URL (the channel doesn't need to see a proxy ticket, or the user's proxied username and password, it just needs to use them to get at some content). This also has some advantages in terms of the context capabilities – such a context could do things like post a signed SAML assertion, use a client-side SSL certificate, etc. to authenticate to backing sources.

This enables the API to do more than just authentication. There are other pieces of context to be had. For example, in the context of YaleInfo, the standard HTTP connection timeout might be 30 seconds, whereas in the context of another portal, it might be 10 seconds. Or whatever.

We've recently seen a need for access control policies that control which URIs channels can retrieve. One way to implement this would be by means of a LocalConnectionContext that fails to retrieve URI's not meeting the access control policy.

And, of course, another piece of local context is caching implementations and policy. One school might use a local connection context to run requests through a dedicated proxy server. And another implementation might be to pass requests to the Document Cacher service we're re-inventing here.