05 Web Proxy

Purpose of Web Proxy Channel

CWebProxy allows incorporation of web-based services as channels, regardless of what technology is used to implement them. It provides mechanisms for connecting to and rendering HTML and XML services. Pages are refreshed and kept in-channel when they change. HTTP standards are followed, allowing communication between the browser and dynamic back-end applications. Mechanisms are provided for passing user-specific information to the back-end application, as well as ways to support local interface technologies on a per-channel basis. (Such as encryption, shared secrets, single-sign-on, modification of http request headers, etc.)

How the Web Proxy Channel Works

This section describes the functionality of CWebProxy in general terms. Specifics of configuration and use are covered in the sections following.

Web applications are written to interact directly with users through their browsers. When a portal incorporates such an application, it must intercept this communication to tailor it for the portal environment. This is done by rewriting the application's output appropriately. In particular, rewriting any URLs so that they will go through the portal if appropriate, rather than directly to the back-end application or elsewhere. Other mechanisms allow sharing of information between the portal and the application, to aid application functionality. Caching is available to improve performance.
Rewriting Application Output

The key mechanism is "pass-through". It is the means for passing request parameters through the portal to the application. There are currently four levels of pass-through supported:

  • all: all links stay in the channel.
  • application: references to the application stay in the channel. Other links go outside the portal framework. I.e. they are normal links that replace the portal in the browser window.
  • marked: the application indicates which links should stay in channel by adding a tag to those URLs.
  • none: all links will leave the portal framework.

Note that it is possible to change the pass-through type at any point, so if a link is followed that would best be served by another pass-though type, it is a trivial matter to change it at that time.

The output from applications is rewritten in four stages. HTML, XHTML, and WML are supported in the code as distributed.

  1. JTidy: If the cw_tidy attribute is on, the application's output is run through JTidy to convert it from HTML to well-formed XHTML.
  2. AbsoluteURLFilter: This converts relative URLs to absolute ones.
  3. CWebProxyURLFilter: Rewrites URLs according to cw_passThrough or cw_download.
  4. XSLT: The XML is passed through a stylesheet according to the channel parameters and the media type. Static and runtime data parameters are passed to the stylesheet. This feature is not used by the distributed stylesheets, but may prove useful to custom-written stylesheets, particularly those with no URL Filters.

CWebProxy will use the same method (GET or POST) to call the application as was used to call the portal. Since the portal is intercepting HTTP requests aimed at the application, this will result in the correct method being used, according to what the application expects.

CWebProxy will filter out any attribute/value pairs in the query string which are portal-specific, and hand on any others that were in the URL to the application. Note that querystrings may also contain keywords without using the attribute/value format. In this case they are also passed on to the application. Although you mix this type of querystring with mechanisms that will generate attribute/value pairs, if it sees this happening, CWebProxy will pass the keywords by adding a "keywords" attribute with the appropriate value.

In the case where CWebProxy needs the channel to provide refreshed output, it will call the application with no parameters, as is usual for HTTP.

In some cases, you don't want a link to go through the normal mechanisms, but instead wish it to be handled as a download for an object with its own MIME type. CWebProxy includes a way to indicate this.

Information Sharing

Applications can benefit by getting information from the portal. CWebProxy supports the standard HTTP Cookie mechanism for keeping session and state information between page requests. Additionally, it can pass information on the user from the uPortal IPerson object. There is also support for customizing the communications to fit local policy and technical infrastructure.
Session Support

Support is provided for cookies as specified in the original Netscape specification, as well as RFC 2109 and RFC 2965. Only the Cookie, Cookie2, Set-Cookie, and Set-Cookie2 headers are currently processed.

Cookies are not maintained between portal logins. Once you logout of the portal, your cookies are discarded.

Applications maintaining sessions via URL rewriting in http query strings should also work. Other forms of URL rewriting to maintain state probably will not work. Most applications use cookies by preference if available, which they are.
IPerson Attribute Passing

uPortal maintains certain user information (called the IPerson object), possibly aggregated from several sources, for each active user session. CWebProxy allows the application to request this information. At publish time, you may configure which information is allowed to be passed to applications, and you may also choose to always pass particular attributes by default. These are sent as additional http request parameters. A uPortal configuration property sets the defaults for channels that do not specify publish-time values.
Local Connection Context

Connecting to, and communicating with, certain applications may require custom modifications to the communications. For example, you might need to add special encryption, shared secret passing, hooks into authentication mechanisms, addition or filtering of headers or cookies, etc. CWebProxy uses a mechanism called Local Connection Context to simplify and modularize this process on a per-channel basis.
Caching

Channel output maybe be cached to improve efficiency. The three aspects of caching are:

  • Mode: To cache or not to cache. May be none (normally don't cache), or all (cache by default).
  • Scope: Affects which instances of this channel share cached data. At this time only instance caching is available.
    • Timeout: How long cached output is valid in "all" mode.

For each of these, defaults are set in uPortal properties which may be overridden for each channel at publish time. Applications may later change the defaults for a channel instance, or override them for a single page request. Cache scope and mode can only be made more restrictive, not less.

According to the HTML specification, "If the processing of a form is is idempotent (i.e. it has no lasting observable effect on the state of the world), then the form method should be GET. Many database searches have no visible side-effects and make ideal applications of query forms...If the service associated with the processing of a form has side effects (for example, modification of a database or subscription to a service), the method should be POST." For this reason, POST requests are not cached.

Configuration

Static Data

Except as noted, parameters are identical for both static and runtime data. The channel state variables are initially set acccording to static data, or defaults. Runtime data modifies the equivalent channel state variables. All parameters are then passed through to the stylesheets based on the current state. The parameters are:

  • cw_xml: a URI representing the source XML or HTML document. I.e. the backend application.
  • cw_ssl: a URI representing the corresponding .ssl (stylesheet list) file.
  • cw_xslTitle: a title representing the stylesheet (optional). If no title parameter is specified, a default stylesheet will be chosen according to the media.
  • cw_xsl: a URI representing the stylesheet to use. If cw_xsl is supplied, cw_ssl and cw_xslTitle will be ignored.
  • cw_info: a URI to be called for the info event.
  • cw_help: a URI to be called for the help event.
  • cw_edit: a URI to be called for the edit event.
  • cw_tidy: if set to on, filter the source document through JTidy, converting HTML to XHTML.
  • cw_passThrough: indicates that runtime data not specific to the portal is to be passed through. If passThrough is supplied, and not set to "none", additional runtime data parameters and values will be passed as request parameters to the cw_xml.

    cw_passThrough values

    • none: (default). Don't do anything.
    • marked: If runtime data includes cw_inChannelLink, pass through other runtime data as request parameters.
    • application: Only URLs referring to the application are kept in-channel.
    • all: All link URLs are rewritten to go through the channel.
  • cw_cacheDefaultTimeout - Default timeout in seconds.
  • cw_cacheDefaultMode - Default caching mode. May be none (normally don't cache), or all (cache everything).
  • cw_cacheTimeout - override default for this request only. Primarily intended as a runtime parameter, but can user statically to override the first instance.
  • cw_cacheMode - override default for this request only. Primarily intended as a runtime parameter, but can user statically to override the first instance.
  • cw_person - IPerson attributes to pass. A comma-separated list of IPerson attributes to pass to the back end application. The static data value will be passed on all requests not overridden by a runtime data cw_person.
  • cw_personAllow - Restrict IPerson attribute passing to this list. A comma-separated list of IPerson attributes that may be passed via cw_person. An empty or non-existent value means use the default value from the corresponding property. The special value "" means all attributes are allowed. The value "!" means none are allowed. Static data only.
  • upc_localConnContext - LocalConnectionContext implementation class. The name of a class to use when data sent to the backend application needs to be modified or added to suit local needs. Static data only.

Properties

CWebProxy has a few properties which act as portal-wide defaults for equivalent static data (channel publish-time parameters). These are set in the properties/portal.properties configuration file.

  • cache_default_timeout: Default value for cw_cacheDefaultTimeout.
  • cache_default_mode: Default value for cw_cacheDefaultMode. If you're not sure what to use, disable caching by default with the value none.
  • person_allow: Default value for cw_personAllow. An empty value means everything is restricted, disabling this mechanism. A special value of * means no restrictions.

Controlling CWebProxy Channels

Once it is running, you can control the behaviour a CWebProxy channel instance in two ways. An HTTP application may pass instructions to the channel via portal runtime data, which means passing special attributes in the request query string. Secondly, the channel reacts to certain portal events. These are generally triggered by user actions, such as clicking on a channel button.
Runtime Data

Most static data parameters can be changed via the equivalent runtime data parameter, and have the same semantics. The exceptions are cw_person_allow and upc_localConnContext. The following are runtime-only parameters:

  • cw_reset: An instruction to reset internal variables. The value return resets cw_xml to its last value before changed by button events.
  • cw_download: Use download worker for this link or form. Any link or form that contains this parameter will be handled by the download worker, if the pass-through mode is set to rewrite the link or form. This allows downloads from the proxied site to be delivered via the portal, primarily useful if the download requires verification of a session referenced by a proxied cookie

Portal Events

CWebProxy supports the button events for help, about, and edit. A channel instance can specify URIs for any of these via static or runtime data. A button event will then redirect the channel to the appropriate URI. Note that these URIs are subject to the same filtering and stylesheets are the normal URI. The event URI should return control to the original application via the runtime attribute cw_reset=return.

Issues and Limitations

  • JTidy must be recompiled to work with some servlet containers. See Software Dependencies.
  • HTTP-style caching is not yet implemented.
  • HTML and XHTML <body> background colours and images are not reflected in the output.
  • Any <link> elements from the <head> of an HTML document, including those that reference CSS stylesheets, are not reflected in the output. This would require access to the <head> element generated by the portal.
  • URLs that use frames cannot be incorporated as channels.
  • Suppression of JTidy diagnostic output has not been tested on non-UNIX platforms.
  • UTF-8 is the default character encoding for JTidy. If the character encoding is specified in the HTTP headers, CWebProxy will set it accordingly. Note that JTidy does not yet recognize the use of the HTML meta element for specifying the character encoding.
  • The cw_reset=reset runtime command is not implemented yet.
  • If cw_xml is changed before cw_reset=return is called, your are returned to the last cw_xml used, not necessarily the one that was in use before a button event.
  • The source html cannot specify a namespace if cw_tidy is on.
  • Caution must be used with URLs leaving out a required trailing slash ("/").

Scripts

Limited support is provided for included scripts, but they may not work exactly as they would when viewed directly. Note in particular that if your script generates URLs, they will probably need to be absolute URLs, not relative, to work through a portal.

XHTML 1.0 indicates that an XHTML document must be valid XML so if embedded JavaScript code contains <, &, > or −−, it must be wrapped in a CDATA section element. However, CDATA sections are recognized by XML processors but not browsers. According to XHTML 1.0, external scripts should be used if your script uses those character sequences. CWebProxy supports embedded JavaSript containing these characters only if it is wrapped in a CDATA section element and sent through JTidy. Note that the output will not be valid XML, and thus not valid XHTML.

Software Dependencies

CWebProxy uses the JTidy package is used to convert HTML to XHTML.

Under some servlet containers, JTidy r6 has a compatibility problem with the classloader. This is an issue for some versions of Tomcat, for example. You will know you've got this problem if CWebProxy isn't working and you see a line in your logfile that says:

Registered an uncaughted exception java.lang.IllegalAccessError: try to access field org.w3c.tidy.ParserImpl._parseHead from class org.w3c.tidy.ParserImpl$ParseHTML

The JTidy developers have noted the problem, and hopefully will fix it in the next release. In the meantime, it can be worked around by recompiling for your platform. To do this, get the source distribution, unzip it, build it with "ant jar", and replace your old tidy.jar with build/Tidy.jar. I've put a Tidy.jar rebuilt on Linux here, if you'd prefer to try that first.

Examples

The unmodified Tomcat numguess.jsp and servlet examples are set up as examples in uPortal as distributed. They can be seen in the CWebProxy Examples tab of the demo user, and are available for subscription via Preferences. See the webpages/media/org/jasig/portal/channels/webproxy/examples directory for the info and help files.

Further examples can be found in the tutorial on the CWebProxy home page.

For more discussion of CWebProxy, see Andrew Petro's blog post about this channel.