I have blogged numerous times about session management in ColdFusion. More specifically, about Michael Dinowitz's suggestion of turning off session management of web spiders so as to cut down on the amount of client variables that we end of storing. Web spiders do not accept cookies and therefore (on my site) they create a new ColdFusion Session for each page they hit on the site. Spidering a few hundred pages in a matter of minutes can cause variable creation to sky rocket.

Previously, I was testing for the type of user base on the CGI.http_user_agent. If it was a suspect type of user agent, I turned off session management. The code looked something like this:

  • // Define the application. To stop unnecessary memory usage, we are going to give
  • // web crawler no session management. This way, they don't have to worry about cookie
  • // acceptance and object persistence (except for APPLICATION scope). Here, we are
  • // using short-circuit evaluation on the IF statement with the most popular search
  • // engines at the top of the list. This will help us minimize the amount of time that
  • // it takes to evaluate the list.
  • // Create a lowercase version of the user agent so we can run without NoCase checks.
  • strTempUserAgent = LCase( CGI.http_user_agent );
  •  
  • // Check user agent.
  • if (
  • (NOT Len(strTempUserAgent)) OR
  •  
  • // We are gonna try to optimize even a little bit more. A good number of the spider
  • // names end in "bot". If we check for names that have BOT ending on a word boundary,
  • // we can eliminate several of the other spider checks. The bot\b search here takes
  • // care of the spiders that are now commented out below. As you can see, it takes
  • // the place of 18 different spider Find()'s.
  • REFind( "bot\b", strTempUserAgent ) OR
  •  
  • // This will try to get any RSS feed readers.
  • REFind( "\brss", strTempUserAgent ) OR
  • Find( "slurp", strTempUserAgent ) OR
  • Find( "mediapartners-google", strTempUserAgent ) OR
  • Find( "zyborg", strTempUserAgent ) OR
  • Find( "emonitor", strTempUserAgent ) OR
  • Find( "jeeves", strTempUserAgent ) OR
  • Find( "sbider", strTempUserAgent ) OR
  • Find( "findlinks", strTempUserAgent ) OR
  • Find( "yahooseeker", strTempUserAgent ) OR
  • Find( "mmcrawler", strTempUserAgent ) OR
  • Find( "jbrowser", strTempUserAgent ) OR
  • Find( "java", strTempUserAgent ) OR
  • Find( "pmafind", strTempUserAgent ) OR
  • Find( "blogbeat", strTempUserAgent ) OR
  • Find( "converacrawler", strTempUserAgent ) OR
  • Find( "ocelli", strTempUserAgent ) OR
  • Find( "labhoo", strTempUserAgent ) OR
  • Find( "validator", strTempUserAgent ) OR
  • Find( "sproose", strTempUserAgent ) OR
  • Find( "ia_archiver", strTempUserAgent ) OR
  • Find( "larbin", strTempUserAgent ) OR
  • Find( "psycheclone", strTempUserAgent ) OR
  • Find( "arachmo", strTempUserAgent )
  •  
  • // I am no longer checking for the following as they are being
  • // checked in the regular expression at the top.
  •  
  • // Find( "turnitinbot", strTempUserAgent ) OR
  • // Find( "ziggsbot", strTempUserAgent ) OR
  • // Find( "rufusbot", strTempUserAgent ) OR
  • // Find( "researchbot", strTempUserAgent ) OR
  • // Find( "ip2mapbot", strTempUserAgent ) OR
  • // Find( "gigabot", strTempUserAgent ) OR
  • // Find( "exabot", strTempUserAgent ) OR
  • // Find( "mj12bot", strTempUserAgent ) OR
  • // Find( "outfoxbot", strTempUserAgent ) OR
  • // Find( "obot", strTempUserAgent ) OR
  • // Find( "snapbot", strTempUserAgent ) OR
  • // Find( "myfamilybot", strTempUserAgent ) OR
  • // Find( "girafabot", strTempUserAgent ) OR
  • // Find( "aipbot", strTempUserAgent ) OR
  • // Find( "googlebot", strTempUserAgent ) OR
  • // Find( "becomebot", strTempUserAgent ) OR
  • // Find( "msnbot", strTempUserAgent ) OR
  • // Find( "irlbot", strTempUserAgent ) OR
  • // Find( "baiduspider", strTempUserAgent )
  • ){
  •  
  • // This application definition is for robots that do NOT need sessions.
  • THIS.Name = "KinkySolutions v.1 {dev}";
  • THIS.SessionManagement = false;
  • THIS.SetClientCookies = false;
  • THIS.ClientManagement = false;
  • THIS.SetDomainCookies = false;
  •  
  • // Set the flag for session use.
  • REQUEST.HasSessionScope = false;
  •  
  • } else {
  •  
  • // This application is for the standard user.
  • THIS.Name = "KinkySolutions v.1 {dev}";
  • THIS.SessionManagement = true;
  • THIS.SetClientCookies = true;
  • THIS.SessionTimeout = CreateTimeSpan(0, 0, 20, 0);
  • THIS.LoginStorage = "SESSION";
  •  
  • // Set the flag for session use.
  • REQUEST.HasSessionScope = true;
  •  
  • }

This has been working great. The problem is that as I add different web crawlers the number of pre-page-processing steps I have to do increases. Not only that, I am finding that some people are slipping through the user_agent_test; I need to start "black listing" certain IP addresses as being spiders even though they have normal user agents. That means the first if() statement can get bigger and bigger. I don't care about the web crawler experience (how long the pages take to load), but I do care about the user's experience, and right now, the user is getting the raw end of the deal. Their normal page load is the worst-case-scenario for user agent checking (no user agent gets matched).

To help overcome this problem, I am using cookies to prevent the long IF statement from being executed more than once. I start out paraming these cookie values at the top of Application.cfc:

  • <cftry>
  • <cfparam name="COOKIE.SessionScopeTested" type="numeric" default="0" />
  • <cfparam name="COOKIE.HasSessionScope" type="numeric" default="0" />
  • <cfcatch>
  • <cfset COOKIE.SessionScopeTested = 0 />
  • <cfset COOKIE.HasSessionScope = 0 />
  • </cfcatch>
  • </cftry>

COOKIE.SessionScopeTested flags whether or not we have made the check for session management. COOKIE.HasSessionScope flags whether or not the user's page request has session management. Setting cookies in this way does NOT commit them to the user's cookie files, but it does create a temporary cookie value for the page request.

After CFParam'ing the cookie values, I also set a flag for whether or not the session management test was done (the big IF statement):

  • <!---
  • This is flag to see if the cookie test was performed. Will help us determine
  • if we need to commit a Cookie value later on in processing.
  • --->
  • <cfset blnCookieTestPerformed = false />

The updated IF statement for session management checking now looks like:

  • // Define the application. To stop unnecessary memory usage, we are going
  • // to give web crawler no session management. This way, they don't have
  • // to worry about cookie acceptance and object persistence (except for
  • // APPLICATION scope). Here, we are using short-circuit evaluation on the
  • // IF statement with the most popular search engines at the top of the
  • // list. This will help us minimize the amount of time that it takes to
  • // evaluate the list. Create a lowercase version of the user agent so we
  • // can run without NoCase checks.
  • strTempUserAgent = LCase( CGI.http_user_agent );
  •  
  • // Check user agent.
  • if (
  • (NOT Len(strTempUserAgent)) OR
  •  
  • // We are testing the cookie values so that we are not duplicating
  • // logic. This should provide a performance increase of anyone
  • // accepting cookies.
  • (
  • COOKIE.SessionScopeTested AND
  • (NOT COOKIE.HasSessionScope)
  • ) OR
  •  
  • // We are gonna try to optimize even a little bit more. A good number
  • // of the spider names end in "bot". If we check for names that have
  • // BOT ending on a word boundary, we can eliminate several of the other
  • // spider checks. The bot\b search here takes care of the spiders that
  • // are now commented out below. As you can see, it takes the place of
  • // 18 different spider Find()'s.
  • REFind( "bot\b", strTempUserAgent ) OR
  •  
  • // This will try to get any RSS feed readers.
  • REFind( "\brss", strTempUserAgent ) OR
  • Find( "slurp", strTempUserAgent ) OR
  • Find( "mediapartners-google", strTempUserAgent ) OR
  • Find( "zyborg", strTempUserAgent ) OR
  • Find( "emonitor", strTempUserAgent ) OR
  • Find( "jeeves", strTempUserAgent ) OR
  • Find( "sbider", strTempUserAgent ) OR
  • Find( "findlinks", strTempUserAgent ) OR
  • Find( "yahooseeker", strTempUserAgent ) OR
  • Find( "mmcrawler", strTempUserAgent ) OR
  • Find( "jbrowser", strTempUserAgent ) OR
  • Find( "java", strTempUserAgent ) OR
  • Find( "pmafind", strTempUserAgent ) OR
  • Find( "blogbeat", strTempUserAgent ) OR
  • Find( "converacrawler", strTempUserAgent ) OR
  • Find( "ocelli", strTempUserAgent ) OR
  • Find( "labhoo", strTempUserAgent ) OR
  • Find( "validator", strTempUserAgent ) OR
  • Find( "sproose", strTempUserAgent ) OR
  • Find( "ia_archiver", strTempUserAgent ) OR
  • Find( "larbin", strTempUserAgent ) OR
  • Find( "psycheclone", strTempUserAgent ) OR
  • Find( "arachmo", strTempUserAgent ) OR
  •  
  • // These IP addresses are being black listed as being spiders
  • // even though they are not advertising themselves as being
  • // spiders via the CGI.http_user_agent.
  • (NOT Compare( CGI.remote_add, "208.66.195.5" ))
  •  
  • // I am no longer checking for the following as they are being
  • // checked in the regular expression at the top.
  •  
  • // Find( "turnitinbot", strTempUserAgent ) OR
  • // Find( "ziggsbot", strTempUserAgent ) OR
  • // Find( "rufusbot", strTempUserAgent ) OR
  • // Find( "researchbot", strTempUserAgent ) OR
  • // Find( "ip2mapbot", strTempUserAgent ) OR
  • // Find( "gigabot", strTempUserAgent ) OR
  • // Find( "exabot", strTempUserAgent ) OR
  • // Find( "mj12bot", strTempUserAgent ) OR
  • // Find( "outfoxbot", strTempUserAgent ) OR
  • // Find( "obot", strTempUserAgent ) OR
  • // Find( "snapbot", strTempUserAgent ) OR
  • // Find( "myfamilybot", strTempUserAgent ) OR
  • // Find( "girafabot", strTempUserAgent ) OR
  • // Find( "aipbot", strTempUserAgent ) OR
  • // Find( "googlebot", strTempUserAgent ) OR
  • // Find( "becomebot", strTempUserAgent ) OR
  • // Find( "msnbot", strTempUserAgent ) OR
  • // Find( "irlbot", strTempUserAgent ) OR
  • // Find( "baiduspider", strTempUserAgent )
  • ){
  •  
  • // This application definition is for robots that do NOT need sessions.
  • THIS.Name = "KinkySolutions v.1 {dev}";
  • THIS.SessionManagement = false;
  • THIS.SetClientCookies = false;
  • THIS.ClientManagement = false;
  • THIS.SetDomainCookies = false;
  •  
  • // Set the flag for session use.
  • REQUEST.HasSessionScope = false;
  •  
  • // Only set the cookie values if we have not already done this. Most
  • // of the users who get this far are spiders and cannot accept
  • // cookies... which is why I am turning off session management.
  • // However, if they are spiders, I am not too concerned about
  • // performance so I set anyway for good practice.
  • if (NOT COOKIE.SessionScopeTested){
  •  
  • // Set the client cookie for testing so that this doesn't get
  • // tested again.
  • COOKIE.SessionScopeTested = 1;
  •  
  • // Set the client cookie for session availability. This user has
  • // been determined to not need sessions.
  • COOKIE.HasSessionScope = 0;
  •  
  • // Flag the cookie test as being performed.
  • blnCookieTestPerformed = true;
  •  
  • }
  •  
  • } else {
  •  
  • // This application is for the standard user.
  • THIS.Name = "KinkySolutions v.1 {dev}";
  • THIS.SessionManagement = true;
  • THIS.SetClientCookies = true;
  • THIS.SessionTimeout = CreateTimeSpan(0, 0, 20, 0);
  • THIS.LoginStorage = "SESSION";
  •  
  • // Set the flag for session use.
  • REQUEST.HasSessionScope = true;
  •  
  • // Only set the cookie values if we have not already done this.
  • if (NOT COOKIE.SessionScopeTested){
  •  
  • // Set the client cookie for testing so that this doesn't get
  • // tested again.
  • COOKIE.SessionScopeTested = 1;
  •  
  • // Set the client cookie for session availability. This user
  • // has been determined to allow session management.
  • COOKIE.HasSessionScope = 1;
  •  
  • // Flag the cookie test as being performed.
  • blnCookieTestPerformed = true;
  •  
  • }
  •  
  • }

Things to notice in the above code:

1. As the SECOND test of the IF() statement, I am checking the user's cookie values for "COOKIE.SessionScopeTested AND (NOT COOKIE.HasSessionScope)". If the user is a regular user that accepts cookies than these values will evaluate to (1 AND (NOT 1)), which will return FALSE. And, since ColdFusion uses short-circuit evaluation for conditional statements, the entire IF statement will fail and the user will proceed directly to the ELSE statement.

This means that a regular user no longer has a worst-case scenario for session management testing. In fact, now, the user has the BEST case scenario as they are the only ones who can take advantage of this. Spiders cannot accept cookies and therefore, not only will the cookie logic be uselessly done on every page request, but they still have to execute at least part of the IF() statement. This is an experience that I want to have. Web spiders are more patient than users.

Now, until now, we have not actually set any permanent COOKIE values. We have only temporarily set values and referred to them. That is why, AFTER the IF() statement, we have this bit of code:

  • <!---
  • Check to see if the session management cookie values were updated (the
  • test was run). If so, we need to write the cookie value to the user's
  • browser. We could have done this above, but I didn't want to break out
  • of the CFScript block as it would not have looked nice.
  • --->
  • <cfif blnCookieTestPerformed>
  •  
  • <!--- Write the cookie value for the test. --->
  • <cfcookie
  • name="SessionScopeTested"
  • value="#COOKIE.SessionScopeTested#"
  • expires="NEVER"
  • />
  •  
  • <!--- Write the cookie value for the test outcome. --->
  • <cfcookie
  • name="HasSessionScope"
  • value="#COOKIE.HasSessionScope#"
  • expires="NEVER"
  • />
  •  
  • </cfif>

We only want to write the cookie values once since they never expire so we only write them if the test for session management has been performed and defined a cookie value. If you the code above, you will see that we can only ever set the cookie values if they have not been set yet.

So what has this accomplished? For the web crawlers, there is no change. They have a few more lines of code to have in their page pre-processing, but overall, nothing has changed except for an attempt to set cookie values which will fail. For the end user though, this should show a performance increase as we are cutting down on session management logic that they have to perform.

The page does set more values, COOKIE and temporary, but setting values in ColdFusion takes virtually no time.