AADL “Search Cloud”

Life’s had me buried, so I’ve neglected my blog–be kind and cut me some slack!

AADL Catalog Search CloudAt any rate, my little holiday treat this year is a catalog search cloud for the AADL catalog. For about four months now, I’ve been collecting search statistics on queries done against our catalog with the sole intention of creating this little proof-of-concept app.

Basically, it looks at the {x} most popular searches in the past {x} days and generates, what we all recognize, as a tag cloud of those searches with links into the catalog for each.

I say this is a “proof-of-concept” feature because in the coming months, we’ll be launching into our AADL.ORG 3.3 development push which will, hopefully, include a whole lot more of this type of stuff.

Have a safe, happy holiday. I won’t take 2 months to do another post, I promise!

PatREST to Include OCLC Audience Level Data

I’ve updated AADL’s PatREST interface to reflect an addition I’ve made to the PatREST specification (now 1.3). This addition takes advantage of OCLC’s Audience Level indicator. OCLC makes this information available via an XML web service. From their service page:

There are a variety of ways to characterize library materials. The type of reader believed to be interested in a particular item is one. Such an indicator, generally known as the audience level, is potentially useful for a variety of activities, including the development of new ways to improve information relevance for retrieval, reference services (including readers advisory) and collection development. Audience-level filters could be implemented in existing retrieval systems to assist users in finding content based on their information needs.

This is not the first OCLC service PatREST has taken advantage of. PatREST has been incorporating data from OCLC’s xISBN service for quite some time. By pulling in the data they make available, the data PatREST is able to return becomes significantly more valuable.

Because AADL’s PatREST implementation relies heavily upon III’s XML server, I’ve added OCLC’s Audience Level functionality to the PHP XMLOPAC class code which is freely available from my files page or you can directly grab it right here.

AADL.org upgrades to Drupal 4.7

A little over a year after launch, AADL's Drupal-powered site has been upgraded to 4.7 from 4.6. Those familiar with Drupal's release schedule and changelog will know that this is a substantial upgrade that puts us in a good position to be ready for the touted and forthcoming 5.0 release (for which there is now a code freeze).

Drupal 4.7 sports a number of great new features. I'm most excited about the new search engine which does a much better job of indexing the site and allows users to do an advanced search. Searches now actually return meaningful results. Other features include a new Ajax-enabled content creation system with nifty improvements such as re-sizable text fields, collapsible elements, a file upload system that doesn't require authors to leave their work, and live menu updates. On the development side, these new features are accessible via the new form-handling system. In other words, coders can easily incorporate these new Ajax elements in their own work. Theme developers will be happy with the ability to create an infinite number of regions--nice to achieve that highly-polished CSS look. I think a couple new block types were added as well.

Another great feature is the wiki-style revision system that allows editors to roll-back their work and leave editorial log messages (a very useful feature in large, collaborative environments). Commenting benefits, as well, with the ability of site administrators to manage and moderate multiple entries at once. Finally, Drupal 4.7 supports free tagging. Not something we're using at this point, but, from my point of view, it means that the engine is there for future module work. I have a feeling I'll be using those hooks for some forthcoming feature upgrades on the website itself...

The upgrade was fairly smooth. Drupal ships with an update script which ran flawlessly, but that's the easy part. A fair amount of prep-work was done ahead of time to ensure that all of our custom modules were 4.7-compatible. Basically, this meant updating all of our form-handling code to handle the new system. We also segregated all of our own code and theme information from Drupal's using the multi-site capability. This means that we can easily keep track of our own work without it getting mixed up with the vanilla code-base. This wasn't completely necessary, but it was worth the work because it'll make all future upgrades much easier to do. Doing things this way is also in-line with my philosophy of never touching stock code unless you absolutely have to.

The long and the short of this whole upgrade means that our users will probably not notice a lot of difference, but we're now in a good position to work on AADL 3.2. And that they will notice.

For more info, check out these Drupal videocasts:

[tags] AADL, AADL.ORG, Drupal, CMS, PHP, Library, Web [/tags]

Incorporating Google Books into the Hit-list

So the folks over at Google Books think they can go ahead and incorporate our catalogs into their search, do they?

Actually, that's fine, I have no problem with that, which means... They should have no problem with me incorporating Google Books into our hit-list. Right?

Now when users search the AADL catalog, they will be given the option to peek inside the books on the hit-list--that is, if there is a record over at Google Books. Basically, the first time that record is displayed in the list, the middleware queries Google Books to see if it has that item in its database. If it does, the middleware makes note of that in a MySQL table so that the remote query doesn't need to be run again. That way, future queries save time and bandwidth.

Looking at the Syndetics offerings next to it, this seems like a much richer and more useful resource. Enjoy!

** Update 1: 8/24/06 9:45 PM **

Ha! It looks like that was short-lived! (Thanks to Ryan for giving me the heads-up), Google apparently doesn't return the favor:

We're sorry...

... but your query looks similar
to automated requests from a computer virus or spyware
application. To protect our users, we can't process your request
right now.

We'll restore your access as quickly as possible, so try again soon. In the meantime, if you suspect that your computer or network has been infected,
you might want to run a virus checker or spyware remover to make sure that your systems are free of viruses and other spurious software.

We apologize for the inconvenience, and hope we'll see you again on Google.

And here I was, trying to be nice by caching the results... Guess we'll have to wait for the API.

** Update 2: 8/25/06 8:50 AM **

So, I think I found a way to fix this. Essentially, the way I was previously determining if Google Books has a record for and ISBN what by using this URL template:

http://books.google.com/books?vid=ISBN$isbn&printsec=frontcover&dq=isbn:$isbn

Now I'm using a different URL that does not return 404:

http://books.google.com/books?as_isbn=$isbn

If there was no record for that ISBN, Google would throw a 404. I think the fact that one IP was requesting so many 404s is what spooked Google, not the retrieval rate. Also, I noticed that I could no longer use wget on the command-line to grab the data--Google would return a 403 (Forbidden). So, my thought was to ditch PHP's file_get_contents for CURL which allows you to spoof a user agent. I took a peek at our apache logs and chose:

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6

So, instead of looking like a "virus or spyware", the script now appears, to Google, as an extremely zealous Google Books user. We'll see how long it lasts, but it seems to be holding...

** Update 3: 8/25/06 11:40 AM **

No go, they've blocked us again. I'm sending an email to the kind folks at Google Books, and we'll see if they reply. Until then, I've got a few more tricks up my sleeve... In the meantime, I'll leave the cached information active...

** Update 4: 8/25/06 4:07:PM **

Google scores major points in my book! One of the managers over at Google Books just emailed me to say that he likes the idea of the hit-list links and that he is going to see if they can accommodate these types of queries.

[tags] Google, GoogleBooks, Sneaky, AADL, Library, OPAC, Catalog [/tags]

PatREST Enhancements & Documentation

A couple of notes regarding the status of PatREST.

First, two new functions have been added to the service. One provides access to tops (or popular items) lists. The other provides access to the new materials lists. I believe these are significant-enough additions to the service that they merit the 1.1 version number.

The top items query is scalable by result size and can be paginated, just like the search results. In addition, it can be scoped by material type: books, cd, dvd, or bocd (books on CD). When applied to the AADL XSLT, it looks something like this:

The new items query is similar to the top query in that it can be scaled by material type, size and be paginated. You can also search new items using their subject headings--useful for querying those new knitting books, Ed... It looks something like this:

Second, and probably more importantly, I've finally drafted a specification for PatREST which includes an explanation of it's XML schema and some documentation for it's various functions. It's about time, I know. It can be found in the files section, or downloaded here. (PDF)

I'm not going to go in to too much detail about the two new functions in this post because the new documentation contains everything you need to know to get started with them.