Incorporating Google Books into the Hit-list

So the folks over at Google Books think they can go ahead and incorporate our catalogs into their search, do they?

Actually, that’s fine, I have no problem with that, which means… They should have no problem with me incorporating Google Books into our hit-list. Right?

Now when users search the AADL catalog, they will be given the option to peek inside the books on the hit-list–that is, if there is a record over at Google Books. Basically, the first time that record is displayed in the list, the middleware queries Google Books to see if it has that item in its database. If it does, the middleware makes note of that in a MySQL table so that the remote query doesn’t need to be run again. That way, future queries save time and bandwidth.

Looking at the Syndetics offerings next to it, this seems like a much richer and more useful resource. Enjoy!

** Update 1: 8/24/06 9:45 PM **

Ha! It looks like that was short-lived! (Thanks to Ryan for giving me the heads-up), Google apparently doesn’t return the favor:

We’re sorry…

… but your query looks similar
to automated requests from a computer virus or spyware
application. To protect our users, we can’t process your request
right now.

We’ll restore your access as quickly as possible, so try again soon. In the meantime, if you suspect that your computer or network has been infected,
you might want to run a virus checker or spyware remover to make sure that your systems are free of viruses and other spurious software.

We apologize for the inconvenience, and hope we’ll see you again on Google.

And here I was, trying to be nice by caching the results… Guess we’ll have to wait for the API.

** Update 2: 8/25/06 8:50 AM **

So, I think I found a way to fix this. Essentially, the way I was previously determining if Google Books has a record for and ISBN what by using this URL template:$isbn&printsec=frontcover&dq=isbn:$isbn

Now I’m using a different URL that does not return 404:$isbn

If there was no record for that ISBN, Google would throw a 404. I think the fact that one IP was requesting so many 404s is what spooked Google, not the retrieval rate. Also, I noticed that I could no longer use wget on the command-line to grab the data–Google would return a 403 (Forbidden). So, my thought was to ditch PHP’s file_get_contents for CURL which allows you to spoof a user agent. I took a peek at our apache logs and chose:

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20060728 Firefox/

So, instead of looking like a “virus or spyware”, the script now appears, to Google, as an extremely zealous Google Books user. We’ll see how long it lasts, but it seems to be holding…

** Update 3: 8/25/06 11:40 AM **

No go, they’ve blocked us again. I’m sending an email to the kind folks at Google Books, and we’ll see if they reply. Until then, I’ve got a few more tricks up my sleeve… In the meantime, I’ll leave the cached information active…

** Update 4: 8/25/06 4:07:PM **

Google scores major points in my book! One of the managers over at Google Books just emailed me to say that he likes the idea of the hit-list links and that he is going to see if they can accommodate these types of queries.

About this entry