Warning: file_get_contents(http://www.futureofthebook.org/commentpress/downloads/version.php) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found in /home/spo/scholarlypublishing.org/jpwilkin/wp-content/themes/commentpress/functions.php on line 583
John Wilkin’s blog » Our hidden digital libraries

Two of my very talented colleagues, Kat Hagedorn and Josh Santelli, just published a nice piece in D-Lib entitled "Google Still Not Indexing Hidden Web URLs." Kat and Josh and I have discussed this problem off and on, stimulated in part by our frustrations in getting the OAI data collected by OAIster into search services like Google.

In preparation for a recent talk in China (the challenges and opportunities for digital libraries), I talked to Kat and Josh about the extent of OAIster data not findable through standard web searches. That so much of our digital library content is not findable through standard search engines has always been a troublesome issue, and I would have expected that with the passage of time, this particular problem have been solved. It hasn't, and that has made me wonder about what we do in digital libraries and how we do it.

Kat's and Josh's numbers are compelling. OAIster focuses on the hidden web--resources not typically stored as files in a crawlable web directory--and so OAIster, with its 16 million records, is a particularly good resource for finding digital library resources. Kat and Josh conclude that more than 55% of the content in OAIster can't be found in Google.


1

As much as I like Kat's and Josh's analysis, I draw a different conclusion from the data. They write that, "[g]iven the resulting numbers from our investigation, it seems that Google needs to do much more to gather hidden resources." This perspective is one many of us share. We're inclined to point a finger at Google (or other search engines) and wish they tried harder to look into our arcane systems. We believe that if only Google and others had a deeper appreciation of our content or tried harder, this problem would go away. I've been fortunate enough to be able to try to advance this argument one-on-one with the heads of Google and Google Scholar, and their responses are similar--too much trouble for the value of the content. As time has passed, I've come to agree.


2

Complexity in digital library resources is at the heart of our work, and is frankly one reason why many of us find the work so interesting. Anyone who thinks that the best way to store the 400,000 pages (140+ million words) of the texts in the Patrologia Latina is as a bunch of static web pages knows nothing of the uses or users of that resource or what's involved in managing it. Similarly, to effectively manage the tens of thousands of records for a run-of-the-mill image collection, you can't store them as individual HTML pages lacking well-defined fields and relationships. These things are obvious to people in our profession.

We often go wrong, however, when we try to share our love of complexity with the consumers. We've come to understand that success in building our systems involves making complicated uses possible without at the same time requiring the user to have a complicated understanding of the resource. What we must also learn is that a simplified rendering of the content, so that it can be easily found by the search engines, is not an unfortunate compromise, but rather a necessary part of our work.

Will it be possible in all cases to break down the walls between the complex resources and the simple ways that web crawlers need to understand them? Absolutely not. The growing sophistication of the search engines does ensure that it gets easier with time, however. About a decade ago, we tried populating directories with tiny HTML files created from records in image databases. The crawlers gave up after picking up a few thousand records, apparently daunted by the vastness of this content. Now, however, this sort of approach works and only requires patience as the crawlers make repeated passes over the content. Large and complex text collections can by modeled as simplified text files, and the search engines can be tricked into pointing the user to the appropriate entry point for the work from which the text is drawn.


1

One thing the analysis of the OAIster data shows is that, as a community, we have not availed ourselves of these relatively simple solutions to making our resources more widely discoverable. Not all of the challenges of modeling digital library resources are this easy. There are bigger challenges that require more creative solutions, but creating these solutions is part of the job of putting the resources online, not a nuisance or distraction from that job.

Posted by jpwilkin on July 27, 2008
Tags: library technology, google scholar

Total comments on this page: 9

How to read/write comments

Comments on specific paragraphs:

Click the icon to the right of a paragraph

  • If there are no prior comments there, a comment entry form will appear automatically
  • If there are already comments, you will see them and the form will be at the bottom of the thread

Comments on the page as a whole:

Click the icon to the right of the page title (works the same as paragraphs)

Comments

No comments yet.

Ryan Shaw on paragraph 4:

“…too much trouble for the value of the content.”

This is precisely why we must avoid putting corporations in charge of our cultural heritage: despite claims to be “organizing the world’s information,” they really are only interested in organizing the subset of information that will bring them advertising revenues. Unless one believes that this subset is all that’s worth organizing, one ought to firmly reject the idea that Google and its ilk are anything more than advertising companies that provide useful tools. Adopt SEO techniques and manipulate them to draw traffic to our resources? Certainly. Trust them as stewards of those resources? Hell no.

July 27, 2008 11:50 am
Matthew Theobald on whole page :

John,

You might investigate the ISEN initiative. We’re pretty quiet now, but building up a voice toward revealing the entire hidden, invisible and deep WWW.

-m@

July 27, 2008 5:30 pm
ignorance on paragraph 5:

Isn’t there a difference between *storing* digital library as HTML and representing it that way for users to find?

July 27, 2008 6:39 pm
jpwilkin on paragraph 5:

You betcha. The difference is really about maintaining the objects (there’s much that you can do with them in formats that aren’t native to the Web) and flexibility (e.g., you can ‘express’ them in a variety of different ways). I’m sure there’s a large body of literature on this by now. I did a few pieces related to this in the early/mid-90’s, including a D-Lib article (”Just-in-time Conversion, Just-in-case Collections”) and a couple of things for the journal Public-Access Computer Systems Review (including, for example, “Using the World-Wide Web to Deliver Complex Electronic Documents: Implications for Libraries”).

July 28, 2008 5:05 am
jpwilkin on whole page :

I should have noticed Roy Tennant’s piece on this, http://hangingtogether.org/?p=475 (A Map to Destinations Uncrawled). A great analysis of the problem and of strategies.

July 28, 2008 5:11 am

[…] indexing of repositories? Hagendorn and Santell point the finger to Google indeed. However, John Wilkin, a colleague of them, doesn’t agree. Just as Lorcan Dempsey didn’t. And neither do […]

July 28, 2008 6:18 am
James Weinheimer on paragraph 8:

I don’t know if I completely agree with you that simplified text files are “good enough” (that may have to wait for practical tests) but I do agree that our materials must be in Google–for better or worse. Google has decided that they want XML sitemaps, and to be fair, it looks as if Google at least tried our way (OAI-PMH).

So, why don’t we give them their XML sitemaps? It seems to me easier than OAI-PMH, and we may find out that they, and you, are correct. If it doesn’t work, that could be interesting as well.

In either case, it would certainly be better than our materials being left out of Google searches.

July 30, 2008 2:27 am

[…] John Wilkin’s blog » Our hidden digital libraries (tags: google webinvisible webdesign digitallibraries bibliothèques library DocumentNumerique crawler interface moteur searchengine) […]

July 31, 2008 5:31 pm

[…] of Michigan library maven John Wilkin has some very interesting thoughts about “hidden” digital collections and making them more accessible via Google. Intriguing that we’ve so quickly come to this […]

August 16, 2008 10:02 pm
Name (required)
E-mail (required - never shown publicly)
URI