<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>John Wilkin’s blog &#187; google scholar</title>
	<atom:link href="http://scholarlypublishing.org/jpwilkin/archives/category/google-scholar/feed" rel="self" type="application/rss+xml" />
	<link>http://scholarlypublishing.org/jpwilkin</link>
	<description>John's blog on libraries, library technology, and pizza</description>
	<lastBuildDate>Wed, 17 Feb 2010 01:28:30 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Our hidden digital libraries</title>
		<link>http://scholarlypublishing.org/jpwilkin/archives/14</link>
		<comments>http://scholarlypublishing.org/jpwilkin/archives/14#comments</comments>
		<pubDate>Sun, 27 Jul 2008 17:36:29 +0000</pubDate>
		<dc:creator>jpwilkin</dc:creator>
				<category><![CDATA[google scholar]]></category>
		<category><![CDATA[library technology]]></category>

		<guid isPermaLink="false">http://scholarlypublishing.org/jpwilkin/archives/14</guid>
		<description><![CDATA[Two of my very talented colleagues, Kat Hagedorn and Josh Santelli, just published a nice piece in D-Lib entitled &#8220;Google Still Not Indexing Hidden Web URLs.&#8221; Kat and Josh and I have discussed this problem off and on, stimulated in part by our frustrations in getting the OAI data collected by OAIster into search services [...]]]></description>
			<content:encoded><![CDATA[<p>Two of my very talented colleagues, Kat Hagedorn and Josh Santelli, just published<a href="http://www.dlib.org/dlib/july08/hagedorn/07hagedorn.html" target="_blank"> a nice piece </a>in D-Lib entitled &#8220;Google Still Not Indexing Hidden Web URLs.&#8221; Kat and Josh and I have discussed this problem off and on, stimulated in part by our frustrations in getting the OAI data collected by OAIster into search services like Google.</p>
<p>In preparation for a recent talk in China (the challenges and opportunities for digital libraries), I talked to Kat and Josh about the extent of OAIster data not findable through standard web searches.  That so much of our digital library content is not findable through standard search engines has always been a troublesome issue, and I would have expected that with the passage of time, this particular problem have been solved.  It hasn&#8217;t, and that has made me wonder about what we do in digital libraries and how we do it.</p>
<p>Kat&#8217;s and Josh&#8217;s numbers are compelling.  OAIster focuses on the hidden web&#8211;resources not typically stored as files in a crawlable web directory&#8211;and so OAIster, with its 16 million records, is a particularly good resource for finding digital library resources.  Kat and Josh conclude that more than 55% of the content in OAIster can&#8217;t be found in Google.</p>
<p>As much as I like Kat&#8217;s and Josh&#8217;s analysis, I draw a different conclusion from the data.  They write that, &#8220;[g]iven the resulting numbers from our investigation, it seems that Google needs to do much more to gather hidden resources.&#8221;  This perspective is one many of us share.  We&#8217;re inclined to point a finger at Google (or other search engines) and wish they tried harder to look into our arcane systems.   We believe that if only Google and others had a deeper appreciation of our content or tried harder, this problem would go away.  I&#8217;ve been fortunate enough to be able to try to advance this argument one-on-one with the heads of Google and Google Scholar, and their responses are similar&#8211;too much trouble for the value of the content.  As time has passed, I&#8217;ve come to agree.</p>
<p>Complexity in digital library resources is at the heart of our work, and is frankly one reason why many of us find the work so interesting.  Anyone who thinks that the best way to store the 400,000 pages (140+ million words) of the texts in the Patrologia Latina is as a bunch of static web pages knows nothing of the uses or users of that resource or what&#8217;s involved in managing it.  Similarly, to effectively manage the tens of thousands of records for a run-of-the-mill image collection, you can&#8217;t store them as individual HTML pages lacking well-defined fields and relationships.  These things are obvious to people in our profession.</p>
<p>We often go wrong, however, when we try to share our love of complexity with the consumers.  We&#8217;ve come to understand that success in building our systems involves making complicated <em>uses</em> possible without at the same time requiring the user to have a complicated <em>understanding</em> of the resource.  What we must also learn is that a simplified rendering of the content, so that it can be easily found by the search engines, is not an unfortunate compromise, but rather a necessary part of our work.</p>
<p>Will it be possible in all cases to break down the walls between the complex resources and the simple ways that web crawlers need to understand them?  Absolutely not. The growing sophistication of the search engines does ensure that it gets easier with time, however.  About a decade ago, we tried populating directories with tiny HTML files created from records in image databases.  The crawlers gave up after picking up a few thousand records, apparently daunted by the vastness of this content.  Now, however, this sort of approach works and only requires patience as the crawlers make repeated passes over the content.  Large and complex text collections <em>can</em> by modeled as simplified text files, and the search engines can be tricked into pointing the user to the appropriate entry point for the work from which the text is drawn.</p>
<p>One thing the analysis of the OAIster data shows is that, as a community, we have not availed ourselves of these relatively simple solutions to making our resources more widely discoverable.  Not all of the challenges of modeling digital library resources are this easy.  There are bigger challenges that require more creative solutions, but creating these solutions is part of the job of putting the resources online, not a nuisance or distraction from that job.</p>
]]></content:encoded>
			<wfw:commentRss>http://scholarlypublishing.org/jpwilkin/archives/14/feed</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Metasearch vs. Google Scholar</title>
		<link>http://scholarlypublishing.org/jpwilkin/archives/6</link>
		<comments>http://scholarlypublishing.org/jpwilkin/archives/6#comments</comments>
		<pubDate>Tue, 06 Nov 2007 01:59:04 +0000</pubDate>
		<dc:creator>jpw</dc:creator>
				<category><![CDATA[google scholar]]></category>
		<category><![CDATA[metasearch]]></category>

		<guid isPermaLink="false">http://scholarlypublishing.org/jpwilkin/2007/11/05/metasearch-vs-google-scholar/</guid>
		<description><![CDATA[What the world needs now is not another metasearch engine.  Mind you, having more and better and even free metasearch engines is a good thing, but there are already many metasearch engines, each with different strengths and weaknesses, and even some that are free and open source (e.g., see Oregon State’s LibraryFind).  Metasearch [...]]]></description>
			<content:encoded><![CDATA[<p>What the world needs now is not another metasearch engine.  Mind you, having more and better and even free metasearch engines is a good thing, but there are already many metasearch engines, each with different strengths and weaknesses, and even some that are free and open source (e.g., see <a href="http://libraryfind.org/" target="_blank">Oregon State’s LibraryFind</a>).  Metasearch isn’t an effective solution for the problem at hand.</p>
<p>Let’s start with the problem:  each of our libraries invests millions of dollars each year in a wide array of electronic resources for the campus, and we’d like to make it possible for our users to get the best possible information from these electronic resources in the easiest possible way.  When presented with this problem over the years, libraries have tacitly posed two possible solutions:  (1) bring all of the information together into a single database, or (2) find some way to search across all of these resources with a single search.  I suspect no one in our community has the audacity to suggest the first option as a solution because it’s <em>crazy talk</em>.  On the other hand, though, for more than a decade we’ve held out the hope of being able to search across many databases as a solution.  Wikipedia perhaps says it best in defining the term <a href="http://en.wikipedia.org/wiki/Metasearch" target="_blank"><em>metasearch</em></a>:  &#8220;Metasearch engines create what is known as a virtual database. They do not compile a physical database or catalogue of [all of their sources]. Instead, they take a user&#8217;s request, pass it to several other heterogeneous databases and then compile the results in a homogeneous manner based on a specific algorithm.&#8221;  Elsewhere, in the more polished <a href="http://en.wikipedia.org/wiki/Federated_search" target="_blank">entry for <em>federated search</em></a> (a more old-fashioned reference to the same concept), the author notes that federated searching solves the problem of scatter and lack of centralization, making a wide variety of documents “searchable without having to visit each database individually.”</p>
<p>Metasearch is a librarian’s idealistic solution to an intractable problem.<a href="http://scholarlypublishing.org/jpwilkin/archives/6#foot1">[1]</a>   Metasearching works, and there are standards that help ensure that it does.  So why doesn’t metasearch work to solve the larger problem I laid out at the beginning?  There are many reasons:  small variability in network performance, vast variations in the ways that different vendors database systems work, even greater variation in the information found in those different databases, and an overwhelming number of sources.  We complain at Michigan that our vendor product, MetaLib, is only able to search eight databases at once, but if there were no limits would we ask it to search the roughly 800 resources we currently list for our users?  Surely these problems are tractable.  Networks get more robust, standards are designed to iron out differences in systems, and 800 hardly seems like a large number.  Nevertheless, networks are in fact very robust right now and those standards only persist in trying to hamstring vendors who are trying to distinguish themselves from their competitors, and 800 <em>is</em> a very large number.  Despite all we do, even in the simplest metasearch applications today, when we repeat the <em>same</em> query against the <em>same</em> set of databases, we retrieve <em>different</em> results (IMHO, one of the greatest sins imaginable in a library finding tool).  We toss out important pieces of functionality in some of the resources in order to find the right lowest common denominator.  (Think about the plight of our hapless user when one database consists of fulltext and another is only bibliographic information:  a search of the first resource needs to be crafted carefully to avoid too-great recall, and a the search of the second needs the broadest set of possible terms to avoid too high a level of precision.)   This is not to say that it doesn’t make perfect sense to use metasearch to attack, say, a small group of similarly constructed and perhaps overlapping engineering databases rather than submitting the same search against each in some serial fashion.</p>
<p>Although metasearch doesn’t work to conduct discovery over the great big world of licensed content, creating a comprehensive database does work to conduct discovery over a vast array of resources.  Recent years have seen several presumptive dominant comprehensive databases.  Elsevier’s Scopus (focusing on STM and social science content) <a href="http://www.info.scopus.com/news/press/pr_050621.asp" target="_blank">claims</a> that its “[d]irect links to full-text articles, library resources and other applications like reference management software, make Scopus quicker, easier and more comprehensive to use than any other literature research tool.”  Scopus is just one of the most recent entrants in an arena where California’s <a href="http://oedb.org/" target="_blank">Online Education Database</a>, with its slogan of “Research Beyond Google,” can <a href="http://oedb.org/library/college-basics/research-beyond-google" target="_blank">claim</a> to present “119 Authoritative, Invisible, and Comprehensive Resources.”  Ironically, in describing the problem of getting at an “invisible web” estimated to be 500 times the size of the visible web, the OEDB poses itself as <a href="http://oedb.org/library/college-basics/research-beyond-google" target="_blank">going <em>beyond</em> Google</a>, when the obvious place to turn in all of this is <em>Google Scholar</em>.</p>
<p>Google Scholar (GS) is absolutely <em>not</em> a replacement for the vast array of resources we license for our users.  Criticisms of Google Scholar abound.  Perhaps most troubling to an academic audience, GS is secretive about its coverage:  no information exists either inside GS or by any watchdog group analyzing the extent of its coverage in any area or for any publisher.  Moreover, it will probably always be the case that some enterprises in our sphere fund the work of finding and indexing the literature of a discipline, online and offline, by charging for subscriptions, thus putting them in direct opposition to GS and keeping their indexes out of GS.  (Consider, for example, the Association of Asian Studies with its <a href="http://quod.lib.umich.edu/b/bas/" target="_blank">Bibliography of Asian Studies</a> or the Modern Language Association and the <a href="http://www.mla.org/bibliography" target="_blank">MLA Bibliography</a>, each funding its bibliographic sleuthing by selling access to the resulting indexes.  To give their information to GS is to destroy the same funding that makes it possible for them to collect the information.)  And yet, as we learned in the recent article “Metalib and Google Scholar: a User Study,” undergraduates are more effective in finding needed information through Google Scholar than through our metasearch tools.<a href="http://scholarlypublishing.org/jpwilkin/archives/6#foot2">[2]</a></p>
<p>If metasearch is an ineffective tool for comprehensive “discovery” and Google Scholar has its own shortcomings, the need and the opportunity in this space is <em>not</em> creating a more effective metasearch tool; rather, the challenge is to bring these two strategies together in a way that best serves the interests of an insatiable academic audience, whether undergraduate, graduate or faculty.</p>
<p>Recently, Ken Varnum (our head of Web Systems) and I brainstormed about a few approaches and followed this with a conversation with Anurag Acharya, who developed Google Scholar.  I toss out the the strategies that follow to seed this conversation space with a few ideas, not to pretend to be exhaustive or to point to the best possible solution.  These need to be further developed and tested before exploring them further.  In each of these, the scenario begins with an authenticated user querying Google Scholar.  While the GS results are coming back and are presented to the user, into either a separate frame (Anurag’s recommendation, based on usability work at Google) or into a separate pop-up window, we present information about other sources that might prove useful.</p>
<p><strong>1. Capitalize on user information to augment GS searches</strong>:  When a user authenticates, we have at our disposal a number of attributes about the user such as status, currently enrolled courses, and degree programs.  With this, we initiate a metasearch of databases we deem to be relevant and either return, in that frame or window, ranked results or links to hit counts and databases.  One advantage of this approach is that it’s fairly straightforward with few significant challenges.  We would probably want to capitalize on work done by Groningen in their <a href="http://livetrix.ub.rug.nl/" target="_blank">Livetrix</a> implementation, where they eschew the standard MetaLib interface for a connection to the MetaLib X-Server so that they can better tailor interaction with the remote databases and present results.  The obvious disadvantage to this approach is that we make an assumption about a user based on his or her subject focus:  when a faculty member in English searches Google Scholar for information on mortality statistics in 16c England, we’re likely to have missed the mark by searching <em>MLA Bibliography</em>.</p>
<p><strong>2. Capitalize on query information to augment GS searches</strong>:  In this scenario, we find some way to intercept query terms to try to map concepts to possible databases.  We would use the same basic technical approach described above (i.e., GS in the main frame or window; other results in a separate frame or window) to ensure that the user immediately gets on-target results, but through sophisticated linguistic analysis we find and introduce the user to other databases that might bear fruit.  This approach avoids the deficiency of the first by making no assumptions about a user’s interest based on his or her degree/departmental affiliation.  It does, however, create great challenges for us in creating quick and accurate mapping relationships between brief (one- or two-word) query terms and databases.  Although a library might be able to undertake the first strategy with only modest resources, this second approach requires partnership with researchers in areas such as computational linguistics.</p>
<p><strong>3. Introduce the user to the possibility of other resources</strong>:  This more modest approach only requires the library interface to figuratively tap the user on the shoulder and point out that, in addition to GS, other resources may be helpful.  So, for example, we might submit the user’s query to GS <em>while</em> we submit the same query to Scopus and Web of Science, two other fairly comprehensive resources, produce hit counts, and suggest to the user that s/he look at results from these two databases or some of our other 800 resources.</p>
<p><strong>4. Use GS results to augment GS</strong>: Use the results from GS, rather than queries to GS, to derive the content of the “you could also look at…” pane.  By clustering things that come back, we could provide some subject areas that might be useful.  Clustering is tricky, of course, for the same reason that metasearch is tricky—we’re not working with a lot of text and with dissimilar text lengths—but if we could pull back the full text of documents via the OpenURL links GS provides, and then cluster that, we might have some useful information.  Again, a library might benefit from collaboration with some area of information science research, particularly on the semantic aspects.  The biggest challenge here would be in doing something that doesn’t introduce significant delay (and thus annoyance); however, we might accomplish this by offering it as an <em>option</em> to users (i.e., as in “good stuff here, but think you might want more and better?”).</p>
<p>Our challenge is to help our users through the maze of e-resources without interrupting their journey, getting them to results as quickly as possible; by combining results from Google Scholar with licensed resources we can help them get fast results <em>and</em> become more aware of the wealth of resources available to them.  All of these ideas are off-the-cuff and purposely sketchy.  Ken and I have spent little time exploring the opportunities or pitfalls. Some approaches will lend themselves to collaboration more than others (e.g., collaboration with HCI and linguistics researches), but all benefit from further study (How much more effective is this approach than traditional metasearch?  Than Google Scholar alone?  How satisfied is the user with the experience compared to those other approaches?).</p>
<p><strong>Notes</strong><br />
<a title="foot1" name="foot1"></a>[1] Note the interestingly self-serving article by Tamar Sadeh, from Ex Libris, where she concludes, “Metasearch systems have several advantages over Google Scholar. We anticipate that in the foreseeable future, libraries will continue to provide access to their electronic collections via their branded, controlled metasearch system” (<em>HEP Libraries Webzine</em>, Issue 12 / March 2006, <a href="http://library.cern.ch/heplw/12/papers/1/" target="_blank">http://library.cern.ch/heplw/12/papers/1/</a>).</p>
<p><a title="foot2" name="foot2"></a>[2] Haya, Glenn, Else  Nygren, and Wilhelm  Widmark. “<a href="http://www.emeraldinsight.com/10.1108/14684520710764122" target="_blank">Metalib and Google Scholar: a User Study</a>” <em>Online Information Review</em>  31(3)(2007): 365-375.  I found one review of the article by an enlightened librarian where he concludes that the moral of the study is that we need to do a better job training our users to use metasearch<span id="more-6"></span><!--more--><!--more--><!--more--></p>
]]></content:encoded>
			<wfw:commentRss>http://scholarlypublishing.org/jpwilkin/archives/6/feed</wfw:commentRss>
		<slash:comments>29</slash:comments>
		</item>
	</channel>
</rss>
