<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>John Wilkin’s blog &#187; library technology</title>
	<atom:link href="http://scholarlypublishing.org/jpwilkin/archives/category/library-technology/feed" rel="self" type="application/rss+xml" />
	<link>http://scholarlypublishing.org/jpwilkin</link>
	<description>John's blog on libraries, library technology, and pizza</description>
	<lastBuildDate>Wed, 17 Feb 2010 01:28:30 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>The launch of HathiTrust</title>
		<link>http://scholarlypublishing.org/jpwilkin/archives/16</link>
		<comments>http://scholarlypublishing.org/jpwilkin/archives/16#comments</comments>
		<pubDate>Mon, 13 Oct 2008 13:04:02 +0000</pubDate>
		<dc:creator>jpwilkin</dc:creator>
				<category><![CDATA[digitization]]></category>
		<category><![CDATA[library technology]]></category>

		<guid isPermaLink="false">http://scholarlypublishing.org/jpwilkin/archives/16</guid>
		<description><![CDATA[Today, we officially launched HathiTrust, a multi-institutional effort to create the universal library&#8211;to bring together as comprehensive a body of works as possible and to do it in a way that ensures access, permanence, content preservation, and an advanced environment for research.  See the press release here:  http://www.hathitrust.org/press.  In short, HathiTrust is an effort born [...]]]></description>
			<content:encoded><![CDATA[<p>Today, we officially launched HathiTrust, a multi-institutional effort to create the universal library&#8211;to bring together as comprehensive a body of works as possible and to do it in a way that ensures access, permanence, content preservation, and an advanced environment for research.  See the press release here:  <a href="http://www.hathitrust.org/press" title="HathiTrust press release(s)" target="_blank">http://www.hathitrust.org/press</a>.  In short, HathiTrust is an effort born of libraries, working to bring the lasting contributions of libraries to bear on the growing body of digital materials available to students and researchers. Much has been said and written about the silo effect of digital libraries, the way that our early technological efforts balkanized content and failed to capitalize on economies of scale.  With the creation of HathiTrust, many of the world&#8217;s great research libraries will work together to create a single, comprehensive library without walls.  Our partners will work to coordinate their investments both in curating content and in building services, to create a whole greater than the sum of its parts.</p>
<p>In doing this, of course, we raise many questions:</p>
<p><strong>Is this an effort that will compete with Google Book Search?</strong><br />
We believe in the value the private sector can bring to great challenges like discovery, but we also believe that our commitment to permanence sets us apart from private sector efforts.  Should Google or Microsoft lose interest or should their stockholders question the corporate commitment to these large bodies of information, these companies will move on to other problems.  The libraries that have initiated this effort are committed to the long-term preservation and availability of their content; doing so is part of their fundamental identity as research libraries.  Moreover, it is always likely that research libraries will support uses that the private sector does not value.  Consider, for example, data mining and other types of analysis.  We will be working to support this type of activity for the researchers of our institutions, and for the public more broadly.  You can be sure that when something in the HathiTrust is cited, you can always return to that source, to confirm, refute or build on previous work.</p>
<p><strong>If HathiTrust strives to support <em>access</em>, what about access to its in-copyright materials?</strong><br />
The member institutions of HathiTrust obey the law and do not believe that, for example, &#8220;fair use&#8221; can be construed to mean authenticated access to this entire body of material for all of our users.  However, we do and will support many lawful uses of the in-copyright materials.  Under the terms of Section 108 of US copyright law, we may provide limited access to works that are in jeopardy and that are not readily available on the market.  In addition, for the first time ever, through the use of appropriate technologies we will be able to provide broad library access to many disabled users.  We also hope to work with rights holders to broaden access, not only to our constituencies, but to the world.  And, at the very least, one basic appropriate use is the preservation of this content.</p>
<p><strong>Is HathiTrust a digital archiving effort to end all digital archiving efforts?</strong><br />
We believe that HathiTrust occupies an important space in a valuable and growing area of work by our community.  Where Portico works with publishers to curate actively published journal content, HathiTrust will serve as the vehicle for preserving books and many journals (particularly journals that have ceased publication).  We intend to grow HathiTrust in many ways, but we will also work actively with organizations like Portico, OCLC and CLOCKSS to strengthen the support our community gives to preserving digital content.</p>
<p><strong>Is the content of HathiTrust &#8220;open&#8221;? </strong><br />
The library partners who have created HathiTrust are committed to broad access to the content in this digital library.  Hundreds of thousands of public domain works are already available in HathiTrust, and not simply to the communities immediately served by our libraries.  We understand that many would like to copy large numbers of digitized works from HathiTrust, and where we have appropriate rights (for tens of thousands of volumes already), we will make that possible.  We know that this openness provides the greatest benefit to our users, and we will work to make the content in the HathiTrust more accessible as time goes on.</p>
<p>HathiTrust faces many issues going forward&#8211;the quality of the content deposited, challenges to digital preservation, governance and cost models &#8211;but HathiTrust has demonstrated success and efficiency in overcoming significant challenges it has faced thus far. By leveraging the capabilities of large-scale digitization and bringing together key partners, HathiTrust will create a new way for libraries to work together to ensure that the great values we have always stood for are supported well into the future.</p>
]]></content:encoded>
			<wfw:commentRss>http://scholarlypublishing.org/jpwilkin/archives/16/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Our hidden digital libraries</title>
		<link>http://scholarlypublishing.org/jpwilkin/archives/14</link>
		<comments>http://scholarlypublishing.org/jpwilkin/archives/14#comments</comments>
		<pubDate>Sun, 27 Jul 2008 17:36:29 +0000</pubDate>
		<dc:creator>jpwilkin</dc:creator>
				<category><![CDATA[google scholar]]></category>
		<category><![CDATA[library technology]]></category>

		<guid isPermaLink="false">http://scholarlypublishing.org/jpwilkin/archives/14</guid>
		<description><![CDATA[Two of my very talented colleagues, Kat Hagedorn and Josh Santelli, just published a nice piece in D-Lib entitled &#8220;Google Still Not Indexing Hidden Web URLs.&#8221; Kat and Josh and I have discussed this problem off and on, stimulated in part by our frustrations in getting the OAI data collected by OAIster into search services [...]]]></description>
			<content:encoded><![CDATA[<p>Two of my very talented colleagues, Kat Hagedorn and Josh Santelli, just published<a href="http://www.dlib.org/dlib/july08/hagedorn/07hagedorn.html" target="_blank"> a nice piece </a>in D-Lib entitled &#8220;Google Still Not Indexing Hidden Web URLs.&#8221; Kat and Josh and I have discussed this problem off and on, stimulated in part by our frustrations in getting the OAI data collected by OAIster into search services like Google.</p>
<p>In preparation for a recent talk in China (the challenges and opportunities for digital libraries), I talked to Kat and Josh about the extent of OAIster data not findable through standard web searches.  That so much of our digital library content is not findable through standard search engines has always been a troublesome issue, and I would have expected that with the passage of time, this particular problem have been solved.  It hasn&#8217;t, and that has made me wonder about what we do in digital libraries and how we do it.</p>
<p>Kat&#8217;s and Josh&#8217;s numbers are compelling.  OAIster focuses on the hidden web&#8211;resources not typically stored as files in a crawlable web directory&#8211;and so OAIster, with its 16 million records, is a particularly good resource for finding digital library resources.  Kat and Josh conclude that more than 55% of the content in OAIster can&#8217;t be found in Google.</p>
<p>As much as I like Kat&#8217;s and Josh&#8217;s analysis, I draw a different conclusion from the data.  They write that, &#8220;[g]iven the resulting numbers from our investigation, it seems that Google needs to do much more to gather hidden resources.&#8221;  This perspective is one many of us share.  We&#8217;re inclined to point a finger at Google (or other search engines) and wish they tried harder to look into our arcane systems.   We believe that if only Google and others had a deeper appreciation of our content or tried harder, this problem would go away.  I&#8217;ve been fortunate enough to be able to try to advance this argument one-on-one with the heads of Google and Google Scholar, and their responses are similar&#8211;too much trouble for the value of the content.  As time has passed, I&#8217;ve come to agree.</p>
<p>Complexity in digital library resources is at the heart of our work, and is frankly one reason why many of us find the work so interesting.  Anyone who thinks that the best way to store the 400,000 pages (140+ million words) of the texts in the Patrologia Latina is as a bunch of static web pages knows nothing of the uses or users of that resource or what&#8217;s involved in managing it.  Similarly, to effectively manage the tens of thousands of records for a run-of-the-mill image collection, you can&#8217;t store them as individual HTML pages lacking well-defined fields and relationships.  These things are obvious to people in our profession.</p>
<p>We often go wrong, however, when we try to share our love of complexity with the consumers.  We&#8217;ve come to understand that success in building our systems involves making complicated <em>uses</em> possible without at the same time requiring the user to have a complicated <em>understanding</em> of the resource.  What we must also learn is that a simplified rendering of the content, so that it can be easily found by the search engines, is not an unfortunate compromise, but rather a necessary part of our work.</p>
<p>Will it be possible in all cases to break down the walls between the complex resources and the simple ways that web crawlers need to understand them?  Absolutely not. The growing sophistication of the search engines does ensure that it gets easier with time, however.  About a decade ago, we tried populating directories with tiny HTML files created from records in image databases.  The crawlers gave up after picking up a few thousand records, apparently daunted by the vastness of this content.  Now, however, this sort of approach works and only requires patience as the crawlers make repeated passes over the content.  Large and complex text collections <em>can</em> by modeled as simplified text files, and the search engines can be tricked into pointing the user to the appropriate entry point for the work from which the text is drawn.</p>
<p>One thing the analysis of the OAIster data shows is that, as a community, we have not availed ourselves of these relatively simple solutions to making our resources more widely discoverable.  Not all of the challenges of modeling digital library resources are this easy.  There are bigger challenges that require more creative solutions, but creating these solutions is part of the job of putting the resources online, not a nuisance or distraction from that job.</p>
]]></content:encoded>
			<wfw:commentRss>http://scholarlypublishing.org/jpwilkin/archives/14/feed</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Next Generation Library Systems</title>
		<link>http://scholarlypublishing.org/jpwilkin/archives/7</link>
		<comments>http://scholarlypublishing.org/jpwilkin/archives/7#comments</comments>
		<pubDate>Fri, 16 Nov 2007 15:56:11 +0000</pubDate>
		<dc:creator>jpwilkin</dc:creator>
				<category><![CDATA[library technology]]></category>

		<guid isPermaLink="false">http://scholarlypublishing.org/jpwilkin/archives/7</guid>
		<description><![CDATA[The problem
With the backdrop of the widely touted lessons of Amazoogle—an expression I can barely stand to write—three of the more interesting emerging developments of late have been OCLC’s WorldCat Local, Google Book Search, and Google Scholar. As Lorcan Dempsey argued, the &#8220;massive computational and data platforms [of Google, Amazon and EBay] exercise [a] strong [...]]]></description>
			<content:encoded><![CDATA[<p><strong>The problem</strong><br />
With the backdrop of the widely touted lessons of Amazoogle—an expression I can barely stand to write—three of the more interesting emerging developments of late have been OCLC’s WorldCat Local, Google Book Search, and Google Scholar. As Lorcan Dempsey <a href="http://orweblog.oclc.org/archives/000562.html" target="_blank">argued</a>, the &#8220;massive computational and data platforms [of Google, Amazon and EBay] exercise [a] strong gravitational web attraction,&#8221; a sort of undeniable central force in the solar system of our users’ web experience. What has happened with WorldCat Local, Google Book Search and Google Scholar has extended that same sort of pull to key scholarly discovery resources. No one needed the OCLC environmental scans to be reminded that our users look to Google before they turn to the multi-million dollar scholarly resources that we purchase for them, and everyone was aware that Amazon satisfied a broad range of discovery needs more effectively than the local catalog. Now, however, mainstream “network services” like Amazon and Google web search, deficient in their ability to satisfy scholarly discovery, are complemented by similarly “massive computational and data platforms” that specialize in just that—finding resources in the scholarly sphere. These forces, and perhaps more like them in the future, should influence the way that we design and build our library systems. If we ignore these types of developments, choosing instead to build systems with ostensibly superior characteristics, <em>systems that sit on the margins</em>, we effectively ensure our irrelevance, building systems for an idealized user who is practically non-existent.</p>
<p>Our resources, skills and investments have helped to create an opportunity for us to shape a next generation of library systems, simultaneously cognizant of the strong network layer <strong>and</strong> our needs and responsibilities as a preeminent research library. At Michigan, we have designed and built our past systems, each in partial isolation from the other system, reflecting the state of library technology and our response to user needs. We were not <em>wrong</em> in the way that we developed our systems, but rather we were right for those times. In building things in this way, we have developed an LMS support team with extraordinary talent and responsiveness, a digital library systems development effort that blazed trails and continues to be valued for the solidity of its product, and base-funded IT infrastructure that is utterly rock-solid&#8211;all great, but generally as independently conceived efforts.<a href="http://scholarlypublishing.org/jpwilkin/archives/7#foot1">[1]</a> What libraries like ours must do now is reconceive our efforts in light of the changed environment. The reconceptualization should, as mentioned, not only be built with an awareness of the new destinations our users choose, but also with a recognition that we have a special responsibility for the long-term curation of library assets. Even at its most successful, Google Scholar does not include all of the roughly $8m in electronic resources that we purchase for the campus, and Google Book Search is not designed to support the array of activities that we associate with scholarship.</p>
<p>Knowing that we must change where we invest our resources is one thing; knowing where we must invest is another. I don’t believe I should (or could) paint an accurate picture of the sorts of shifts we should make. On the other hand, I can lay out here a number of key principles that should guide our work.</p>
<p><strong>Principles<br />
1.    Balanced against network services</strong>: I believe this is probably the most important principle in the design of what we must build. We must not try to do what the network can do for us. We must find ways to facilitate integration with network services and ensure that our investment is where our role is most important (e.g., not trying to compete with the network services <em>unless</em> we think we can and should displace them in a key area). For example, we have recognized that Google will be a point of discovery, and so rather than trying to duplicate what they do well for the broad masses of people, we should (1) put all things online in a way that Google can discover; and (2) because we recognize that Google won’t build services in ways that serve all scholarly needs, work to strategically complement what they do. In the first instance (i.e., making sure that Google can discover resources), we will always need to block them, for legal or other reasons, from discovering content.<a href="http://scholarlypublishing.org/jpwilkin/archives/7#foot2">[2]</a> These types of exceptions should add nuance to what we do in exposing content. In the second instance, when it comes to building complementary services, we’ll need to be both smart (and well-informed) and strategic.</p>
<p><strong>2.    Openness</strong>:  What we develop should easily support our building services <em>and</em>, even more importantly, should allow others to build them. It should take advantage of existing protocols, tools and services. Throughout this document, I want to be very clear that these principles or criteria don’t necessarily point to a specific tool or a specific way of doing things. Here, I would like to note that the importance of openness, though great, does not necessarily point to the need to do things as <em>open source</em>.  As O’Reilly <a href="http://www.oreilly.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html" target="_blank">has written</a> in his analysis of the emergence of Web 2.0, this is what we see in Amazon’s and Google’s architectures, where the mechanisms for building services are clearly articulated, but no one sees the code for their basic services: the investment shifts from shareable software to services. Similarly, our being open to having external services built on top of our own should not imply that our best or only route is open source software. What is particularly important is the need to have data around which others would like to build tools and services: openness in resources that few wish to include is really only beautifying a backwater destination.</p>
<p><strong>3.    Open source</strong>: Despite what I noted above about openness, we should try, wherever possible, to do our work with open source licensing models and we should try to leverage existing open source activities. In part, this is simply because, in doing so, we’ll be able to leverage the development efforts of others. We should also aim for this because of the increasing cost of poorly functioning commercial products in the library marketplace. Note, though, that when we choose to use open source software, it’s important to pick the right open source development effort—one that is indeed open and around which others are developing. Much open source software is isolated, with few contributions. We should aim for openness in our services over slavish devotion to open source. We should also choose this route when we can simply because it&#8217;s the best economic model for software in our sphere.</p>
<p><strong>4.    Integration</strong>: Tight integration is not the most important characteristic of the systems we should build, nor should this sort of integration be an end in itself; however, we have an opportunity to optimize integration across all or most of our systems, making an investment in one area count for others. In Michigan’s MBooks repository, we have already begun to demonstrate some of the value in this type of integration by relying on the Aleph X-Server for access to bibliographic information, and we should continue to make exceptions to tighter integration only after careful deliberation. A key example is <a href="http://scholarlypublishing.org/jpwilkin/archives/6" target="_blank">the use of metasearch for discovery of remote and local resources</a>: we should need to address only a single physical or virtual repository for locally-hosted content. We should give due consideration to the value of “loose” integration (e.g., automatically copying information out of sources and into target systems), but the example of the Aleph X-Server has been instructive and shows the way this sort of integration can provide both increased efficiency and greater reliability in results.</p>
<p><strong>5.    Rapid development</strong>: If we take a long time to develop our next generation architecture, it will be irrelevant before we deploy it. I know this pressure is a classic tension point between Management and Developers: one perspective holds that we’re spending our time on fine-looking code rather than getting a product to the user, and the other argues that work done rapidly will be done poorly. This dichotomy is false. The last few years of Google’s “perpetual beta” and a rapidly changing landscape have underlined the need to build services quickly, while the importance of reliability and unforgiving user expectations have helped to emphasize the value of a quality product. We can’t do one without the other, and I think the issue will be scaling our efforts to the available resources, picking the right battles, and not being overambitious.</p>
<p><strong>Directions</strong><br />
These sorts of defining principles are familiar and perhaps obvious, but what is less obvious is where all of this points. Although there are some clear indications that these sorts of principles are at play in, for example, the adoption of WorldCat Local or the integration of Fedora in VTLS’s library management system, there are also contradictory examples (e.g., the rush to enhance the local catalog, and many more silo-like systems like DSpace), and I’ve heard no articulations of an overarching integrated environment. If we undertake a massive restructuring of our IT infrastructure rather than strategic changes in some specific areas, or tweaking in many areas, it may appear to be an idiosyncratic and expensive development effort that robs one&#8217;s larger library organization of limited cycles for enhancements to existing systems. On the other hand, if we don’t position ourselves to take advantage of the types of changes I mentioned at the outset, we will polish the chrome on our existing investments for a few years until someone else gets this right or libraries are entirely irrelevant. Moreover, if we make the right sorts of choices in the current environment, we should also be able to capitalize on the efforts of others, thus compounding the return on each library’s investment. And of course, situating this discussion in a multi-institutional, cooperative effort minimizes the possibility that building the new architecture robs our institutions of scarce cycles.</p>
<p>It’s important, also, to keep in mind that this kind of perspective (i.e., the one I’m positing here) doesn’t presume to replace our existing technologies with something different. Many libraries have made many good choices on technologies that are serving their institutions well, and to the extent that they are the best or most effective tool for aligning with the principles I’ve laid out, we should use them. The X-Servers of Aleph and MetaLib are excellent examples of tools that allow the sort of integration we imagine. At UM, our own <a href="http://www.dlxs.org/" target="_blank">DLXS</a> and the new repository software we developed are powerful and flexible tools without the overhead of some existing DL tools. But in each case, it may make more sense to migrate to a new technology because we are elaborating a model of broader integration (both locally and with the ‘net) that others may also use. Where there is a shared development community (e.g., <a href="http://www.fedora-commons.org/" target="_blank">Fedora</a>, <a href="http://www.open-ils.org/" target="_blank">Evergreen</a> or <a href="http://libraryfind.org/" target="_blank">LibraryFind</a>), we can benefit from a community of developers. In all of this, we’ll need a strategy, and a strategy that remains flexible as the landscape changes.</p>
<p>It’s time to see our environment as being comprised of a set of inventory management responsibilities (both print and digital, both local and remote) that leverages a growing and maturing array of network services so that our users can effectively discover and use the resources available to them. I think that requires a change in the way we think about our technologies and a much more strategic arrangement of those technologies in relation to each other. We may be stuck with a bunch of local print “repositories” because of the nature of print and the history of library development. That’s not the case for our digital repository, however. On top of this, we need to conceptualize the sorts of services we need (e.g., ingest, exposure, other types of dissemination, archiving, etc.) and the tools that can best accomplish these things.</p>
<p><strong>Notes</strong><br />
<a title="foot1" name="foot1"></a>[1] Incidentally, I also believe that <a href="http://www.lib.umich.edu/lit/" target="_blank">Michigan’s organizational model</a>, comprised as it is of five distinct IT departments, is ideally suited to building the next generation of access and management technologies. Core Services should continue to provide a foundation of technology relevant to all of our activities, and should continue to develop and maintain system integration services used by all of the Library’s IT units. Library Systems will need to continue to support operational activities such as circulation and cataloging at the same time that it manages our most important database of descriptive metadata. DLPS should continue to focus on technologies that manage and provide access to the digital objects themselves—the data described by those metadata. Web Systems is ideally suited to provide a top layer of discovery and “use” tools that tap into both local data resources and those things we license remotely. I believe that our current organizational model shares out responsibility effectively and allows for a sort of specialization that is complementary; however, I wouldn’t rule out different organizational models if they made sense in the course of this process. For those readers outside the UM Library, the fifth department is Desktop Support Services, responsible not only for the desktop platform but also for the infrastructure supporting it.</p>
<p><a title="foot2" name="foot2"></a>[2] For example, with regard to <a href="http://deepblue.lib.umich.edu/" target="_blank">Deep Blue</a>, our institutional repository, in Michigan’s agreement with Wiley, approximately 33% of the Wiley-published/UM-authored content is restricted to UM users; and in our agreement with Elsevier, we may make it possible for Google to discover metadata but not fulltext. Similar things are bound to occur in the materials we put online in services other than Deep Blue.</p>
]]></content:encoded>
			<wfw:commentRss>http://scholarlypublishing.org/jpwilkin/archives/7/feed</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
	</channel>
</rss>
