<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>John Wilkin’s blog &#187; digitization</title>
	<atom:link href="http://scholarlypublishing.org/jpwilkin/archives/category/digitization/feed" rel="self" type="application/rss+xml" />
	<link>http://scholarlypublishing.org/jpwilkin</link>
	<description>John's blog on libraries, library technology, and pizza</description>
	<lastBuildDate>Wed, 17 Feb 2010 01:28:30 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>The launch of HathiTrust</title>
		<link>http://scholarlypublishing.org/jpwilkin/archives/16</link>
		<comments>http://scholarlypublishing.org/jpwilkin/archives/16#comments</comments>
		<pubDate>Mon, 13 Oct 2008 13:04:02 +0000</pubDate>
		<dc:creator>jpwilkin</dc:creator>
				<category><![CDATA[digitization]]></category>
		<category><![CDATA[library technology]]></category>

		<guid isPermaLink="false">http://scholarlypublishing.org/jpwilkin/archives/16</guid>
		<description><![CDATA[Today, we officially launched HathiTrust, a multi-institutional effort to create the universal library&#8211;to bring together as comprehensive a body of works as possible and to do it in a way that ensures access, permanence, content preservation, and an advanced environment for research.  See the press release here:  http://www.hathitrust.org/press.  In short, HathiTrust is an effort born [...]]]></description>
			<content:encoded><![CDATA[<p>Today, we officially launched HathiTrust, a multi-institutional effort to create the universal library&#8211;to bring together as comprehensive a body of works as possible and to do it in a way that ensures access, permanence, content preservation, and an advanced environment for research.  See the press release here:  <a href="http://www.hathitrust.org/press" title="HathiTrust press release(s)" target="_blank">http://www.hathitrust.org/press</a>.  In short, HathiTrust is an effort born of libraries, working to bring the lasting contributions of libraries to bear on the growing body of digital materials available to students and researchers. Much has been said and written about the silo effect of digital libraries, the way that our early technological efforts balkanized content and failed to capitalize on economies of scale.  With the creation of HathiTrust, many of the world&#8217;s great research libraries will work together to create a single, comprehensive library without walls.  Our partners will work to coordinate their investments both in curating content and in building services, to create a whole greater than the sum of its parts.</p>
<p>In doing this, of course, we raise many questions:</p>
<p><strong>Is this an effort that will compete with Google Book Search?</strong><br />
We believe in the value the private sector can bring to great challenges like discovery, but we also believe that our commitment to permanence sets us apart from private sector efforts.  Should Google or Microsoft lose interest or should their stockholders question the corporate commitment to these large bodies of information, these companies will move on to other problems.  The libraries that have initiated this effort are committed to the long-term preservation and availability of their content; doing so is part of their fundamental identity as research libraries.  Moreover, it is always likely that research libraries will support uses that the private sector does not value.  Consider, for example, data mining and other types of analysis.  We will be working to support this type of activity for the researchers of our institutions, and for the public more broadly.  You can be sure that when something in the HathiTrust is cited, you can always return to that source, to confirm, refute or build on previous work.</p>
<p><strong>If HathiTrust strives to support <em>access</em>, what about access to its in-copyright materials?</strong><br />
The member institutions of HathiTrust obey the law and do not believe that, for example, &#8220;fair use&#8221; can be construed to mean authenticated access to this entire body of material for all of our users.  However, we do and will support many lawful uses of the in-copyright materials.  Under the terms of Section 108 of US copyright law, we may provide limited access to works that are in jeopardy and that are not readily available on the market.  In addition, for the first time ever, through the use of appropriate technologies we will be able to provide broad library access to many disabled users.  We also hope to work with rights holders to broaden access, not only to our constituencies, but to the world.  And, at the very least, one basic appropriate use is the preservation of this content.</p>
<p><strong>Is HathiTrust a digital archiving effort to end all digital archiving efforts?</strong><br />
We believe that HathiTrust occupies an important space in a valuable and growing area of work by our community.  Where Portico works with publishers to curate actively published journal content, HathiTrust will serve as the vehicle for preserving books and many journals (particularly journals that have ceased publication).  We intend to grow HathiTrust in many ways, but we will also work actively with organizations like Portico, OCLC and CLOCKSS to strengthen the support our community gives to preserving digital content.</p>
<p><strong>Is the content of HathiTrust &#8220;open&#8221;? </strong><br />
The library partners who have created HathiTrust are committed to broad access to the content in this digital library.  Hundreds of thousands of public domain works are already available in HathiTrust, and not simply to the communities immediately served by our libraries.  We understand that many would like to copy large numbers of digitized works from HathiTrust, and where we have appropriate rights (for tens of thousands of volumes already), we will make that possible.  We know that this openness provides the greatest benefit to our users, and we will work to make the content in the HathiTrust more accessible as time goes on.</p>
<p>HathiTrust faces many issues going forward&#8211;the quality of the content deposited, challenges to digital preservation, governance and cost models &#8211;but HathiTrust has demonstrated success and efficiency in overcoming significant challenges it has faced thus far. By leveraging the capabilities of large-scale digitization and bringing together key partners, HathiTrust will create a new way for libraries to work together to ensure that the great values we have always stood for are supported well into the future.</p>
]]></content:encoded>
			<wfw:commentRss>http://scholarlypublishing.org/jpwilkin/archives/16/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Discovering the Undiscovered Public Domain</title>
		<link>http://scholarlypublishing.org/jpwilkin/archives/13</link>
		<comments>http://scholarlypublishing.org/jpwilkin/archives/13#comments</comments>
		<pubDate>Tue, 20 May 2008 00:23:57 +0000</pubDate>
		<dc:creator>jpwilkin</dc:creator>
				<category><![CDATA[copyright]]></category>
		<category><![CDATA[digitization]]></category>

		<guid isPermaLink="false">http://scholarlypublishing.org/jpwilkin/archives/13</guid>
		<description><![CDATA[At Michigan we’re engaged in an activity that I hope will one day seem ordinary and a routine part of library work. Resources from several departments are devoted to determining the copyright status of works typically presumed to be in copyright. For now, we’re focusing on US monographic imprints (books, that is) published between 1923 [...]]]></description>
			<content:encoded><![CDATA[<p>At Michigan we’re engaged in an activity that I hope will one day seem ordinary and a routine part of library work. Resources from several departments are devoted to determining the copyright status of works typically presumed to be in copyright. For now, we’re focusing on US monographic imprints (books, that is) published between 1923 and 1963, but plan to turn our attention to non-US publications in the future. I wouldn’t want to give anyone the impression that this is easy work or work without its share of legal perils, but it does feel distinctly like “library work” and, as might be obvious, has a number of very significant positive benefits for library users in an increasingly digital environment. In this post, I won’t be describing our procedures for doing our work here or detailing the pitfalls, but I’d like to use this installment to muse on a few things that have seemed remarkable or interesting. We have, on the other hand, made a proposal to IMLS to ramp up this activity and make it more reliable through collaboration with several other institutions, and before I finish I’ll say a bit more about that.</p>
<p><strong>Impact</strong><br />
One of the most interesting things about all of this is the great impact that’s possible with relatively modest resources. Experienced library staff in technical services—staff with considerable experience in various sorts of bibliographic work like copy- and original cataloging—review publications and use a variety of tools to make determinations about a book’s copyright status. In most cases, those staff can reliably confirm that the work is in the public domain or in copyright. Because their work is driven by content that has been digitized and is online, if the work has been determined to be in the public domain, we update records that control access to the materials and permit access.</p>
<p>I say “modest” and I know that this will seem odd to some colleagues at smaller libraries or other types of libraries, but in terms of research library staffing, the 1.5 FTE of professional staffing we devote to this work has a profound effect. Yes, of course more infrastructure is necessary than just these technical services staff members, but the processes in our library IT organization handle input from their work in overwhelmingly automated ways, creating lists of works to be reviewed and updating records about the rights status of the works. Compare this number to any other area of materials processing in a research library and the numbers seem modest. Consider, moreover, the fact that these staff process more than 2,000 titles each month, and that the majority of these works are found to be in the public domain. At our current rate of work, we’ll open access to over 15,000 titles in about one year of work. That would be a phenomenal number in the pre-Google days of our digitization (I believe we digitized about 9,000 volumes in our peak year of preservation-related digitization work), and focuses on more current publications than we typically see in digitization. For a relatively small sum, we’re benefiting our own constituency as well as readership throughout the Internet. No matter how you cut it, this feels like a good investment in library funds.</p>
<p><strong>What is “library work”?</strong><br />
Once you get past the question of whether digitization is “library work,” it seems frivolous to ask whether this kind of copyright-determination work is also the work of libraries, but I think it’s very clearly a question for some of my colleagues. The copyright determination work involves several steps where our staff are particularly skilled, and the outcomes seem right up our alley. What sort of outcomes? Clearly <em>access</em> is a big one, but one doesn’t need a lot of imagination to understand that our preservation work (particularly of these digital files) is benefited by the increased access. The skills piece is particularly germane, though. Doing this work depends on knowledge of bibliographic description, of the sorts of variability in practice that one sees between a book as it leaves the printer and the ways that it’s described in everything from library catalogs (including OCLC WorldCat) to the Library of Congress’s copyright registration records. It’s also work that depends on recognizing the traps that come along with copyright, like the possibility that a US work was previously published abroad and thus may still be eligible for copyright protection.  These things come naturally to a person well versed in bibliographic description: people who have been employed in processes that create the same sorts of records they are now being asked to review.</p>
<p>I’m a big fan of mainstreaming. In a conversation with a library director at another institution, a director who was once director of technical services, I found myself arguing that this work is the work of technical services, much to her dismay. I’m certain some institutions would be tempted to build out a separate unit devoted to this new activity, populated by bright young staff who have never worked their way through descriptive work using everything from the pre-1956 National Union Catalog to ancient card records to the once great variety of “bibliographic utilities” (RLIN, OCLC, WLN and the rest), but what we would lose in that sort of staffing model is a sort of skill that comes along with recognizing variations in descriptive practice and the great variety of publishing practices that we see in the materials themselves. By relying on existing technical services staff members, we have those skills and a sensitivity to the need to create sustainable, routinized activities.</p>
<p><strong>Not the common wisdom</strong><br />
Some readers will have noticed that my numbers don’t add up to what we have generally considered to be the distribution between in-copyright and public domain for US 1923-1963 publications. When we talk about US renewals, lots of numbers are bandied about, some numbers are based on very early analyses, and some numbers are based on reasonable sample-based analyses. The most common estimate is that only 15% of US books published between 1923 and 1963 had their copyright renewed. Our fairly random selection of titles has generated very different numbers: we’re finding renewals for about 30% of the works in our queue, with another 10% having problems that are complex enough (e.g., possibly previous foreign publication or the inclusion of works such as short stories or poems by multiple authors).</p>
<p>I should say a little bit about our processes and the way that our queue may be influencing these numbers. As either Michigan or Google digitizes volumes and they flow back into our repository, we cull candidate titles for review. We rely on fixed fields in the MARC records to find candidate titles (i.e., monographs published in the US between 1923-1963). We tend to prefer those materials that have been digitized or processed more recently, as the quality of processing improves over time and we’d like to optimize our impact. To date, most of the candidate volumes have been drawn from our storage facility (Buhr), and so tend to be lower circulation titles in poorer condition. These facts, by themselves, shouldn’t skew our numbers or, if they do, should skew the selection toward volumes where the copyright wasn’t renewed because the title was less popular. In any case, it’s hard to imagine that our collection, which is fairly comprehensive for the period, would tend to be anything but representative, and yet the numbers are running very high for renewals. Of course I’d be happier to find that 85% of the works are in the public domain, but I continue to be encouraged to find that 60% of the works are in the public domain.</p>
<p>It’s far too soon to say if these numbers will hold or if there are other factors at work, but after having reviewed more than 25,000 volumes, the fairly constant pattern emerging should interest many who watch this space.</p>
<p><strong>More to say, more to do</strong><br />
I didn’t intend to address the wealth of related issues in this space, but wanted to get these few thoughts down to start the conversation. Someone from Michigan should discuss how the work gets done, what the liabilities are for making mistakes, the various ways that we could make mistakes (did I hear the word “restoration”?), whether we should be doing the work in isolation, and many other topics. I will close with beginning to address the last question, however. We have long advocated sharing this work among many institutions and saw the Google digitization effort as one tremendous stimulus to creating some thoughtful, reliable <strong>group sourcing</strong>. In discussions with a very sympathetic General Counsel’s office, we concluded that the work of making these determinations would be strengthened by collaboration—double-blind or triple-blind tests of status, if you will. This winter, Anne Karle-Zenith on our staff wrote a proposal to IMLS for the creation of a multi-institutional queuing and vetting mechanism, and our friends at Indiana, Minnesota and Wisconsin wrote letters offering their enthusiastic support. I hope we will one day be doing this work in a well-documented and open group space, with contributions by many institutions. After all, while this really <em>is</em> library work, when it comes to US publications, there’s a bounded body of candidates, and by sharing this work our community can add several thousand titles to the <em>known</em> public domain.</p>
]]></content:encoded>
			<wfw:commentRss>http://scholarlypublishing.org/jpwilkin/archives/13/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Did I say &#8220;theoretical&#8221;?  Openness and Google Books digitization</title>
		<link>http://scholarlypublishing.org/jpwilkin/archives/12</link>
		<comments>http://scholarlypublishing.org/jpwilkin/archives/12#comments</comments>
		<pubDate>Fri, 25 Apr 2008 18:01:17 +0000</pubDate>
		<dc:creator>jpwilkin</dc:creator>
				<category><![CDATA[digitization]]></category>

		<guid isPermaLink="false">http://scholarlypublishing.org/jpwilkin/archives/12</guid>
		<description><![CDATA[I was recently quoted in an AP article (published here in Salon) as saying that Brewster Kahle&#8217;s position with regard to the openness of Google-digitized public domain content is &#8220;theoretical.&#8221; Well, I sure thought I said &#8220;polemical,&#8221; but them&#8217;s the breaks.  Brewster argues that Google&#8217;s work in digitizing the public domain essentially locks it [...]]]></description>
			<content:encoded><![CDATA[<p>I was recently quoted in an AP article (published <a href="http://www.salon.com/wires/ap/scitech/2008/04/24/D908LLMO0_google_book_search/index.html">here</a> in Salon) as saying that Brewster Kahle&#8217;s position with regard to the openness of Google-digitized public domain content is &#8220;theoretical.&#8221; Well, I sure thought I said &#8220;polemical,&#8221; but them&#8217;s the breaks.  Brewster argues that Google&#8217;s work in digitizing the public domain essentially locks it up&#8211;puts it behind a wall and makes it their own&#8211;and that this is a loss in a world that loves openness.  The contrast here is meant to be with the work of the Open Content Alliance, where the same public domain work might be be shared freely, transferred to anyone, anywhere, and used for any purpose.   I don&#8217;t want to get into the quibble here about the constraints on that apparently open-ended set of permissions (i.e., that an OCA contributor may end up putting constraints on materials that look worse than Google&#8217;s constraints).  What&#8217;s key here for me, though, is the real practical part of openness&#8211;what most people want and what&#8217;s possible through <a href="http://www.lib.umich.edu/mdp/">what Michigan puts online</a>.</p>
<p>I think all of this debate begs us to ask the question &#8220;what is open&#8221;?  For the longest time (since the mid-1990&#8217;s), Michigan digitized public domain content and made it freely viewable, searchable and printable.  <em>Anyone</em>, anywhere could come to a collection like <a href="http://moa.umdl.umich.edu/">Making of America</a> and read, search and print to his heart&#8217;s delight.  If the same user wanted to download the OCR, that too was made possible and, in fact, the <a href="http://www.pgdp.net/c/">Distributed Proofreader&#8217;s</a> project has made good use of this and other MOA functionality.  We didn&#8217;t make it possible for anyone to get a collection of our source files because we were actively involved in setting up Print-on-Demand (POD), POD typically has up-front, per-title costs, and making the source files available would have cost us some sales that might otherwise pay for that initial investment.   As we moved into the <a href="http://www.lib.umich.edu/mdp/umgooglecooperativeagreement.html">agreement with Google</a>, we made clear our intention to do the same &#8220;open&#8221; thing with the Google-digitized content, and to throw in our lot with a (then) yet-to-be-defined multi-institutional &#8220;Shared Digital Repository.&#8221;  In fact, now we have hundreds of thousands of public domain works online, all of which are readable, searchable and printable by anyone in the world in much the same way.</p>
<p>So, what&#8217;s the beef?  The <a href="http://www.opencontentalliance.org/faq.html">OCA FAQ</a> states that for them this openness means  that &#8220;textual material will be free to read, and in most cases, available for saving or printing using formats such as PDF.&#8221;  By all means!   I hope it&#8217;s clear by what I wrote above that this is an utterly accurate description of what happens when Google digitizes a volume from Michigan&#8217;s collection and Michigan puts it online.  It&#8217;s also, incidentally, what Google makes possible, but even if Google didn&#8217;t, Michigan could and would be rushing in to fill that breach.  The challenges to Google&#8217;s openness always seem to ignore what&#8217;s actually possible through our copies at Michigan.   This sort of polarizing rhetoric seems to be about making a point that&#8217;s not accurate in the service of an attack on Google&#8217;s primacy in this space:  we don&#8217;t want them to dominate the landscape, so let&#8217;s characterize their Bad version as being the opposite of our Good version.   This notion that what Google does is closed is not an accurate description of Google&#8217;s version of these books, and even less so a description of Michigan&#8217;s.</p>
<p>Could the Google books be <em>more</em> open?  Absolutely.   Along with <a href="http://freegovinfo.info/node/1541">Carl Malamud</a>, for example, I would love to see all of the government documents that have been digitized by Google available for transfer to other entities so that the content could be improved and integrated into a wide variety of systems, thus opening up our government as well as our libraries.   I believe that will happen, in fact, and that Google will one day (after they&#8217;ve had a chance to gain some competitive advantage) open up far more.  In the meantime, however, when we talk about &#8220;open,&#8221; let&#8217;s mean it the way that the OCA FAQ means it.  Let&#8217;s mean it in the same way that the bulk of our audience means it.  Let&#8217;s talk about the ability to read, cite and search the contents of these books, and let&#8217;s call the Google Books project and particularly Michigan&#8217;s copies Open.  Let&#8217;s stop being theoretical, er, I mean polemical.</p>
]]></content:encoded>
			<wfw:commentRss>http://scholarlypublishing.org/jpwilkin/archives/12/feed</wfw:commentRss>
		<slash:comments>20</slash:comments>
		</item>
	</channel>
</rss>
