1

I was recently quoted in an AP article (published here in Salon) as saying that Brewster Kahle's position with regard to the openness of Google-digitized public domain content is "theoretical." Well, I sure thought I said "polemical," but them's the breaks. Brewster argues that Google's work in digitizing the public domain essentially locks it up--puts it behind a wall and makes it their own--and that this is a loss in a world that loves openness. The contrast here is meant to be with the work of the Open Content Alliance, where the same public domain work might be be shared freely, transferred to anyone, anywhere, and used for any purpose. I don't want to get into the quibble here about the constraints on that apparently open-ended set of permissions (i.e., that an OCA contributor may end up putting constraints on materials that look worse than Google's constraints). What's key here for me, though, is the real practical part of openness--what most people want and what's possible through what Michigan puts online.


5

I think all of this debate begs us to ask the question "what is open"? For the longest time (since the mid-1990's), Michigan digitized public domain content and made it freely viewable, searchable and printable. Anyone, anywhere could come to a collection like Making of America and read, search and print to his heart's delight. If the same user wanted to download the OCR, that too was made possible and, in fact, the Distributed Proofreader's project has made good use of this and other MOA functionality. We didn't make it possible for anyone to get a collection of our source files because we were actively involved in setting up Print-on-Demand (POD), POD typically has up-front, per-title costs, and making the source files available would have cost us some sales that might otherwise pay for that initial investment. As we moved into the agreement with Google, we made clear our intention to do the same "open" thing with the Google-digitized content, and to throw in our lot with a (then) yet-to-be-defined multi-institutional "Shared Digital Repository." In fact, now we have hundreds of thousands of public domain works online, all of which are readable, searchable and printable by anyone in the world in much the same way.


4

So, what's the beef? The OCA FAQ states that for them this openness means that "textual material will be free to read, and in most cases, available for saving or printing using formats such as PDF." By all means! I hope it's clear by what I wrote above that this is an utterly accurate description of what happens when Google digitizes a volume from Michigan's collection and Michigan puts it online. It's also, incidentally, what Google makes possible, but even if Google didn't, Michigan could and would be rushing in to fill that breach. The challenges to Google's openness always seem to ignore what's actually possible through our copies at Michigan. This sort of polarizing rhetoric seems to be about making a point that's not accurate in the service of an attack on Google's primacy in this space: we don't want them to dominate the landscape, so let's characterize their Bad version as being the opposite of our Good version. This notion that what Google does is closed is not an accurate description of Google's version of these books, and even less so a description of Michigan's.


2

Could the Google books be more open? Absolutely. Along with Carl Malamud, for example, I would love to see all of the government documents that have been digitized by Google available for transfer to other entities so that the content could be improved and integrated into a wide variety of systems, thus opening up our government as well as our libraries. I believe that will happen, in fact, and that Google will one day (after they've had a chance to gain some competitive advantage) open up far more. In the meantime, however, when we talk about "open," let's mean it the way that the OCA FAQ means it. Let's mean it in the same way that the bulk of our audience means it. Let's talk about the ability to read, cite and search the contents of these books, and let's call the Google Books project and particularly Michigan's copies Open. Let's stop being theoretical, er, I mean polemical.

Posted by jpwilkin on April 25, 2008
Tags: digitization

Total comments on this page: 20

How to read/write comments

Comments on specific paragraphs:

Click the icon to the right of a paragraph

  • If there are no prior comments there, a comment entry form will appear automatically
  • If there are already comments, you will see them and the form will be at the bottom of the thread

Comments on the page as a whole:

Click the icon to the right of the page title (works the same as paragraphs)

Comments

No comments yet.

Kathleen on paragraph 1:

This article from CNN says you said “theoretical” so it must be true. Of course, the article also says we’re doing scanning on the 2nd floor of our book-shelving department, so what do they know!

April 25, 2008 12:51 pm
Carl Malamud on paragraph 3:

> for saving or printing using formats such as PDF

John, pardon me if I don’t grock Mirlyn, but I pulled up a public domain document (a congressional hearing). I was able to pull the text up and page through, but there didn’t appear to be an easy way to save a single page, let alone the entire hearing. Perhaps that function is available to Michigan students, but I suspect the rest of the citizens of Michigan are in the same boat as the rest of using the crippled interface.

I think it is fine that Michigan and Google have their arrangement, but it is disturbing when we see a state-funded institution like U. of Michigan putting up artificial barriers to access.

Your Mirlyn site is ok as far as web sites go, but letting a thousand flowers bloom always leads to more innovation. It would be great if any grad student in Ann Arbor (or anyplace else) could download your govdocs docs and come up with a better user interface.

(In addition to more innovation, that policy would lead to a more informed citizenry, which is generally considered an important part of democracy and I suspect is part of your state-sponsored mandate.)

Carl

April 25, 2008 2:10 pm
jpwilkin on paragraph 3:

Gosh, Carl, I think the best way I can respond is not only to say that I whole-heartedly agree with your call for more vigorous sharing, but to point to my fourth paragraph, where I point to your work and urge the same thing. Look, my point is that while this is good, and we are fighting for deeper sharing, this sort of thing is a fairly narrow piece of the openness issue.

On your point about the functionality and the opaqueness of getting PDFs, we’ll take that into account in our usability. It’s there, and we can do better. I should not that for us larger PDF chunks is also a resource issue, but that we’re very close to releasing a new version that gives you 10 pages at a time. Personally, I like the screen resolution PNG files and very much dislike PDF as a format, but that’s a usability position and not a philosophical one.

April 25, 2008 3:38 pm
Carl Malamud on paragraph 3:

At the risk of having too many positions dancing on the head of a comment, being able to download/save a copy of the full doc is a pretty key usability concern.

April 25, 2008 3:50 pm
Brewster Kahle on paragraph 2:

John– while it may not be appropriate to start this in a comment, but I am quite taken aback by your seeming implication that “open” includes what google is doing and what UMich is doing.

“Open” started to be widely used in the Internet community in association with certain software. Richard Stallman calls it “free”, but “open” has also come to be used as well. Lets start with that.

“Open Source” in that community means the source code can be downloaded in bulk, read, analyzed, modified, and reused.

“Open Content” has followed much the same trajectory. Creative Commons evolved a set of licenses to help the widespread downloading of creative works, or “content”. Downloading, and downloading in bulk, is part of this overall approach as we see it at the Internet Archive.

Researchers (and more general users, but we can stick with researchers because they are a community that research libraries are supposed to serve) require downloadability to materials so they can be read, compared, analyzed, and recontextualized.

Page at a time interfaces, therefore, would not be “open” in this sense. Downloadable crippled versions would not be open in the Open Source or Open Content sense either.

As a library community, we can build on the traditions from the analog world of sharing widely even as we move into the digital world. We see this as why we get public support.

Lets build that open world.

We would be happy to work with UMich to support its open activities.

-brewster

April 25, 2008 4:59 pm
jpwilkin on paragraph 2:

I think this is precisely the sort of rhetoric that’s muddying the waters right now, Brewster. There is no uniformly defined constituency called “researchers” who “require downloadability.” I know ‘em, I work with ‘em, and I know that’s not true. Access (and openness) is defined on a continuum. What we do is extraordinarily open and has made a tremendous difference for research and the in the lives of ordinary users. This sort of differentiation in the full accessibility of source materials is one of the key incentives that has brought organizations like Google and Microsoft to the table, and if it didn’t make sense, the OCA wouldn’t go to pains to stipulate that “all contributors of collections can specify use restrictions on material that they contribute.” Is more open better? Damned right. That’s one reason why for two years we’ve been offering OCA the texts Michigan digitizes as part of its own in-house work. But is what we’re doing with Google texts open? Absolutely.

April 26, 2008 8:43 am

I’m not sure I get all these degrees of open … let me add a hypothetical if that helps clear this up.

What if a bunch of students in Ann Arbor organized themselves into a Democracy Club and started grabbing all the public domain documents they can find on MBooks and uploading them to some site such as scribd.com or pacer.resource.org for recycling? If the docs are open (and we’re just talking “works of the government” which are clearly in the public domain), would you consider that a mis-use of your system and try and stop it or would that fall inside of the open side of the open continuum we’re all trying to mutually understand in this dialogue?

Hypothetically speaking, of course. I’m not advocating that students form a Democracy Club and crawl your site to recycle public domain materials, I’m just trying to understand if the restrictions on reuse are passive ones like obscuring how to download files or if these are active restraints where the library is involved in enforcing restrictions on access to public domain materials.

Again, I’m not at all suggesting that students interested in furthering the public domain form Democracy Clubs and start harvesting documents from the public taxpayer-financed web sites at UMich and re-injecting them into the public domain.

April 26, 2008 7:13 pm

[…] Kahle’s as “theoretical,” when John meant polemical.” John has a nice blog post on the on the subject, with responses and rejoinders from both Brewster and from Carl Malamud. The […]

April 26, 2008 4:01 pm
jpwilkin on paragraph 2:

What if? If there really were that sort of interest, I’d hope that we’d have a chance to talk to the students and make sure they were aware of powerful options to make “in situ” use of the openly accessible government documents that they find in MBooks. I’d want to make sure they knew that in late June we’re releasing a “collection builder” application that will allow them to leverage our investment in permanent (did I say permanent?) curation of these materials so that the materials could be found and used after the current crop of students comes and goes, that the students could add to the body of works as more get digitized from our collection and the collections of other partner libraries (e.g., Wisconsin’s are coming in soon) and that we would want to hear what sorts of services (an RSS feed of newly added gov docs?) might aid them in their work. I’d want to talk to them about the issue of authority and quality, and would see if there were ways that their efforts could help improve the works in MBooks rather than dispersing the effort to copies in multiple places. And if they needed computational resources to do things like data mining, I’d let them know that we’re glad to help. But if none of this satisfied them, would we try to stop them? Assuming Google digitized the works, according to our agreement (4.4.1) we would make “reasonable efforts … to prevent [them] from … automated and systematic downloading” of the content, something we currently do and which does not undermine the ability of those same students to read, search and print the documents. Lots of openness there.

April 27, 2008 12:13 pm
Erik Hetzner on whole page :

If you are going to quote from the OCA FAQ, please do quote the entire answer.


What can people do with materials contained in the OCA archive?

The OCA will encourage the greatest possible degree of access to and reuse of collections in the archive, while respecting the rights of content owners and contributors. Generally, textual material will be free to read, and in most cases, available for saving or printing using formats such as PDF. Contributors to the OCA will determine the appropriate level of access to their content.

For example, the collection of American literature contributed by the Internet Archive, the University of California, and Yahoo! carries no restrictions and may be downloaded and reused for any purpose.

Additionally, the OCA welcomes all efforts to create and offer tools (including finding aids, catalogs, and indexes) that will enhance the usability of the materials in the archive.

Note the second two paragraphs, which define openness considerably differently than you do.

Your response to the question about the UM students suggests that you are not particularly interested in allowing researchers to use the scanned material for purposes beyond what you imagine.

Yes, it is better for a document to be online for anybody to read, search, print, & even save than it is for the document to be unavailable. But why are we allowing Google to place restrictions on books which are now in the public domain? Mass digitization opens up possibilities for data-mining, etc. that were not previously available. And by keeping documents locked up for use as a whole by researchers, you are restricting their use of it. As far as I am concerned, this is not “open”, no matter how much you might call it that.

April 27, 2008 2:54 pm
jpwilkin on whole page :

Suz Chapman reminds me that all too often discussions overlook or may not be clear on the fact that Michigan has secured the right to bring back a copy of all of the digitized content and to put it online ourselves. In fact, stories like the AP article that stimulated this discussion depict the state of things as being about access only being possible through Google. At Michigan, we have a fairly continuous flow of files coming back to us and the process of getting those links in the catalog is happening just as continuously (and automatically). One thing I’m pleased to add is that we’re collaborating with other institutions to expand the body of content, i.e., to create a “Shared Digital Repository,” something we intended from the outset of our agreement with Google.More on this later.

April 27, 2008 5:55 pm
bowerbird on paragraph 3:

i have cheered your stance
to make your scanned books
available to the public from
the very beginning…

what i see here — as well as
in your refusal to allow the
downloading of all the text
from a book in one operation,
as opposed to page-by-page
– seems to be a backtrack
on your earlier promises…

don’t know if i misunderstood
your early statements or if
i misunderstand you now,
but there seems — to me –
to be a significant difference.

-bowerbird

April 27, 2008 6:11 pm
Anon on paragraph 4:

Why give Google competitive advantage? How is this the role of a libraries vis a vis a vendor?

April 28, 2008 2:33 am
jpwilkin on paragraph 4:

An interesting question. Whether it *should* be (the role) or not isn’t really a question at this point, with decades of examples of libraries working with “vendors” in ways that leveraged library collections for what you might view as a vendor’s competitive advantage. Obviously, some deals were more selfless than others, and many have involved royalties or discounts provided to the libraries. In fact, *some* sort of relationship is absolutely necessary because of the inaccessibility of the materials in our collections (publishers, for example, don’t have the titles and frequently don’t know what they’ve published). On the other hand, the nature of the deal is what’s at issue, and we believe that having a copy of the files for long-term preservation and meaningfully open access, and fairly liberal terms for the way that we use the materials, is a good exchange for the hundreds of millions of dollars worth of work entailed in doing the scanning.

April 28, 2008 5:44 am

[…] Vaidhyanathan weighs in on the recent comments of John Wilkin€™s of the University of Michigan libraries, wherein he described U of M’s collaboration with […]

April 29, 2008 2:30 pm

[…] Kahle and John Wilkin of the University of Michigan recently debated what “open” means. According to Kahle, open content, like open source code, can be […]

May 9, 2008 10:33 am

[…] and we should move in that direction. [I didn’t find a quote saying exactly that but Wilkin is a strong booster of Google […]

May 17, 2008 7:58 pm
jpwilkin :

Learned fangirl writes that “He quotes John Wilkin of University of Michigan libraries, as saying that kids today are only interested in looking at digital books….” Ain’t the sort of thing I weigh in on. I wouldn’t pretend to have thought much about digital natives or even just kids today and their use of systems.

May 18, 2008 4:21 am
R on whole page :

As I posted on my blog: Thanks for the comment. I was trying to summarize a quoted quote in a lecture, like an academic game of telephone. Considering the number of steps away from your own words I apologize for *all* of my inaccuracies!

May 19, 2008 7:15 am
bowerbird on paragraph 2:

i couldn’t post here (don’t know why),
so i posted elsewhere:
> http://paulcourant.net/2008/04/26/john-wilkin-and-others-on-openness-and-its-opposites/#comment-262

May 27, 2008 10:55 am
Name (required)
E-mail (required - never shown publicly)
URI