I few months ago I wrote a program called get-mbooks.pl, and it is was used to harvest MARC data from the University of Michigan’s OAI repository of public domain Google Books. You can download the program here, and what follows is the distribution’s README file:
This is the README file for script called get-mbooks.pl
This script — get-mbooks.pl — is an OAI harvester. It makes a connection to the OAI data provider at the University of Michigan. [1] It then requests the set of public domain Google Books (mbooks:pd) using the marc21 (MARCXML) metadata schema. As the metadata data is downloaded it gets converted into MARC records in communications format through the use of the MARC::File::SAX handler.
The magic of this script lies in MARC::File::SAX. Is a hack written by Ed Summers against MARC::File::SAX found on CPAN. It converts the metadata sent from the provider into “real” MARC. You will need this hacked version of the module in your Perl path, and it has been saved in the lib directory of this distribution.
To get get-mbooks.pl to work you will first need Perl. Describing how to install Perl is beyond the scope of this README. Next you will need the necessary modules. Installing them is best accomplished through the use of cpan but you will need to be root. As root, run cpan and when prompted, install Net::OAI::Harvester:
$ sudo cpan
cpan> install Net::OAI::HarvesterYou will also need the various MARC::Record modules:
$ sudo cpan
cpan> install MARC::RecordWhen you get this far, and assuming the hacked version of MARC::File::SAX is saved in the distribution’s lib directory, all you need to do next is run the program.
$ ./get-mbooks.pl
Downloading the data is not a quick process, and progress will be echoed in the terminal. At any time after you have gotten some records you can quit the program (ctrl-c) and use the Perl script marcdump to see what you have gotten (marcdump <file>).
Fun with OAI, Google Books, and MARC.
[1] http://quod.lib.umich.edu/cgi/o/oai/oai