A digitisation system for entomological notebooks
The following is a short technically oriented description of the digitisation system for entomological notebooks currently used by the Finnish Museum of Natural History.
Data entry in the project is largely done as remote work over the network, therefore a web based platform is needed to provide a basis for the work. An instance of Centos Linux is used on a virtual machine server, provided by the Museum. The digitisation project uses Drupal - an open source content management framework written in PHP. Additionally several Drupal extention modules have been installed to facilitate the work. These modules include Workflow, Faceted search, Book, OAI-PMH and Forums, among others.
The workflow has the following stages:
- imaging
- cataloguing of book information
- entering text the text content of each page into a text field in Drupal
- proofreading
- structured data entry
- XML conversion
The entomological notebooks are given a distinct number, such as n1, n2 etc. The images of each double page also have a running number, which is appended to the notebook number to form the image file name, such as 'n1-001.JPG'. A small program written in Perl is used to convert into this format the image numbering that is done by Canon camera software.
The imaging station is comprised of a Kaiser R1 RSX camera stand with lighting units and a Canon EOS 7 camera connected to a computer.
Smaller versions of the images are produced and shown on the Drupal page, and a thumbnail image of the cover is also produced and shown on the book page. A freeware program called XnView is used to produce the smaller images from large JPG-files or from Canon RAW image file format.
The large JPG images are moved to separate web server and the smaller images, including the thumbnails, are moved to server folders directly accessible to Drupal.
The image file names are also used as Drupal page titles. This allows the generation of tags automatically, so that the images show up as soon as they are moved to appropriate folders on the server.
With the help of the Book module the pages are structured into a two level hierarchy: book metadata on the top level and the notebook page content under each book.
The data entry is done in two phases. In the first phase the text content of each notebook page is entered in a way that reflects the original. The page layout is not necessarily maintained, and parts of the page, like arrows and special underscore lines, may not be reproduced as html.
Some of the notebooks contain text typed with a typewriter. An example of this is the Stockmann collection of entomological notebooks. With these material optical character recognition (OCR) software Abbyy FineReader has been used. With this software text can be read from images into text files. The raw text produced by this OCR method always needs manual editing to some extent.
Proofreading is also part of the process, and this is done in such a way, that each person doing data entry assigns the entered page for proofreading to some other member of the data entry team. The Drupal's Workflow module supports this, and each person doing data entry gets a list of pages assigned to them for proofreading.
In the second phase, structured data entry is done. This is done with an Excel-sheet specially tailored for the purpose. Its fields correspond to ABCD schema elements and its local variant, the FMNH2008 schema. The ready-made spreadsheet files are uploaded to the server as file attachments in to the corresponding pages. A conversion program produces XML files from them, and the relevant XML files can be viewed as linked files attached to the pages. These XML files are also transferred to the Museum's SVN-based repository.