Home Straight for ANC Archives Digitization Project
By the end of March 2018 the 14-member Africa Media Online team resident in Alice, Eastern Cape and working in the National Heritage and Cultural Studies Centre (NAHECS) at the University of Fort Hare, had complete the digital capture of all the material assigned to them in the current phases of the ANC Archives digitisation project.
While the overall project actually started in the last quarter of 2011 and continued into 2012, the current phases of the project kicked off in November 2015 and are due to end at the end of April 2018. By then the team will have digitised the contents of 3,576 archival boxes as well as museum objects, posters, banners and other materials. At one time we had five different workflows running concurrently – one to digitise bound manuscripts, another to digitise fragile manuscripts, another to digitise plain paper, another to digitise photographic prints and the final one to digitise slides and negatives. In all we would have captured close to 2 million pages with over 1.3 million being plain paper, over 300,000 being fragile papers and also over 300,000 being bound manuscripts. We also captured almost 82,000 photographic images and over 3,000 museum objects. The team worked exceptionally hard often operating in two shifts and at times in three shifts around the clock to keep to targets. They not only digitised the archive but ordered the archive and documented the archive too.
Before we could even start digitising the collection, we needed to order the collection down to the item level. The physical archive was well structured with various subcollections and within those subcollections, there are various series. The series are made up of containers (mostly archival boxes) and within the containers are usually folders. Inside each folder are items. The challenge we faced when we arrived to do the digitisation was that the archive had been ordered down to the folder level, but not to the item level. So began the massive task of itemizing the entire collection which meant we did not actually start digitising the collection until 9 months after the start of the project and that itemizing process continued in parallel with digital capture until the end of 2017. In order to do that we had to grow the team after the first nine months from six to 14 members.
The manuscripts workflow started with the Itemizing Team where four or five team members sat assigning a number to each and every item in the archive. Next, the boxes were shifted to the Dividing Team who went through every folder and assigned each item to one of three workflows – the bound manuscripts workflow where the items were captured on a v-cradle capture device, the fragile papers workflow captured using an overhead camera, and the plain paper workflow captured on form-feed scanner. The boxes then moved across to the Inventory Team who sat capturing onto a spreadsheet for each item: its place in the archive; the number of pages that need to be captured; and the workflow it had been assigned to. In this way, we built up an inventory against which we could check at each subsequent stage of the process. From the Inventory Team, the boxes moved to the Capture Team and the relevant workflow where the items assigned to that particular workflow were captured and returned to the box before the box moved on to the next workflow. Mostly the boxes were returned to their place in the archival storage rooms at NAHECS in between their capture at each workflow station. When all items were captured in a box, the box was returned to the archival store after the items were “de-divided” from the three workflows and recompiled in the right order in their folders and within the box.
To ensure that the digital archive reflected the physical archive in its structure, in Phases 1,2 and 3 back in 2011 and 2012, we had developed a system of digital folders that could represent the arrangement of the physical archive. So when pages were captured using one of the capture devices, they were saved into this folder structure on an external hard drive. These hard drives were then sent up to our head office in Pietermaritzburg (we ended up with close to 70 hard drives of 1, 2, 3 or 4 TB in size that were rotated back and forth between Alice and Pietermaritzburg). There the Processing Team went to work ensuring that each and every digital file was up to standard, was cropped and colour corrected in line with colour targets that were captured with each batch. Files that were rejected were recorded and the information sent back to the Digitisation Team for recapture. Maintaining the same folder structure, these processed files were then saved out from Raw to Tiff format at which point in time the Quality Control Team checked each file and checked the folder path of each file against the inventory or each collection that had been compiled by the Inventory Team.
Currently, the Quality Control Team is working long hours, working sub-collection by sub-collection, to get the entire collection checked, compiled and submitted to the MEMAT Digital Vault. From there the files pass to the domain of the IT Team. They are ingested into the Digital Vault. As part of that, they are processed to a Jpeg2000 format that meets specific archival standards for long-term preservation. Once each page of a manuscript is ingested into the Vault, it is then run through an Optical Character Recognition (OCR) engine to make it searchable and all pages are gathered into a PDF/A. Then the manuscript is made available on the web interface of the ANC Archives Research Website.
The work of aligning the folder path of the digital file with the structure of the physical archive and the record of the structure of the physical archive in the inventory spreadsheet, has also been supplemented by the team going beyond the call of duty to update the ANC Archive finding aids such that all four match each other.
The one aspect of the project that will be continuing in the coming months is the capturing of metadata, particularly against photographic images. That involves both a local team capturing information off the back of photographs and an experienced remote team that takes that information and fills out various metadata fields.
The completed material is already in the process of being ingested into the Digital Vault, passing through the OCR process and starting to show online. One step of that process takes 5 seconds per file which is the current bottleneck. The amount of data is so great that at that rate running 24 hours a day non-stop it will take between 4 and 5 months for all the material to appear online!