Digital Preservation

I’d like to focus on the challenge posed by the constantly changing hardware and programming languages used in digital preservation. This is something I think a lot about for my digital archive project. It’s hard to sink a lot of time and effort into building a digital collection on a WordPress site, when I know nothing about coding, content management systems like WordPress, servers, hosting services, etc. But I’m moving forward, taking precautions as best I can with the expectation that I’m going to need to start over from scratch several times. Using WordPress seems like a safe bet now, for a digital novice like myself. But who knows what will happen five years down the line. 

Without know what kinds of technological changes I should anticipate, my strategy has been to store my collection offline and on the Internet Archive, and then link each item to my website. Most of the materials that I want to include in my collection were born digital, like news articles and Amnesty International reports. So they can easily be moved from one page/site to another, in case I need to start over and build a new website from scratch. I’ve created a library with the Internet Archive where I’m collecting permalinks and uploading videos, PDFs, and images. I can then link the individual items from my library to my website. And if something happens with my website—if the hosting service goes bankrupt, or if WordPress jacks up the subscription fees—the really crucial material should survive, and I should be able to link them the a new website.

I also keep the same library offline using Zotero. So even if something happens with my IA library, the permalinks should still work and I should be able to upload the library again. I believe that Zotero and the IA have similar metadata fields, so I won’t be in danger of losing that either. 

My videos pose a much bigger challenge, however. I currently use the H.264 codex for my videos, which I believe is what YouTube uses. But I’ve read that H.265 (which is better for reasons I don’t fully understand) could soon become the new standard, and YouTube might adopt that standard. It is already getting very expensive for me to buy external hard drives large enough to store all of my videos, including the originals, the Adobe Premiere files, and the edited finals converted into H.264. I don’t know if I can realistically continue to store all these files from every interview I do, so that I can convert them to new codex as they come out. Alternatively, I don’t know if I can just leave them up on YouTube either. How long will it be before older video formats are no longer supported? 

Of course, the biggest challenge is my own tech-illiteracy and my inability to understand all the relevant factors and anticipate new trends in technology. 

Born Digital

For this week’s assignment, I’m trying to think about the authenticity, accessibility, and reliability of the materials in these digital archives, and the archives themselves. As I understand the challenges that go along with digital collections, we’re mostly concerned with the reliability of the hardware (which could fail and lose data), the accessibility of the materials in the collection (either due to too much information, or the proper technology isn’t available to everyone, and the authenticity of the materials themselves, which often aren’t traceable or their origins aren’t transparent. These are the concerns that stuck with me from this week’s readings.

That said, I don’t share the caution that Cohen and Rosenweig seem to have towards digital history. I don’t think the problems that come with web-based preservation and research are fundamentally different older ways of doing history.

For example, my first observation, after looking at the 9/11 digital archive, was that the inclusion of emails and anonymous submissions gave a forum to the most casual of impressions, memories, and reflections. The first email I looked at described a dream that the person had and his interpretation of it and how it related to 9/11. A good amount of these materials might be more useful to literary scholars who want to say something about the cultural imagination than to historians. In this case, digital tools made it very easy to collect lots of material, even stuff that wasn’t exactly choice historical documentation. But I don’t think this is a fundamentally new problem with archives.

My second observation, in regard to the Hurricanes Katrina and Rita archive, is that the volume of information available can make it difficult to browse materials or find what you’re looking for; especially if the materials are not organized by an intuitive schema. I found the Katrina and Rita archive very hard to use. With 8,462 items organized into four very big categories (“stories,” “images,” “oral histories,” and “video”), it’s not easy to browse or search for something specific. Here digital tools allowed the archivists to source an incredible amount of materials, but they failed to build an intuitive system that would allow visitors to easily browse, search, and retrieve items.

The April 16 Archive suffers from both the problems I discussed above. It appears to be the most casual collection of materials ever, sorted only by several hundred tags. While this website is an extreme case, my impression is that these challenges are not specific to digital collections. As we saw at the UMass Special Collections last week, they collect a great many things. But they have an excellent system for dealing with the volume of the material and making it easily searchable. To me, this is the most important feature of digital collection—that the information is organized in an intuitive manner and it is made easily searchable. Doing this well will mitigate these challenges discussed above.

Born Digital

I have a very practical interest in digital archives and digitization projects. Besides my People’s History of Fallujah digital archive, which is a pretty straightforward collection of materials that were already on the web (I’m just collecting them to a single location, I also want to start a digitization project for the Carlo Danio Library in Grumento Nova, where I’m doing my other project, Lingua e Memoria Grumentina.

If you followed the link, then you’ve seen that this library is a hidden treasure. The translation on the webpage is a bit difficult the follow, but you get the picture. It’s a well preserved library filled with books from the 17th century on, some in Latin, some in Vulgar Latin, some even in Italo-Romance languages other than Italian (like Grumentino). Scholars of the classics and Italian history should be traveling from all of the world to come visit this collection, but no one knows it exists. And it’s not easy to get to Grumento Nova (you need to figure out the chaotic southern Italian bus system to get there).

This is the paradox of southern Italy—it’s rich in natural resources and cultural heritage, and yet poor. Grumento Nova is in some ways better off, and in some ways worse off, than the rest of the south. The single largest problem that Grumento Nova is facing today is pollution from the Eni COVA oil extraction plant within its city limits. It’s the largest oil extraction site in all of Italy, and its ruining the entire Agri Valley area.

This library is one potential source of income for Grumento Nova. This, combined with a tourism industry, could help make Grumento less economically dependent on its petroleum resources. And I think the best way to utilize this library and start building an alternative economy in Grumento will be to digitize this collection and charge subscription fees to libraries around the world. An alternative approach would be to better advertise the contents of the library and hope that scholars come to use it. But my hunch is that more money could be made through selling online subscriptions. I’m not sue how exactly to predict how much more money could come in through online subscriptions. This is really just a hunch. But it seems like common sense to me.

At this point I’m unsure of the benefits of using some sort of mark-up language to digitize the books. I think it will be hard enough to get a grant to bring a scanner to Grumento, let alone finding someone who will put the labor into doing the mark-up (I’m definitely not doing it). I suspect that using a scanner with OCR technology will be sufficient for the vast majority of the books. However, I believe there might be a few handwritten manuscripts in the collection, too. For those, my intuition is that simple image scans will be sufficient.

Second, I’d like to discuss the Atlante Linguistico della Sicilia (the Linguistic Atlas of Sicily). This is a very interesting kind of digitization project because language, in many ways, is intangible. It’s a system of signs based on social conventions. You can begin to document a language and create a digital record by recording analogue sound waves as audio files. These files can then be visualized by transcribing the sounds using a phonetic alphabet. But there are even difficulties with this. First, the perception of linguistic sounds is not straightforward; the languages we speak can shape the way we perceive sounds from another language. So another way that audio files can be visualized is with a spectrogram, which measures amplitude and frequency over time. Spectrograms can help see what we can’t perceive by sound.

And yet this kind of analysis still only goes so far. Transcription and spectrograms can only really tell you about an individual of speaker, not the language itself, or the sociolinguistic context in which this language, other languages, and varieties of them exist together. That’s what I love about the Atlante Linguistico and its “geolinguistic” approach. It uses the traditional tools of language documentation and adds a geospatial dimension to it. The “carta sonora” (sound map) tab is an interesting feature of this site, because it uses a mapping program to show how the same word is pronounced differently in various Sicilian locals, using both transcription and audio files. There’s lots of analysis to go with this on the site’s other pages, which paints a complex picture of the linguistic situation on the island, in which hundreds of distinct languages (though mutually intelligible) exist together in a sociolinguistic environment. I think it’s a brilliant way of taking something so complicated, ephemeral, and intangible as spoken language and preserving it and making it available with digital tools.

Lastly, I’d like to discuss a personal website, managed by Dr. Phil Taylor, on the University of Leeds website. It’s called Phil Taylor’s Papers, and its a collection of articles, essays, doctrinal writings, and reports on the topic of information warfare and strategic communications. I wish more people were aware of how governments use information and the news media to advance their policy goals. So I’m glad to see someone collecting this information under one roof. However, it’s a very casual effort at archiving. It almost seems like Taylor just wanted all these resources in one place for the sake of organizing his own research materials. So I thought it might be worth discussing what went wrong here, when there was so much potential for this to be such a useful public resource.

First, Taylor just cut and pasted the materials he liked onto webpages and organized the many many links under menu tabs. It doesn’t seem like he logged in any metadata at all, so titles and even keywords aren’t searchable. Also, there are no permalinks provided to the original source, and none of the hyperlinks in the original text were preserved. And the themes according to which the materials are organized are really broad. One needs to understand the difference between PSYOP and Strategic Communications to understand what they’re looking at.

There are over 1,000 items in this collection and it would have been an enormously time consuming effort for this one man to catalogue each item, provide metadata and hyperlinks, and create an intuitive schema for organizing all the materials. As it is, it’s a great resource for researchers familiar with the topic, but not much more.