Briefings from the Library of Congress Designing Digital Storage Architectures Meeting 2018 (DSA)
You are sitting in your desk pushing files to the cloud from a recent digitization project. As the files are uploaded, you know the system is working its magic. You know quite a bit about what happens behind the scenes, you work for a large archive or a department with digital preservation as one of their goals, so this is what you do. You know there are internal servers and storage appliances (ok, maybe IT knows better about the actual going ons of storage management), stuff is stored on tape (one backup), and they are also pushed to the cloud (as an additional backup; best case scenario there’s some sort of geographic replication in the whole setup). And as you are sitting there watching the progress bar move, you wonder what the future holds for the digital assets you manage and if you’re making the right decisions. If you found yourself nodding, then you will enjoy this post!
Experts in the field of digital storage meet annually to talk about these issues, advances in technology, and its impact to digital preservation at the Digital Storage Architectures meeting (DSA), organized by the Library of Congress. AVP had the opportunity to attend DSA this year, and here’s a brief recap with some of our main takeaways.
For those of you who are not familiar with this event, the DSA meeting gathers technical experts and experienced practitioners that are interested in digital preservation to discuss storage infrastructures, practical approaches, and current and new technologies to draw the landscape for the future of the discipline. It is a unique opportunity for service providers, vendors and consultants to learn about the needs and requirements of organizations dealing with preservation of digital assets, and a chance for these institutions to hear about the latest developments in digital storage technologies and market trends. This year’s DSA — which took place on September 18-19, 2018 — included case studies from organizations, conversations about trends in the storage media market,, current research on digital storage, and alternative storage media.
From the community
The morning of the first day started with an overview of the Library of Congress’ current practices, projects and challenges. Managing over 9 PB of storage coming from both digitization and born-digital content, the Library recently went through a huge migration process from 4 datacenters to 2, which also included migration of tape drives and development of abstraction layers for overall faster data movement. Although they have been giving some consideration to cloud services for backup copies, internal requirements and costs (including retrieval and ongoing management) are the main reasons why this has not occurred.
Over the two days, other organizations presented on their current projects and/or infrastructures. Sharing their challenges and successes, these presentations opened up interesting conversations about practical approaches and offered colleagues down-to-earth thoughts about implementations in different contexts. Interesting questions about retrieval, search and access came up from these presentations and here are just a few notes that you might find interesting:
Sally Vermaaten (Gates Archive) talked about their current redesign of digital infrastructure which started with an internal assessment, a comparison of common practices in other organizations, analysis of cloud vendors, and audit of their staging storage. Their experience seems relevant as it could resonate with what other similar-sized archives are dealing with. Their careful assessment might be an example of a good place to start.
Karen Cariani and Rebecca Fraimow (WGBH) presented their approaches in storage environments from the perspective of an archive within a production environment. WGBH has established a model where description during production before transferring files to the archives is required in order to bill projects, which makes metadata collection more efficient (in other words, it actually happens!!). They are currently striving to create a more integrated environment, for both archives and content management, which also seems to be somewhat an issue for Gates, as automated metadata is generated by separate systems.
From the perspective of a consultant, Ben Fino-Radin (Small Data Industries) presented the results of a survey about digital storage practices in organizations that own media art collections. This study revealed huge gaps and wide differences in the way digital storage and digital preservation is achieved by these organizations, which put in evidence their growing need for support in this area. Beyond what art organizations are doing — or not doing — to store and protect their assets, the picture that Small Data Industries showed may look familiar to many cultural and/or small organization who are suddenly faced with the long-term preservation of digital assets. As Fino-Radin pointed out: attention storage vendors, there’s a huge group of underserved customers with lots of valuable data to store.
Two other presentations were very interesting as they dealt with large, scalable approaches to data retrieval and storage. Leslie Johnston (NARA) introduced the new Electronic Records Archive systems (ERA) developed as a cloud-based suite of tools that allows more than 200 submitting government agencies to perform process and deliver electronic records. Also, Brian Wheeler (Indiana University) talked about their current film digitization and preservation workflows and optimization of processes via HPSS (High Performance Storage System).
Storage media manufacturers
Representatives from storage media manufacturers were also present at this meeting. Here is a very quick overview of some of their presentations.
Henry Newman (Seagate) talked about the benefits of quantum computing for processes such as checksum calculation and verification, optimization and encryption to achieve improved security.
Robert Fontana (IBM Research) painted the landscape of the market on storage media - LTO (Linear Tape-Open) data tape, HDD (Hard Disk Drive) and NAND (i.e., solid state or flash) - based on areal density, revenue and cost per GB. Although HDD continues to dominate the market in terms of amount of data usage, NAND keeps growing as an important source of revenue mostly based on its usage in the consumer industry such as smartphones. Generally speaking, value/data stored is dropping at about 20-25% per year and manufactured storage has a linear growth for HDD and tape, whereas for NAND it’s is exponential. Additionally, tape manufacturing has a limitation based on the number of companies producing them (only 2: Sony and Fuji). Cost per GB continues going down for HDD and tape, with almost no changes for NAND (decrease was stopped by a market imbalance). From the power consumption perspective, it is generally getting optimized and reduced, but HDD continues being more expensive than NAND.
Jon Trantham (Seagate - Industry Review) talked about the introduction of the HAMR (Heat Assisted Magnetic Recording) and dual-actuator technologies as an approach to improving data storage capacity and write/read speed for HDD, which is in their perspective the main market as these devices are primarily used in cloud solutions, which is, as we all would expect, a growing market. They envision that with this technology storage capacity will go up to 30-40 TB per unit.
An interesting conversation sprung up after hearing the advances made by the NAND industry — is there a market shift? Attendees pointed out the importance of writing interfaces in the advance and adoption of these technologies, as well as comments about special needs for archives (for example long-term costs for maintenance, high cost of SSD, reliability and data retention issues with SSD, and stability of magnetic media over SSD).
Xiaodong Che (Western Digital) presented on improvements in the development of EAMR (Energy-Assisted Magnetic Recording) for the improvement of read/write latency.
As you have already guessed by now, most research conducted by manufacturers is focused on developing faster and more efficient hardware for the use in the cloud, and specifically for HDD.
Service providers also joined the conversation. Kevin Miller (AWS) talked about their development focus for Glacier, which is now on improving ecosystems, not only bit storage. He acknowledged retrieval speed was an issue in the past and how they have been improving it. Their vision is now to focus on specific archival needs with the aim to take away the burden from users. They have been open about listening the audience requests regarding durability, checksum transparency, confidence and trust, and they want to maintain communication channels open in order to understand the community’s needs. A big concern manifested by attendees was that small organizations not necessarily have the right expertise for self-management of storage cloud technologies, that the barriers for access and technical understanding are high, an issue that was also evident in the survey results shown by Fino-Radin.
David Friend (Wasabi) presented their cheap, integrated, flexible, transparent, user-friendly cloud storage. This service offers a user interface that allows easy-access and control over assets in the cloud. However, Wasabi currently offers only two different geographical locations for storage.
Pashupati Kumar (Microsoft - Project Pelican) presented the advances made in cold-tier storage rack systems based on HDD and tape with the purpose of lowering costs by reducing power consumption, increasing drives per rack, storage disaggregation, flexible performance and the use of commodity components.
From the comments, questions and conversations, it is interesting to note that although some organizations have taken the path of cloud storage for at least one of their backups, there’s still some discomfort in the room with cloud services. Transparency in the processes and management of the files is a big concern for archives; it isn’t enough to have 99,99% data durability if providers do not disclose the details on how files are managed and provide at least some basic administrative metadata (some cloud services already offer this feature). Security, as you would imagine, is also a sensitive subject, as most colleagues feel that cloud systems have not yet been put to the test of a huge failure.
What’s on the horizon?
At DSA we get to see the present, the near future, and the far future. Research on storage is of course not limited to improvement of current storage media. Looking for faster, denser, more reliable options is a permanent goal of the industry, and the results of that search could take us to places, materials, components, and systems we haven't even imagined. Isn’t it fun to think about all the hundreds of TB you’ll be able to store? Here’s some food for your imagination.
Rob Hummel (Group 47) talked about a more stable, low-cost, environmental-friendly, “visually readable” storage technology: DOTS. As opposed to what you might think, this technology is not new — it was developed and tested by Eastman Kodak. DOTS stores information on a polyester base covered with a metallic alloy that reacts to the heat of a laser. Information can be read with a camera and it can contain digital information, but also text and images if desired.
Peter Kazansky (U. of Southampton/Microsoft) introduced a new method of optical storage on quartz glass which will be able to store about 1TB of data in a volume equivalent to a DVD disc. Project Silica has developed this technology with the cloud in mind and the great advantages of this medium are durability and writing speed.
You have heard about DNA storage, for sure. One of the main benefits of this technique is its high density and stability, however, its costs and writing speed are still issues to tackle. Devin Leake (Catalog) spoke about the advances they have made in this area; they claimed the possibility to store up to 10 exabits/cm3 of data, however, it costs about USD $1 million at about 3.7 kb/s to store 200MB. Karin Strauss (University of Washington - Microsoft) showed advances in the creation of a prototype to read back information using microfluidics in order to build “DNA Libraries”. Search and retrieval could take about 1.5 hours and although it fully supports read-write, automation of the processes is still in development. You can also be part of this project! The #MemoriesInDNA project are looking to collect 10,000 images to store on DNA.
So, DNA, we are not quite there yet. One interesting question that came up was related to data interoperability with DNA, if many organizations are working on these prototypes, how viable is interoperability? According to Devin and Karin, this is an encoder/decoder issue; as long as the system knows how the data was stored, it will be able to read it. So, we can only hope that these codification schemas remain open, for the benefit of the users.
Back to planet earth - a useful resource
The meeting concluded with a revision of the Digital Preservation Storage Criteria, a list of topics and relevant considerations that can help any organization in the initial planning of digital storage for preservation. It was highlighted that these are not by any means a set of requirements, but just a reference list to help in the process of thinking about digital storage.
Now that you are back from this reverie, you can go back and continue uploading files to the cloud. I hope I was able to answer some of your questions and also plant some more thoughts and questions on your mind!