The Newton Project's Collaborative Projects
Cultural Heritage Language Technologies (CHLT) is an international collaboration which will address important questions, and offer practical solutions, regarding the creation of International Digital Library Technology (IDLT) that includes: (i) the development of an infrastructure for IDLT; (ii) powerful IT tools for end users that are designed to be responsive to the ways that different individuals and researchers use the system; (iii) integration of advanced tools for working with and visualising digital documents; (iv) establishment of a framework for sharing metadata, data, and tools across multiple digital libraries; (v) providing a stable, distributed archive allowing for long-term preservation of, and easy access to, digital data.
The objectives of this collaboration are to integrate computational linguistic tools and techniques within an infrastructure that will enable distribution into an international digital library environment. We will develop pioneering International Digital Library Technology and create a range of IT tools and applications for use within digital collections adapted to aid the research of humanities users. CHLT will provide generic tools for multi-lingual information retrieval; concept identification; visualisation and display; vocabulary profile analysis and a syntactic parsing toolbox. Key objectives are interoperability and the sharing of meta-data in an open-source environment so that data, metadata, and tools may be shared among partners and affiliated digital libraries in order to test prototypes, refine programs, and reflect user-studies at the research and development stage. The infrastructure will allow partner libraries to generate hypertexts that will link similar resources in different collections, unify search and retrieval facilities, and share resource-intensive progams. We will also create and integrate new corpora as testbeds for these applications, many of which will be linked to pre-existing digital facsimile images of original manuscripts. The data produced by the project will conform with the Open Archives Initiative metadata-sharing standard and will be made accessible through OAI services. The project also aims to extend the infrastructure of the OAI meta-data standard to the granularity of individual words within texts. We will provide long-term preservation in data storage facilities at the Oxford Text Archive to the highest standards available.
Our consortium involves 8 participants (4 US and 4 European) who are committed to taking a leading role in the development of international digital library technology (IDLT). Each of the participants brings to this project a digital collection that when linked together will create a mini-international digital library which will act as a testbed for the creation of an infrastructure model and IT tools and applications. The workpackages are designed to contribute to the overall development of an IDLT Infrastructure Model, which will be our first deliverable and upon which all further work will progress. A report on the justification of this model will be produced to show the logic of our decision making procedures. Once the infrastructure model is in place and operational at each of the partner sites, then individual workpackages can proceed in tandem within this single working environment. The strategy and management policies of this project work consistently toward integration of technology and resources.
Our workpackages are organised around a series of advanced digital library applications that will be developed by different workgroups in our consortium and integrated into a single Digital Library System. The methodology of tool creation relies upon the development of a robust indexing architecture that scales across systems in multiple languages. This will allow our programming teams to focus upon developing tools instead of confronting problems relating to disparate tagging systems. This will also allow us to apply these tools to every text in the Digital Library System (DLS) without custom programming for every new set of texts. The aim is to retain the independence of each digital collection while at the same time offering it a place in the overall system.
Three workpackages involve the creation of corpora as testbeds for new applications. Although many of our collaborators bring existing corpora to the project, the infrastructure that we are developing will allow us easily to integrate other texts at a very low cost per megabyte. Ultimately the corpora that we create and integrate into our system will be a substantial contribution to European Cultural Heritage. At the end of three years, we will have added to our system 300MB of Early Modern Latin texts, including many of Isaac Newton's papers, and 12 MB of Old Norse literature.
At the end of Year 1 we will have produced an infrastructure model that will be distributed and operational in each of the partner sites, as well as tools and applications built around a common architecture. The milestones of Year 2 will concentrate on the fruits of sharing data, metadata, and tools within the infrastructure model to create a Digital Library System (DLS). In Year 3 we will be in a position to test out our IDLT and DLS in end-user studies between US and EU partners and refine our programmes to reflect those studies; the final product will be a set of thematically coherent digital library collections in a single Digital Library System based within a Digital Library Infrastructure Model that employs advanced IT tools and applications.