Early Saturday morning, I attended a 4 hour panel discussion on linked data (LD) and next generation catalogs. I wanted to gain a better understanding of what exactly linked data is since that term is batted about frequently in the literature. I will try to explain it to the best of my ability, but I still have much to learn. So here it goes.
Uniform resource identifiers (URI) is a string of characters used to identify names for “things”. Specifically, HTTP URIs should be used so that people are able to look up those names. Useful information should be provided with URIs, as well as, links to other URIs so that individuals can discover even more useful things.Per Corey Harper, NYU’s Metadata Services Librarian, we need to start thinking about metadata as a graph instead of string based as is most of our data currently. Typed “things” are named by URIs, and relationships between “things” are also built on URIs. LD allows users to move back and forth between information sources where the focus is on identification rather than description.
Mr. Harper provided several examples of LD sites available on the Web, some of which individuals and institutions may contribute data. Google owned Freebase is a community curated collection of RDF data of about 21 million “things”. Freebase provides a link to Google Refine that allows individuals to dump their metadata, clean it up, and then link it back to Freebase. Thinkbase displays the contents of Freebase utilizing mindmap to explore millions of interconnected topics.
Phil Schreur, who is the head of the Metadata Department for Stanford University libraries, talked about shattering the catalog, freeing the data, and linking the pieces. Today’s library catalogs are experiencing increased stressors such as:
- Pressure to be inclusive–the more is better approach as seen with Google
- Loss of cataloging–the acceptance and use of vendor bulk records; by genericizing our catalogs, we are weakening our ties to our user/collection community
- Variations in metadata quality
- Supplementary data–should the catalog just be an endless supply of links
- Bibliographic records–catalogers spend lots of time tinkering with them
- Need for a relational database for discovery–catalogs are domain silos that are unlinked to anything else
- Missing or hidden metadata–universities are data creation powerhouses (e.g. reading lists, course descriptions, student research/data sets, faculty collaborations/lectures); these are often left out of catalog, and it would be costly to include them
Linked open data is the solution along with some reasons why:
- It puts information on the Web and eliminates Google as our users’ first choice
- Expands discoverability
- Opens opportunities for creative innovation
- Continuous improvement of data
- Creates a store of machine-actionable data–semantic meaning in MARC record is unintelligible to machines
- Breaks down silos
- Provides direct access to data based in statements and not in records–less maintenance of catalog records
- Frees ourselves from a parochial metadata model to a more universal one
Schreur proceeded to discuss 4 paradigm shifts involving data.
- Data is something that is shared and is built upon, not commodified. Move to open data, not restricted records.
- Move from bibliographic records to statements linked by RDF. One can reach into documents at chapter and document level.
- Capture data at point of creation. The model of creating individual bibliographic records cannot stand. New means of automated data will need to be developed.
- Manage triplestores; not adding more records to catalog. The amount of data is overwhelming. Applications will need to be developed to bring in data.
He closed by stating the notion of authoritative is going to get turned on its head. The Web is already doing that. Sometimes Joe Blow knows more than the national library. This may prove difficult for librarians and catalogers to accept since our work has revolved around authoritative sources and data.
OCLC’s Ted Fons spoke about WorldCat.org”s June 20, 2012 adoption of schema.org descriptive mark-up to its database. Schema.org is a collaboration between Bing, Google, Yahoo, and Russian search index Yandex and is an agreed ontology for harvesting structured data from the web. The reasons behind doing this includes:
- Makes library data appear more relevant in search engine results
- Gain position of authority in data modeling in a post-MARC era
- Promote internal efficiency and new services
Jennifer Bowen, Chair of the eXtensible Catalog Organization, believes LD can help libraries assist and fulfill new roles in the information needs of our users. Scholars want their research to be findable by others, and they want to connect with others. Libraries are being bypassed not only by Google and the Web, but users are also going to tailored desktops, mobile, and Web apps. Libraries need to push their collections to mobile apps and LD allows us to do just that. Hands-on experience with LD to understand its potential and to develop LD best practices is needed. We need to create LD for our local resources (e.g. Institutional Repository) to showcase special collections. Vendors need to be encouraged to implement LD now! Opportunities for creative innovation in digital scholarship and participation can be fostered by utilizing LD.
A tool that will enable libraries to move from its legacy data to LD is needed. The eXtensible Catalog (XC) is open source software for libraries and provides a discovery system and set of tools available for download. It provides a platform for risk-free experimentation with metadata transformation/reuse. RDF/XML, RDFa, and SPARQL are 3 methods of bulk creating metadata. XC converts MARC data to FRBR entities and enables us to produce more meaningful LD. Reasons to use FRBR for LD include:
- User research shows that users want to see the relationships between resources, etc. Users care about relationships.
- Allows scholars to create LD statements as part of the scholarly process. Vocabularies are created and managed. Scholars’ works become more discoverable.
- Augments metadata.
The old model of bibliographic data creation will continue for some time. We are at the beginning of the age of data, and the amount of work is crushing. Skills in cataloging is what is needed in this new age, but a recasting of what we do and use is required. We are no longer the Cataloging Department but the Metadata Department. The tools needed to create data and make libraries’ unique collections available on the Web will change, and catalogers should start caring more about the context and curation of metadata and learning LD vocabulary.
While this was my second visit to Anaheim, CA to attend ALA’s Annual Conference, it was my first time ever presenting at a national conference. On Sunday morning starting at 8 am, Erik Mitchell and I hosted and convened the panel discussion, Current Research on and Use of FRBR in Libraries. The title of our individual presentation was FRBRizing Mark Twain.
We began the session with a quick exploration of some of the metadata issues that libraries are encountering as we explore new models including FRBR and linked open data. Erik and I discussed our research which explored metadata quality issues that arose when we applied the FRBR model to a selected set of records in ZSR’s catalog. The questions to our research were two-fold:
- What metadata quality problems arise in application of FRBRization algorithms?
- How do computational and expert approaches compare with regards to FRBRization?
So in a nutshell, this is how we did it:
- Erik extracted 848 catalog records on books either by or about Mark Twain.
- He extracted data from the record set and normalized text keys from elements of the metadata.
- Data was written to a spreadsheet and loaded into Google Refine to assist with analysis.
- Carolyn grouped records into work-sets and created a matrix of unique identifiers.
- Because of metadata variation, Carolyn performed a secondary analysis using book-in-hand approach for 5 titles (approx. 100 books).
- Expert review found 410 records grouped in 147 work-sets with 2 or more expressions and 420 records grouped into 420 single expression work sets. Lost/missing or checked out books were not looked at and account for the numbers not adding up to the 848 records in the record set.
- Metadata issues encountered included the need to represent whole/part or manifestation to multiple work relationships, metadata inconsistency (i.e. differences in record length, composition, invalid unique identifiers), and determining work boundaries.
- Utilizing algorithms, Erik performed a computational assessment to identify and group work-sets.
- Computational and expert assessments were compared to each other.
Erik and I were really excited to see that computational techniques were largely as successful as expert techniques. We found, for example, that normalized author/title strings created highly accurate keys for identifying unique works. On the other hand, we also found that MARC metadata did not always contain the metadata needed to identify works entirely. Our detailed findings will be presented at the ASIS&T conference in October. Here are our slides:
Our other invited speakers included:
- OCLC’s Chief Scientist Thom Hickey who spoke about clustering at the FRBR entity 1 work level OCLC’s database, which is under 300 million records, and clustering within work-sets by expression using algorithm keys; FRBR algorithm creation and development; and the fall release of GLIMIR which attempts to cluster WorldCat’s records and holdings for the same work at the manifestation level.
- Kent State’s School of Information and Library Science professors Drs. Athena Salaba and Yin Zhang discussed their IMLS (Institute of Museum and Library Services) funded project, a FRBR prototype catalog. Library of Congress cataloging records were extracted from WorldCat to create a FRBRized catalog. Users were tested to see if they could complete a set of user tasks in the library’s current catalog and in the prototype.
- Jennifer Bowen, Chair of XC organization and Assistant Dean for Information Management Services at the University of Rochester, demonstrated the XC catalog to the audience. The XC project didn’t set out to see if people liked FRBR, but what are our users trying to do with the catalog’s data. According to Ms. Bowen, libraries are/should be moving away from thinking we know what users need to what do users need to do in their research. How do users keep current in their field? In regards to library data, we need to ask our users, “What would they do with a magic wand?” and continue to ponder “What will the user needs of the future be?
Following our session, I attended a packed room of librarians eager to hear more about Library of Congress’ (LC) Bibliographic Framework Transition Initiative (BFI) which is looking to translate the MARC21 format, a 40 year old standard, to a LD model. LC has contracted with Zepheira to help accelerate the launch of BFI. By August/September, an LD working draft will hopefully be ready to present to the broader library community.