Tags and Subject Headings in LibraryThing

When I finished library school in August, I put all of my papers in some boxes and haven’t looked at them since, because I really needed to recover. Happily, yesterday’s post about folksonomy finally forced me to dredge them out and bring to light a paper I did for my indexing and abstracting class on the use of a folksonomy alongside a controlled vocabulary in LibraryThing.

The first part is a sort of “literature review” on folksonomy (such as it is) and an overview of the concepts involved. The second part takes a look at LibraryThing and compares tags and subject headings.

The full text of the paper follows, or you can download a pdf for printing (warning: it’s in ugly, formatted-to-turn-in-for-class format; one of these days I’ll get it prettied up). Some discussion of the features of LibraryThing are slightly out of date (Tim & co. move fast!), and the statistics about popular tags and subject headings certainly are, but I think the main points are still relevant. I’d like to do some more in-depth analysis, especially of the statistical data, at some point in the future.

Introduction

“Folksonomy” is a term coined by Thomas Vander Wal from “folk” and “taxonomy” (Smith, 2004) to describe systems in which people (not professional indexers) assign natural language descriptors to resources. Many definitions of varying precision and breadth have been used for “folksonomy” (Vander Wal, 2005), and a variety of similar terms have been proposed (Merholz, 2005), but the term folksonomy appears to have become the most accepted. Because of the echo of “taxonomy”, folksonomy is often incorrectly described as classification, but it is actually more correctly labeled categorization or indexing (Mathes, 2004).

Folksonomies appear in many current web applications, most famously in the “social bookmarking” service del.icio.us, in which users bookmark and tag web pages, and the photo sharing site flickr, where users can upload and tag photographs and other images.

Folksonomies have also begun to appear in more “scholarly” services, such as Connotea and CiteULike, which are similar to del.icio.us but oriented toward article citations. Folksonomies are also being applied to books, in the University of Pennsylvania’s library catalog (a project called PennTags), and in LibraryThing, a social site for cataloging personal libraries. These examples are especially interesting, since the resources involved (articles, books) are already indexed using controlled vocabularies, providing a fertile ground for exploring the similarities and differences in the functions of folksonomies and controlled vocabularies. This article examines properties of folksonomies in general and then looks at the example of LibraryThing in particular.

Definitions

Vander Wal (2005) asserts that the three important components of folksonomy are the following:

  1. The users;
  2. The resources being described, with a unique identifier such as a URL or ISBN; and,
  3. The descriptors or “tags” used to describe the information resource.

The act of assigning descriptors is often referred to as “tagging.”

In addition, the folksonomy has two important aspects:

  1. It is performed for personal organization and retrieval.
  2. It is performed in a social environment, and greater numbers of users improve the system.

The personal aspect makes participation in a folksonomic system worthwhile for users to pursue, since the system directly benefits their own organization and retrieval (Mathes, 2004).

The social aspect makes folksonomies useful not only for retrieval, but also for discovery. With any two of the three components (users, resources, or tags), one can find more of the third—other users with similar interests, resources similar to those already found, or descriptors similar to those already used (Vander Wal, 2005).

Properties of Folksonomy

Folksonomy uses an uncontrolled vocabulary, so it has much in common with any other system using natural language for descriptors. However, folksonomy also employs a social system absent from other systems of indexing that sets it apart.

Folksonomy as Uncontrolled Vocabulary

As uncontrolled vocabularies, folksonomies suffer from many difficulties such as ambiguity in the meaning of and differences between terms, a proliferation of synonyms, varying levels of specificity, and lack of guidance on syntax and slight variations of spelling and phrasing (Spiteri, 2005; Mathes, 2004). Syntax is also system-dependent: while some systems allow spaces in tags or differentiate between capital and lowercase letters, others do not (Mathes, 2004). Additionally, indexer error may be higher in folksonomies because of incorrect usage; Spiteri (2005) observes the application of “archaeology” as a tag to materials about dinosaurs and prehistoric microbes. And, while controlled vocabularies are used by professional indexers with some claim to objectivity, the interests of the user of the folksonomy are explicitly subjective.

Uncontrolled vocabularies do have several significant advantages over controlled vocabularies: the barriers to entry in terms of effort, time, and cognitive burden are much lower than in employing a controlled vocabulary (Mathes, 2004). This property allows much more widespread adoption of tagging systems.

Folksonomy as Social System

Large numbers of users in a folksonomy present opportunities not present in traditional indexing, with a limited number of indexers and little overlapping in the materials indexed. The large number of indexers, combined with the fact that indexers and the audience are the same, allows a tight feedback between indexing and use involving repetition and imitation (Udell, 2004; Spiteri, 2005). Through these processes, users negotiate the meanings of terms to reach a consensus, much as markets negotiate prices. Individual meanings need not be given up—this market process does not necessarily result in homogenization—but community meanings emerge from the aggregation of individual effort (Shirky, 2005).

Folksonomy systems have a number of mechanisms to facilitate this shared meaning. These include suggesting popular tags used by others for the same resource, exposing the statistical relationships between tags that are often used together, and allowing users to collaborate to combine tags they think are equivalent. These replace some of the functions of the thesaurus in a controlled vocabulary.

However, Shirky (2005) points out that tags that seem equivalent, which might be collapsed to a single preferred term in a controlled vocabulary, may exhibit significant differences in the consensus meanings achieved through folksonomies, e.g. movies v. cinema or queer v. homosexual. These subtle differences are achievable through a social system, where wide subjective usage results in an emergent consensus that does not occur easily by deliberation alone. In addition, Shirky has pointed out that tags can be probabilistic; a resource can be considered “partially” something, unlike in a single-indexer controlled vocabulary, where the resource either fits the category or doesn’t.

The Example of LibraryThing

LibraryThing is a website created by Tim Spalding in which users can catalog their personal collections of books, leveraging bibliographic data from Amazon.com and from libraries. Users can categorize books using tags and discover other users with similar books and interests. To date, LibraryThing has over 4 million books cataloged (LibraryThing Zeitgeist, 2006).

Since LibraryThing operates on books and makes use of library data, many of its books have subject headings assigned to them. The majority of these are likely to be Library of Congress Subject Headings (LCSH), but because the data is from many libraries, there are likely to be other English-language subject heading systems in use (such as the Sears subject headings), and there are subject headings from non-U.S. libraries including some not in English. Since the interface of LibraryThing is English, however, the vast majority of users are English-speaking and the most popular tags are English.

LibraryThing’s tagging system is an example of a folksonomy, meeting the definition and having the properties discussed previously. Subject headings comprise a thesaurus of controlled terms, assigned by cataloging librarians to describe books using a relatively objective protocol.

The next section analyzes the similarities and differences in tags and subject headings in LibraryThing.

Analysis

As of July 26, 2006, LibraryThing has 213,862 unique tags and 174,072 unique subject headings, but these numbers include minor variations of spelling and punctuation (LibraryThing Zeitgeist, 2006; Spalding, personal communication). Subject headings are available only for books for which the data is derived from library catalogs (excluding the Amazon.com data, which does not include subject headings). As a result, the coverage of subject headings in LibraryThing is narrower than that for tags (Spalding, personal communication).

The terms in subject headings include topical subjects, geographical locations, time periods, forms, and other categorizations; similar concepts appear in tags. Tags also include information that applies to a particular user’s copy, such as “signed”, or location tags such as “living room shelf”, which would be handled by bibliographic data other than the subject headings in a library record. However, since such tags are personal and context-dependent, they have small numbers of users compared to topical tags, and so are easily excluded from popularity-based information. Other context-dependent tags that are more universal include “read” and “unread”, indicating whether or not a user has yet read the book; these tend to be more common—“read” is the eighth most popular tag on LibraryThing (LibraryThing Zeitgeist, 2006).

Subject headings are pre-coordinated, meaning that conjoint concepts are coordinated into a single heading at the time of indexing; e.g. “England > Fiction” (Jacob, 2004). In contrast, tags tend to be post-coordinated, embodying a single idea; e.g., the item with subject heading “England > Fiction” might have the tags “England” and “fiction” both applied to it (and can be retrieved in a search using a Boolean AND). The pre- and post-coordinate systems have various merits. In general, post-coordinate systems are simpler because the coordination of terms does not have to be anticipated. This works especially well for de-coupled term combinations, such as a topical subject and a geographic locale or time period. However, it has been observed that in combinations of topical subjects, pre-coordinated terms can be more expressive: compare the subject headings “History > Philosophy” and “Philosophy > History” (Blachly, 2006).

Looking at the top 75 tags and subject headings by popularity, some differences become apparent. It has been shown that tag distributions for del.icio.us, flickr, and millionsofgames (a video-game tagging site) obey a power law distribution with lambda < 0.6, while more structured vocabularies and classifications (the Wikipedia category system and the Dewey Decimal Classification) are distributed according to a power law with lambda > 0.9 (Voss, 2006). Figure 1 shows these distributions.

The distributions of tags and subject headings in LibraryThing exhibit somewhat different behavior. Although the tails of the distributions (for tags with rank ≥ 10) are distributed according to power laws, lambda ≈ 0.9 for both subject headings and tags, indicating that the differences in the power distributions found in Voss may not be the result of differences in vocabulary structure (see Figure 2).

Firgure 1, Distribution of tags in various systems.

Figure 1. Distribution of tags in various systems. Note logarithmic scales. (from Voss, 2006)

Figure 2, distribution of tags and subject headings in LibraryThing.

Figure 2. Tag and subject heading distribution in LibraryThing.

Additionally, the most popular tags and subject headings (with rank < 10) are distributed anomalously. The top tags are more popular than the distribution for the tail would suggest, while the top subject headings are less popular (see Figure 2). This may be an artifact of the immense popularity of the number one tag, “fiction”. (It is notable that the subject headings have no term so broad; “Fiction” is coordinated with other terms or appears as a partial concept in terms such as “Science fiction”, “Mystery fiction”, and so on.) Further quantitative analysis of folksonomies and subject headings systems is needed to understand their statistical structures.

A qualitative evaluation of the top 75 tags and subject headings yields several observations. First, it is clear that both tags and subject headings exist in duplications with minor differences in spelling or punctuation. This is to be expected with tags since there is no vocabulary control among users, but for subject headings it is the result of multiple vocabularies (beside LCSH) or, possibly, because of human error. LibraryThing works around these difficulties in several ways. First, it assumes terms with differences only in capitalization are the same. Second, it supplies links to statistically related tags and subject headings, by calculating the frequency with which they are used in conjunction on items (see Figure 3). Finally, LibraryThing offers a facility for users to combine tags they consider to be equivalent, harnessing human intelligence to gloss over variations of spelling and phrase. For example, the tag “science fiction” is combined with tags such as “scifi”, “sci-fi”, and “science fiction”. To this author’s knowledge, it is the only folksonomy currently using such a feature.

Combining tags essentially makes up for one of the functions of an absent controlled vocabulary. It makes tags more useful, but caution is also needed. Spalding (2005a) has observed that Shirky’s remarks about the differences in “film” versus “cinema” are borne out in LibraryThing, as well as even more subtle differences, such as between “humor” and “humour”.

[see http://www.librarything.com/tag/childrens]

Figure 3. LibraryThing tag info page for tag “childrens”, showing (1) tag combinations, (2) related tags, (3) related subjects.

In the top 75, tags and subject headings exhibit a reasonable degree of overlap of topics, although the differences between pre- and post-coordinated terms must be taken into account. The top 75 subject headings appear to exhibit a slant because of certain highly popular works, including a heavy dosage of headings such as “Wizards > Fiction”, “Magic > Juvenile Fiction”, “Potter, Harry (Fictitious character) > Juvenile fiction”, “Middle Earth (Imaginary place) > Fiction”, and so on. Top tags include information about form that would be characterized by different bibliographic data in a library catalog, such as “series”, “literature”, or “paperback”, so there are no equivalent subject headings. Subject headings appear to make more fine distinctions in topics than do tags, with a litany of headings depicting characters and their relationships, such as “Fathers and daughters > Fiction”, “Orphans > Fiction”, and even “Triangles (Interpersonal relations) > Fiction”, kinds of concepts which are largely absent from the highly popular tags.

Blachly (2006) has observed that tags in LibraryThing can exhibit the probabilistic behavior described by Shirky by comparing the tags and subject headings for “Dystopia”; where only a limited number of works have the subject heading, many works have the tag, but applied only a few times. Blachly also notes several instances in which tags provide a more natural discovery path for items because of their more direct language; e.g. the tags “queer” and “gay fiction” for Maupin’s Tales of the City, where the top subject headings are “City and town life > Fiction” and “Humorous stories”.

Conclusions

LibraryThing shows that folksonomies and controlled vocabularies can harmoniously coexist, and that correlations between the two can be useful. To this author’s knowledge, it is the only example where a folksonomy and a controlled vocabulary are being used in conjunction in this way. Spiteri (2005) has suggested that folksonomies could especially benefit from alignment with focused vocabularies, such as the Getty Thesaurus of Geographic Terms.

Clearly, tags and subject headings derive from similar principles and work toward similar ends. Equally clearly, they have radically different models of application and use that cannot be completely reconciled. So, what can systems of subject headings learn from tags? What can folksonomic systems learn from subject headings?

Folksonomic systems generally make excellent use of the wealth of data in exposing related tags. In controlled vocabularies like subject headings, cross-references are generally used to the same end. However, subject heading lists could employ the same statistical techniques as folksonomies to find related subjects (which LibraryThing does), and useful differences from the anticipated cross-references might appear. To date, this type of data has not been leveraged in library catalogs Spiteri (2005) has also suggested that folksonomies could be used as a basis to develop controlled vocabularies that match the language of users. This might be an especially useful technique in a controlled system, such as a corporate intranet.

Folksonomic systems may benefit from alignment with controlled vocabularies, especially where there is a strong ontological basis for the vocabulary—in geographic names or time periods, for example. However, Shirky (2005) points out that the classification of even something so concrete as geographic locations may be seen very differently by users, and so tagging practices may not line up precisely with a chosen controlled vocabulary.

A system such as LibraryThing, focusing on books, can benefit from observations of the practices of bibliographic description, especially in how cataloging data is shared and customized for local use. It might be beneficial, for example, to segregate personal tags and social tags (or perhaps container tags and content tags are clearer terms), keeping such tags as “read” and “signed” separate from “science fiction” and “politics”.

Further comparison and contrast of folksonomies and controlled vocabularies will no doubt be instructive as folksonomies continue to increase in popularity.

References

Blachly, Abby (2006). “Tagging meets Subject Headings” in Thing-ology Blog, May 14, 2006.

Jacob, Elin K. (2004). “Classification and categorization: a difference that makes a difference.” Library Trends 52(3): 515–540.

LibraryThing Zeitgeist (2006). Retrieved July 26, 2006.

Mathes, Adam (2004). “Folksonomies—Cooperative Classification and Communication Through Shared Metadata.”

Merholz, Peter (2005). “Mob indexing? Folk categorization? Social tagging?” in Peterme.com, January 3, 2005.

Shirky, Clay (2005). “Ontology is Overrated: Categories, Links, and Tags.”

Smith, Gene (2004). “Folksonomy: social classification” in Atomiq, August 3, 2004.

Spalding, Tim (2005a). “New: Tag pages and related tags” in LibraryThing Blog, November 6, 2005.

Spalding, Tim (2005b). “Combining tags (heresy!)” in LibraryThing Blog, December 18, 2005.

Spiteri, Louise (2005). “Controlled Vocabularies and Folksonomies.” (pdf) Presentation at Canadian Metadata Forum, Ottawa, ON, September 27, 2005.

Udell, Jon (2004). “Collaborative Knowledge Gardening” in InfoWorld, August 20, 2004.

Vander Wal, Thomas (2005). “Folksonomy Definition and Wikipedia” in Off the Top, November 2, 2005.

Voss, Jakob (2006). “Collaborative Thesaurus Tagging the Wikipedia Way.”

Leave a Reply

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>