A Metadata Primer
By Ann Rockley at 2003-02-13 00:00:00 |
Excerpted from Managing Enterprise Content: A Unified Content Strategy, Ann Rockley et al, ISBN:0735713065; Chapter 5. Copyright 2002. Reproduced by permission of New Riders.
As you have no doubt noticed, more information is available than ever before -- on the Web, your company intranet, in your content management repository, and elsewhere. This is both exciting and problematic, and extremely frustrating when you can't find what you're looking for.
What is missing is information about the information -- that is, labeling, cataloging and descriptive information -- that enables a computer to properly process and search the content elements. This information about information is known as metadata.
Although metadata has been a buzz word in the information technology and data warehousing business for some time, it has recently emerged as an important concept for developers of enterprise content management and Web publishing solutions. With more complex authoring processes and information delivery requirements, you need some way of classifying and identifying all of the information or content "bits" so that they can be retrieved and combined in meaningful ways for users. Well-designed metadata can provide the classification and identification you need.
In a unified content strategy, metadata enables content to be retrieved, tracked, and assembled automatically. Metadata enables:
- Effective retrieval
- Systematic reuse
- Automatic routing based on workflow status
- Tracking of status
Unified content requires two types of metadata: categorization and element. Users tend to find information based on categorization metadata, whereas authors tend to retrieve information based on element metadata.
For years, libraries have used metadata to catalog and categorize documents. Without the card catalog and the Dewey Decimal system it would have been impossible to find content in a library. Without metadata it is nearly impossible to find content online..
The increasing use of portals has encouraged organizations to make the portal the central location for access to organizational content. However, as each new piece of content is added, users' ability to find content decreases. Corporate information needs to be just as accessible as library content, which means organizing content in a logical structure, categorizing it, and using the categories to add metadata to the information.
To begin the process of creating categorization metadata you need to understand your users. Understanding your users helps you to define the ways in which they will retrieve information. Ask the following questions:
- Who is going to retrieve the content?
For example, customers will want to retrieve product information, marketing will want to retrieve reports, product information, and industry materials, and employees will want to retrieve policies and procedures.
- What tasks are they trying to accomplish with the content?
For example, are they trying to complete a task or make a decision? You need to categorize content for tasks in areas such as procedures, while policies may be categorized separately.
- What terms will they use when retrieving the content?
Anticipating the terms that people will use is always difficult. Everyone uses different terms and thinks about information in different ways. You will never be able to ensure that content will be accessible under everyone's terms, but understanding how people will refer to content helps you to determine your taxonomy. After you develop a taxonomy you can educate your users to use the available terms.
Now you need to categorize your content and create a taxonomy. This involves:
- Grouping or clustering related content
As you start to categorize your content, you need to start grouping like or similar content together. These groups create categories, which are then refined by individual items in the category. For example:
- Company benefits
- Benefit policies
- Benefit forms
- Benefit frequently asked questions
Which can be simplified to:
- Company benefits
Frequently asked questions
Developing your taxonomy
As you group content, categorize it, and define the terms to be used to identify your content, you are automatically creating your taxonomy. Each term in your taxonomy becomes metadata.
- Testing your taxonomy to ensure that it is appropriate and comprehensive.
You need to ensure that the metadata you have created is appropriate and usable by your audience. Before even using the metadata electronically, categorize some sample content and ask users to perform a usability test to ensure your taxonomy is appropriate.
Element metadata identifies your content at the element level, based on the elements defined in your information model (see Chapter 8, "Information modeling"). Authors use element metadata to help them manage content throughout the authoring process. There are three main types of element metadata:
- Reuse metadata
- Retrieval metadata
Metadata for reuse
Metadata for reuse identifies the components of content that can be reused in multiple areas. For example, if an overview already exists for the ABC product, you can use metadata like "content type = overview, product = ABC" to help you find the correct content to reuse.
Before even beginning to write, authors can search the content management system by metadata for reusable content. Alternatively, the content management system can automatically search for appropriate reusable content (based on models and metadata) and deliver it (systematic reuse) to authors. In both cases, metadata is very important to correctly identify the elements of content.
To determine what metadata you need to enable reuse, you need to determine the business result you are trying to achieve and build your metadata backward to achieve that result. Think about the following:
- Where is content going to be reused?
Across product? Across information product? If you answered yes to any of these then you need to create metadata to identify each reuse. For example:
- Product, such as:
- Information product, such as:
- What type of content is it?
You also need to know the element content type for which the content is valid. Your metadata might include
Content type, for example:
- What else do you need to know about the content to ensure that the correct
piece of content is reused?
For example, you might also need to know to which version of the product the content applies:
Version, for example:
Furthermore, you may need to know the region or location where the product is being sold or used, so that you can identify content such as safety regulations, language, and configuration. In this case, your metadata might include:
- Region, for example:
Language, for example:
Finally, you may need to know the audience so that appropriate content is provided for each audience.
Audience, for example:
- Region, for example:
Metadata for retrieval
Metadata for retrieval is used to help authors retrieve content and may include much or all of the metadata used for reuse. However, metadata for retrieval is more extensive then metadata for reuse, providing additional information about an element that facilitates retrieval. Think about what other information would help you retrieve content more effectively. For example, your retrieval metadata might include:
This type of metadata can be entered by the author, or the system can use the title that appears in the content to create this metadata.
The system usually automatically generates this type of metadata, based on the author information.
- Date (creation, completion, modification)
The system usually automatically generates this metadata as it is checked into the content management system.
This metadata can be entered by the author; however, it is preferable to provide the author with a list of keywords from which to choose. This way keywords will be used consistently.
- Security level (who can view the content)
This type of metadata is usually applied by the author from a selected list of options.
Metadata for tracking (status)
Metadata for tracking is particularly useful when you are implementing workflow as part of your unified content strategy. By assigning status metadata to each content element, you can determine which elements are active. You can also control what can to be done to an element and who can do it. Generally, status changes based on the metadata are controlled through workflow automation, not by end users. Sometimes an author will identify a status change such as "ready for review" because the system cannot automate this type of information. Status metadata can include such tracking items as:
- Draft (under development by the author)
- Ready for review
Again, like your other metadata, you identify tracking metadata by determining the business result you are trying to achieve, and then build your metadata backward to achieve that result. Design your metadata for tracking after you have designed your workflow. This enables you to identify what metadata needs to be applied to the content at each stage of workflow to enable the workflow system to manage it.
After you have designed your metadata to support your workflow, you need to identify other metadata that can help you to track your content. For example:
- Who created the content (author)?
- When was it created/modified (date)?
- Who modified the content (editor)?
- Who reviewed/approved the content (reviewer/approver)?
- How long did it take to create/modify/review (time)?
- Where has it been reused (information product, product)?
- Has it been translated (content status)?
Most content management systems automatically create some of this metadata (for example, author, date), whereas other metadata may already be defined in retrieval metadata and reuse metadata, but you should go through this exercise to make sure that you have identified all the possible metadata you require for tracking and reports.
Creating a controlled vocabulary
Metadata needs to be consistent to facilitate reuse, retrieval, and tracking. This requires a controlled vocabulary. A controlled vocabulary reconciles all the various possible words that can be used to identify content and to differentiate among all the possible meanings that can be attached to certain words. Using an unlimited or uncontrolled set of metadata terms leads to additional work for authors (they have to figure out the metadata each time they apply it) and reduces the percentage of content that can be effectively retrieved (different terms means either using multiple terms to search or missing some content because retrievers are unaware of alternate terms). If authors can create their own metadata tags, there is a high probability they will create different metadata.
To create a controlled vocabulary:
- Identify your metadata categories (for example, Content type, Product).
- Identify the terms that make up that metadata category.
- Content status
- Ready for review
- In review
- In approval
In this example, "content status" is the metadata category and the controlled terms are "Draft," "Ready for review," and so on.
Uncontrolled metadata terms should be the exception to the rule. If possible, do not provide any metadata that can be defined by the author. If that is not possible, monitor the uncontrolled metadata terms to see whether patterns are emerging that could then be used to create a controlled vocabulary.
Ensuring metadata gets used
Metadata can be very valuable and useful; however, it is only valuable if it gets used. Wherever possible, automate the application of metadata. Leaving the application of metadata up to authors adds yet another burden to the authoring process and leads to inconsistency. Some authors diligently apply the correct metadata, some apply some of it correctly and some of it incorrectly, and some don't apply it at all. Unless it is applied appropriately all the time, your metadata could become useless.
Wherever possible, have the system apply the metadata. This can include automatic:
- Categorization metadata based on the content
- Metadata based on the template and model
- Inheritance of metadata based on the parent (for example, if a container element is given a restricted security, all the elements within the container automatically have the same security metadata applied to them)
- Metadata based on position in the workflow
If it is necessary for authors to add metadata, make it possible for them to add the metadata as they are authoring so that they don't have to wait until the content is checked into the content management system. For example, if a step varies based on role, let them add the role metadata as soon as they finish writing the step. If it has to wait until the content is checked into the content management system, the system either has to prompt them to add the metadata for every single element (a very tedious process), or it may be up to the author to remember to add the metadata in all the relevant places (a recipe for missed metadata).