Searching Instead Of Browsing: Organizing Information Using Labels as Meta-Data

Being able to assign labels to content to organize information for searching is superior to placing content in folders for manual browsing. The folder concept may be suitable to physical documents on paper, but does not lend itself well to digital information. The labels concept combined with an effective search capability is a faster way to organize content and find information.

Organizing content is a means to the end goal of finding information. Since organizing content is not a goal by itself, it should be as simple and less work as possible required to meet the goal of finding information.

The folder concept has many limitations:

  • A particular item of content can only belong to one folder. Placing it in two folders requires either:
    • Making duplicates. This is problematic to maintain.
    • Using links. This is problematic too: With ‘soft links’ the content resides in only one folder and if that folder is deleted, the content is deleted too. With ‘hard links’, it is hard to know how many ‘folders’ contain this content and unlinking the last one may unintentionally erase it.
  • Similarly, folders can only be contained within one folder.
  • To organize content well in folders requires deep levels of sub-folders. These can be a challenge to browse.
  • All content must be placed in a folder for it to be well organized in this scheme. Doing this manually is a burden. Setting up rules for some of the content to be automatically placed in folders relieves the burden to a certain extent. However, after a rule has run and placed a content item in a folder, if the rule was found to have been flawed and it mixed the content in with other content in the wrong folder, it can be a bigger burden to find the content and place it in the right folder.
  • Folders are static. Search results are dynamic. With computing power available to the common person growing, dynamic search makes better sense than static folders which put some of the work on the user rather than the computer.

It should not be mandatory to apply all appropriate labels to all content. If the automated content categorization being used employs techniques like artificial intelligence and pattern recognition and can determine that this article is about personal information management or content management then that particular label should not be mandatory.

As the number of labels grows, the labels should not be organized in a taxonomy tree with a folders/sub-folders structure. Such a tree structure has the problems of folders associated with it. The labels should be associated with each other in complex relationships as ‘concepts’ in a language.

For example, placing the label “computing” should return the content in search results for “technology”. Placing the label “personal information management” should find it in the search results for the concept “email”. Note that in a traditional taxonomy tree, “computing” could be a child of “technology”, but “personal information management” could be a parent of “email”.

However, since web page URLs as they are commonly used, especially on static-html sites, are based on the concept of folders, this is a challenge. Now URLs don’t have to be folder-like in their appearance. For example, all the news articles on a site could have URLs like “phillynews.com/ra23px4” instead of something like “phillynews.com/sports/ice_hockey/flyers/04-08-27-victory.htm” or “phillynews.com/inquirer/2004/08/27/sports/flyers-victory.htm”. In this fictitious example, “ra23px4” is an automatically generated, short and easy to type id pointing to the article like the shortcuts generated by services like tinyurl.com and metamark.net.

Let us consider the organization of email. It seems to be headed in this direction. Some examples in the email space are Google’s GMail, Microsoft’s LookOut Search Plugin for Outlook, Nelson Email Organizer (NEO).

Some possible labels for this document: “personal information management”, “content management”, “computing”, “technology”.

Preserving URLs of Evergreen Content

Changing the URLs of pages containing narrative content like articles has several disadvantages, especially for a content site:

  1. Readers’ bookmarks to the site’s pages break
  2. Links archived in electronic mediums (e.g. emails, documents) & print mediums (e.g. books, magazines, newspapers) to evergreen content1 like articles or news stories break
  3. Incoming links from other sites break
  4. Search engines drop the ranking of the pages
  5. It becomes harder for readers of the site to find content
  6. The site loses credibility with the readers
  7. The points above result in a significant loss of traffic to the pages, which in turn results in a loss of revenue

The idea of permanent links to content is gaining renewed popularity with blogs. Almost every blog entry has ‘permanent link to this item’ link.

Years ago, when I decided to move my web site from an html+cgi platform to a better dynamic web site platform, I selected Microsoft’s Active Server Pages (.asp). I was disappointed that all my content page URLs were going to have to change from the .html extension to .asp, but I reasoned it would be a one-time change. Going with Microsoft’s new standard seemed a safe bet, so I did :-(

A few years later, when the .NET platform came along, I was even more disappointed to learn that I’d have to change my content page URL extensions to .aspx. I figured that with the criticism MS has received with the change from .asp to .aspx, MS would settle on .aspx for good. So this time, going with the new MS standard was surely a safe bet, so I again began to slowly change my pages extensions again :-(

Now MS came up with yet another extension for file names in URLs, .mspx which is beginning to show up on some content pages at microsoft.com. Perhaps it is a sign to switch to a web application platform with stable URLs filename extensions like PHP or JSP. (The PHP developers listened to the user community when they tried to introduce the new .php3 filename extension and remained with .php.)

Yes, there are ways to preserve URL filename extensions while changing the underlying technology, but none of them is a good solution:

  • URL Rewriting. There are some URL rewriting engines on the IIS platform, but none is well-supported, strongly established in the market, or feature-rich like mod_rewrite on the Apache platform
  • Redirects. The way to do this correctly is via server configuration. On IIS sites at hosting providers, that is often not an option.
  • Mapping the old extension to the new technology. Since .asp, .aspx and .mspx pages are incompatible, it is impossible to slowly migrate the pages, a few at a time. This also results in an unsupported usage of the platform. Most hosting providers will not do this
  • Staying with a deprecated technology (keeping my pages .asp) is not an option either since that technology platform is on its way out and new features are not being added to it. Also, as a technologist, I don’t want my site’s pages to display an obsolete technology

The fact that microsoft.com’s own pages have been changing extensions from .asp to .aspx to .mspx is a sign that the way they have designed these technologies to not be backward compatible, sites will have to change their pages extensions.

Ideally, content publisher and readers should not have to deal with these issues. Perhaps I should use a URL rewriter and completely do away with url filename extensions on my site. Then I could have some pages as .asp, some as .aspx, some as .php and show readers only a uniform .htm extension (or no extension at all). Maybe I will move to PHP and do this as Michael Radwin at Yahoo suggests in his blog.

  1. evergreen content: pages expected to serve their purpose for a long time. []