The topic of integrating legacy technology systems with web technology systems often comes up in the newspaper industry.

There is a key difference between Content Legacy Companies (e.g. newspapers) and Other Legacy Companies (e.g. pharmaceuticals). With the world wide web and information technology becoming part of everyday life, every company becomes a content company in certain ways.

In the case of other legacy companies like pharmaceuticals, aeronautics, construction, etc. their legacy products are not going away nor changing as drastically as a result of the world wide web and IT as is happening in the case of content legacy companies like newspapers.

For other legacy companies, it makes sense to integrate the web systems like content management with their core products because their other core products are not fading away as a result of the web and IT.

However, in print media companies like newspapers whose legacy has been content, their product in its legacy form is going away as a direct result of web and IT. So for them it may make sense to not spend too much effort on integrating legacy systems with web systems. Instead, it may be a better strategy to spend more resources on enhancing and upgrading the web systems and digital media products. So for newspapers today, the 1990s holy grail of having one seamless print+web content management system may be less relevant in 2007. It may actually make better business sense to to keep the print cms and web cms separate, focus more on web and digital media and allow newspaper printed on paper to retire over the next decade.

I’m building a pull-down-menu navigation for the rajiv.com site using the Google Web Toolkit (GWT) and I’m impressed by this Google product.

It allows you to create user interface (UI) widgets and dynamic functionality for your web app using the Java programming language. You develop and debug your app in the Eclipse Integrated Development Environment, just like you do any other Java app. When you are done, GWT translates the Java app into AJAX technologies: JavaScript, HTML, CSS and XML. This gives you the advantages of both worlds: You program and test using the robust Java platform and the final output is in AJAX (no Java applets at all) which works consistently across most modern web browsers.

Developing a web page UI using GWT, Eclipse and Java saves a lot of time over the alternative of coding all the AJAX (JavaScript, HTML, CSS, XML). GWT also takes care of issues like cross-browser compatibility and the AJAX UI not conflicting with the browser’s back button, which would otherwise have to be extra coding and testing work if developed in AJAX.

What’s also great is that the generated HTML pages are clean and nicely documented using comments, all automatically done by GWT.

Java programmers should welcome GWT since it gives them the ability to create rich dynamic HTML functionality in the robust environment they are familiar with. You can view the Java source code here.

www.rajiv.com now has a new hosting provider: Kattare Internet Services. The web site is now powered by Java and related technologies. My previous Internet hosting provider Brinkster gave me good service for many years, but when I converted my site from Microsoft .NET and ASP technologies to Java, I had to move to a service provider that supports Java and related technologies. Kattare was my choice because they have already been providing great service to www.cofax.org for years. Kattare’s customer service is excellent and they provide a rich collection of technologies and products.

As amongst the earliest adopters of both ASP and ASP.NET technologies (quote from Bill Gates), my personal web site’s move to Java in no way reflects a preference of Java over MS .NET. I like both platforms.

The templates for the pages on www.rajiv.com are powered by Tiles, a component of Struts. (Thanks to my colleague Magesh for suggesting it.) With the change from .NET and ASP to Struts and Tiles, the pages now have .htm extensions instead of .aspx and .asp. The old .aspx and .asp pages are redirected to the .htm ones using Apache’s mod_rewrite. The web pages on this site appear to be static html with the search-engine-friendly paths and page names ending in .htm, but internally they are dynamically assembled from components using Struts and Tiles.

Kattare Internet Hosting Powered by Struts

Being able to assign labels to content to organize information for searching is superior to placing content in folders for manual browsing. The folder concept may be suitable to physical documents on paper, but does not lend itself well to digital information. The labels concept combined with an effective search capability is a faster way to organize content and find information.

Organizing content is a means to the end goal of finding information. Since organizing content is not a goal by itself, it should be as simple and less work as possible required to meet the goal of finding information.

The folder concept has many limitations:

  • A particular item of content can only belong to one folder. Placing it in two folders requires either:
    • Making duplicates. This is problematic to maintain.
    • Using links. This is problematic too: With ’soft links’ the content resides in only one folder and if that folder is deleted, the content is deleted too. With ‘hard links’, it is hard to know how many ‘folders’ contain this content and unlinking the last one may unintentionally erase it.
  • Similarly, folders can only be contained within one folder.
  • To organize content well in folders requires deep levels of sub-folders. These can be a challenge to browse.
  • All content must be placed in a folder for it to be well organized in this scheme. Doing this manually is a burden. Setting up rules for some of the content to be automatically placed in folders relieves the burden to a certain extent. However, after a rule has run and placed a content item in a folder, if the rule was found to have been flawed and it mixed the content in with other content in the wrong folder, it can be a bigger burden to find the content and place it in the right folder.
  • Folders are static. Search results are dynamic. With computing power available to the common person growing, dynamic search makes better sense than static folders which put some of the work on the user rather than the computer.

It should not be mandatory to apply all appropriate labels to all content. If the automated content categorization being used employs techniques like artificial intelligence and pattern recognition and can determine that this article is about personal information management or content management then that particular label should not be mandatory.

As the number of labels grows, the labels should not be organized in a taxonomy tree with a folders/sub-folders structure. Such a tree structure has the problems of folders associated with it. The labels should be associated with each other in complex relationships as ‘concepts’ in a language.

For example, placing the label “computing” should return the content in search results for “technology”. Placing the label “personal information management” should find it in the search results for the concept “email”. Note that in a traditional taxonomy tree, “computing” could be a child of “technology”, but “personal information management” could be a parent of “email”.

However, since web page URLs as they are commonly used, especially on static-html sites, are based on the concept of folders, this is a challenge. Now URLs don’t have to be folder-like in their appearance. For example, all the news articles on a site could have URLs like “phillynews.com/ra23px4″ instead of something like “phillynews.com/sports/ice_hockey/flyers/04-08-27-victory.htm” or “phillynews.com/inquirer/2004/08/27/sports/flyers-victory.htm”. In this fictitious example, “ra23px4″ is an automatically generated, short and easy to type id pointing to the article like the shortcuts generated by services like tinyurl.com and metamark.net.

Let us consider the organization of email. It seems to be headed in this direction. Some examples in the email space are Google’s GMail, Microsoft’s LookOut Search Plugin for Outlook, Nelson Email Organizer (NEO).

Some possible labels for this document: “personal information management”, “content management”, “computing”, “technology”.

Changing the URLs of pages containing narrative content like articles has several disadvantages, especially for a content site:

  1. Readers’ bookmarks to the site’s pages break
  2. Links archived in electronic mediums (e.g. emails, documents) & print mediums (e.g. books, magazines, newspapers) to evergreen content1 like articles or news stories break
  3. Incoming links from other sites break
  4. Search engines drop the ranking of the pages
  5. It becomes harder for readers of the site to find content
  6. The site loses credibility with the readers
  7. The points above result in a significant loss of traffic to the pages, which in turn results in a loss of revenue

The idea of permanent links to content is gaining renewed popularity with blogs. Almost every blog entry has ‘permanent link to this item’ link.

Years ago, when I decided to move my web site from an html+cgi platform to a better dynamic web site platform, I selected Microsoft’s Active Server Pages (.asp). I was disappointed that all my content page URLs were going to have to change from the .html extension to .asp, but I reasoned it would be a one-time change. Going with Microsoft’s new standard seemed a safe bet, so I did :-(

A few years later, when the .NET platform came along, I was even more disappointed to learn that I’d have to change my content page URL extensions to .aspx. I figured that with the criticism MS has received with the change from .asp to .aspx, MS would settle on .aspx for good. So this time, going with the new MS standard was surely a safe bet, so I again began to slowly change my pages extensions again :-(

Now MS came up with yet another extension for file names in URLs, .mspx which is beginning to show up on some content pages at microsoft.com. Perhaps it is a sign to switch to a web application platform with stable URLs filename extensions like PHP or JSP. (The PHP developers listened to the user community when they tried to introduce the new .php3 filename extension and remained with .php.)

Yes, there are ways to preserve URL filename extensions while changing the underlying technology, but none of them is a good solution:

  • URL Rewriting. There are some URL rewriting engines on the IIS platform, but none is well-supported, strongly established in the market, or feature-rich like mod_rewrite on the Apache platform
  • Redirects. The way to do this correctly is via server configuration. On IIS sites at hosting providers, that is often not an option.
  • Mapping the old extension to the new technology. Since .asp, .aspx and .mspx pages are incompatible, it is impossible to slowly migrate the pages, a few at a time. This also results in an unsupported usage of the platform. Most hosting providers will not do this
  • Staying with a deprecated technology (keeping my pages .asp) is not an option either since that technology platform is on its way out and new features are not being added to it. Also, as a technologist, I don’t want my site’s pages to display an obsolete technology

The fact that microsoft.com’s own pages have been changing extensions from .asp to .aspx to .mspx is a sign that the way they have designed these technologies to not be backward compatible, sites will have to change their pages extensions.

Ideally, content publisher and readers should not have to deal with these issues. Perhaps I should use a URL rewriter and completely do away with url filename extensions on my site. Then I could have some pages as .asp, some as .aspx, some as .php and show readers only a uniform .htm extension (or no extension at all). Maybe I will move to PHP and do this as Michael Radwin at Yahoo suggests in his blog.

  1. evergreen content: pages expected to serve their purpose for a long time. []

BeanShell is a fully Java compatible scripting language, capable of interpreting ordinary Java source files. You can also use it for working with Java interactively like an interpreted Unix Shell or Perl. You can try out Java’s object features, APIs, GUI widgets and other libraries hands on.

BeanShell is free and also ships bundled with popular applications such as BEA Weblogic, Forte for Java and the NetBeans IDE.

Can’t find what you are looking for using Google? There are other search engines too. For specific searches, some of these may have their own unique advantages over Google. Google is still great too, but isn’t the only option around anymore.

Update: 2008-Feb-02: The above list is now managed as WordPress blogroll links using a plugin called Blogroll Links. So as the Web search engine landscape changes, I can keep the above list current.

Search, when effectively integrated with content, creates a combination that is greater than the sum of the two separately.

Let us consider an example.

A printed phone book has been available to people for decades. The information in it was accessible primarily for one intended purpose: search by name for phone number and home address. Accessing it differently (e.g. search by phone number for home address and name) was practically impossible for most people, even though the information was all there in the phone book. When the same phone book — the exact same content is made searchable via a computer, it raises privacy issues. When the same computer assisted search is made easily available to millions over the Internet, it raises serious privacy concerns. Notice the content didn’t change, but adding search-ability to the content transformed the content into something more powerful.

Search technology is a powerful enabler.

I do not view content and search technology as two separate entities that can be put together to provide better information. Many web sites do this and that is one of the key reasons why their site search is ineffective.

Search is most effective when it is intimately integrated with the content.

Content should not be considered merely a block of text or data. It should be considered an object: a combination of data and functionality. This is similar to an object in the computer science term object-oriented. The search-ability should not be external to the content, but the content itself should be search-enabled. Besides text and data, the content object should include both headers and in-line meta-data to be searchable better. A better form of content is one that has search-enablement built into it or integrated with it. This search-enablement could be programmable code, rule sets, meta-data or a combination.

Let us discuss an example to illustrate this.

A news media site has many types of content in it. Let us consider two of them: news articles and movie listings. When an external search engine such as Google brings back results from such a site, it does not effectively differentiate between these two types of content. For an external search engine, they are just web pages.

It would be better if the search engine employed on the site had an understanding of the types of content and searched it differently. Some sites such as the new Yahoo Search and C|Net do a fairly good job at this when they bring back search results from different types of content repositories.

When someone searches for “digital cameras”, the following types of content are of relevance to them: product information, product reviews, product storefront. This is because someone looking to buy a digital camera would like to know more about digital cameras, would like to read reviews of different digital cameras and would like to find a place where they can buy one. A search engine that treats all these different types of content the same — as web pages — isn’t very effective. An effective search solution groups these different types of search results for better access.

Search would be even more useful, if the content itself (being object oriented) knew how to interact with the search engine.

This could be achieved using adapters for each type of content. The search engine would talk to the adapter, which would be intimately integrated with the content. This would result in the content (via the adapter) responding differently to different types and combinations of search queries. It would result in a very useful and powerful search for the users.

At a news media site, examples of adapters would include: news article adapter, movie listing adapter, classified ad adapter, etc. These adapters would be implemented using an object-oriented programming language such as Java or C#. Search-ability would be just one aspect of these adapters’ functionality. They would be provide a wrapper around the content and provide functionality like accessing and editing the content. An example technical design of these adapters is the subject of my article for a technical audience.

Your feedback on this article will be greatly appreciated. Please share it with me via the contact page.