Toward the Semantic Web

A new standard from the World Wide Web Consortium brings the Web a step closer to realizing the vision of its inventor, Tim Berners-Lee.


When the World Wide Web went live in 1991, it consisted of static pages of text connected to each other by hyperlinks, and that's pretty much what it remained for years. But from the outset, the Web's inventor, Tim Berners-Lee, had envisioned a much more sophisticated Web, a so-called Semantic Web, which wouldn't just store data but would actually know what it meant. Now an MIT professor, Berners-Lee also directs the World Wide Web Consortium (W3C), a standards body whose industrial participants include everybody from Adobe to Yahoo, and which maintains an office at MIT's Computer Science and Artificial Intelligence Lab. The W3C has just published a new standard that should help bring the Semantic Web that much closer to fruition.

If the current Web is like a giant text file — which you can search for instances of particular words — the Semantic Web would be like a database, where every item of information is categorized, and new queries can combine categories in any imaginable way. You could, for instance, search the Web for a restaurant within a mile of a railway station in a town with a theater that offers vegetarian lasagna and at least one lamb dish. And if you wanted the restaurant’s menu, you could pull up just the menu — not page after page of review sites that happened to use the word "menu."

But while an ordinary database has categories selected in advance by a programmer, the Semantic Web is "a database where each person controls their own data," says Sandro Hawke, systems architect at the World Wide Web Consortium (W3C). "You have your own parts of the database, so you can put whatever data out there that you want."

A giant networked database where people control their own data has obvious advantages: huge numbers of people can contribute to it, and they can ensure that their contributions aren't categorized or recorded incorrectly. But it also has an obvious disadvantage: There's no guarantee that people will organize and label their data in a uniform way.

To take a simple example, suppose that two nearby medical clinics put their staff lists online. Semantic Web technologies would allow the clinics to categorize the information in the lists. But suppose that one clinic chose to label the surnames of its doctors "surname," and the other clinic chose the label "last name." A Web search that listed local doctors by "surname" might not pick up those labeled "last name," and vice versa.

In fact, an existing Semantic Web standard, the Web Ontology Language, solves this problem. The language gives programmers a way to specify that, for instance, "last name," "surname," and maybe "family name" or just "last" indicate the same types of data.

The case for rules

But what if a third clinic, while still adopting Semantic Web technology, chooses to dump first names, last names, and middle initials into a single category, labeled "name"? A direct mapping of category to category will no longer work. Instead, unifying the data on different sites requires a rule, such as, Put everything up to the first space character in "first name," anything after the last space character in "last name," and anything else in "middle."

The newly released Semantic Web standard is called the Rule Interchange Format, or RIF, and it gives Web programmers a way to write rules for translating between data on different sites. But that's not the only purpose rules serve on the Web. For instance, Hawke points out, an online Web retailer might offer customers free shipping if their total purchases exceed some threshold in a given time period; but the retailer's Web servers might store no data about its customers other than individual invoices. The code for sifting through the invoices and determining whether to offer the discount is another example of a rule. "Part of the standards game is to have these very different use cases around the same table and then get one standard that can be used in all these different pieces of software," Hawke says.

If the RIF standard becomes widely adopted, it's likely to go unnoticed by most Internet users. The Web is already replete with pages that aggregate data from other sites: A personalized Google home page, for instance, might include headlines from several different news sources, weather reports from yet another site, and stock prices from still another. When such content aggregators are already popular online destinations, it can be hard to convey exactly what the advantage of a Semantic Web would be. But as Hawke puts it, "You can always build something to aggregate data you already know about"; what the Semantic Web offers is a way to aggregate data you don't already know about. A small site that lists weekend events in a particular neighborhood, for instance, could retrieve data from sources that didn't even exist when it was built, as long as they categorized their data according to Semantic Web standards.

Although it has been nearly 20 years since Berners-Lee launched the first website, if his original idea finally comes to fruition, "it'll happen so quickly that no one will know," Hawke says. "They'll just notice the Internet doing more cool things."

What is the Semantic Web?
Video: Melanie Gonick
Sandro Hawke, systems architect for the World Wide Web Consortium, discusses the new standards that will bring the Semantic Web a step closer to reality.


Topics: Computer Science and Artificial Intelligence Laboratory (CSAIL), Computer science and technology, Internet, Semantic Web, W3C

Comments

Back to the top