BEYOND HTML: markup languages and the future of electronic information

A review article (published in Australian Academic and Research Libraries, v. 29 (2) (1998), 150-156)

by Toby Burrows, Scholars’ Centre, The University of Western Australia Library

The World Wide Web is one of the greatest inventions of the twentieth century. As a kind of amalgam of the library, the art gallery, the museum, and the shopping centre, with elements of the postal system and the telephone service thrown in, it is rapidly becoming an indispensable part of our social fabric. When all the advertisements for major corporations include a Web address as a matter of course, as they are now beginning to do, the Web has moved into the mainstream of the corporate world - and that means into the centre of Western society. While we may choose whether or not to buy the products of these companies, we have little choice but to use the Web if we want to stay up-to-date.

One of the key pillars on which the Web is built is HTML, the HyperText Markup Language. It is HTML which supplies the fundamental structure of Web pages or documents. As the term "markup language" suggests, HTML consists of a set of rules for marking-up a document, which are similar to the grammar and syntax of a language. These rules describe how to insert tags into a text to indicate its structure and functions. Independent of any particular hardware or software, HTML is simple in its scope and contents but powerful enough to act as a container for the elaborate programming of database searching and computer graphics. A knowledge of HTML has become, in only the last two or three years, one of the essential skills of an information professional.

The number of books published about HTML and the World Wide Web is now quite staggering, as a visit to your local bookshop will readily confirm. More surprisingly, perhaps, there has recently been a sudden rush of publications devoted to the concepts and standards from which HTML derives - and particularly to SGML, the Standard Generalized Markup Language. At least a dozen major books about SGML have appeared in the last six months or so, and more can be expected in 1998. This is, after all, a very recherche field with only a handful of commercial titles published in the previous decade. Why the sudden surge of interest?

In a nutshell, it’s because large corporations have discovered SGML. Most companies produce vast amounts of internal documentation - product specifications, training manuals, policy and standards manuals, catalogues, and so on. Traditionally, this was in printed form, often in looseleaf binders designed for updating, but now much of it is electronic, often delivered over internal networks (Intranets) or over the World Wide Web. To manage all this information efficiently is a critical factor in the success of a business. The costs of rekeying information can be very high, whether this is in order to transfer it to a different format or simply for the purpose of updating it. An effective document management system is essential to avoid this kind of waste, and SGML is increasingly being seen as the best foundation for such a system.

SGML is not in itself a markup language. Strictly speaking, it is the international standard (ISO 8879) for document representation. The rules it prescribes are generic ones for constructing markup languages, of which HTML is merely one example. These rules cover such things as the correct form which markup should take, valid ways of describing document structures, and the correct method for identifying and coding the components, or elements, of a document. The SGML standard provides a formal basis for developing specific markup languages, known as Document Type Definitions or DTDs. It also contains rules for validating documents and Document Type Definitions against the standard.

The power and value of SGML come from the way in which it focuses on the content of a document. By describing a document’s content and structure rather than its format and appearance, SGML-based markup avoids the problems associated with the formatting used by proprietary software. Instead of using the "bold" and "14-point" features of a word-processing program to represent a heading, for example, an SGML-derived language uses a tag like <head> or <h1> as part of the document, which is saved as a plain-text file, and not in a proprietary file format like Microsoft Word or Word Perfect. This approach maximizes the portability of documents across different software platforms. It also provides the basis for publishing a document in a variety of formats - print, CD-ROM, Web - without reformatting or editing the content itself.

There are other advantages to using SGML-based markup too. One of these is in controlling different versions of a document. Material can be added to or removed from the text without losing the original version, as long as the markup language contains codes which can distinguish between sections belonging to the current version and those belonging to earlier versions. Similarly, a single document can be encoded in such a way that some sections are only displayed for particular users or in a particular output format. In this way, an abridged printed version can be produced from the full electronic document without any need for rekeying or cutting the text.

The tags inserted in a document can be a powerful tool for searching, especially with large and complex texts. In an SGML-encoded version of the Bible, for instance, it should be possible to limit keyword searches to chapter headings, or to the scholarly commentary, or to the Biblical text itself, or to any combination of these. Proper names can be marked up to allow for variant spellings, nicknames, or other versions of the name, and searches can be expanded and contracted accordingly. A text without such markup can only be searched for literal strings of characters without any context. SGML-based texts provide a far more sophisticated and subtle basis for searching.

SGML documents are also capable of containing unlimited internal links and cross-references, as well as links to external documents. Using the same principles, a document can have different kinds of non-textual media embedded within it: images, sound files, video clips, and so on. Computer programs can also be run from within the document, as shown most impressively and strikingly on the World Wide Web with such methods as Java applets and CGI scripts. The effectiveness of these techniques within the framework of HTML is testimony to the power of the SGML standard from which HTML derives.

The value of SGML to the corporate sector can be seen in the case studies presented by Chet Ensign in $GML: the Billion Dollar Secret. Among these is the Sybase computer company, which saved about $5 million in a single year after moving its software manuals to an SGML-based publishing system. When the Sikorsky Aircraft Corporation began to use SGML for all the documentation associated with helicopter contracts, the benefits included a dramatic improvement in productivity: an increase of 60% in technical information, with less than half the staff. Most interestingly for an academic audience, the Grolier publishing company adopted SGML for its encyclopaedias and managed to improve the timeliness of its products, while also enhancing the content and reducing costs. If you want a clear illustration of the value of SGML and the problems it can solve, Chet Ensign’s book is essential reading - practical, informative, and light on technical jargon.

In the higher education sector, SGML is important in two ways: for publishing institutional information and academic scholarship, and as the basis of various commercial publications acquired by libraries. Into the latter category fall such titles as the electronic literary texts published by companies like Chadwyck-Healey and legal publications from companies like Butterworths, not to mention numerous HMTL-based products. The former category includes electronic text projects at universities like Sydney and Western Australia, as well as course handbooks and student dissertations. But academic institutions have been much slower than the commercial sector to adopt SGML - apart from HMTL, of course. This is presumably because companies have a much harsher financial imperative, and have recognized the need to invest in SGML to increase their profit margins in the future. For universities and colleges, the initial cost of adopting SGML is harder to recoup quickly.

This cost arises, in part, because SGML, while simple enough at the conceptual level, is complex to learn and apply in detail. To someone without a background in document markup, even the introductory books tend to look daunting and technical. Bill von Hagen’s SGML for dummies does an excellent job of conveying the basic concepts and rules of SGML in a readable and often humorous way. It also contains a good deal of useful information and sensible advice on business reasons for using SGML and on document management systems. But, because SGML operates at a conceptual level rather than a practical one, it’s impossible for von Hagen to write a step-by-step textbook aiming at helping the reader to master a specific task, in the way that guides to using HTML can. Nor can he avoid devoting a substantial part of his book to the details of Document Type Definitions and their components: elements, attributes, entities, and the like. Nevertheless, he has managed to produce one of the clearest and most straightforward introductions both to the rules of the SGML standard and to the surrounding issues in document management.

Neil Bradley’s Concise <SGML> Companion is more technical in approach. Aimed at programmers, analysts, and consultants specializing in this field, it contains a thorough and systematic guide to the principles and concepts of the SGML standard. A particularly good feature is an extensive glossary of hundreds of terms connected with SGML, all explained concisely and clearly, with plenty of cross-references. Also very interesting and unique is a series of "Road Map" charts designed to show the hierarchical structure of the SGML standard in diagrammatic form. Almost 200 syntactic components are covered, each with brief examples. This visual approach is missing from the standard itself, and is a great addition to it. Features like this make The Concise <SGML> Companion an excellent reference guide for anyone working with SGML-based document management systems. It is not an introduction to SGML for the uninitiated, however, nor a discussion of the uses to which SGML can be put.

The more advanced books are definitely off-putting. Don’t open Steven DeRose’s SGML FAQ Book unless you’re fully conversant with the concepts of SGML and have had considerable experience in applying it in practice. It contains a range of very useful, expert advice but the problems it addresses are highly technical and relevant only to other experts or would-be experts. Its sub-title "Understanding the Foundations of HTML and XML" is quite misleading, since HTML is hardly mentioned. But, for people working with raw SGML markup or developing their own markup languages (DTDs), DeRose has the answers to more than a hundred difficult questions.

While the cost of acquiring expertise in SGML needs to be taken into account, another significant area of cost is the need for software to carry out the various stages of publishing and managing SGML documents. There is an increasingly vast array of this kind of software, some of it very specialized and some of it very expensive. The SGML Buyer’s Guide is designed to identify all this software. Compiled by a team of experts led by Charles Goldfarb (the inventor of SGML), this is the definitive directory indeed. The heart of the Buyer’s Guide is a descriptive listing of over 150 software tools, arranged according to the type of function they perform: editing and composition, document conversion, electronic delivery, workflow management, and so on. In most cases, the descriptive information has been supplied by the manufacturer, and there is no attempt at a comparative analysis or rating of similar products. Also included is a supplement consisting of advertising blurbs from the software companies which sponsored the book. There is also an accompanying CD-ROM with 45 free pieces of software, though it comes without documentation and is not really suitable for beginners.

To cope with this plethora of products, Goldfarb and his colleagues also offer a new decision-making approach for selecting SGML-based software. Known as HARP analysis, this methodology is intended to provide a formal logic for analysing an existing publishing system, and for identifying the tools needed for particular functions. It uses diagrams which superficially resemble flow charts but are actually designed to track the different representations a document goes through during a publishing process. The abstract and formal nature of HARP analysis is likely to prove off-putting to the general reader, but specialists in SGML and information management will find it of great value as a rigorous analytical tool. It is typical of the approach taken by the Buyer’s Guide, which avoids personal and anecdotal judgments and focuses on the formal properties of SGML-based publishing systems and the software required to operate them.

SGML-based software is also the focus of Bob DuCharme’s SGML CD book, but his aims are much more specific and limited. For experienced SGML users, he provides a set of tools for composing, editing, validating, and publishing SGML documents. All the programs run under Windows 95 and Windows NT. They are all freely available over the Web, and they offer an alternative to the expensive commercial software used by large corporations and by some universities. The value of DuCharme’s book is in the detailed instructions and guidance he provides for each program. With SGML CD, you can build your own SGML publishing system at no immediate cost. This is not for amateurs, however. You will need to have a fairly thorough understanding of SGML already, and to get the full benefit from this package it will also help to be familiar with the programming languages Perl and C.

The relationship between SGML and HTML is a particularly important one, and is addressed by several recent authors. Strictly speaking, HTML is a markup language constructed under SGML rules and embodied in a Document Type Definition. In comparison with other DTDs, however, HTML is very limited. It has only a small number of tags, most of which are concerned with the formatting and presentation of documents on a computer screen. There is little in the way of structural or analytical markup. This simplicity is HTML’s great strength, of course, and has been one of the key elements in its ready acceptance. Another important factor has been the easy availability of software designed for interpreting and displaying documents marked up with HTML. This proved to be something of a double-edged sword, however, as software companies like Netscape quickly became dissatisfied with HTML’s limitations and began to develop their own additions to HTML. The result was several different varieties of HTML, which meant that documents constructed in accordance with one variety did not display properly with software based on another variety. The World Wide Web Consortium is still grappling with the ramifications of this problem.

One approach to the SGML/HTML relationship can be found in a book by Murray Maloney and the late and much-lamented Yuri Rubinsky, one of the greatest enthusiasts for - and popularizers of - SGML. Their SGML on the Web is not really what the title suggests: an analysis of how to provide the full capacity of SGML over the World Wide Web. Instead, it is a guide to SGML for those familiar with HTML but not with the principles which form its foundation. By taking HTML and explaining it in terms of SGML, Rubinsky and Maloney enable the HTML user to master the complexity of the standard in a comparatively straightforward way. Their approach is clear, concise, and often humorous, and uses a step-by-step method to build up a deeper understanding from a firm base. At each step there are several examples to be worked through, using files provided on the accompanying CD-ROM. Pedagogically, this is a well-designed course which would benefit any HTML user who wants to know what SGML is all about, but would prefer to build on what they already know.

Martin Bryan covers some of the same ground, in a quite different way, in his SGML and HTML Explained. This is a completely revised version of Bryan’s 1988 book, SGML: an Author’s Guide, which was one of the first guides to SGML. The new version retains the approach of the original text: a formal analysis of the characteristics of SGML, starting from first principles. HTML (version 3.2) is tackled in much the same way. Bryan manages to convey a great deal of technical information in a very effective and succinct way, but his book is best considered as a reference guide rather than an instructional manual. It will certainly be valuable for users who have some knowledge of SGML or of HTML, but it is not the book to give a beginner in either field. Nor does it have much to say on the broader issues and practical questions involved in the relationship between these two languages, particularly in the context of the World Wide Web. A book on these topics would be very welcome indeed.

The limitations of HTML, and the problems arising from proprietary versions of HTML, are widely recognized. But there is still a widespread feeling that the full SGML standard is too complex to be a real alternative, except for large-scale electronic publishing. In response to this, a third way is now being rapidly developed, in the form of the Extensible Markup Language, or XML. XML is designed to bring the power and flexibility of SGML to the World Wide Web without requiring the full complexity of the standard. Instead of the fixed and limited markup of HTML, XML will allow creators of Web sites to devise their own markup schemes. Instead of the so-called "Tag Wars" between software companies over extensions to HTML, XML will permit a great variety of extensions to co-exist. But XML will provide a clear set of rules to guide all this activity.

XML is still in its preparatory stages. But software for producing and browsing XML documents is rapidly being developed by the major companies, and we can expect to see XML beginning to replace HTML for more complex Web documents and applications in the near future. The first comprehensive guide to XML is Richard Light’s Presenting XML. While he covers all the technical details of XML’s structure and rules, in a very clear and precise way, Light also examines the important larger questions surrounding XML, particularly its relationship to SGML and HTML and its potential future applications. Especially interesting are his thoughts on the probable role of XML in enabling the automated transfer of data between businesses, and between information providers and their customers. His analysis of the potential of XML for automating the interchange of museum catalogue records is very relevant to academic and research libraries. There has already been significant work on reconsidering MARC records in the light of SGML, and XML is likely to give added impetus to this process. Light’s book, though largely technical in approach and content, is highly recommended to anyone interested in the future of the Web and of methods for sharing electronic information.

Another recent book about XML, edited by Dan Connolly, is much more of a mixed bag. Also published as an issue of the World Wide Web Journal, its contents are mostly technical articles about specific applications of XML, accompanied by documents from the World Wide Web Consortium. The authors are all experts drawn from the main organizations and software companies involved in this field. But there are also some more general papers which take up the question of the broader significance of markup languages. Jon Bosak ("XML, Java, and the future of the Web") provides some good insights into the possible commercial uses of XML for Web-based information provision, while Connolly, Khare and Rifkin ("The evolution of Web documents: the ascent of XML") give an overview of the imperatives which led to the emergence of XML. There are even two dissenting, eccentric voices: David Siegel (self-styled "Web terrorist") explains how he ruined the Web by his unorthodox adaptations of HTML, and Ted Nelson (visionary inventor of hypertext) argues against the very concept of embedding markup in the document itself.

Why do markup languages matter to academic and research libraries? The short answer is because we are in the business of designing and maintaining electronic information services. While all libraries offer some kind of static Web pages, many are now experimenting with more complex and dynamic Web-based services involving a mixture of HTML, Java, Perl, and CGI scripts to interface with databases. Even the venerable on-line catalogue is migrating to the Web. This environment involves much more than simply running a vendor’s integrated management system. A whole range of computer applications must be designed and programmed, and an integrated information architecture must be developed. The resulting electronic information services must be effective in delivering their products and efficient in their use of resources.

The format in which these services store their data is critical to their effectiveness, and markup languages are a vital component of such formatting. An understanding of markup languages and their differing capabilities and uses will be essential. But HTML will not be enough. The Web-based information services of the future will need to look beyond HTML to the power of SGML and the flexibility of XML. Otherwise, they will be limited in their scope and incapable of delivering the sophisticated features which their users will come to expect.

Books Reviewed

(Bradley) The Concise <SGML> Companion Neil Bradley. Harlow Addison-Wesley 1997 xi, 324 p. ISBN 0-201-41999-8 A$41.95

(Bryan) SGML and HTML Explained Martin Bryan. 2nd ed. Harlow Addison Wesley Longman 1997 xx, 234 p. + CD-ROM ISBN 0-201-40394-3 A$59.95

(Connolly) XML: Principles, Tools, and Techniques edited by Dan Connolly. Sebastopol, CA. O’Reilly 1997 (World Wide Web Journal, vol. 2 no. 4) ix, 248 p. ISBN 1-56592-349-9 A$59.95

(DeRose) The SGML FAQ Book: Understanding the Foundation of HTML and XML by Steven J. DeRose Boston Kluwer 1997 (Electronic Publishing Series) xxvi, 250 p. ISBN 0-7923-9943-9

(Du Charme) SGML CD Bob Du Charme Upper Saddle River, N.J. Prentice Hall PTR 1997 (Charles F. Goldfarb Series on Open Information Management) xix, 353 p. + CD-ROM ISBN 0-13-475740-8 A$89.95

(Ensign) $GML: the Billion Dollar Secret Chet Ensign. Upper Saddle River, N.J. Prentice Hall PTR 1997 (Charles F. Goldfarb Series on Open Information Management) xxvi, 213 p. ISBN 0-13-226705-5 A$59.95

(Goldfarb) SGML Buyer’s Guide Charles F. Goldfarb, Steve Pepper, Chet Ensign. Upper Saddle River, N.J. Prentice Hall PTR 1997 (Charles F. Goldfarb Series on Open Information Management) xxxv, 1135 p. + CD-ROM ISBN 0-13-681511-1 A$89.95

(Light) Presenting XML Richard Light Indianapolis Sams.net 1997 xxix, 414 p. ISBN 1-57521-334-6 A$59.95

(Rubinsky) SGML on the Web: Small Steps Beyond H.T.M.L. by Yuri Rubinsky and Murray Maloney. Upper Saddle River, N.J. Prentice Hall PTR 1997 (Charles F. Goldfarb Series on Open Information Management) xxvi, 501 p. + CD-ROM ISBN 0-13-519984-0 A$59.95

(von Hagen) SGML for Dummies by Bill von Hagen. Foster City, CA.: IDG Books, 1997 xxiv, 383 p. + CD-ROM ISBN 0-7645-0175-5 A$59.95


Back to: Toby Burrows : Home Page

Last updated: October 1998