Friday, February 23, 2018

What is a document - Part 7



The word “document” is, like the word “database”, simple on the outside and complex on the inside. 

Most of us carry around pragmatically fuzzy definitions of these in our heads. Since the early days of personal computers there have been software suites/bundles available that have included distinct tools to manage “documents” and “databases”, treating them as different types of information object. The first such package I used was called SMART running on an IBM PC XT machine in the late Eighties. It had a 10MB hard disk. Today, that is hardly enough to store a single document, but I digress...

I have used many other Office Suites since then, most of which have withered on the vine in enterprise computing, with the notable exception of Microsoft Office. I find it interesting that of the words typically associated with office suites, namely, “database”, “word processor”, “presentation”, and “spreadsheet” the two that are today most tightly bound to Microsoft office are “spreadsheet” and “presentation” to the point where “Excel” and “Powerpoint” have become generic terms for “spreadsheet” and “presentation” respectively. I also think it is interesting Excel has become the de-facto heart of Microsoft Office in the business community with Word/Access/Powerpoint being of secondary importance as "must haves" in office environments, but again I digress...

In trying to chip away at the problem of defining a “document” I think it is useful to imagine having the full Microsoft office suite at your disposal and asking the question “when should I reach for Word instead of one of the other icons when entering text?” The system I worked in in the Nineties, mentioned previously in this series, required a mix of classic field-type information along with unstructured paragraphs/tables/bulleted lists. If I were entering that text into a computer today with Microsoft Office at my disposal, would I reach for the word processor icon or the database icon?

I would reach for the Word icon. Why? Well, because there are a variety of techniques I can use in Word to enter/tag field-type textual information and many techniques for entering unstructured paragraphs/tables/bulleted lists. The opposite is not true. Databases tend to excel (no pun intended) at field-type information but be limited in their support for unstructured paragraphs/tables/bulleted lists – often relegating the latter to “blob” fields that are second-class citizens in the database schema. 

Moreover, these days, the tools available for post-processing Word's .docx file format make it much easier than ever before to extract classic “structured XML” from Word documents but with the vital familiarity and ease of use for the authors/editors I mentioned previously.

Are there exceptions? Absolutely. There are always exceptions. However, if your data structure necessarily contains a non-trivial amount of unstructured or semi-structured textual content and if your author/edit community wants to think about the content in document/word-processor terms, I believe today version of Word with its docx file format is generally speaking a much better starting point than any database front-end or spreadsheet front-end or web-browser front-end or any structured XML editing tool front-end.

Yes, it can get messy to do the post-processing of the data but given a choice between a solution architecture that guarantees me beautifully clean data at the back-end but an author/edit community who hate it, versus a solution architecture that involves extra content enrichment work at the back end but happy author/edit users, I have learned to favor the latter every time.

Note I did not start there! I was on the opposite side of this for many, many years, thinking that structured author/edit tools, enforcing structure at the front-end was the way to go. I built a few beautiful structured systems that ultimately failed to thrive because the author/edit user community wanted something that did not “beep” as they worked on content. I myself, when writing the books I wrote for Prentice-Hall (books on SGML and XML - of all things!), I myself wanted something that did not beep!

Which brings me (finally!), to my answer to the question “What is a document?”. My answer is that a document is a textual information artifact where the final structure of the artifact itself is only obvious after it has been created/modified and thus requires an author/edit user experience that gets out of the way of the users creative processes until the user decides to impose structure – if they decide to impose a structure at all.

There is no guaranteed schema validity other than that most generic of schemas that splits text into flows, paragraphs, words, glyphs etc and allows users to combine content and presentation as they see fit.

On top of that low level structure, anything goes – at least until the point where the user has decided that the changes to the information artifact are “finished”. At the point where the intellectual work has been done figuring our that the document should say and how it should say it, it is completely fine - and generally very useful - to be able to validate against higher level, semantic structures such as "chapter", "statute", "washing machine data sheet" etc.

The big lesson of my career to date in high volume document management/processing is that if you seek to impose this semantic structure on the author/edit community rather than have them come to you and ask for some structure imposition, you will struggle mightily to have a successful system.

Friday, February 09, 2018

What is a document? - part 6



By the late Nineties, I was knee deep in the world of XML and the world of Python, loving the way that these two amazing tools allowed tremendous amounts of automation to be brought to traditionally labor intensive document processing/publishing tasks. This was boom time in electronic publishing and every new year brought with it a new output format to target: Microsoft Multimedia Viewer, Windows Help, Folio Views, Lotus Notes and a whole host of proprietary formats we worked on for clients. Back then, HTML was just another output for us to target. Little did we know back then that it would eclipse all the others.

Just about twenty years ago now - in the fall of 1998 - , I co-presented a tutorial on XML at the International Python Conference in Houston, Texas. [1]. At that same conference, I presented a paper on high volume XML processing with Python [2]. Back in those days, we had some of the biggest corpora of XML anywhere in the world, here in Ireland. Up to the early/mid oozies, I did a lot of conference presentations and become associated with the concept of XML processing pipelines[3].

Then a very interesting thing happened. We began to find ourselves working more and more in environments where domain experts –not data taggers or software developers – needed to create and update XML documents. Around this time I was also writing books on markup languages for Prentice Hall[4] and had the opportunity to put “the shoe on the other foot” so-to-speak, and see things from an authors perspective.

It was then that I experienced what I now consider to be a profound truth of the vast majority of documents in the world - something that gets to the heart of what a document actually is which distinguishes it from other forms of digital information. Namely, that documents are typically very “structured” when they are finished but are highly unstructured when then are being created or in the midst of update cycles.

I increasingly found myself frustrated with XML authoring tools that would force me to work on my document contents in a certain order and beep at me unless my documents were “structured” at all times. I confess there were many times when I abandoned structured editors for my own author/edit work with XML and worked in the free-flowing world of the Emacs text editor or in word processors with the tags plainly visible as raw text.

 I began to appreciate that the ability to easily create/update content is a requirement that must be met if the value propositions of structured documents are to be realized, in most cases. There is little value in a beautifully structured, immensely powerful back-end system for processing terabytes of documents coming in from domain experts unless said domain experts are happy to work with the author/edit tools.

For a while, I believed it was possible to get to something that authors would like, by customizing the XML editing front-ends. However, I found that over and over again, two things started happening, often in parallel. Firstly, the document schemas became less and less structured so as to accommodate the variations in real-world documents and also to avoid “beeping” at authors where possible. Secondly, no amount of GUI customization seemed to be enough for the authors to feel comfortable with the XML editors.

“Why can't it work like Word?” was a phrase that began to pop up more and more in conversations with authors. For quite some time, while Word's file format was not XML-based, I would look for alternatives that would be Word-like in terms of the end-user experience, but with file formats I could process with custom code on the back end.

For quite a few years, StarOffice/OpenOffice/LibreOffice fitted the bill and we have had a lot of success with it. Moreover, it allowed for levels of customization and degrees of business-rule validation that XML schema-based approaches cannot touch. We learned may techniques and tricks over the years to guide authors in the creation of structured content without being obtrusive and interrupting their authoring flow. In particular, we learned to think about document validation as a function that the authors themselves have control over. They get to decide when their content should be checked for structural and business rules – not the software.

Fast forward to today. Sun Microsystems is no more. OpenOffice/LibreOffice do not appear to be gaining the traction in the enterprise that I suspected they would a decade ago. Googles office suite – ditto. Native, browser based document editing (such as W3C's Amaya [5]) does not appear to be getting traction either....

All the while, the familiar domain expert/author's mantra rings in my ears “Why can't it work like Word?”

As of 2018, this is a more interesting question than it has ever been in my opinion. That is where we will turn next.



Friday, January 26, 2018

What is a document? - Part 5


In the early Nineties, I found myself tasked with the development of a digital guide to third level education in Ireland. The digital product was to be an add-on to a book based product, created in conjunction with the author of the book. The organization of the book was very regular. Each third level course had a set of attributes such as entry level qualifications, duration, accrediting institution, physical location of the campus, fees and so on. All neatly laid out, on page per course, with some free-flowing narrative at the bottom of each page. The goals of the digital product were to allow prospective students to search based on different criteria such as cost ranges, course duration and so on.

Step number one was getting the information from the paper book into a computer and it is in this innocuous sounding step that things got very interesting. The most obvious approach - it seemed to me at the time - was to create a programmable database – in something like Clipper (a database programming language that was very popular with PC developers at the time). Tabular databases were perfect for 90% of the data – the “structured” parts such as dates, numbers, short strings of text. However, the tabular databases had no good way of dealing with the free-flowing narrative text that accompanied each course in the book. It had paragraphs, bulleted lists, bold/italics and underline...

An alternative approach would be to start with a word-processor – as opposed to a database – as it would make handling the free-flowing text (and associated formatting, bold/italic, bulleted lists etc.) easy. However, the word processor approach did not make it at all easy to process the “structured” parts in the way I wanted to (in many cases, the word processors of the day stored information in encrypted formats too).

My target output was a free viewer that came with Windows 3.1 known as Windows Help. If I could make the content programmable, I reasoned, I could automatically generate all sorts of different views of the data as Windows Help files and ship the floppy disk without needing to write my own viewer. (I know this sounds bizarre now but remember this work predated the concept of a generic web browser by a few years!)

I felt I was facing a major fork in the road in the project. By going with a database, some things were going to be very easy but some very awkward. By going with a document instead...same thing. Some things easy, some very awkward. I trawled around in my head for something that might have the attributes of a database AND of a document at the same time.

As luck would have it, I had a Byte Magazine from 1992 on a shelf. It had an article by Jon Udell that talked about  SGML - Standard Generalized Markup Language. It triggered memories of a brief encounter I had had with SGML back in Trinity College when Dr. David Abrahamson had referencing it in his compiler design course, back in 1986. Back then, SGML was not yet an ISO standard (it became one in 1987). I remember in those days hearing about “tagging" and how an SGML parser could enforce structure – any structure you liked – on text – in a similar way to programming language parsers enforced structure on, say, Pascal source code.

I remember thinking “surely if SGML can deal with the hierarchical structures like you typically find in programming languages, it can deal with the simpler, flatter structures you get in tabular databases?”. If it could, I reasoned, then surely I could get the best of both worlds. My own data format that had what I needed from database-approaches but also what I needed from document approaches to data modelling?

I found – somehow (this is all pre-internet remember. No Googling for me in those days.) – an address in Switzerland that I could send some money to in the form of a money order, to get a 3.5 inch floppy back by return post, with an SGML parser on it called ArcSGML. I also found out about an upcoming gathering in Switzerland of SGML enthusiasts. A colleague, Neville Bagnall went over and came back with all sorts of invaluable information about this new thing (to us) called generalized markup.

We set to work in earnest. We created our first ever SGML data model. Used ArcSGML to ensure we were getting the structure and consistency we wanted in our source data. We set about inventing tags for things like “paragraph”, “bold”, “cross-reference” as well as the simpler field-like tags such as “location”, “duration” etc. We sent about looking at ways to process the resultant SGML file. The output from ArcSGML was not very useful for processing, but we soon found out about another SGML parser called SGMLS by Englishman James Clark. We got our hands on it and having taken one look at the ESIS format it produced, we fell in love with it. Now we had a tool that could validate the structure of our document/database and feed us a clean stream of data to process downstream in our own software.

Back then C++ was our weapon of choice. Over time our code turned into a toolkit of SGML processing components called IDM (Intelligent Document Manager) which we applied to numerous projects in what became known as the “electronic publishing era”. Things changed very rapidly in those days. The floppy disks gave way to the CD-ROMs. We transitioned from Windows Help files to another Microsoft product called Microsoft Multimedia Viewer. Soon the number of “viewers” for electronic books exploded and we were working on Windows Help, Multimedia Viewer, Folio Views, Lotus Notes to name but four.

As the number of distinct outputs we needed to generate grew, so too did the value of our investment getting up to speed with SGML. We could maintain a single source of content but generate multiple output formats from it, each leveraging the capabilities of the target viewer in a way that made them look and feel like they had been authored directly in each tool as opposed to programmatically generated for them.

My concept of a “document” changed completely over this period. I began to see how formatting – and content – could be separated from each other. I began to see how in so doing, a single data model could be used to manage content that is tabular (like a classic tabular database) as well as content that is irregular, hierarchical, even recursive. Moreover, I could see how keeping the formatting out of the core content made it possible to generate a variety of different formatting “views” of the same content.

It would be many years later that the limitations of this approach became apparent to me. Back then, I thought it was a completely free lunch. I was a fully paid-up convert to the concept of generalized markup and machine readable, machine validatable documents. As luck would have it, this coincided with the emergence of a significant market for SGML and SGML technologies. Soon I was knee deep in SGML parsers, SGML programming languages, authoring systems, storage systems and was developing more and more of our own tools, first in C++, then Perl, then Python.

The next big transition in my thinking about documents came when I needed to factor non-technical authors into my thinking. This is where I will turn next. What is a document? - Part 6.

Monday, January 15, 2018

What is a document? - part 4


In the late Eighties, I had access to an IBM PC XT machine that had Wordperfect 5.1[1] installed on it. Wordperfect was both intimidating and powerful. Intimidating because when it booted, it completely cleared the PC screen and unless you knew the function keys (or had the sought-after function key overlay [2]) you were left to you own devices to figure out how to use it.

It was also very powerful for its day. It could wrap words automatically (a big deal!). It could  redline/strikeout text which made it very popular with lawyers working with contracts. It could also split its screen in two, giving you a normal view of the document on top and a so-called “reveal codes” view on the bottom. In the “reveal codes” area you could see the tags/markers used for formatting the text. Not only that, but you could choose to modify the text/formatting from either window.

This idea that a document could have two “faces” so to speak and that you could move between them made a lasting impression on me. Every other DOS-based word processor I came across seemed to me to be variations on the themes I had first seen in Wordperfect e.g. Wordstar, Multimate and later Microsoft Word for DOS. I was aware of the existence of IBM Displaywriter but did not have access to it. (The significance of IBM in all this document technology stuff only became apparent to me later.)

The next big "aha moment" for me came with the arrival of a plug-in board for IBM PCs called the Hercules Graphics Card[3]. Using this card in conjunction with the Ventura Publisher[4] on DRI's GEM graphics environment [5] dramatically expanded the extent to which documents could be formatted - both on screen an on the resultant paper. Multiple fonts, multiple columns, complex tables, equations etc. Furthermore, the on-screen representation mirrored the final printed output closely in what is now universally known as WYSIWYG.

Shortly after that, I found myself with access to an Apple Lisa [6] and then an Apple Fat Mac 512 with Aldus (later Adobe) Pagemaker [7] and an Apple Laserwriter[8]. My personal computing world split into two. Databases, spreadsheets etc. revolved around IBM PCs and PC compatibles such as Compaq, Apricot etc. Document processing and Desktop Publishing revolved around Apple Macs and Laser Printers.

I became intoxicated/obsessed with the notion that the formatting of documents could be pushed further and further by adding more and more powerful markup into the text. I got myself a copy of The Postscript Language Tutorial and Cookbook by Adobe[9] and started to write Postscript programs by hand.

I found that the original Apple Laserwriter had a 25 pin RS/232 port. I had access to an Altos multi-terminal machine [10]. It had some text-only applications on it. A spreadsheet from Microsoft called – wait for it – Multiplan (long before Excel) – running on a variant of – again, wait for it – Unix call Microsoft Xenix [11].

Well, I soldered up a serial cable that allowed me to connect the Altos terminal directly to the Apple Laserwriter. I found I could literally type in Postscript command at the terminal window and get pages to print out. I could make the Apple Laserwriter do things that I could not make it do via Aldus Pagemaker by taking directly to its Postscript engine. 

Looking back on it now, this was as far down the rabbit hole of “documents as computer programs” that I ever went. Later I would discover TeX and find it in many ways easier to work with than programming Postscript directly. My career started to take me into computer graphics rather than document publishing. For a few years I was much more concerned with Bezier Curves and Bitblits[12] using a Texas Instruments TMS 34010[13] to generate realtime displays of financial futures time-series analysis (A field known as technical analysis in the world of financial trading [14]).

It would be some years before I came back to the world of documents and when I did, my approach route back, caused me to revisit my “documents as programs” world view from the ground up.

It all started with a database program for the PC called dBase by Ashton Tate[15]. Starting from the perspective of a database made all the difference to my world view. More on that, next time.


Tuesday, January 02, 2018

What is a document? - Part 3


Back in 1983, I interacted with computers in three main ways. First, I had access to a cantankerous digital logic board [1] which allowed me to play around with boolean logic via physical wires and switches.

Second I had access to a Rockwell 6502 machine with 1k of RAM (that's 1 kilobyte) which had a callous-forming keyboard and a single line (not single monitor – single line) LED display called an Aim 65[2]. Third, at home I had a Sinclair ZX80 [3] which I could hook up to a black and white TV set and get a whopping 256 x 192 pixel display.

Back then, I had a fascination with the idea of printing stuff out from a computer. An early indication – that I completely blanked on at the time – that I was genetically predisposed to an interest in typesetting/publishing. The Aim 65 printed to a cash register roll which was not terribly exciting (another early indicator that I blanked on at the time). The ZX80 did not have a printer at all...home printing was not a thing back in 1984. In 1984 however, the Powers That Be in TCD gave us second year computer science newbies rationed access to a Vax 11/870, with glorious Adm3a[4] terminals.

In a small basement terminal room on Pearst St, in Dublin, there was a clutch of these terminals and we would eagerly stumble down the stairs at the appointed times, to get at them. Beating time in the corner of that terminal room, most days, was a huge, noisy dot matrix printer[5], endlessly chewing boxes of green/white striped continuous computer paper. I would stare at it as it worked. In particular, finding it particularly fascinating that it could create bold text by the clever trick of backing up the print head and re-doing text with a fresh layer of ink.

We had access to a basic e-mail system on the Vax. One day, I received an e-mail from a classmate (sender lost in the mists of time) in which one of the words was drawn to the screen twice in quick succession as the text scrolled on the screen (these were 300 baud terminals - the text appeared character by character, line by line, from top to bottom). Fascinated by this, I printed out the e-mail, and found that the twice-drawn word ended up in bold on paper.

"What magic is this?", I thought.  By looking under the hood of the text file, I found that the highlighted word – I believe it was the word “party” – came out in bold because five control characters (Control-H [5] characters[6]) had been placed right after the word. When displayed on screen, the ADM3a terminal drew the word, then backed up 5 spaces because of the Control-H's, then drew the word again. When printed, the printer did the same but because ink is cumulative, the word came out in bold. Ha!

Looking back on it, this was the moment when it occurred to me that text files could be more that simply text. They could also include instructions and these instructions could do all sorts of interesting things to a document when it was printed/displayed...As luck would have it, I also had access to a wide-carriage Epson FX80[7] dot matrix printer through a part-time programming job I had while in college.

Taking the Number 51 bus to college from Clondalkin in the mornings, I read the Epson FX-80 manual from cover to cover. Armed with a photocopy of the “escape codes”[8] page, I was soon a dab hand at getting text to print out in bold, condensed, strike-through, different font sizes...

After a while, my Epson FX-80 explorations ran out of steam. I basically ran out of codes to play with. There was a finite set of them to choose from. Also, it became very apparent to me that littering my text files with these codes was an ugly and error prone way to get nice print outs. I began to search for a better way.  The “better way” for me had two related parts. By day, on the Vax 11/780 I found out about a program called Runoff[9]. And by night I found out about a word-processor called Worstar[10].

Using Runoff, I did not have to embed, say, Epson FX80 codes into my text files, I could embed more abstract commands that the program would then translate to printer-specific commands when needed. I remember using “.br” to create a line break (ring any bells, HTML people?). “.bp” began a new page, “.ad” right-aligned text. etc.

Using Wordstar on an Apple II machine running CP/M (I forgot to mention I had access to one of them also...I wrote my first ever spreadsheet in Visicalc on this machine, but that is another story.) I could so something similar. I could add in control codes for formatting and it would translate for the current printer as required.

So far, everything I was using to mess around with documents was based on visible coding systems. i.e. the coding added to the documents was always visible on the screen interspersed with the text. So far also, the codes added to the documents where all control codes. i.e. imperative instructions about how a document should be formatted.

The significance of this fact only became clear to me later but before we get there, I need to say a few words about my early time with Wordperfect on an IBM PC XT. My first encounter with a pixel-based user interface – it was called GEM [11] and ran on top of DOS on IBM PCs. An early desktop publishing system called Ventura Publisher from Ventura Software which ran on GEM. I also need to say a little about the hernia-generating Apple Lisa[12] that I once had to carry up a spiral stair-case. 

Oh, and the mind blowing moment I first used Aldus Pagemaker[13] on a Fat Mac 512k[14] to produce a two columned sales brochure on an Apple Laserwriter[15] and discovered the joys of Postscript.

Next : What is a document? - Part 4.

[5] Similar to this http://bit.ly/2CFhue9

Thursday, December 14, 2017

What is a document? - Part 2


Back in 1985, when I needed to create a “document” on a computer, I had only two choices. (Yes, I am indeed avoiding trying to define “document” just yet. We will come back to it when we have more groundwork laid for a useful definition.) The first choice involved typing into what is known generically as a “text editor”. Back in those days, US ASCII was the main encoding for text and it allowed for just the basic symbols of letters, numbers and a few punctuation symbols. In those days, the so called “text files” created by these “text editors” could be viewed on screens which typically had 80 columns and 25 rows. They could also be printed onto paper, using either “dot matrix” printers or higher resolution, computerized typewriters such as the so-called “golf ball” typewriters/printers which mimicked a human typist using a ribbon-based impact printer.

The second choice was to wedge the text into little boxes  called "fields" to be stored in a "database". Yes, My conceptual model of text in computers in those early days was a very binary one. (Some nerd humour in the last sentence.)

On one hand, I could type stuff into small “boxes” on a screen which typically resulted in the creation of some form of “structured” data file e.g. a CODASYL database [1]. On the other hand, I could type stuff into an expandable digital sheet of paper without imposing any structure on the text, other than a collection of text characters, often chunked with what we used to call CRLF separators (Carriage Return, Line Feed).

(Aside: You can see the typewriter influence in the terminology here. Return the carriage (holding the print head) to the left of the page. Feed the page upwards by one line. So Carriage Return + Line Feed  = CR/LF).

(Aside:I find the origins of some of this terminology is often news to younger developers who wonder why moving to a new line is two characters instead of one on some machines. Surely “newline” is one thing? Well, it was two originally because one command moves the carriage back (the “CR”) and another command moved the paper up a line “LF”, hence the common pairing: CR/LF. When I explain this I double up by explaining “uppercase/lowercase”. The origins of the latter in particular, are not well known to digital natives in my experience.)

From my first encounters with computers, this difference in how the machines handled storing data intrigued me. On one hand, there were “databases”. These were stately, structured, orderly digital objects. Mathematicians could say all sorts of useful things about them and create all sorts of useful algorithms to process them. The “databases” are designed for automation.

On the other hand, there was the rebellious, free-wheeling world of text files. Unstructured. Disorderly. A pain in the neck for automation. Difficult to reason about and create algorithms for, but fantastically useful precisely because they were unstructured and disorderly.

I loved text files back then. I still love them today. But as I began to dig deeper into computer science I began to see that the binary world view : database versus text. Structured versus unstructured. Was simple, elegant and wrong. Documents can indeed be “structured”. Document processing could indeed be automated. It is possible to reason about them, and create algorithms for them, but it took me quite a while to get to grips with how this can be done.

My journey of discovery started with an ADM 3A+ terminal to a VAX 11/780 mini-computer (by day) [2] and an Apple IIe personal computer running CP/M – by night[3].

For the former, a program called RUNOFF. For the latter, a program called Wordstar and one of my favorite pieces of hardware of all time : an Epson FX80  dot matrix printer.



Thursday, December 07, 2017

What is a document? Part 1.

I am seeing a significant up-tick in interest in the concept of structured/semantic documents in the world of law at present. My guess is that this is as a consequence of the activity surrounding machine learning/AI in law at the moment.

It has occurred to me that some people with law/law-tech backgrounds are coming to some of the structured/semantic document automation concepts anew whereas people with backgrounds in, for example, electronic publishing (Docbook etc.), financial reporting (XBRL etc.), healthcare (HL7 etc.) have already “been around the block” so-to-speak, on the opportunities, challenges and pragmatic realities behind the simple sounding – and highly appealing – concept of a “structured” document.

In this series of posts, I am going to outline how I see structured documents, drawing from the 30 (phew!) or so years of experience I have accumulated in working with them. My hope is that what I have to say on the subject will be of interest to those newly arriving in the space. I suspect that at least some of the new arrivals are asking themselves “surely this has been tried before?” and looking to learn what they can from those who have "been there". Hopefully, I can save some people some time and help them avoid some of the potential pitfalls and “gotchas” as I have had plenty of experience in finding these.

As I start out on this series of blog posts, I notice with some concern that a chunk of this history – from late Eighties to late Nineties – is getting harder and harder to find online as the years go by. So many broken links to old conference websites, so many defunct publications....

This was the dawn of the electronic publishing era and coincided with a rapid transition from mainframe green-screens to dialup compuserv, to CD-ROMs, to the Internet and then to the Web, bringing us to where we are today. A period of creative destruction in the world of the written word without parallel in the history of civilization actually. I cannot help feeling that we have a better record of what happened in the world from the time of Gutenburg's printing press to the glory years of paper-centric desktop publishing, than we do for the period that followed it when we increasingly transitioned away from fixed-format, physical representations of knowledge. But I digress....

For me, the story starts in June 1992 with a Byte magazine article by Jon Udell[1] with a title that promised a way to “turn mounds of documents into information that can boost your productivity and innovation”. It was exactly what I was looking for in 1992 for a project I was working on. An electronic education reference guide to be distributed on 3.5 inch floppy disks to every school in Ireland.

Turning mounds of documents into information. Sound familiar? Sound like any recent pitch you have heard in the world of law? Well, it may surprise you to hear that the technology Jon Udell's article was about – SGML – was largely invented by a lawyer called Dr Charles F. Goldfarb[2]. SGML set in motion a cascade of technologies that have lead to the modern web. HTML is the way it is, in large part, because of SGML. In other words, we have a lawyer to thank for a large aspect of how the Web works. I suspect that I have just surprised some folks by saying that:-)

Oh, and while I am on a roll making surprising statements, let me also state that the cloud – running as it does in large part on linux servers – is, in part, the result of a typesetting R&D project in AT&T Bell Labs back in the Seventies.

So, in an interesting way, modern computing can trace its feature set back to a problem in the legal department. Namely, how best to create documents in computers so that the content of the documents can be processed automatically and re-used in different contexts?

More on that later, but best to start at the beginning which for me was 1985. The year when a hirsute computer science undergraduate (me) took a class in compiler design from Dr. David Abrahamson[3] in Trinity College Dublin and was introduced to the wonderful world of machine readable documents.

Yes, 1985.

Next: Part 2.