Thursday, December 07, 2017

What is a document? Part 1.

I am seeing a significant up-tick in interest in the concept of structured/semantic documents in the world of law at present. My guess is that this is as a consequence of the activity surrounding machine learning/AI in law at the moment.

It has occurred to me that some people with law/law-tech backgrounds are coming to some of the structured/semantic document automation concepts anew whereas people with backgrounds in, for example, electronic publishing (Docbook etc.), financial reporting (XBRL etc.), healthcare (HL7 etc.) have already “been around the block” so-to-speak, on the opportunities, challenges and pragmatic realities behind the simple sounding – and highly appealing – concept of a “structured” document.

In this series of posts, I am going to outline how I see structured documents, drawing from the 30 (phew!) or so years of experience I have accumulated in working with them. My hope is that what I have to say on the subject will be of interest to those newly arriving in the space. I suspect that at least some of the new arrivals are asking themselves “surely this has been tried before?” and looking to learn what they can from those who have "been there". Hopefully, I can save some people some time and help them avoid some of the potential pitfalls and “gotchas” as I have had plenty of experience in finding these.

As I start out on this series of blog posts, I notice with some concern that a chunk of this history – from late Eighties to late Nineties – is getting harder and harder to find online as the years go by. So many broken links to old conference websites, so many defunct publications....

This was the dawn of the electronic publishing era and coincided with a rapid transition from mainframe green-screens to dialup compuserv, to CD-ROMs, to the Internet and then to the Web, bringing us to where we are today. A period of creative destruction in the world of the written word without parallel in the history of civilization actually. I cannot help feeling that we have a better record of what happened in the world from the time of Gutenburg's printing press to the glory years of paper-centric desktop publishing, than we do for the period that followed it when we increasingly transitioned away from fixed-format, physical representations of knowledge. But I digress....

For me, the story starts in June 1992 with a Byte magazine article by Jon Udell[1] with a title that promised a way to “turn mounds of documents into information that can boost your productivity and innovation”. It was exactly what I was looking for in 1992 for a project I was working on. An electronic education reference guide to be distributed on 3.5 inch floppy disks to every school in Ireland.

Turning mounds of documents into information. Sound familiar? Sound like any recent pitch you have heard in the world of law? Well, it may surprise you to hear that the technology Jon Udell's article was about – SGML – was largely invented by a lawyer called Dr Charles F. Goldfarb[2]. SGML set in motion a cascade of technologies that have lead to the modern web. HTML is the way it is, in large part, because of SGML. In other words, we have a lawyer to thank for a large aspect of how the Web works. I suspect that I have just surprised some folks by saying that:-)

Oh, and while I am on a roll making surprising statements, let me also state that the cloud – running as it does in large part on linux servers – is, in part, the result of a typesetting R&D project in AT&T Bell Labs back in the Seventies.

So, in an interesting way, modern computing can trace its feature set back to a problem in the legal department. Namely, how best to create documents in computers so that the content of the documents can be processed automatically and re-used in different contexts?

More on that later, but best to start at the beginning which for me was 1985. The year when a hirsute computer science undergraduate (me) took a class in compiler design from Dr. David Abrahamson[3] in Trinity College Dublin and was introduced to the wonderful world of machine readable documents.

Yes, 1985.

Next: Part 2.


No comments: