Featured Post

Linkedin

 These days, I mostly post my tech musings on Linkedin.  https://www.linkedin.com/in/seanmcgrath/

Friday, February 23, 2018

What is a document - Part 7



The word “document” is, like the word “database”, simple on the outside and complex on the inside. 

Most of us carry around pragmatically fuzzy definitions of these in our heads. Since the early days of personal computers there have been software suites/bundles available that have included distinct tools to manage “documents” and “databases”, treating them as different types of information object. The first such package I used was called SMART running on an IBM PC XT machine in the late Eighties. It had a 10MB hard disk. Today, that is hardly enough to store a single document, but I digress...

I have used many other Office Suites since then, most of which have withered on the vine in enterprise computing, with the notable exception of Microsoft Office. I find it interesting that of the words typically associated with office suites, namely, “database”, “word processor”, “presentation”, and “spreadsheet” the two that are today most tightly bound to Microsoft office are “spreadsheet” and “presentation” to the point where “Excel” and “Powerpoint” have become generic terms for “spreadsheet” and “presentation” respectively. I also think it is interesting Excel has become the de-facto heart of Microsoft Office in the business community with Word/Access/Powerpoint being of secondary importance as "must haves" in office environments, but again I digress...

In trying to chip away at the problem of defining a “document” I think it is useful to imagine having the full Microsoft office suite at your disposal and asking the question “when should I reach for Word instead of one of the other icons when entering text?” The system I worked in in the Nineties, mentioned previously in this series, required a mix of classic field-type information along with unstructured paragraphs/tables/bulleted lists. If I were entering that text into a computer today with Microsoft Office at my disposal, would I reach for the word processor icon or the database icon?

I would reach for the Word icon. Why? Well, because there are a variety of techniques I can use in Word to enter/tag field-type textual information and many techniques for entering unstructured paragraphs/tables/bulleted lists. The opposite is not true. Databases tend to excel (no pun intended) at field-type information but be limited in their support for unstructured paragraphs/tables/bulleted lists – often relegating the latter to “blob” fields that are second-class citizens in the database schema. 

Moreover, these days, the tools available for post-processing Word's .docx file format make it much easier than ever before to extract classic “structured XML” from Word documents but with the vital familiarity and ease of use for the authors/editors I mentioned previously.

Are there exceptions? Absolutely. There are always exceptions. However, if your data structure necessarily contains a non-trivial amount of unstructured or semi-structured textual content and if your author/edit community wants to think about the content in document/word-processor terms, I believe today version of Word with its docx file format is generally speaking a much better starting point than any database front-end or spreadsheet front-end or web-browser front-end or any structured XML editing tool front-end.

Yes, it can get messy to do the post-processing of the data but given a choice between a solution architecture that guarantees me beautifully clean data at the back-end but an author/edit community who hate it, versus a solution architecture that involves extra content enrichment work at the back end but happy author/edit users, I have learned to favor the latter every time.

Note I did not start there! I was on the opposite side of this for many, many years, thinking that structured author/edit tools, enforcing structure at the front-end was the way to go. I built a few beautiful structured systems that ultimately failed to thrive because the author/edit user community wanted something that did not “beep” as they worked on content. I myself, when writing the books I wrote for Prentice-Hall (books on SGML and XML - of all things!), I myself wanted something that did not beep!

Which brings me (finally!), to my answer to the question “What is a document?”. My answer is that a document is a textual information artifact where the final structure of the artifact itself is only obvious after it has been created/modified and thus requires an author/edit user experience that gets out of the way of the users creative processes until the user decides to impose structure – if they decide to impose a structure at all.

There is no guaranteed schema validity other than that most generic of schemas that splits text into flows, paragraphs, words, glyphs etc and allows users to combine content and presentation as they see fit.

On top of that low level structure, anything goes – at least until the point where the user has decided that the changes to the information artifact are “finished”. At the point where the intellectual work has been done figuring our that the document should say and how it should say it, it is completely fine - and generally very useful - to be able to validate against higher level, semantic structures such as "chapter", "statute", "washing machine data sheet" etc.

The big lesson of my career to date in high volume document management/processing is that if you seek to impose this semantic structure on the author/edit community rather than have them come to you and ask for some structure imposition, you will struggle mightily to have a successful system.