Featured Post

Linkedin

 These days, I mostly post my tech musings on Linkedin.  https://www.linkedin.com/in/seanmcgrath/

Monday, June 14, 2010

XML in legislature/parliament environments: content aggregation and its role in content modeling and information architectures

Last time in this KLISS series I talked about the subtle inter-twingularities between content and presentation and how, in legal/regulatory environments, the quantum-like uncertainties concerning what you will see what you observe a document through software-tinted glasses, are enough to keep you awake at night. This is especially true if you are trying to apply the XML standard model with its penchant for dismissing "rendering" as a merely "down-stream" activity with no desirable or necessary impact on the "up-stream" content architecture. Would that it were so....

Today I want to close out my outlining of the problem areas with some discussion about content aggregation and its role in content modeling and information architectures in legislatures/parliaments. The best way to approach the problem I want to highlight is by reference to the XML standard model. Consider the time honored concept of "document analysis". In this phase of any XML project, you get your hands on documents produced by the entity under analysis and then you classify and decompose them. By "classification" I mean that some type system is constructed into which all the documents "fit". For example you might proceed like this:
  • The customer thinks of all the documents in this pile as "Bills"
  • Bills always have a number and a short title and a long title and one or more sponsors.
  • Ok, let us declare "Bill-ness" to be present in any document that has:

    • The word "Bill" in its long title, and
    • A unique(-ish) number, and
    • a short title, and
    • a long title, and
    • one or more sponsors.

  • Hmm...now when we look closely at all these bills, they appear to sub-divide into sub-types. There are money bills, bills that change statute, bills that don't, bills that substitute for other bills etc. etc.

There are two primary problems with this "bring out your dead trees" approach in legislatures/parliaments from my perspective.
  • It pre-supposes an acceptable dominant de-composition for each document type
  • It does not take into account the multiple levels of transclusion that can take place during production of the document. Making that transclusion process explicit in the model is oftentimes the key to deriving value from XML-based systems in legislatures/parliaments in my experience.

Let us take each of these in turn.

Pre-supposing an acceptable dominant de-composition for each document type


In many content architecture puzzles, there comes a point where the words in the documents are just that: words. In legislatures/parliaments however, the words are incredibly important because they often speak about the workflows that the documents themselves have already gone through or are about to go through.
Put a bill on front of a layperson and after the classic structured stuff at the front, they will just see lots of words. Put the same bill on front of an attorney steeped in the lore of the law and of the law making process, and they see a completely different information vista. They see the formulaic language signaling a repeal. They know that this is a recession bill, not because it says so, but because of its impact on the general fund. They know that the various enactment dates have been crafted to ensure that conflicting states are avoided. etc. etc.
Now, how many of the things I just mentioned can be or should be tags in the data model? Here is what I have found in my experiences with legislatures/parliaments:
  • Everything is a candidate for being tagged because nobody wants to risk leaving something out.
  • Communities of interest invariably have different, irreconcilable data models in their world views. A bill really does look different to a drafting attorney.
  • All communities share the belief that their model is the most important and should be foremost in the model.
  • The entity doing the modeling is rarely empowered to disappoint anybody in the room and therefore, all possible world-views are mashed into a single tag soup in which essentially every possible element can occur in ever possible context.

This problem is especially serious when the entity charged with doing the modeling, is not incentivized to produce a model that will actually work in practice. A common example of this is the all-to-common pattern of having entity A create the information models and a separate entity B build the actual system. If entity A has no incentive to tackle the tyranny of the dominant decomposition they are unlikely to do so.

Even if a dominant decomposition is agreed upon and the worst excesses of tag-soup avoided, information models often degrade towards tag soup over time. A schema goes into production and a new element is required or perhaps an existing element is required in a new context. The easiest way to accommodate this whilst guaranteeing backwards compatibilty is to loosen the content constraints. Do this a few times and your once constrained, hierarchical, validation-enforcing schema has suffered information entropy death.

Grammar-based schemas are not statistical in nature and that is one of their great weaknesses in my opinion. All elements do not occur with equal probability. Far from it. In fact, many of the truly document-oriented corpora I have looked at have power law distributions for their elements.

The practical upshot of this is that regardless of how long you spend and how many stake-holders you get into the room and how much money you pay for your schema-to-end-all-schemas, 20% of your information elements will account for 80% of your elements-as-used.

Finally on this point, it never ceases to amaze me how much of that 80% - i.e. the tags actually used - consists of paragraph, heading, bold, italic tags i.e. tags carrying essentially zero classical XML semantics!....Hold that thought. I will be returning to it later in this series.

Insufficient modelling of transclusions


The second problem I have with the classic document analysis in legislatures/parliaments relates to the critically important area of transclusions. Legislatures/parliaments are simply rife with these things! Bills contain statute sections but the statute sections must also live stand alone. Amendment lists contain amendments but the amendments themselves must also stand alone. Journals contain votes but the votes themselves must stand alone. Final calendars contain calendars but the calendars must stand alone...

...The fun really starts when you follow the content back to its creation. Bills contain statute sections and the sections stand alone. Fine. Where did the statute sections come from? They came from the bills that enacted them. Ok, what was in the bills? The statute being amended. Ok. Where did that come from? The bills...

Take another example. The Bill Status tells that something interesting happened to the bill on page 12 of yesterdays journal. Where did the page citation come from? The journal. What was in the journal, the bill + other things. When was the bill status updated? As soon as the action happened on the bill. When was the journal produced? 24 hours later. When did the page number get into the Bill Status? A few hours after that.

Final example, something happens on the chamber floor that changes the status of a bill. The event is recorded so that the Bill Status can be updated. But the event must also be recorded in the journal so it is entered in there too. Maybe the change to the bill was that it was referred to a committee. That means that the state of that committee needs to change, which will cause a change to its meeting schedule which will result in meeting minutes will result in messages to the chamber which will result in entries in the journal...

Around and around it goes. Where it stops...actually it never stops. Legislatures/parliaments are the biggest Hermeneutic circles I have ever encountered. No field of human endeavour – with the possible exception of software engineering (yes, I will be coming back to that) is more worthy of using Escher's Drawing Hands as its emblem.

"So what?", is perhaps what you are thinking. So the information is complex, the flows are complex and they feed back on themselves. I am dragging you through this because I firmly believe that fully engaging with these complexities is what Legislative Enterprise Architecture needs to be all about if true value is to be derived from legislative IT projects. It is possible to use these feedback loops to generate efficiencies and reduce errors and increase transparency and improve service to Members and do all those things...but you won't get there if you ignore them.

What you will get instead are silos. If you do not take the time to look at all the information flows and all the feedback loops you get silos. I have seen a goodly number of fully XML compliant, state-of-the-art document management systems that are good old fashioned silos.
  • It is not unusual for journal systems to be developed independently of bill drafting systems even though bills flow into journals.
  • It is not unusual for bill drafting systems to be developed independently of statute codification systems even though bills are the starting point for codification and the codifications end up being the source of statute for new bills.
  • It is not unusual for journal systems to be developed independently of bill status systems even though the journal and the bill status systems need to agree on what happens to bills.


I hope I have convinced you that the overlaps and inter-relationships between the data items in legislatures/parliaments are many and deep. This is not a problem that can be solved by throwing tags around like pixie dust. This is not a domain in which serious value can be extracted from computerization projects without fully engaging with the domain based on what it really is, not on what we might like it to be. Bismarck's sausage machine is not hidden from us but we cannot understand it unless we are willing to look deep into it...

When you do so, what you see is a machine fuelled by its own feedback loops; brimming with time-based citations and time-based transclusions; replete with subtle inter-twinglements between content and presentation; overflowing with cascading event streams. The complexity can be overpowering at first but after you have seen a few of them up close, the patterns emerge and opportunities to leverage the patterns with technology present themselves.

In the next post, I want to start looking at these patterns.

No comments: