Sunday, September 04, 2005

Why you should care about file formats

On Sept. 1, 2005, the Information Technology Division of the Commonwealth of Massachusetts proposed the Enterprise Technical Reference Model v.3.5 (ETRM v3.5). It is one step in the implementation of the Enterprise Open Standards Policy. The policy would force state agencies to store and transmit digital documents only in certain formats that comply with publically accessible and implementable standards. Some of these are quite familiar. The state would standardize its email in ASCII text1, and its web pages in HTML 4.012. XML (eXtensible Markup Language)-based formats will be preferred for databases. It also expresses a preference for Adobe's Portable Document Format for some unspecified documents.3 The most dramatic, and probably, the most noticable, change, will be the adoption of OASIS OpenDocument formats for office documents. This post will hopefully inform you about why you should care about how the state government stores its documents, and answer some of the misinformation that was directed at derailing the policy's implementation. The public comment period extends until Sept. 9. Public comments should be directed to

  1. File format matters. Have you ever tried opening a file and all you got was some junk text on screen? Maybe it included some incomprehensible symbols. The most likely cause of that problem is that you tried to open the file using a program that could not read the file's format. As I explained elsewhere, computers don't natively understand context. File formats provide the context of how to turn a string of meaningless bits in a particular file into usable data. If your software can't provide that context, the file is useless.
  2. Standardization within an organization is a practical necessity. This is true whether the large organization is a private corporation or a state government. If any organizations' parts have multiple incompatible file formats, a format conversion would have to be done if those parts send electronic data between each other. This puts an additional burden on the Information Technology staff, who have to support all of the inevitable conversion problems between the formats. The burden is especially high if the formats are unpublished, because the IT staff would have no recourse in performing the conversions other than the companies that made the formats incompatible in the first place. In that case, sharing documents and data between different branches of the organization becomes harder or impossible, and the most likely course of action is that the data remains unshared.
  3. You pay for the documents the state produces. Taxpayer money funds the production of all documents coming out of the state government. The Computer Technology Industry Association would like you to believe that standardizing on open formats will be costly to taxpayers. It is true that there will be some transitional costs. Workers may have to be retrained in new software (if the current software providers refuse to support the new standard formats). If a given agency decides to standardize their software stack on proprietary software that supports the formats, they will have to purchase new software licenses. If they decide to standardize on free/open-source software, they may need new support contracts. But, the once the new standards are in place, the cost of software licensing will likely be driven down by increased competition in the marketplace. Even Microsoft, which used to be able to set its prices by fiat, is now being forced to lower its software licensing prices for some customers because of the threat of free/open source software. But, beyond the cost of ownership and support, the state is also a public entity. This means that you, the residents, should be able to access documents that the state produces for on your behalf without the additional burden of buying proprietary software. As email is used for more official functions, in the future, you may even be able to submit government forms online. This requires the ability of the user to view and modify the documents before you send them back to the state. You were taxed once by Massachusetts. You should not be taxed a second time by Microsoft.
  4. The policy does not automatically favor free/open source software over proprietary software. The OASIS OpenDocument format is described by standards documents, which may be obtained freely. There are no additional terms required to be placed on implementations of the format. Proprietary vendors already implement read and write filters for a number of other applications to aid migration and interoperability between different vendors' products. There is no technical or legal reason that commercial software manufacturers can't implement fully functional filters for OASIS OpenDocument well before 2007. Microsoft has chosen to place itself out of the market of interoperable programs by relying solely on its own format.
  5. Using open formats is not a statement against free enterprise. I never really understood how using standards that originated in open-source hurt free enterprise in any way. Most of the Internet architecture is based on such standards, and companies seem to be able to sell all kinds of Internet applications.
  6. “XML” does not mean the format is open. Microsoft, one of the big losers in the change, argues that the XML-based format used in Office 12 is a sufficiently open format. But, XML alone does not make the format usable for interoperability. Here's why. XML is not a format for storing data. It's a language used to describe a format for storing data. In order to make sure that the document is not just properly formed XML, but also a properly formed derivative format (such as the XHTML in which this web page is written), one must have access to a description of the derivative format. Those schema are as important for being able to read a particular XML-formatted document as is the document being well-formed XML. The schema for Microsoft Office 12's XML-format are released under a relatively liberal license. But, Microsoft has attempted to patent the XML schema used in their format. If their application is successful, it will give them ultimate control over who can implement it. In the patent license, this term is included:

    You are not licensed to sublicense or transfer your rights.

    That term alone is enough to shut out all free/open-source software, which requires that anyone be allowed to redistribute source code. The patent prevents interoperability between Office 12 and the most successful of its competitors. What Microsoft gave with one hand, they took away with the other.
  7. The release version of (1.1.4) does not support OpenDocument. The release candidate for OOo 1.1.5 supports it, and it is the default format in the 2.0 beta version. Release versions of AbiWord, and the KOffice suite already support OpenDocument.

1 I don't understand why they chose ASCII text, as opposed to Unicode UTF-8. ASCII only supports the latin alphabet, while Unicode supports most alphabets used today. Considering that state documents, which may include emails to residents, have to be in multiple scripts, this makes no sense. UTF-8 is also backwards-compatible with ASCII, so it would not cause problems with older email readers that don't support the more general format.

2 Despite being an older format, this one may still make sense over XHTML 1.0, since some people still have old browsers that do not fully implement XHTML. There is also no indication that HTML 4.01 support will go away any time soon. But, considering the emphasis on XML-based formats in the rest of the document, I'm surprised there isn't an option to use XHTML 1.0, which is a more consistent format, and easier to convert to other formats.

3 PDF was once only usable as a read-only format by non-Adobe applications. Now, there are many good PDF readers available on multiple platforms, and a number of applications that can write PDF files. Unfortunately, I don't know of any open-source applications capable of modifying already-written PDF files, including filling in PDF-based forms. It would have been nice if PDF's use were limited to read-only informational documents. But, since I don't see any reason that open-source applications can't support the PDF features in the future, it would probably be unreasonable to block its use.

Technorati tags: , , ,

The change is not happening fast enough. Unless you consider chisel-and-stone a "standard format," since that's the most functional thing in my workspace at the moment.
They've got over a year to implement the changeover. And, for now, it's just a file format switch, which may mean that, in the end, you'll be using the same nonfunctional equipment and software as you were at the beginning.
Open file format beginning about twelve and a half months ago would have made my life a heck of a lot easier. The equipment would still be crappy, but at least I would be draining years of my life converting back and forth between MSWord and WordPerfect (ancient versions of each, I'll have you know).
It will be interesting to see what happens to all those older documents after the transition.
This comment has been removed by a blog administrator.
flpsed is an open-source program which leaves a lot to be desired but does let you import pdfs and write text on them and save the result. Not great, but better than nothing. On the other hand the current round of IRS tax forms have document rights applied that are supposed to let you save data in the forms even with just the basic Reader. Which is not quite what you were asking for but at least it should be functional.
Thanks, aaron. I'd never heard of that before, maybe I'll give it a try. :-)

Of course, you could always convert PDFs to other formats (eg, pictures, or PostScript) and edit them. It's just terribly inconvenient.

I don't know enough about PDF to know whether the fill-in form problem is one with the format itself or just that nobody's written good software to do it yet.

The Acrobat Reader for Linux itself has always been buggy and about one to two versions behind the Windows one, aside from it being non-free software. Last time I used it, it didn't support fill-in forms anyway. But, I haven't seen it in maybe a year. I now use xpdf (does not support fill-in forms) for viewing PDFs.
Post a Comment

<< Home

Links to this post:

Create a Link