Designing Patterns

View from the Acropolis

A high level look at the theory and practice of software engineering



View from the Acropolis RSS FeedSite Feed

Characteristics of a Good Software Documentation Pipeline

My previous entry discussed software documentation standards and the importance of establishing one for a software development group. This is not sufficient to ensure that a group produces quality software documentation, however. A documentation creation and publishing pipeline also must be built to support the standard. Such a pipeline transforms software documentation created by programmers into formatted documents, such as web pages, and then publishes these documents on one or more repositories. A pipeline must satisfy a number of requirements in order for a group to derive the greatest benefit from creating documentation while expending the least effort and minimizing developer pain, which is crucial to ensuring wholehearted compliance with and peer enforcement of the standard.

The heart of a documentation pipeline is a tool or series of tools that read software documentation and generate formatted documents; these tools are called documentation generators. Such a tool might process a Ruby source file containing a class and generate a web page listing all of the class’ methods and displaying each method’s comments. Different documentation generation tools support different languages; in general, the more popular a language, the more tools will be available for it. Some tools, such as Doxygen, support a number of programming languages while others, such as RDoc, focus on only a single language, Ruby in this case. If a tool focuses on only one language, it likely will provide much better support for that language than does a more general tool. RDoc, for instance, is the only tool that supports Ruby’s metaprogramming constructs. When a group works with several languages, using the optimal tool for each language must be weighed carefully against using one tool for multiple languages. This is a very nuanced decision that should be made separately for each language and that depends on how tightly coupled the language’s code is with that of the other languages, how long it will take for the developers to master the language’s optimal tool, and how much benefit the language’s optimal tool offers over a multi-language tool. If a group maintains a code base that has different languages calling into each other, then I recommend using a single tool that can handle all of these languages and that preferably can understand the relationships across the language boundaries. At a prior employer I worked in a group that maintained a system written in C, C++, Fortran 77, and Fortran 95, and a documentation tool that could have understood the inter-language relationships would have been very valuable. RDoc, for example, provides this functionality for C and Ruby; although its purpose is to document Ruby code, it can parse C and understands the C/Ruby calling conventions, allowing it to document properly libraries consisting of both C and Ruby, which are common.

Aside from generating documents from code and comments, most widely-used documentation generation tools also define a markup language that can describe elements of the software documentation. A key characteristic of such a markup language is its expressiveness, particularly whether the markup can describe all program elements properly. Beyond being able to describe common elements like lists, headings, and emphasis, the markup also should be able to describe code constructs like method parameters, return values, and possible exceptions. While markup is not needed to generate basic documents, it greatly increases the expressiveness of the documentation and enhances the information conveyed by the generated documents. Describing elements with markup allows document formatting to be tweaked to display the elements in special ways, such using a particular color for method parameters. Without expressive markup, developers often use primitive markup in order to format documentation properly. RDoc, for instance, does not offer markup for parameters and return values, and so my company simulates such markup by using the less descriptive markup for headings and lists. Not only is the required markup longer, more painful to write, and duplicated throughout our code base, RDoc does not understand the elements properly and so cannot generate optimal documents with them.

Another important characteristic of a documentation generation tool’s markup language is how easily it can be understood, modified, and written. XML and HTML, for instance, are very verbose (since each tag must be closed), and I find that the tags obscure the content. Such complicated languages also are difficult to write correctly and so are painful for developers to use to write comments. By contrast, there are a variety of markdown languages that are easier to read and write (wikis often use markdown languages for this reason). It is vital that marked up comments be legible as text, since programmers working directly with the source code must be able to read them. This also is an important consideration if the package-level documentation is distributed as text files (for example, RDoc transforms a README that I maintain into beautiful HTML, but the text file remains very easy to read).

A defining characteristic of a documentation generation tool is the format of the documents that it generates. At a minimum, it should be able to generate web pages, since web pages can be viewed on virtually every platform, provide a plethora of stylistic options, make cross-references very easy to follow (via links), and can be searched easily. Additionally, if the web pages are published on the public internet, they will be indexed by search engines and so will advertise the software. Finally, web pages easily can be proofread by developers before releasing documentation. It is crucial that the default look of the web pages be as aesthetically pleasing and as readable as possible, so that the documents will be a pleasure to read. The look also should be customizable, so that the group’s developers can tweak it, and so that developers around the world can craft better looks. In addition to being more usable, appealing web pages are strong positive reinforcement for developers writing documentation. The tool also should generate cross-reference links to the appropriate documents for class names, method names, modules, and other source code entities, as such links make documentation much easier to use. The tool additionally should be able to link methods to syntax-highlighted source code, allowing readers to browse the source easily in the context of the documentation. Finally, the tool should be able to generate and embed class diagrams. Javadoc, Doxygen, and RDoc are examples of tools that offer all of these features.

Depending on the readers of a group’s software documentation, it may be desirable to publish documentation in other formats besides web pages. Some UNIX programmers, for instance, prefer to read software documentation on the command-line, and so tools exist for this such as man for UNIX utilities, perldoc for Perl scripts and libraries, pydoc for Python libraries, and ri for Ruby libraries. Windows programmers, on the other hand, often read software documentation in Windows Help, and so many tools can generate documents in this format. In order to make the documentation as usable as possible, the pipeline should generate all applicable output formats from a single documentation source, either with one tool or a series of tools. Javadoc, Doxygen, and RDoc all can generate documents in multiple formats.

After generating documents, a pipeline must publish them to at least one repository in order to allow developers to view them. The nature of this repository will, of course, depend on the document format. At a minimum, however, it should allow developers to search the documents. In addition, it should be accessible from wherever the developers will work; if developers log into a corporate system from home and code, for instance, then they also should be able to access the documentation repository from home. Depending on the software’s release cycle, the repository may need to store multiple versions of documentation, tracking different versions of the software. Lastly, it must be kept up to date, so that developers always can trust that it reflects the code base faithfully. This should be automated in order to eliminate human error and in order to save developers yet another chore when moving code. A prior employer ran a scheduled job that generated documentation nightly from the production code. At Designing Patterns, our build system generates and publishes documentation when releasing new versions of our open-source packages.

While it may take some work to establish, a good documentation pipeline will justify the effort by increasing programmer productivity and happiness, positively reinforcing the documentation standard.


Gold Standard for Software Documentation

We are told throughout our educations and careers that writing documentation about our software for other programmers, software documentation, is critically important. Though virtually every software engineer pays lip service to its importance, many fail to create proper documentation for their projects due to lack of interest, lack of time, or lack of direction. Documentation is as important as design, implementation, and testing in constructing quality software, however, and inadequately documented software is poor software. It therefore follows that groups producing poorly documented software are fundamentally flawed. Establishing a documentation standard and a documentation creation and publishing pipeline is part of setting a group’s technical direction. The standard ensures that the developers prioritize creating documentation and the pipeline ensures that excellent documentation is generated as efficiently and painlessly as possible, which in turn positively reinforces the group’s documentation standard.

The first kind of software documentation that a standard should mandate is code documentation, or comments in source code. There are two kinds of code comments:

  1. Comments discussing the intention of a piece of code, where this is not obvious from the code itself
  2. Comments describing an interface (class, function, or method), particularly its inputs and outputs

I write such comments after finishing a module, while reviewing the code, because I have found that comments written while a module is being developed need to be revised constantly during the development, wasting time. Aside from just creating documentation, this process actually improves the code itself. Writing the first kind of comment forces re-examination of tricky code and often leads to simplification and clarification of such code. Writing the second kind of comment forces rethinking of interfaces, in particular whether an interface really is necessary, desirable, and well-expressed; I frequently remove methods, alter argument lists, or change names while documenting code. In addition, documenting an interface’s inputs often reveals potential error conditions. Both types of commenting greatly increase the speed at which other programmers can understand and modify code and so directly contribute to a group’s efficiency, repaying the original time investment to write the comments. Interface documentation is particularly important in this regard, because it often can prevent the interface from being invoked incorrectly, saving debugging time. Comments also make it much easier to maintain code without the original author being present, which is crucial given that people change jobs very frequently (US citizens change jobs roughly once every two years) and that open source code may be used and modified by programmers from all around the world.

The second kind of software documentation that a standard should mandate is package and library level documentation. README files distributed with packages and websites describing libraries are this kind of documentation (this README is a concrete example for an open-source package that I maintain). While code documentation should be exhaustively complete and can discuss low-level details, package documentation only should discuss important, high-level concepts. It should not duplicate code documentation. Each method in a class should be commented, for instance, but package documentation discussing the class should review just the class’ main features and the methods that need to be called in the class’ most common use cases. In fact, package documentation should be organized around a series of examples that both illustrate what the package can do and quickly teach the reader its basic functionality. Such examples make the package easier to use, since the reader may not need to look at the source code for common use cases, and also serve as excellent advertisement for the package’s features, making it more likely that the package will be used. This is especially important for open-source software, as excellent package-level documentation distinguishes such software from the competition and also can be indexed by search engines.

Good programmers always will create documentation, even in the absence of any support. One group that I worked in had no documentation standard, for instance, and some developers still documented their work. There are a couple of problems with such a group’s documentation, however. Firstly, without a documentation standard, different developers inevitably will document differently and to different degrees. While inconsistent documentation is better than no documentation, it does hinder the creation and use of a documentation repository. Time will be lost as developers are forced to switch between documentation styles throughout the day. Secondly, some developers inevitably will create insufficient documentation, dragging down the group’s efforts and creating an environment in which poorly documented software is acceptable (the broken windows effect). Parts of the system will not be documented at all, emasculating a documentation repository as developers will not be able to trust that it always has the information that they need. In my prior group, for instance, many developers never bothered to use the doxygen repository because it was incomplete and instead always went directly to the source code.

In order to establish a documentation standard, a group’s management must classify software documentation as a mandatory deliverable for every project, and project planners always need to budget time for it. Management also must instruct developers and team leaders to enforce the standard during code reviews and should evaluate programmers partially on the basis of the documentation that they produce. The group’s technical leadership, moreover, should outline requirements, evaluate and purchase products, and ask developers to write tools for an efficient documentation creation and publishing pipeline for the standard. This will ensure that the group gets the greatest bang for the buck from creating documentation. Also, as even good programmers often do not enjoy writing documentation (viewing it as a necessary evil), an excellent pipeline is a carrot that helps to ensure acceptance of and compliance with the documentation standard. I will discuss what constitutes a good pipeline in my next article.