Biolib Documentation 1.0

From BioLib
Revision as of 03:17, 23 September 2016 by WikiSysop (talk | contribs) (Parsing Doxygen XML)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Biolib Documentation

Aiming for a first Biolib 1.0 release for Perl with POD documentation.

The success of any software project depends on its documentation. For this there are many from source documentation generators around - like POD for Perl and JavaDoc for JAVA. For Biolib the situation is complicated by the fact that it deals with a potpourri of languages (a potpourri is a mixture of dried, naturally fragrant plant material, used to provide a gentle natural scent in houses, so it isn't all negative).

Initially I started off with Doxygen, a documentation generation tool for C code. Unfortunately this is not good enough. People who write Perl want to see Perl documentation, likewise for Ruby people want Ruby documentation, Python, etc.

So requirement number one is to generate the API from C code into a readable Perl API, Ruby, Python etc.

Also with Biolib we write integration and (doc)testing scripts. These can also act as documentation. I would like to integrate that with the main documentation.

Generating good API documentation for multiple languages is a problem. Ideally we would use the C/C++ code base to expose the interface for all mapped languages - if possible including generated example code for every language (!). SWIG has little support for that (there have been attempts in the past, but apparently dropped). Unfortunately the code SWIG generates does not really lend itself to the native scripting language documentation generators, though there are examples like Perl::GSL which have done exactly that.

What SWIG can do is generate XML. The C function:

int my_mod(int x, int y);

gets output like the (here simplified):

<cdecl>
  <attributelist>
    <attribute name="sym_name" value="my_mod">
    <attribute name="name" value="my_mod">
    <attribute name="decl" value="f(int,int).">
    <parmlist>
      <parm>
        <attributelist>
          <attribute name="name" value="x">
          <attribute name="type" value="int">
        </attributelist>
      </parm>
      <parm>
        <attributelist>
          <attribute name="name" value="y">
          <attribute name="type" value="int">
        </attributelist>
      </parm>
    </parmlist>
    <attribute name="kind" value="function">
    <attribute name="type" value="int">
  </attributelist>
</cdecl>

Which allows transforming the function definitions into some other format. Likewise there are facilities for structs, classes etc.

Still, a simple tranformation does not fully solve sharing function descriptions and/or examples. It would also be nice to have languages use the native documentation generators - e.g. we would have to create POD files for Perl - as these are what users are comfortable with and allows Biolib inclusion into CPAN etc. I'll try and do something with that later.

For the 1.0 release, I am thinking of using the SWIG generated XML to list all methods, and simply link them to relevant Doxygen documentation, the source code, or some other type of documentation (like EMBOSS' own). For every method in Biolib we would like to know where the most relevant documentation is.

Another useful tool may be ctags - as it can point out where the methods reside in the C/C++ code and where they are referred from.

In all we have three tools which generate output: SWIG, Doxygen and ctags. Add to that some loose documentation and the (possible) 'native' doctests for Ruby, Perl and Python, that come with Biolib.

Doxygen API document generator

Boilib supports generating Doxygen docs.

Use the master switch to ./CMakeLists.txt named BUILD_DOCS:BOOLEAN=TRUE, so

rm CMakeCache.txt
cmake -DBUILD_DOCS:BOOLEAN=TRUE . 
make apidoc

will generate Doxygen output in ./doc/apidoc/html for the C files listed in the ./doc/doxygen/Doxyfile config file. Through this route the Affyio module was documented in the early days of Biolib.

Doxygen can also generate XML(!) This means, we can inject documentation from the C files into a native format. Unfortunately Doxygen is not all that is required. The reason is that Biolib does not control all code - and upstream authors do not necessarily like Doxygen-type documentation of their code - this will only work with close co-operation from upstream maintainers.

So I'll also have to think of a way of injecting documentation without touching the original code base.

You see, there are several challenges here. First code and documentation comes from different sources and need to be treated differently. Second, the generated documentation comes in different forms - for every supported Biolib language.

It is taking me a while to understand all implications (and I am sure I am missing some).

SWIG XML generator

The different languages in Biolib may have different mappings - i.e. not all functions/modules in Biolib may be supported for all languages at the same time.

This means for every language (Perl, Ruby etc.) Biolib needs to generate the XML for the specific module. Generating the XML is really fast, so I could do that at build time. The SWIG command line syntax at build time is the same, replacing the -perl switch with -xml. Doing this at build time has the added advantage we can use the settings at that point, like the include paths for the C code, and we don't need to write separate documentation scripts/make files.

We can write the XML to, for example,

./build/doc/xml/swig/perl/emboss.xml
./build/doc/xml/swig/ruby/emboss.xml
./build/doc/xml/apidoc/emboss.xml
...

resulting in

./build/doc/html/ruby/emboss.html
./build/doc/html/perl/emboss.html
./build/doc/html/apidoc/emboss.html

Where the individual methods in the Perl docs will point to the details in apidoc. 'apidoc' reflects the native C documentation. And each language gets its own documentation. Later we can also generate

./build/doc/POD/emboss.pm
./build/doc/rdoc/emboss.rb

etc. To be parsed by the relevant tools.

With regard to the Cmake build system: Inside the swig/perl/affyio/CMakeLists.txt file we call BUILD_PERL_BINDINGS. MapPerl includes MapSwig and PerlMacros. All it does, currently, is copy the SWIG generated .pm file to the build dir (see cmake-tools/modules/PerlMacros.cmake). What we want to do is use the environment SWIG uses and simply generate XML.

Generating SWIG XML, in place

We will use the main build scripts to generate SWIG XML. A master switch will switch on/off XML generation.

The CMake program comes with two swig modules:

/usr/share/cmake-2.6/Modules$ ls *SWIG*
FindSWIG.cmake  UseSWIG.cmake

in these are defined

SWIG_EXECUTABLE
swig_include_dirs

and in cmake-tools are added

CMAKE_INCLUDE_PATH
USE_SWIG_INCLUDEPATH

which we, probably, need as SWIG needs to find all relevant include files.

To create XML we can run SWIG by hand:

cd src/mappings/swig/perl/affyio/
swig -I../../../../clibs/affyio/src/ -xml perl_affyio.i 

generates perl_affyio_wrap.xml, sized 470K, in half a second. To create a special output file

swig -I../../../../clibs/affyio/src/ -o build/affio_perl.xml -xml perl_affyio.i

multiple include paths just get added. Note this command just runs as stated. Likewise for EMBOSS:

swig -I../../../../../contrib/EMBOSS/ajax/core/ -I../../../../clibs/emboss/src/ -xml perl_emboss.i 

Parsing XML

I had half a mind to use XSLT to simplify the XML, but we'll start with direct XML parsing using the libxml library. I need to build a 'logic' tree which will order methods by group - and eventually class. Writing Perl, Ruby, or Python should be handled by writers using the logic tree.

First we will build up the tree from the SWIG XML. Next we fetch comments from the Doxygen XML.

SWIG XML

Every method starts with a 'cdecl'. The contained attribute list contains attributes 'sym_name' (method name), and the return value. Next is a parmlist describing each parameter.

Doxygen XML

Doxygen generates XML. Biolib generates this in ./build/doc/xml/apidoc. A method is described as:

    <sectiondef kind="func">
    <memberdef kind="function" id="group__affyio_1ge0bdacab9809b9b475b9d7bfe1332bc6" prot="public" static="no" const="no" explicit="no" inline="no" virt="non- virtual">
      <type><ref refid="structCELOBJECT" kindref="compound">CELOBJECT</ref> *</type>
      <definition>CELOBJECT* open_celfile</definition>
      <argsstring>(const char *celfilename)</argsstring>
      <name>open_celfile</name>
      <param>
        <type>const char *</type>
        <declname>celfilename</declname>
      </param>
      <briefdescription>
      </briefdescription>
      <detaileddescription>
      <para>Open a cel file using the Affyio library and return a pointer to
        a <ref re fid="structCELOBJECT" kindref="compound">CELOBJECT</ref>,
        which maintains state keeping track of the opened CEL data. The full
        array data gets stored in memory - including stddev, npixels, masks and
        outliers.</para><para>Use the direct celfile_methods instead, for more
        effecient memory usage.</para><para>
       <parameterlist kind="param"><parameteritem>
      <parametername>celfilename</parametername>
      </parameternamelist>
      <parameterdescription>
        <para>points to a valid Affy CEL file (or .gz edition)</para>
      </parameterdescription>
      </parameteritem>
      </parameterlist>

which is pretty elaborate. The 'group' file is pretty similar. And structures get their own XML definition files.

I just found someone has been working on combining SWIG with Doxygen (for JAVA) in: http://swig.svn.sourceforge.net/viewvc/swig/branches/gsoc2008-cherylfoil/Doc/Manual/Doxygen.html. I am not sure what has been achieved here, but it looks incomplete (a often seen feature of many OSS projects is that they fail to take up momentum, and the developers leave it - it is a form of evolution). Also, this implementation does not appear to use the XML output of SWIG. I don't know, but the project merely targets JAVA and looks dead anyway. It does make me realise this is a separate project which would benefit others. So let's start something on github and name it swig2doc. See http://github.com/pjotrp/swig2doc

Inject Extra Information

I am also going to add injecting additions and overrides for method definitions. This will be in a non-XML format, which allows rapid editing (I am no fan of XML). For example a simple file format may contain class+method names: ~ Name: Class::method Description: A multi line description Example: A multi-line example ~

Create example code

Creating examples for each native implementation is (perhaps) tricky. We also want to test the examples - as this will allow the documentation to validate. I guess I'll start with Specific examples:

Example-ruby:

>> object = open_txtfile("test.txt")
>> t = read_txt(object)
=> 20
>> close_txtfile(object)

Here three methods are tested, and one returns a tested value. The Perl version will be really similar:

Example-perl:

>> $object = open_txtfile("test.txt");
>> $t = read_txt($object);
=> 20
>> close_txtfile($object);

We could generalize this to some simpler parseable format

>> var object = open_textfile("test.txt")
>> var t = read_txt(var object)
=> int 20
>> close_txtfile(var object)

which would also easily translate into Python. In fact, this could be even be translated into C and injected back into the original C code - for the upstream authors to accept, or reject. The JAVA version would be

>> File object = open_textfile("test.txt")
>> int t = read_txt(object)
=> int 20
>> close_txtfile(object)

I would also have to add overrides for hairy code. For example we could add language specific lines:

ruby>> x =~ /#{myval}/
perl>> $x =~ /$myval/

so as to keep the example generator really simple.

libxml2

We will use libxml2, as it is the fastest XML parser. Use Ruby gems to install it - only on Debian use 'apt-get install libxml-ruby'!

gem install libxml

We want to create a tree of methods. This will be in memory:

class
  name
  type=class|group|module
  method1
    name
    description
    example
    returns
      type
      description
      C version
    par1
      type
      name
      description
      C version
    par2
      type
      name
      description
      C version

In code we should be able to query:

groups.each | group |
  group.each_var do | var |
  end
  group.each_method do | method |
    print method.name
    print method.description
    method.each_param do | param | 
      print param
    end
  end
end

SWIG2DOC

SWIG2DOC accepts SWIG XML files as input. E.g.

swig2doc emboss.xml affyio.xml ...

by default it generates HTML in the ./swig2doc/ directory. We will add other formats like Doxygen XML too.

Parsing XML (part II)

I have always disliked XML. Most parsing libraries are a royal pain. I am going to use libxml-ruby, as it is the fastest and does not require loading everything in memory. But after half a day of parsing SWIG I am no less annoyed. It is just too non-inituitive. After looking at the XML I decide I simply want to do:

if doc.is_swig?
  header = parse_list(doc,'attributelist')
  module = header['name']
  doc.each(['cdecl','class']) do | type, attrs |
    if attrs['type'] == 'function'
      add_function(attrs) 
    end
  end
end

where attrs is a nested Hash add_function creates the object. What I want to have is the XML parser outside the logic describing the structure. Also I want to change the 'language' of the XML parser to read more naturally. With this document we are parsing embedded lists of attributes (the SWIG XML is not a nice example of XML, really, but it has repeated patterns for parsing).

With an attribute list we can do something like:

def parse_list(attributelist,'attributelist')
  list = {}
  attributelist.each do | attribute |
    if attribute.is_list?
      list[:parmlist] = parse_list(attribute,attributelist.fetch('parmlist/parm/attributelist')) 
    else
      list[attribute['name']] = { :value => attribute['value'] }
    end
  end
end

If it reads like this it is easy to match with the XML document - and, perhaps, easy to find bugs and/or maintain the code base.

I also want to do away with some of the XML parser logic. For example when you want an element name, with libxml you have to do two reads to move the pointer to the next element. So let us make that something like:

element = read_element()
# with properties
element.name
element.start?
element.end?
element.text
element.attributes

For this I am creating XMLEasyReader as a wrapper of XML::Reader. To allow for raw processing I pass in an existing XML::Reader. Now I can parse SWIG XML into a Hash type structure, for example:

{"kind"=>"function", "name"=>"cel_mm",
"decl"=>"f(p.CELOBJECT,p.CDFOBJECT,unsigned int,unsigned int).",
"sym_overname"=>"__SWIG_0", "sym_name"=>"cel_mm", "type"=>"double",
"sym_symtab"=>"b7cd53a8", "parmlist"=>[{"compactdefargs"=>"1",
"name"=>"celobject", "type"=>"p.CELOBJECT"}, {"name"=>"cdfobject",
"type"=>"p.CDFOBJECT"}, {"name"=>"probeset", "type"=>"unsigned
int"}, {"name"=>"probe", "type"=>"unsigned int"}]}

I wish to abstract this into an object structure - but separate from the XML processing, as this may change in time. So we want to end up with an easy to use OOP object tree - rather than a Hash table. This means:

XML data -> Hash -> OOP classes

Which, as it happens, puts functionality where it belongs.

I created my first output with ./bin/swig2doc test/data/swig/perl_affyio_wrap.xml

swig2doc 0.02 (February 2010)
2.7.3 LibXML reading test/data/swig/perl_affyio_wrap.xml
  affyio:cdf_mmprobe_info
  affyio:cdf_num_probesets
  affyio:cdf_pmprobe_info
  affyio:cdf_probeset_info
  (...)

For listing methods I can use the order of definitions in SWIG, as well as sort on function names. OK, it would be nice also to present the functions in the order of the source files - that information, however, is not in SWIG XML - as the author may have thought about the ordering in presenting his source (yes, some people are that clever). This I may pick up from the doxygen XML, which contains line numbers, or from the ctags file.

Parsing Doxygen XML

The next step is to get the additional information from Doxygen. A nice job for the flight from Tokyo to Paris tonight.

For the Affyio module I have put an example in test/data:

biolib__affyio_8c.xml
biolib__affyio_8h.xml

Usually the .c file will contain the useful information on functions, though people are known to use the headers too. Initially I am going to parse the c files only. To pick up the C XML files we can assume (for now) the doxygen generated files will be in one directory.

swig2doc --doxydir test/data/doxygen/xml emboss.xml affyio.xml ...

So, the first step is to test whether this is a doxygen file, next we find the first memberdef where kind==function. we can pick up the descriptions of function and parameters, as well as the location/line in the file. This is nice, as we can add references to the code itself. For those cases the code *is* the documentation. So, essentially:

if doc.is_doxygen?
  doc.each_sectiondef do | section |
    type = section.attrib['kind']
    if type == function
      name = section['name']
      definition = section['definition']
      brief      = section['briefdescription']
      detailed   = section['detaileddescription']
      parameternamelist = section['parameternamelist']
      parameternamelist.each do | param |
        parameter[param['name'] = param['description']
      end
      add_function(attrs) 
    end
  end
end

On a side note, this is why I call this a BioBlog, Paris is gone and we are in India now. I gave lectures on OSS and Bioinformatics in Bangalore (IISc, IBAB) and Chennai (University of Madras) - and another one coming up in Trivandrum next week. In India OSS is something of a novelty, though I just read the state of Kerala is opting for FOSS, over Microsoft Windows, for education. So India should get on the OSS map, eventually.

Anyway, I left off with parsing attributes during the Paris flight. I want to send the attributes into a Hash, and in code it reads ugly

if (a = getfirst_attribute()) {
  do {
    do something
  } while (a = getnext_attribute())
}

Annoyingly the following Ruby code

if @reader.has_attributes?
  while @reader.move_to_next_attribute == 1
    h[@reader.name] = @reader.value
  end
end

throws a libxml segmentation fault in random locations of the XML file. Ah, after reading the attributes one has to call move_to_element - which resets the pointer to the attribute's element. This is what I mean that XML parsers are non-obvious (rant). The non-segfaulting code is now:

if @reader.has_attributes?
  if @reader.move_to_first_attribute == 1
    h[@reader.name] = @reader.value
    while @reader.move_to_next_attribute == 1
      h[@reader.name] = @reader.value
    end
  end
  @reader.move_to_element # reset main pointer in libxml
end

At this stage I can parse Doxygen XML. I want to create an object tree that is separate from the SWIG objects - so they are easier to test. That will be another DoxyModules hierarchy for every module. Meanwhile I'll set up a separate object tree that will reference the other trees SwigModules and DoxyModules, named CModules. The call interface will be similar. In a bigger setting we might prefer using Ruby's namespacing, e.g. Doxy::Modules, Swig::Modules and C::Modules - but for my purpose here I figure it does not really help.

The code in DoxyXMLParser turns the XML into a hash, containing:

{"name"=>"cel_num_cols", "bodystart"=>79, "line"=>80, "detaileddescription"=>"\n<para><simplesect kind=\"return\"><para>number of columns on the chip </para></simplesect>\n</para>        ", "briefdescription"=>"\n ", "bodyend"=>82, "type"=>"unsigned long", "file"=>"/mnt/auto/flash/git/opensource/biolib/src/clibs/affyio/src/biolib_affyio.c"}

which gives the information we need - i.e. the source file location of the function 'cel_num_cols', as well as the descriptions in the source file as parsed by Doxygen. You can see the detaileddescription contains some special Doxy XML that can be transformed into HTML, for example. At this point I am leaving that as is - I'll add transformations later (maybe I get to show off some XSLT).

Meanwhile I find the XML of the .c files contains similar information to the .h files - and minor differences. The .h files look complete, including a reference to the function body in the .c file. So I should use that information. That means I have to compile both, and merge the results. Life is about merging...

Another thing to parse, outside the function info, is the information at the top of the source file - which authors use to give a more global introduction. It comes *after* the <sectiondef> in a <detaileddiscription> section before the <programlisting> section. Wow, the full source comes with the XML - no wonder these files are so large.

Combining SWIG and DOXY information

The Doxygen XML can contain more, or less, function definitions than the SWIG XML mapped ones. That depends on the completeness/coverage of the SWIG mappings. SWIG mapped functions that miss a Doxygen counterpart should issue a warning.

Another complication is that named SWIG modules, e.g. the Affyio module, are not named in Doxygen - the only thing Doxygen gives are source file names and function names. I have to have a think about that, as there can be naming conflicts between modules. That module name information has to come from outside, or SWIG - perhaps matching multiple function names from Doxygen lists to SWIG lists is the best way to handle this automatically. Alternatively we can generate documentation one module at a time. Perhaps that is the cleanest anyway. I'll only need to find a way to generate a 'master' index for all modules.

(a little later)

Having thought about it - module based parsing makes sense as it will use only the namespace of the particular module. To create a global index I can write a small database file containing the module names - and the functions.

This also means I don't have to distinguish names on the command line - the parser can figure it out whether and XML file is a Doxy file, SWIG, or something else. It is a simplification. So now it becomes

swig2doc doxy1.xml doxy2.xml swigmodule.xml

Normally one SWIG module will be passed in, and multiple Doxy modules for each Biolib module. The SWIG module will give both the module name and the language type (Perl, Ruby etc.) that it should generate.

Just to be sure we'll combine information through something like

objects.each_swig_function do | swigfunc |
  objects.each_doxy_function do | doxyfunc |
    funclist.combine(swigfunc, doxyfunc)
  end
end

where the funclist is the ultimate reference list of functions that are, or aren't, mapped by SWIG. I have decided to also add functions to the documentation that are not mapped. Marked as such, they will help people searching the Internet to find the definitions. If they are not implemented people can always put in a request. Mapping usually goes by need - which is good, because actual usage is a form of testing the success of the mapping.

Listing functions

There are multiple ways of listing mapped functions. First would be ordered by module:functionname. The other order would be based on module:sourcefilename:line - assuming the way functions are ordered in the source are meaningful (again, some authors are smart that way). For HTML output I can make both available. For Perl POD files - it may be either.

Perl POD files

Let's start with POD files - as these are the first 'deliverable' for a BioLib release 1.0.

I am going to take hints from Jonathan Leto's Math::GSL Perl module. He has generated POD files that look like:

%perlcode %{
@EXPORT_OK = qw/
               gsl_monte_miser_integrate 
               gsl_monte_miser_alloc 
               gsl_monte_miser_init 
             /;
%EXPORT_TAGS = ( all => [ @EXPORT_OK ] );

__END__
=head1 NAME

Math::GSL::Monte - Routines for multidimensional Monte Carlo integration

=head1 SYNOPSIS

This module is not yet implemented. Patches Welcome!

    use Math::GSL::Monte qw /:all/;

=head1 DESCRIPTION

Here is a list of all the functions in this module :

=over

=item * C<gsl_monte_miser_integrate >

=item * C<gsl_monte_miser_alloc >

=item * C<gsl_monte_miser_init >

(...)

=head1 AUTHORS

(authorlist)

=head1 COPYRIGHT AND LICENSE

Copyright (C) (...)

This program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.

=cut
%}

which is a really minimalistic POD. I'll want to add more information. Seems that Perl has the phylosophy of putting documentation outside source files (i.e. POD files). The best POD example I have on this laptop (now offline in India) is in /usr/lib/perl5/XML/LibXML, so I'll use that as a template.

Swig2doc configuration files

We want to keep the generator as simple as possible. Repeated code, like a copyright statement, should be defined outside the code base. We will create standard configuration in the source tree in ./etc/default.yaml, which is available through the DefaultConfig class.

Every module can have it's own configuration parameters. For example it is possible to set a copyright message for the swig2doc POD generated files. Other options may include web URL's to point to, etc. See ModuleConfig for the complete set of options. The information comes from a YAML file sitting in the Biolib source tree - with the module. Perhaps in ./src/clib/module/doc/swig2doc.yaml, or maybe better, ./src/mappings/swig/doc/affyio.yaml. An example for testing is in ./test/data/affyio.yaml. This config file can also direct to other files for fetching information relative to the source tree, including regular expressions (this to prevent having several versions defined in separate places). For example:

# Extra documentation for Affyio module
:global:
  :modulename: affyio
  :paths:
    :csrc: ./src/clibs/$modulename
    :contrib: ./contrib/$modulename
  :version: 
    :file: $contrib/DESCRIPTION  
    :regex: Version: (\w+)
  :author:
    :file: $contrib/DESCRIPTION  
    :regex: Author: 
  :license:
    :type: LGPL2
    :descr: "LGPL - GNU LESSER GENERAL PUBLIC LICENSE version 2.1"
    :file: $contrib/LICENSE
  :description:
    :brief: Access to Affymetrix microarray .CEL and .CDF files
    :detailed: "Fetch PM (perfect match) and MM (mismatch) signal for individual
      probes, and combine probes into probesets. This code is based on the original
      Affyio R package by Ben Bolstad"
:mapping:
  :author: Mapping author
  :methods:
    :file: $csrc/doc/affyio_descr.txt
  :doctest:
    :file: $csrc/doc/affyio_doctest.txt
:documentation:
  :author: Docs author
:perl:
  :license: Perl License

etc. etc. The only problem with YAML is that is a data format. The 'business logic' that ties elements together is not necessarily obvious from the layout of the data file. Though naming and ordering of data structure does help to glean meaning. The named fields above (tuples), and their nested structure can help in interpretation - and therefore the YAML file ought to be relatively easy to write and maintain. When creating these formats it is a good idea to keep the business logic in the actual parser as light (and logical) as possible.

In the above file you can see some references to file locations. These are locations that are valid for all mappings - i.e the definition is shared between Perl, Python, Ruby etc. mappings. This implies a simplification for the command line, as all Doxy defs are shared too and can be added to YAML:

:doxygen:
  - test/data/doxygen/xml/affyio/biolib__affyio_8c.xml 
  - test/data/doxygen/xml/affyio/biolib__affyio_8h.xml

or

:doxygen:
  - test/data/doxygen/xml/affyio/*[ch].xml 

so you end up with the YAML file and the SWIG definition on the command line.

It dawns on me a Rake file would be a possibility for handling configuration too. Especially the functions for wild card expansion can come in handy. Rake functions are part of standard Ruby now, so I can use those without penalty.

Transforming Doxy XML

The Doxygen information is in XML. To transform this to text I basically remove all tags, but it still looks a bit ugly. Meanwhile I have realized the XML is quite rich in information, so the natural thing to do is use some XSLT transformation to, for example, HTML. In the past I have used different engines - and I like xalan for its speed - though it lacks fancier XSLT2. Here it should be fine. However it means calling xalan as a binary program for every documentation snippet. Maybe we can bind to in it later with SWIG. For now I'll just try the low level conversions. Doxy data looks like this:

<para>Open a cdf file using the Affyio library and return a pointer
to a <ref refid="structCDFOBJECT"
kindref="compound">CDFOBJECT</ref>, which maintains state keeping
track of the opened CDF data. Unlike the Affyio internal
representation the Biolib affyio CDF methods represent a unified
layout for Affymetrix chips. Basically every probeset can return the
name, probe values (PM, MM) and QC. Loading all other information is
trivial, as Affyio makes it available - but not implemented here, at
this point.</para><para><simplesect kind="note"><para>FIXME: XDA
format not tested</para></simplesect>
<parameterlist kind="param"><parameteritem>
<parameternamelist>
<parametername>cdffilename</parametername>
</parameternamelist>
<parameterdescription>
<para>points to a valid Affy CDF file (textual or XDA
binary)</para></parameterdescription> </parameteritem>
</parameterlist>
<simplesect kind="return"><para>pointer to <ref
refid="structCDFOBJECT"
kindref="compound">CDFOBJECT</ref></para></simplesect> <simplesect
kind="see"><para>close_cdffile </para></simplesect>
</para>        

You can see it is quite rich in describing parameters, return value etc. For conversions I added ./etc/xsl/doxy2html.xsl. You can do really fancy stuff with XSLT - it almost pretends it is a programming language, though a very ugly one - as it is basically XML too. I sometimes wonder who is crazy enough to come up with this stuff. Anyway, the bottom line is that if you need to transform XML it is, probably, hard to beat.

Here we are going to transform XML to XHTML with CSS style sheet controlled layout. For example, to transform <para>...</para> into .. we simply say

 <xsl:template match="para">
    <xsl:apply-templates />
 </xsl:template>

Compared to using regular expressions - which is also quite feasible - XSLT has the advantage of allowing more 'context' - so you can treat the same 'para' differently when in an different setting. With regular expressions this can rapidly become complex. However, the most important reason to use XSLT is that it is less 'brittle'. That is, if we get offered an unanticipated input it is more able to deal with it gracefully. Certainly when transforming to other XML-like data formats, like HTML.

Anyway, I am creating a unit test for this. Nothing beats unit tests for this type of setup. I have added open_cdf.xml in ./test/data/doxygen/xml/output and unit test in ./test/unit/test_doxy_transform.rb.

Parsing concluded

OK, at this point we can parse the SWIG and Doxygen XML formats, and transform them into any type of output format. We also have a global and module configuration in YAML. This allows us to start writing POD for Perl.

I have plans to inject information from other sources, like doctest code, and to add generic example code. But that should be trivial to implement, at least compared to what we have achieved so far. This is going to be fun, and it is highly relevant also to other projects that use some form of SWIG.

Writing POD

I downloaded Larry Wall's description of Perl POD. Looks straightforward to generate. The two questions I have is what to do with function parameters - and what to do with code examples. It appears to be a roundabout way of generating documentation - we may generate much nicer HTML. Still, POD generated HTML is consistent with CPAN module online documentation - and I would like to fit it in.

With every output format there are some recurrent features. We want to create a template and fill in the blanks. The global descriptions go at the top, followed by each mapped (and unmapped) function, giving descriptions, parameters, return value and, possibly, examples. Also thinks like copyright information and web links need to be injected into the template.

Ruby can handle some basic language type templating using the Perl-like buffers:

author='Pjotr Prins'
buf = <<TEMPLATE
  Copyright #{author}
TEMPLATE

This is quite nice. However, for looping inside a template something more powerful is needed. The erb module, part of standard Ruby, does exactly that. With erb we can do:

% functions.each do |func|
    =item <%= func.name %>
    
    <%= func.description %>
% end

which allows us to write a single 'master' template file. Rails, the web application framework, uses erb to create (HTML) views.

Parsing TexInfo for GSL

Jonathan Leto, whom I mentioned earlier, expresses interest in swig2doc for his Perl mapping of the GNU Science Library (Math::GSL). I find the thousands of GSL functions have been documented in separate texi files, as part of the GSL source tree (e.g. doc/statistics.texi).

A typical texi entry should be easy to parse:

@deftypefun double gsl_stats_sd (const double @var{data}[], size_t @var{stride},
  size_t @var{n})
@deftypefunx double gsl_stats_sd_m (const double @var{data}[], 
  size_t @var{stride}, size_t @var{n}, double @var{mean})
The standard deviation is defined as the square root of the variance.
These functions return the square root of the corresponding variance
functions above.
@end deftypefun

The generated documentation of the GSL itself does not look very sexy - see http://www.gnu.org/software/gsl/manual/html_node/Statistics.html

I think we can do better. Jonathan's generated POD files include documentation, see http://search.cpan.org/dist/Math-GSL/lib/Math/GSL/Statistics.pm

You can see the content of the documentation matches. So somehow the texi got migrated to POD (Jonathan writes that it was mostly a manual job, with some simple scripting).

I am going to support Math::GSL as a direct test case for swig2doc. One important reason is that Jonathan will criticize my work. It is important to have other people looking over your shoulder. Math::GSL is very large - lots of functions in different modules. Good test case for swig2doc, and I can compare the results with the 'manual' conversion.

Outside of parsing texi files, the main challenge will be to support Math::GSL's modules. Rather than one large SWIG file, we have to parse a bunch of small ones and combine them into a consolidated package.

Anyway, we have another flight ahead - New Delhi to Amsterdam, so I can wrap up the YAML parser, and start on the texinfo parser. I just notice texinfo can generate XML too. Above definition becomes:

    <definition>
      <definitionterm><indexterm index="fn">gsl_stats_sd</indexterm>
        <defcategory>Function</defcategory>
        <deftype>double</deftype>
        <deffunction>gsl_stats_sd</deffunction>
        <defdelimiter>(</defdelimiter>
        <defparamtype>const</defparamtype>
        <defparam>double</defparam>
        <defparam>data</defparam>
        <defdelimiter>[</defdelimiter>
        <defdelimiter>]</defdelimiter>
        <defdelimiter>,</defdelimiter>
        <defparamtype>size_t</defparamtype>
        <defparam>stride</defparam>
        <defdelimiter>,</defdelimiter>
        <defparamtype>size_t</defparamtype>
        <defparam>n</defparam>
        <defdelimiter>)</defdelimiter>
      </definitionterm>
      <definitionitem>
        <para>The standard deviation is defined as the square root of the vari

ance. These functions return the square root of the corresponding variance funct ions above.</para>

      </definitionitem>
    </definition>

The 'logical' structure of the documentation is also represented, so we can use that if we want. Parsing XML is, probably, the easier option as we don't have to deal with texinfo logic, like

  @tex
  \beforedisplay
  $$
  {\Hat\mu} = {1 \over N} \sum x_i
  $$
  \afterdisplay
  @end tex
  @ifinfo
  @example
  \Hat\mu = (1/N) \sum x_i
  @end example
  @end ifinfo

Oh, no! I just find out that the makeinfo XML generator merely ignores this stretch, so no function is displayed. It is not even a bug, as the inputs state it is for either tex, or texinfo output. This also explains why we see these ugly definitions in the HTML on the GSL website. Meanwhile, when generating HTML using makeinfo, also these formula's get ignored (unlike the HTML on the GSL website). I think one has to set certain parameters to override this behaviour. Meanwhile, even for the GSL website, it would be nice to show proper formula's in some graphical format using a LaTeX to PNG converter.

I'll assume I can override this for XML too, though I have to find out later (I can't find the relevant information in the GSL make files). Ah, I found the switch, use

makeinfo --xml --ifinfo doc/statistics.texi 

and it generates

      <definitionitem>
        <para>This function returns the arithmetic mean of data, a
        dataset of length n with stride stride.  The
        arithmetic me an, or sample mean, is denoted by
        <math>\Hat\mu</math> and defined as ,</para> <example
        xml:space="preserve">\Hat\mu = (1/N) \sum x_i</example> <para
        role="continues">where <math>x_i</math> are the elements of the
        dataset data.  For samples drawn from a gaussian
        distribution the var iance of <math>\Hat\mu</math> is <math>\sigma^2
        / N</math>.</para>
      </definitionitem>

which, as a matter of fact, is what we want.