Category Archives: Report Designer & Engine

MySQL with Pentaho Report Designer for those who dont read manuals

Several people in the forums seem to have severe problems setting up MySQL or understanding how MySQL works. They also seem to have severe problems reading the database manuals pointed out in the forum, insisting the problem lays within the Report Designer. So I am going to write up some simple explanations on how to set up MySQL, how to configure the database users for access over the network and how to work around all the little and large weird quirks of that database. And then hopefully I just can point these users to here and never have to deal with it again.*


I do assume that you have MySQL installed and up and running. If not, then go and get PostgreSQL, a database I personally trust.

Connecting to a MySQL database – the basics

All the Pentaho tools and thus the Pentaho Report Designer are Java Applications and therefore use JDBC to access the databases. For other databases, the general rule is: Use the driver that ships with your database and keep the version numbers in sync.

To work with a MySQL database, you will need a suitable MySQL driver. In my experience MySQL drivers have not been the most stable nor most reliable drivers around. At Pentaho, we recommend to use the MySQL JDBC Driver version 5.1.10 as default driver, as this seems to be the most stable driver. Other versions may produce invalid results, and if you want to use any other version, make sure you test it carefully.

If you use the Pentaho Report Designer to create a connection, you will be asked for several properties. To connect to the database directly, select the connection mode “native”. Then enter your hostname or IP-address, port (default port number for MySQL is 3306) and username and password.

For generic JDBC database connections, you need the JDBC-URL, your username and password and optionally the connection properties:

Setting Value

Driver

The driver class to load. The name of the class that implements java.sql.Driver in MySQL Connector/J is com.mysql.jdbc.Driver.
JDBC-URL

The JDBC URL for the MySQL JDBC Driver uses the following format. Items in square brackets ([, ]) are optional.

jdbc:mysql://[host:port]/[database][?propertyName1][=propertyValue1][&propertyName2][=propertyValue2]...

If the host name is not specified, it defaults to 127.0.0.1. If the port is not specified, it defaults to 3306, the default port number for MySQL servers.


Username

Your user name as given to you by your database administrator. See below how MySQL translates that information and matches this against the internal user database.

Password

Your password for the connection.

To get the same connection as shown in the dialog screenshot above, you would have to specify the JDBC-URL:

jdbc:mysql://localhost:3306/sampledata

with username “pentaho_user” and “password” as password.

In most cases your database will be running on a different host than the computer you are editing reports or where the BI-Server is installed. When the database is running on a different computer, you will be able to distribute the work-load of answering requests and generating reports a lot better. If both the reporting server and the database share resources like the CPU, memory or disks, there is always the danger that both systems compete for resources and slow each other down unnecessarily.

User management and access control in MySQL

When you connect to a MySQL server, your identity is determined by the host from which you connect and the user name you specify. A MySQL server side user name is the combination of the user name you specified on the connection (“pentaho_user” in the example above) and the host from where you try to connect to the database server.

MySQL users are defined by their username and hostname and generally written in the format ‘username’@’hostname’.

In the example above we connect to the special host address ‘localhost’ (or by its IP address: 127.0.0.1). Only applications running on the same computer can make connections to this interface. By default a server will listening to these connections will see them as originating from “localhost” itself.

So in the above example, the full identity would be ‘pentaho_user’@’localhost’.

MySQL uses a two-stage authentication system. In the first stage it checks, whether you are allowed to connect to the database and validates your username and password. In the second stage MySQL checks whether you have the access rights to see the contents of the tables stored in the database or whether you are allowed to update or delete data.

When we try to establish a connection to the MySQL server, the server will check its local user database for an entry matching the username and password. If there is no user defined for the given hostname, you will see the error message

Host [...] is not allowed to connect to this MySQL server

The error message

Access denied for user 'pentaho_user'@'localhost' (using password: YES)

indicates that either the password is invalid or that this user has not been defined for connections from this host.

If you see the error message

ERROR 1044 (42000): Access denied for user 'pentaho_user'@'localhost' to database 'sampledata'

then you need to grant additional permissions to the user to allow access to the database.

Listing the defined users

MySQL stores all administrative information in a special system database called ‘mysql’. You can connect to this database like any other defined database. To see which users are defined, invoke the following SQL statement via the MySQL console:

SELECT User, Host FROM mysql.user;

MySQL will check your identity (username and host) against the user table. Both the user and the host must exist. If the table contains a host name called “%” this particular entry matches all hosts. If the table contains a empty user name, it matches any user name.

To allow the user ‘pentaho_user’ to connect from any localhost, you would need to have a matching entry for it in the table.

+------------------+-----------+
| User             | Host      |
+------------------+-----------+
| pentaho_user     | localhost | 
+------------------+-----------+

To allow connections for ‘pentaho_user’ from any host, a entry with ‘%’ as hostname must exist.

+------------------+-----------+
| User             | Host      |
+------------------+-----------+
| pentaho_user     | %         | 
+------------------+-----------+

Big fat warning: Even though both entries have the same user name (pentaho_user), they define two entirely different and unrelated identities for the database. It is totally possible and legal for ‘pentaho_user’ from localhost to have access to different databases than the same user for the other hosts. For MySQL these are two different entities.

Connections from Pentaho Report Designer via Connections from PhpMyAdmin or other Server Side Software

The MySQL identity system of treating users from different hosts as completely different entities can cause a lot of headaches, especially if you are not familiar with the underlying workings of the internet.

If your MySQL server is set up to only allow connections from specific hosts, you can run into problems where you can access the database from your web-server but not from within the report designer or vice versa.

Lets look at the most common usage scenarios:

Pentaho Reporting, the Web-Server and MySQL run on the same machine

This is a typical development scenario, where the developer installed all software on his own box for convenience.

+---------------+
| PRD-Host      |
+---------------+
| Web-Server    |
+---------------+
| My-SQL Server |
+---------------+

MySQL must be configured to allow access from ‘localhost’ and all connections in the report must use ‘localhost’ as the hostname for the MySQL connections.

A identity for ‘pentaho_user’@’localhost’ must exist.

+------------------+-----------+
| User             | Host      |
+------------------+-----------+
| pentaho_user     | localhost | 
+------------------+-----------+

Pentaho Reporting, the Web-Server and MySQL all use different machines

This is the typical production scenario. Both the database server and the web-server run on different machines for maximum performance. Both machines are usually installed in a server room or data centre.

+----------+          +------------+         +---------------+
| PRD-Host |  --+-->  | Web-Server |  ---->  | My-SQL Server |
+----------+    |     +------------+         +---------------+
                |                                 /|\ 
                +----------------------------------+

Both the developer sitting on the Pentaho Report Designer and the Web-Server need access to the database. While the developer creates the reports he needs access to the database (or a test database with a similar structure than the production database). When the report runs, the web-server must connect to the database as well to execute the reports.

Both the developer PC and the web-server must have entries for the user ‘pentaho_user’ in the user table.

+------------------+-----------------------------+
| User             | Host                        |
+------------------+-----------------------------+
| pentaho_user     | IP or hostname of webserver | 
+------------------+-----------------------------+
| pentaho_user     | IP or hostname of PRD-host  | 
+------------------+-----------------------------+

Alternatively, you could allow access from all hosts. Note that if your database server is not shielded by a firewall, then you may easily open your database to the world. Any self-respecting database administrator may try to strangulate you for suggesting such an configuration.

+------------------+-----------+
| User             | Host      |
+------------------+-----------+
| pentaho_user     | %         | 
+------------------+-----------+

The Web-Server and MySQL share the same machine

You find this set up occasionally in smaller hosting companies or from people who own just a single server. The database and web-server share the same machine, which is hosted in the data centre or server room. The developer runs the report designer on his own desktop machine.

+----------+          +---------------+ 
| PRD-Host |  --+-->  | Web-Server    |
+----------+    |     +---------------+
                +-->  | My-SQL server |
                      +---------------+

The web-server can access the MySQL server via ‘localhost’, while the developer needs to access the MySQL server from his own host outside of the server room.

+------------------+-----------------------------+
| User             | Host                        |
+------------------+-----------------------------+
| pentaho_user     | localhost                   | 
+------------------+-----------------------------+
| pentaho_user     | IP or hostname of PRD-host  | 
+------------------+-----------------------------+

So if you ever wondered why you can access your MySQL server from PhpMyAdmin or similar tools, but not from the Pentaho Report Designer, you should now be able to see the answer and to fix the problem.

You can find more information on this topic in the chapter about MySQL user management in the MySQL manual.

Fine grained access permissions

Even if you are able to connect to the database, you may not be able to query the database. The second stage of MySQL’s security system checks your identity against a set of defined permissions. These permissions control whether you are able to query or modify the data in the database or whether you are allowed to change the table structures or perform other administrative tasks.

Pentaho Reporting only queries the database. The Pentaho reporting tool does not modify data and does not perform any administrative tasks. All MySQL users need to have the “SELECT” permission on all objects you want to access from the reports.

To grant a specific user the permission to select data, use the following statement from the MySQL console.

GRANT SELECT ON database.* TO 'pentaho_user'@'localhost';

MySQL connection properties you should know

To make the JDBC driver work properly in MySQL, you will have to specify some additional connection properties. The driver’s default settings can cause severe problems for the reporting system. These defaults have been defined for historic releases of MySQL and are harmful these days.

Property Definition Comment
useCompression Use zlib compression when communicating with the server (true/false)? Defaults to ‘false’. Useful if you access a remote server through a low bandwidth connection. Disable for production systems.
passwordCharacterEncoding What character encoding is used for passwords? Leaving this set to the default value (null), uses the platform character set, which works for ISO8859_1 (i.e. “latin1”) passwords. For passwords in other character encodings, the encoding will have to be specified with this property, as it’s not possible for the driver to auto-detect this. If you are not using an all ASCII password and work outside the area of Western European languages, knowing about this is a life-saver.
characterEncoding If ‘useUnicode’ is set to true (which is the default), what Java character encoding should the driver use when dealing with strings? (defaults is to ‘autodetect’). If the encoding cannot be determined, then an exception will be raised. Sometimes MySQL server do detect the character set correctly and therefore return garbage when querying tables with non-ascii data. In this case this property can be used to override the autodetection of character sets.
functionsNeverReturnBlobs Should the driver always treat data from functions returning BLOBs as Strings – specifically to work around dubious metadata returned by the server for GROUP BY clauses? The MySQL JDBC-Driver returns string results of functions in SQL statements as byte-arrays. If you use any functions in SELECT statements, set this property to “true” or you will see garbage in the results.
emptyStringsConvertToZero Should the driver allow conversions from empty string fields to numeric values of ‘0’? MySQL treats empty strings or values as zero for numeric columns by default. If you need to see the values as they are stored in the database without helpful corrections from the JDBC driver, then set this flag to false.
zeroDateTimeBehavior What should happen when the driver encounters DATETIME values that are composed entirely of zeros (used by MySQL to represent invalid dates)? Valid values are “exception”, “round” and “convertToNull”. Use this if your select statement fails with an exception and you have a zero-date in it. Set it to ‘convertToNull’ to come to a more reasonable result.
serverTimezone Override detection/mapping of timezone. Used when timezone from server doesn’t map to Java timezone Use this if your dates returned by the database seem to be offset by some hours. Also see ‘useGmtMillisForDatetimes’, ‘useJDBCCompliantTimezoneShift’, ‘useLegacyDatetimeCode’ and ‘useTimezone’ for more time zone related functions.
holdResultsOpenOverStatementClose Should the driver leave the result sets open on Statement.close() (enabling violates JDBC specification) This can be a very helpful property for reports with subreports. It allows the Pentaho Reporting engine to process several server-side resultsets in parallel. Without this setting, the reporting engine will fully buffer result-sets in memory instead, which increases the memory consumption of the reporting system.

Further readings

If you have made it this far, then let me tell you one big hint: MySQL comes with a nice administration tool called My SQL Workbench which allows you to administrate the server without having to go through the command line.

* Yeah. At least I can fake hope, even when I don’t feel it.

An easy way of Printing Aggregations

Setting up a large number of aggregated values like sums, counts or averages can be a dreadful experience in the Pentaho Report Designer.

Calculations are performed by functions and expressions (let me call it just functions). These functions are added in the data tab of the report designer. Each function gets a name under which its result can be referenced. And then it needs to be configured properly. Usually, you have to set at least the field it should work on and the group on which it should reset.

Doing it for one field is not exciting, but ok. Doing it for 10 fields is no longer fun.

If you just want to print the result of that calculation, without ever using it as part of another computation, then here’s an easier way of aggregating data:

(1) Drag a number field onto your report and assign a field name to it (attribute common::fieldname).
(2) Set the aggregation you want to use via the attribute “wizard::aggregation-type”.
(3) Optional: Define the group on which you want to reset. If not defined, the current group is used.

This way of defining aggregations is used by the Report Design Wizard. To see a working example – just generate a report with it and add a summary on the fields.

Element layouting strategies in Pentaho Reporting

Back in February on the London Pentaho User-Group meeting I promised to make the contents of that presentation available as blog entries. This is the first of these three articles.

The Pentaho Reporting Engine supports several lay-outing strategies to create sophisticated reports easier and to simply the report creation process.

You can change the layout strategy of a band using the “layout” style-key on that band.

Canvas-Layout

A canvas layout positions elements freely on the area of the parent band. Elements have no relationship to each other during the lay-outing. Therefore if an element expands its size, it does not push elements out of the way. Expanding elements always increase the size of the parent band.

Elements inside a Canvas level band are positioned using the position::x and position::y style-keys.

The Canvas layout is the default layout for all new reports and bands.

If you are familiar with HTML and CSS, then think of this layout as a collection of absolutely positioned elements.

Block-Layout

Elements in a block layout band are laid out one after each other vertically. Block-level elements span the full width of the parent band. If a element expands it pushes all other elements down so that no element overlaps the other elements.

Master- and SubReport elements as well as Groups are always laid as block elements.

If you are familiar with HTML and CSS, think of this layout as a set of

or

elements.

Inline-Layout

In an inline formatting context, boxes are laid out horizontally, one after the other, beginning at the top of a containing block. Horizontal margins, borders, and padding are respected between these boxes. The boxes may be aligned vertically in different ways: their bottoms or tops may be aligned, or the baselines of text within them may be aligned. The rectangular area that contains the boxes that form a line is called a line box.

A inline element that is placed in a non-inline layout band creates an artificial paragraph to wrap around this element during the lay-outing process. The most common use of this layout strategy is to produce rich-text content from several independent report elements.

If you are familiar with HTML and CSS, think of this layout as a set of elements.

Row-Layout

The row layout positions elements one after each other on the horizontal axis. All elements are printed in a single row, expanding their height as needed. If all elements should expand to the height of tallest element, set the min-height to “100%”.

Row-layout is a natural match for list reports, where multiple columns of data should be printed per row of data. When a element expands its width all other elements get pushed to the right.

If you are familiar with HTML and CSS, think of this layout as a HTML table with a single row of data.

When you use a row-layout for your list-reports, you will no longer need to layout elements manually. To create spacing between elements use either padding on your elements or place a empty band as padding element into the row layout band. The report design wizard makes use of the row-layout to position elements in the details band and details-header bands.

Combining layout strategies for better effects

You can combine several layout strategies in one root element by adding extra bands to your report. All elements placed into these bands will be governed by the band’s layout setting.

How to avoid that dynamic-height elements overlap other elements

You can use this to avoid overlapping elements in your report whenever you use “dynamic-height” elements and to create proper table-rows so that elements of the second row get pushed down by the expanding content of the first row.

Use the following steps to create a two-row details band.

  1. make your details band a block-layout by setting the layout-style of this band to “block”.
  2. Add two bands.
  3. Add the elements for the first row into the first band, and all elements for the second row into the second band.

That’s it. When your first row elements expand, your second row elements will be pushed down.

Download a sample report to see the row- and block-layouter in action.

ENV-Fields: Why, When and How to use them

The reporting-environment came to life as a simple way to inject runtime information from the platform into a report. Ideally, a report should not know about details from the server it runs on. Certain parts of the server’s configuration, like the server’s base-url, easily change when the report travels from development to the test and then to the production environment.

Earlier versions relied on magical string replacements during the parse process to inject that kind of information. Parse-time modifications are generally a bad thing. They are insecure,
as every String manipulation of XML files is. If you manage to get a valid XML-Fragment into your replacement string, you can do everything. There is no way a GUI tool could be written for it. The injection magically happens everywhere without giving the user a say on that matter. At design time, or outside of the legacy code of the Pentaho Platform’s JFreeReportComponent, this replacement does not happen. I don’t want to explain that mad system to anyone, and surely neither do you!

And finally: It really makes it hard to do caching inside the reporting engine. For the reporting engine, these replaced strings look exactly like static strings and all of the sudden the cache is poisoned with a invalid copy.

Moving that information from parse-time to runtime solves that problem for me.

The report-environment is a special interface that allows third party systems like the Pentaho BI-Server to inject configuration settings in a safe and pre-defined way. That configuration information is then available to all expressions and functions. It completely eliminates all magic from the report processing.

A report-environment entry is not a parameter. A parameter is a external value that has been provided by the user or that has been calculated from end-user or report-designer input. A
report-environment property comes solely from the local runtime environment.

Inside the reporting engine, a report-environment setting is a single string. The setting is either one of the well-known environment-keys provided by the runtime implementation, a session-entry or a user-defined environment setting.

Well-Known Keys

  • serverBaseURL:
    The server’s base URL without any servlet context. Use this one to link to other resources on the same server.

    http://127.0.0.1:8080/
    
  • pentahoBaseURL:
    The server’s base url including the local web-application context. Use this one to link to other Pentaho Services.

    http://127.0.0.1:8080/pentaho
    
  • solutionRoot:
    The solution-repository root directory.
  • hostColorPort:
    The host name and port, without the protocol prefix or any path information.

    localhost:8080
    
  • requestContextPath:
    The context path of the web-application

    /pentaho
    
  • username:
    The username of the currently logged in user.
  • roles:
    The set of roles the user is assigned to.

Session keys

A environment setting starting with the prefix “session:” will be treated as a local session lookup. A session lookup will only work inside the Pentaho BI-Server. The BI-Server will compute the session attribute name by removing the “session:” prefix and will lookup the value in the users HTTP-Session instance.

If the value found there is a string it will be returned as valid entry. If the value is a string array, it will be converted into a CSV text using the excel quoting syntax (double-quote-character
are quote characters, and text containing either the comma, a quote character or a line-break will be quoted.

The value Joe, the horse, says "Ho" will become "Joe, the horse, says ""Ho""".

You can use the CSVARRAY formula function to parse that text back into a String-array inside the reporting engine.

User-Defined Environment setting

If a key does not match either one of the well-known keys or if the engine cannot identify it as a session-entry, it will be treated as user-defined setting. The reporting engine will consult the global configuration and will try to find the value by appending the environment key to the configuration prefix “org.pentaho.reporting.engine.classic.core.environment.“.

Therefore a environment lookup for the key “Foo” will query the global configuration for a configuration entry called “org.pentaho.reporting.engine.classic.core.environment.Foo“.

How to use them

The report-environment can be accessed via the OpenFormula function
ENV“. Example:

=ENV("pentahoBaseURL")

The result of the lookup can be used in the same formula:

=ENV("pentahoBaseURL") & "/content/reporting/report.html?solution=steel-wheels&path=reports&name=BuyerReport.prpt"

You can also make the value accessible as a separate field. Place the ENV lookup into the post-processing formula of a hidden parameter or into the “formula” property of a OpenFormula-Expression and the value appears as field.

But there is a easier way to convert your environment lookups into fields!

Auto-mapping

The auto-mapping definitions allow the Pentaho Reporting Engine to automatically lookup a set of report-environment keys and to publish those values as “env::” fields. Which fields are auto-mapped is controlled by a set of configuration entries in the reporting engine’s global configuration.

A mapping rule follows the following syntax:

org.pentaho.reporting.engine.classic.core.env-mapping.=

If the environment string is a CSV string, you can tell the reporting engine about it and it will produce a field that contains a String-array.

org.pentaho.reporting.engine.classic.core.env-mapping.-array=

It is generally a good idea to keep your environment fields in a separate namespace by using the “env::” prefix for your fields. This way you will not run into conflicts with fields from your database or expressions later on. I also heavily recommend to try to keep the reference to the environment key in the name, this provides a sensible self-documenting entry.

Where to use Report-Environment fields

The predefined keys can be split into two groups. The first group, all path and URL related entries are used to calculate relative resource URLs, like logo images or HTML and JavaScript includes.

The second group ,”username” and “roles” is used to enforce security rules in datasources or on calculations.

The third use-case would be to provide system dependent parametrization, like printing static text. It is a easy way to include the department name or the name of the staging system (test, dev, production) in every report.

And last but not least: The environment is accessible to scripts and formulas, so it can be used as a system wide parametrization avenue.

Setting up session values

You can define the values for the session-keys of the report-environment inside the BI-Server with the help of a on-logon-XAction. Within that XAction, use a JavaScript-Rule to access the HttpSession object and to store the string-value on it.

Setting up global user-defined report-environment values

The global user-defined report-environment values are defined in the reporting engine’s global “classic-engine.properties” file. You will find that file in the Pentaho-Report-Designer’s “resources” directory. The BI-Server’s global reporting configuration file can be found in the WEB-INF/classses directory.

The report-environment is a simple and easy to setup way to make your reports more reusable. Whenever you find yourself changing reports manually  while transitioning them from development to testing or production, you now have the tools at hand to elimenate these manual steps.

Charting in Citrus – hardened and extended Chart-Expressions

As we are now reaching RC stage within the next week or so, I think its time to spread some light on the new stuff to come for our Charting.

The charting in Pentaho Reporting was always a very delicate matter. Charting was (and is) implemented via a dual system of Chart-Expressions and Data-Collector functions. The data-collectors are responsible to produce the datasets, while the chart-expressions consume the data-set and produce a JFreeChart object. As JFreeChart-objects are Drawable-implementations, the engine can render them as vector-image for high-quality output.

But until recently, there was a creepy pattern when using our chart-related expressions. Make an error in the data-collector configuration, and you are killed. Define an invalid value – killed. Null values – killed. A invalid property in the chart-expression? Kaboom. Killed big time. An invalid combination of otherwise valid properties? Bloody nightmare. You are dead. Over time, we caused more dead developers than the browser wars at the beginning of the century. Many of these bloody incidents would have been avoidable by just applying defensive programming patterns. The old code behaved generally very unfriendly and rough. Wild casts, no check for invalid values or null-pointers caused all kind of funny exceptions. When asked what is wrong with a certain report-definition, I usually fired up the debugger to understand why an – from the outside – perfectly sane chart-definition failed to produce a chart. (Yeah, I could have resorted to guessing, but you can’t come up with the illogical and arcane rules needed to understand what’s going on.)

As the Citrus release is our big-overhaul, and everything is under revision and no back-pocket of insanity is left untouched. It was not surprising to find charting on the list of things to work on.

Now when creating new charts in Citrus, the first thing you will notice is that the beast (yes, beast!) no longer bites. The majority of properties has been hardened against invalid or missing values, and once arcane “free-form-text” properties are now equipped with real property-editors to guide you towards the few acceptable inputs.

When you create a chart, you first have to set-up the data-collector. There are currently 5 types available:

  • a Categorical-DataSet-Collector for Bar-Charts,
  • a Pivoting Categorical DataSet Collector, which reads series data by rows, instead by column
  • a XY-DataSetCollector, for numeric data charts
  • TimeSeries Collector, for plotting Date-Values on the X-Axis
  • a XYZ-DataSet-Collector for 3-Dimensional charts.

Thanks to the backward-compatibility promise, we cannot go in an change the old collectors without breaking existing reports. So we decided that creating new collectors was the smartest option. May the old code rot in hell. In most cases, the new collectors produce the same results as the old ones. As part of the hardening, they react differently to invalid types and null-values, as they now ignore the invalid values instead of crashing the whole chart-creation process. The Pivoting-Collector was built from scratch, based on how the original one was expected to work originally. The collectors now have a greatly reduced complexity and a smaller set of properties needed to configure them. And while we were at it, we removed the Number One trouble maker, the “summary-only” flag, as this can be safely deducted from other properties.

The Chart-Expressions themselves have seen a bit of change as well. WebDetails, the creators of the Community Dashboard Project, contributed logarithmic axes and a smarter way to generate human readable labels for them. Roman Wild contributed a Radar-Chart implementation. And finally we went through our own support-cases bucket and addressed many of the issues reported there. As a result, Citrus-PRD now allows to configure the tick-mark generation on the axes, the Bar-Line chart can have shared y-axis and the percentage-mode on stacked-area-charts finally works too.

Oh, and the chart-editor looks a lot better now, thanks to the hard work of our UI-designer Brett.

Oh No! I sold my soul to Xul

Xul, noun:
(1) the sumerian word for “evil”. According to the “Necronomicon” (the book of the dead, well known from various movies) conceived by H.P. Lovecraft, most of the less pleasant Gods of ancient Sumer were named “~ Xul”.
(2) the name of the evil diety in the Ghostbuster movies.
(3) the name of the Xml-UserInterface-definition Language of the Mozilla Foundation.

Yesterday, on a dark and stormy night, I finally completed the ritual of summoning. While thunder rolled in like waves from the ocean, crows screamed through the night and an unholy gloom covered the landscape in a rotten light, Xul heard my calls and in (yet another) exchange for my immortal soul I gained control over powers no sane human should possess.

It all started a long time ago, on the fateful day we decided to sacrifice the old report-designer’s codebase on the altar of the Gods of Clean and Maintainable Designs. The reborn designer should not only be bigger, better, faster .. [a lengthy list of positive properties later] but also more maintainable, so that we spend less time fighting the structure and more time being productive. In my book, that immediately boils down to modularization. It worked perfectly well for the reporting engine, works for Kettle, even the Platform heard the call and transformed the Platform 2.0 into a modular design.

So we all agree – “modular is good”(tm) . But only the evil guys can take the treasure without paying. Good things always come at a price. And that price in software engineering boils down to adding extra code and rules to manage the modules.

In the meantime, on a continent far far away, Pentaho created a new UI project to use the powers of Xul to make common dialogs and UI elements more reusable. Pentaho-Xul reads a subset of the Xul language and allows to create scripted UIs much similar to what Mozilla’s webbrowsers do. The Kettle-Database dialog was the first one to be Xulified; later some internal projects followed.

Xul provides several easy ways to create pluggable and extensible UIs. You can either include Xul-Fragments or you can use overlays to reconfigure a existing UI from within a plugin. (Xul, being a evil deity knows: Ascetic purity is for saints. Offer an easy way, and you won’t have troubles harvesting sinners.)

For the new report-designer, full Xulification was never an option. After all, we wanted to get rid of magic instead of adding yet another layer of automagical super-configurable UI-or-whatever-generators. But we still have – no, had – the problem of allowing customizations of at least the toolbar and the menu.

And so Xul entered my life through the backdoor. The voice roamed through my nightmares and promised easy salvation. At the end I had to give in. Luckily Xul is not one of the old fashioned daemons, who always request bloody sacrifices of goats or virgins (Ever tried to find a virgin these days? Forget it!) Xul just wanted to be a integral part of the project and all plugins we or anyone else ever writes, so that he can fest on the souls of the unhappy fools who try to find a Xul-specification document or officially blessed documentation of any kind. I’ve spend my early years as Cobol and RPG/400 programmer, there’s not much soul left here, so buying in was easy for me:

Report-Designer 2.0 ships with Xul.

Pro: Plugins can inject themself in all menus, context-menus and the toolbar.
Con: Your soul may be at risk, if you try to use Xul to modify anything else in the report-designer or if you try to find documentation other than the code.

Purifying action: Report-Designer 2.0 is on the way

Three weeks! It now has been three weeks since the purification started. Three weeks of furious deleting, refactoring, screaming, crying, contemplating suicide, restructuring, analyzing and finally fixing the report-designer’s codebase.

Are we finished now? Not yet. The current SVN version is now in a state where all the major parts are in place, but the overall experience is still .. experimental.

My development theme for the new report-designer is “back to the roots”. Almost all external dependencies are gone now. Well, some minor dependencies are still waiting for their extermination:

  • JGoodies Formlayout – a creature spawned in the depths of the seventh circle of hell – is still used in many of the old dialogs. As this layout manager is as hard to maintain as the infamous Gridbag-Layout of the core JDK, there are no advantages of using it. Whenever we now touch one of the old dialogs, we will replace this nightmare with a plain Gridbag-Layout. GridbagLayout may be ugly, but at least we don’t have to ship a Jar for it or force other developers to learn yet another useless trick.
  • IntelliJ IDEAs NotNull-annotations. The idea of marking parameters, fields and return-values as nullable seems to be a good idea at first. But sadly, the implementation of said idea is horrible. When using IntelliJ, violations of the annotation constraints easily remain undetected – the compiler does not warn you and so you won’t see errors until you get the Exception at runtime. In addition, anyone not using IntelliJ wont have much use of these annotations anyway. Therefore, for the new code and all classes touched during the redesign: Let’s kill these annotations. Now we replace them with strict coding guidelines. We start with explict checks on all public or protected methods, wherever we do pass object-references around. On a architecture level we prevent NullPointerExceptions by having a strict object-model with strict type-checks. And of course, this includes that we do not pass private collections around, as it seems all so common these days.
  • DOM4J and Jaxen: These classes are used by the Pentaho-Version-Checker to parse a structurally simple XML document. One could use a combined total of 500KB of libraries or one could use down-to-earth XML-parser implementations that use JDK-1.5 functionality only. Guess what the old code does and guess with what we will end up with.

Very early in the game, we already removed the Pentaho-Platform, as using a ?? MB collection of libraries to read a single properties-file was a bit ridiculous. We removed all magical code-generation (that was done via asm-lib and reflection) as all of these uses were easily replaceable by a single well-defined interface. We removed the PullParser and all the XML-magic as XML parsing in JDK 1.5 is really well-defined (heck, it was well defined in the age of JDK 1.2, for the usecases we need here). We removed SWT as I cant stand libraries that crash the Virtual Machine just because they cannot cope with the processor architecture or operating system. I still believe in the “100% pure Java” world Sun advertised ages ago. And we removed Castor, as a OR-mappers used for parsing a non-complex XML file is a WTF on its own.

Inside the code itself, we sliced through the event handling and created a clear event-dispatching path. Although it must be fun to virtually connect every object with every other object, I do not want to spend the few remaining years of my sanity on maintaining such a beast. Almost all cases of invoking methods by reflection were removed by simply creating interfaces containing the method to be called. But that road was no fun to walk – so at the end we ripped out the old data-model and rendering and editors and … and so the new report-designer works directly against the reporting-engine’s datastructures and with components that follow a strict separation of concerns.

And now, with all magic killed, we leave the dark age and enter the age of enlightening. With a sane and maintainable architecture to start with, we will finally be able to spend less time on maintenance and more time on developing new features. After all, bug-hunting is not a fun activity, so spending less time there leaves us more time for the fun stuff.

The new report-designer data-model is a thin wrapper around the visual objects of the reporting-engine. The model itself only provides some caching of the layout, change tracking and event-notification and nothing else. There is no point in wrapping around datasources or functions, as long as we set the social rule that after these objects have been altered we always notify the model of it.

The report-designer’s code now no longer contains copies of existing reporting-engine functionality. Datasources, layouting, expressions – it all comes from the one and only principal source now.

And the result of all of this: The report-designer now automatically provides access to all features of the reporting engine as soon as the reporting engine implements them. We can now guarantee that whatever the report-designer shows you in the edit view will be the same result the reporting engine will show you. Everything the reporting-engine can do, can now be done with the report-designer.

We now also started to define the first set of plugin-points for the report-designer, so that anyone can hook-in own editors, datasources and actions without dealing with all the complexity of the whole project.

Independence Day

While the former colonies celebrated their successful rebellion to evade British taxes on tea and other goods, I created my own Declaration of Independence (from old and obsolete rules that burden the pioneer spirit and now just exist for historical reasons).

As the name suggests, the “extensions-reportdesigner-parser” project is a Classic-Engine extension that allows the engine to use *.report files as produced by the current Report-Designer as ordinary report-definition files.

On the quick-win side, this immediately removes the need to convert reports into the extended-XML format during the publish process. As long term gain, this module acts as a compatibility layer as soon as we start the report-designer redesign.

There are several reasons why it is a good idea to separate the report-definition fileformat from the internal model representation:

  • The report-designer’s XML file references classnames and therefore makes it hard to do refactorings without affecting the names put in there. Now we can start to move classes and functionality around without having to worry how our actions may break existing reports.
  • The report-designer’s internal model is a incomplete (and in some cases incompatible) copy of the reporting-engine’s report-model. This results in subtle bugs and diverging feature sets, which in return leads to frustration and ugly hacks.
  • As the designer maintains a own copy of the reporting-engine’s model, changes in the reporting engine’s processing (notably new features or extensions of existing features) need a long time until they become available in the designer.
  • And of course: Maintaining a own model is expensive. The time we have to spend to keep the models in sync, is time we cannot spend to make the designer better.

This module now is the living proof, that there is no need for a separate report-designer model and therefore no need to have a *.report format in the future.

With 0.8.11, we can safely map everything that is inside the report-designer’s file format into the native elements of the reporting-engine. The charting in the designer is covered by the legacy-charting sub-project, the design-time properties in the *.report-files (like the horizontal and vertical rulers) are mapped into Element-attributes, and for each data-set offered by the designer, we have a sane equivalent. (And in case the dataset-specification itself was a bit insane in itself, as it happens with the XML-datasources, we now have legacy-projects for these parts as well. No matter how strange the code or behavior is, for the sake of full backward compatibility we’ll keep that code alive.)

During the next two weeks, we will concentrate on making the Report-Designer 2.0 ready for a bright future. At the end, we will end up with a designer that is a natural extension of the reporting engine core. In the process to that glorious goal, we will happily remove the code that simply duplicates the engine’s behavior. Add a bit of reducing information redundancy and let us reduce the magic of reflection to a minimum and we shall be in a good shape for the next years to come.

Converting Paintings into Tables

In Pentaho Reporting Classic, all Report-Elements are positioned somewhere on a canvas. Whenever an band is being printed, the layouting system jumps in an computes the final layout for the band and all child elements. After the layouting is complete, each element has some valid ‘bounds’, which describe where the painter would have to place the element on the canvas.

The table-generator’s work starts after all elements have a valid layout. For each visible element in the band, the layouter takes the bounds and generates a column or row break for each edge position. All bands of the report are added to a single table. Therefore the table’s column will be based on all generated column breaks of all bands.

Pentaho Reporting Classic has two table-export modes. In the ‘simple’ export mode, the table-generator ignores the right and bottom edge of the child elements (but not for the root-level band). If a ‘complex’ layout is requested, all boundary informations are taken into account.

Theory is dry, lets take a look at some numbers:

Lets assume we have a root-level band with a width of 500 point and a height of 200 points. The band has two childs, a label and a text-field. I’ll specify the bounds as absolute positions, (X1,Y1) denotes the upper-left corner, and (X2,Y2) denotes the lower right corner.

The bounds of all elements involved are:

  • Band: X1=0, Y1=0, X2=500, Y2=200
  • Label: X1=100, Y1=0, X2=300, Y2=100
  • Textfield: X1=100, Y1=100, X2=250, Y2=200

Let’s follow the table-generator’s steps. We assume that the complex layouting is used.

  1. The Band gets added: As there are no column breaks yet, A column break will be added for X1=0 and X2=500. A rowbreak is added at Y1=0 and Y2=200.The first break always marks the start of the table, and the last break marks the end (and total-width) of the table. The table now consists of a single cell, that has a width of 500 points and a height of 200 points.
     
  2. The Label gets added: As there is no column break for X1=100, a new column break is inserted. The table’s only cell splits into two columns.
      Label

    A column break for X2 gets inserted at position 300. The table now contains 3 columns.

      Label  

    The Label’s Y1 does not cause a row-break, as the band already caused one at this position. A row break for Y2 gets inserted at position 100. The table now consists of two rows.

      Label  
     
  3. The text field is added to the table. X1 does not cause a column break, as there is already one at this position. X2 causes a new column break at position 250. Note that the label already occupies the cell from X=100 to X=300. This cell will now span two columns. There is already a column break for the text-field’s Y1 position (at Y=100, caused by the labels bottom edge) and for the Y2 position (at Y=200, caused by the band’s bottom edge).
      Label  
      TextField  

If the table-generator uses the simple algorithm, the resulting table gets simplified in a second step. The column breaks at position 250 and 300 have been caused by a right edge of an report element. These breaks now get removed, so that the resulting table looks like this:

  Label
  TextField

Now it should be clear, that the table-generator works best, if all elements are properly aligned. All elements that should go into one row or column have to start at the same X and y positions. If the strict layouting mode is used, they also must end at the same position. Elements that should go into neighbouring cells must share a common edge. And finally: Elements that do not start at position Zero will cause an empty column or row.

In the next post, I’ll cover how Pentaho Reporting Classic computes cell backgrounds and borders.