Category Archives: Tech-Tips

Pentaho, the Platform: How about not being a server anymore?

Out there in the wild, when it comes to talking about Pentaho, the first impression people have: Its large, its big, it’s heavyweight, it’s THE SERVER. Its a beast that eats CPUs at night and during the next day administrators make barbecue on the remaining CPUs.

You can really stun people by revealing the well-hidden secret, that Pentaho the Platform is not heavyweight at all. Quite the contrary, if used correctly.

My journey on that started a couple of years ago, when I had to debug the reporting integration in the platform. This was the first time in my life, I actually contemplated about retiring as a farmer somewhere in the north-american desert (not many plants, true, but one or two interesting exist). With every compile, redeploy and then restart of the heavy weight JBoss server, I wondered: Have I sinned that much to deserve that hell? I’m sure Dante Alighieri had a 10th circle in his Divine Comedy, which involved lots and lots of J2EE and JBoss-debugging. But obviously he considered that to cruel to be believable, and therefore cut it out. Even Satan would not be so low …

Obviously, as I’m sitting here and writing this article, I found salvation beyond Peyote or eternal torture. And salvation is to simply not to use JBoss* (or any other J2EE system).

Revelation One: The Pentaho Preconfigured Installation is Not the Pentaho Platform

The official documentation and whitepapers make a fine (and really not obvious) distinction between the “Pentaho Platform” and the “Pentaho Server”. The server is the big and heavy-weight thing, and actually relies on a working J2EE infrastructure to run. So from now on, we will ignore this big pink elephant. The interesting pearl is the Pentaho Platform buried in the Server.

The Platform is a couple of JARs, which are almost entirely made of of infrastructure and glue code. The Platform only has one purpose: To orchestrate the various components into a BI related symphony. If I would have to describe the platform in one single sentence, it would be: “The platform is a runtime environment for a XML-based process-language that additionally provides auditing, logging, configuration and other infrastructure needs to allow to run BI-Jobs.

Revelation Two: The Platform does not require J2EE at all.

Now we are leaving the heavy-weight area and dive deep into the sacred land of resource-efficiency! Although the platform provides several implementations of its interfaces to integrate seamlessly into J2EE environments, at the same time it also ships with implementations that are not tainted by any J2EE related code. These clean implementations make it possible to integrate the platform into all kinds of Java-Applications.

For me, running the platform outside of a J2EE server allows me to debug the components I write from inside my IDE. I do not have to deal with a heavy-weight server that starts up and shuts down in 5 or more minutes. I do not have to dive through layers over layers of application server code before I come to the parts of the application that interest me. I do not have to deal with HTTP requests. I do not have to deal with configuring a server before I can work. I can start my work immediately.

When I have to deal with XActions and have to find out why the $%&&$ the thing is not working, I also tend to be faster to simply attach a debugger and see whats going on under the covers instead of performing an pen-and-paper analysis of the XAction file itself. Run, listen for the crash, jump to the crash, and search the burning ruins for hints on what happened. Fast and simple and since I started using the platform as embedded tool, I never had to deal with setting up JNDI datasources in JBoss or any other J2EE system and I never had to write a single XML-deployment descriptor again. This is how heaven must be like.

But having the platform as a embedded toolkit opens a whole new world of opportunities. Maybe you have to provide bursting capabilities (that is: generating lots of reports and sending them out to a predefined list of recipients. Much like what spammers do daily but clean and family friendly) then the platform can do this for you with minimal efforts. Maybe you need to integrate reporting in your application and at the same time you have to ensure (and prove later) that the reports have been generated and have been distributed correctly. Or in an extreme case you need to query a web-service to provide parameters to query on a OLAP server to feed a Kettle transformation to run a sequence of reports that are distributed via email, then the embedded platform allows you to run that XAction as easily as a simple report itself.

Revelation Three: Code!

Up to the Platform 1.2.0, there was a sub-project called the Pentaho-SDK, which contained a couple of examples on how to execute XActions in the standalone mode. A SDK on a OpenSource project (where the full sources are always available) was some sort of strange beast, so this project ceased to exist and only the SVN server knows where it’s spirit went. However, the death of the SDK cut of the audience that just wanted to run the platform and who did not want to deal with all the code of the platform.

So here we start again.

(1) Setup the project

  • Grab the latest sources and copy all JARs from “thirdparty/lib” and all its subdirectories and copy them into your project’s lib directory.
  • Add all the jars to your projects CLASSPATH.
  • Build the platform and add the generated JARs to your classpath as well.
  • Grab a configured copy of the solution directory.
  • Configure JNDI so that the components know how to access the database(s).
  • (Remove BIRT from the system-listeners, as it does not seem to initialize in standalone mode.)

or

  • Download the preconfigured standalone environment 🙂 (scroll down)

(2) Java: Initialize the platform.

Initializing the platform is easy, all you have to do is to provide a standalone-context and point it to your solution-directory.

  public static boolean initialize()
{
try
{
// We need to be able to locate the solution files.
// in this example, we are using the relative path in our project/package.
final File solutionRoot =
new File("/home/src/pentaho/pentaho-demo/pentaho-solutions/");
final File applicationRoot = new File("/home/src/pentaho/pentaho-demo/");
final StandaloneApplicationContext context =
new StandaloneApplicationContext(solutionRoot.getAbsolutePath(),
applicationRoot.getAbsolutePath());

// Initialize the Pentaho system
return PentahoSystem.init(context);
}
catch (Throwable t)
{
// of course, you should have some better
// error handling than I have ;)
t.printStackTrace();
return false;
}
}

(3) Execute your XAction. The XAction-Path should be relative to the solution-repositories root-directory. The parameters must be given in the HashMap and must match the declared parameters of the XAction. By adding more code it is possible to provide a UI on top of this process that queries the parameters in the same way as the Pentaho-Server’s HTML-UI does it.

    final String xactionPath =
"samples/steel-wheels/reports/Income Statement.xaction";
final HashMap parameters = new HashMap();
parameters.put ("output-type", "pdf");

final FileOutputStream out = new FileOutputStream ("/tmp/report.pdf");
try
{
ISolutionEngine engine = SolutionHelper.execute
("Just a description used for logging ", "User (only for logging)",
xactionPath, parameters, out);
List messages = engine.getExecutionContext().getMessages();
engine.getExecutionContext().dispose();

// out contains whatever the XAction produced.
}
catch (Exception e)
{
e.printStackTrace();
}
finally
{
out.close();
}

(4) Clean up. Always shut down the platform before you exit the application. You want to be sure that all data is written into the databases and that all buffers are flushed.

    PentahoSystem.shutdown();
System.exit(0);

So go ahead, download the package and start walking the lightweight path.

* Nitpickers corner: JBoss as used in this article actually represents all the evilness found in all J2EE servers. No matter whether you choose JBoss, WebSphere, BEA or whatever J2EE-Servers you prefer, they are heavyweight machinery and not meant to be used for developing applications. Once you are finished developing, they surely form a superior runtime environment for your J2EE code, but everything that makes them good in production makes them horrible for development. Slow startups, heavy footprint and lots of lots of XML descriptors – efficient development should look different than that.

Converting Paintings into Tables

In Pentaho Reporting Classic, all Report-Elements are positioned somewhere on a canvas. Whenever an band is being printed, the layouting system jumps in an computes the final layout for the band and all child elements. After the layouting is complete, each element has some valid ‘bounds’, which describe where the painter would have to place the element on the canvas.

The table-generator’s work starts after all elements have a valid layout. For each visible element in the band, the layouter takes the bounds and generates a column or row break for each edge position. All bands of the report are added to a single table. Therefore the table’s column will be based on all generated column breaks of all bands.

Pentaho Reporting Classic has two table-export modes. In the ‘simple’ export mode, the table-generator ignores the right and bottom edge of the child elements (but not for the root-level band). If a ‘complex’ layout is requested, all boundary informations are taken into account.

Theory is dry, lets take a look at some numbers:

Lets assume we have a root-level band with a width of 500 point and a height of 200 points. The band has two childs, a label and a text-field. I’ll specify the bounds as absolute positions, (X1,Y1) denotes the upper-left corner, and (X2,Y2) denotes the lower right corner.

The bounds of all elements involved are:

  • Band: X1=0, Y1=0, X2=500, Y2=200
  • Label: X1=100, Y1=0, X2=300, Y2=100
  • Textfield: X1=100, Y1=100, X2=250, Y2=200

Let’s follow the table-generator’s steps. We assume that the complex layouting is used.

  1. The Band gets added: As there are no column breaks yet, A column break will be added for X1=0 and X2=500. A rowbreak is added at Y1=0 and Y2=200.The first break always marks the start of the table, and the last break marks the end (and total-width) of the table. The table now consists of a single cell, that has a width of 500 points and a height of 200 points.
     
  2. The Label gets added: As there is no column break for X1=100, a new column break is inserted. The table’s only cell splits into two columns.
      Label

    A column break for X2 gets inserted at position 300. The table now contains 3 columns.

      Label  

    The Label’s Y1 does not cause a row-break, as the band already caused one at this position. A row break for Y2 gets inserted at position 100. The table now consists of two rows.

      Label  
     
  3. The text field is added to the table. X1 does not cause a column break, as there is already one at this position. X2 causes a new column break at position 250. Note that the label already occupies the cell from X=100 to X=300. This cell will now span two columns. There is already a column break for the text-field’s Y1 position (at Y=100, caused by the labels bottom edge) and for the Y2 position (at Y=200, caused by the band’s bottom edge).
      Label  
      TextField  

If the table-generator uses the simple algorithm, the resulting table gets simplified in a second step. The column breaks at position 250 and 300 have been caused by a right edge of an report element. These breaks now get removed, so that the resulting table looks like this:

  Label
  TextField

Now it should be clear, that the table-generator works best, if all elements are properly aligned. All elements that should go into one row or column have to start at the same X and y positions. If the strict layouting mode is used, they also must end at the same position. Elements that should go into neighbouring cells must share a common edge. And finally: Elements that do not start at position Zero will cause an empty column or row.

In the next post, I’ll cover how Pentaho Reporting Classic computes cell backgrounds and borders.

Pagination and Content Generation Strategies

In the reporting world (and maybe other heavy duty content generators as well) one of the most common (and most difficult) problems is to lay out the generated content into pages. Its not just describing where to put an element, the real problem starts when you have to insert page breaks and page header and footer.

The easy approach would be to simply ignore the pagination while the content is generated. This allows the content generator to be as simple as possible – that implementation does not have to know anything about layouts, pages or header or footer. In a second step, a layouter would jump in and would process the generated content stream and cut it into pieces that fit on a page.

This strategy is usually found in document oriented reporting engines, like BIRT, Windward Reports, the Pentaho Reporting Flow Engine or the truly horrible reporting system of OpenOffice 2.0. From the report generation approach, they are very close to the Mail Merge functionality of today’s word processors. After the content has been generated, a second engine jumps in to perform the layouting and pagination. In the case of OpenOffice, the layouting is outsourced to the word-processor itself, which makes the reporting engine extra-weight-light.

BIRT, on the other hand, simply defines that a page-header or footer cannot reference the normal-flow content in any way. Although this makes it very easy to implement the layouting afterwards, it severely limits the usefulness of these headers and footers.

The classic Office-Documents have a separation between page-sections and the so called normal-flow content. The content that is distributed over the pages is contained in the normal-flow. The page header and footer are defined outside the normal-flow in a page-layout (sometimes called a master-page) and during the layouting, the master-page and the normal-flow are merged together. Page header and footer behave like templates, they are defined once and are applied multiple times. Some systems allow the header and footer to reference properties from the normal-flow (like the current section title).

The more complicated approach performs the pagination while the content is generated. The separation of concerns architectural pattern clearly states, that such behavior is stupid, if not evil, as it leads to overly complicated systems which are tightly coupled. Maintenance of such an architecture is a nightmare.

But there are some advantages, which justify this. Coupling the content generation with the layouting and pagination allows the content generator to use the feedback from the pagination process. A reporting function can now perform page-local computations based on these events.

In the reporting field, there are a few examples, where it is needed to couple the pagination with the content generation. One of the requirements could be is to perform page based calculations (for instance, the count and/or sum of all items printed on this page). But agreed, these special requirements are very seldom encountered.

The mix-up approach is common among the classic reporting engines, like CrystalReports, JasperReports and Pentaho Reporting Classic. All these reporting engines allow their users to put any content into the page header and footer and generally treat these page-sections as dynamic content that be changed by the user during the report processing runs. Although this offers the maximum flexibility possible, it is also the reason for most of the complexity encountered in these engines.