Category Archives: Advanced Topic

Unix: Tell your system how to treat PRPT files as ZIP files

As a BI-Solution developer, every now and then there comes a time to bring out the serious tools. When dealing with report files, you may find that you need to peek inside the ZIP structure, to trace down an error, or to simply look at some of its metadata quickly.

But unzipping manually is tedious. Luckily, on every Linux system (and on Cygwin), the Midnight Commander makes file operations on the console a lot easier. The MC treats ZIP files as a virtual file system, and allows you to browse them like subdirectories and makes it possible to add, edit or remove files from these archives easily.

The only problem is: It only works if the Midnight Commander recognizes the file as a ZIP file. And sadly, for PRPTs it does not.

Internally the Midnight Commander relies on the ‘file’ command, which relies on the ‘magic’ database to check files against known signatures. So lets teach MC how to play nice with PRPTs.

Create a $HOME/.magic file with your favourite text editor, and add the following content:

# ZIP archives (Greg Roelofs, c/o zip-bugs@wkuvx1.wku.edu)
0    string        PK\003\004
>30    ubelong        !0x6d696d65
>>4    byte        0x00        Zip archive data
!:mime    application/zip
>>4    byte        0x09        Zip archive data, at least v0.9 to extract
!:mime    application/zip
>>4    byte        0x0a        Zip archive data, at least v1.0 to extract
!:mime    application/zip
>>4    byte        0x0b        Zip archive data, at least v1.1 to extract
!:mime    application/zip
>>0x161    string        WINZIP          Zip archive data, WinZIP self-extracting
!:mime    application/zip
>>4    byte        0x14        Zip archive data, at least v2.0 to extract
!:mime    application/zip

>30    string        mimetype
>>50    string    vnd.pentaho.        Zip archive data, Pentaho Reporting

Then all you need to do is compile the file, and you are ready to go:

file -C -m $HOME/.magic

This creates a $HOME/.magic.mgc file, which contains the compiled magic code.
Test it via

file a_test_report.prpt

and if it says, “Zip archive data, Pentaho Reporting” your system has learned how to treat PRPTs as ZIPs.

PRD-4090: Data-Factories calling other Data-Factories

DataFactories now have a changed “initialize(..)” method, which
takes a single “DataFactoryContext” object.

The context contains the known init-properties (configuration,
resource manager, resource bundle factory) – nothing new here.
In addition to that, there is now also a property called
‘contextDataFactory’, which is the data-factory of your current
report.

Runtime use:
————

Calling initialize is now mandatory for all datasources. If you
get strange exceptions from the class
“AbstractDataFactory”, then you forgot to call initialize(..).

The data-factory you get from the context is already initialized,
so you can happily start using it. Please do not make any assumptions
on the contents of the data-factory. Calling queries is fine, but
trying to dismantle it to get to the underlying implementation is
not.

Design-time:
————

I steamlined the designtime code a bit. When you need to preview
a data-factory, you can now get a properly initialized data-factory via:

org.pentaho.reporting.engine.classic.core.designtime.datafactory.DataFactoryEditorSupport#configureDataFactoryForPreview

The DataSourcePlugin interface now has a new parameter on its
performEdit method. The “DataFactoryChangeRecorder” accepts
“DataFactoryChange” objects from your editor.

As a data-factory editor, you should not modify the report-definition directly. Return your changed data-factory via the method return
value. If you edit existing data-factories, record the original,
unmodified data-factory and the changed data-factory in the
ChangeRecorder.

The report-designer or report-design-wizard will take that change
information and incorporate it into the report. Letting the caller
have control over that allows us to record Undo information in PRD.

(Note: Undo recording is not implemented yet, as I ran out of time
here. For now, we just add/replace/delete data-factories without
creating undo-objects. We will have that fixed before we go GA,
simply work as if it is there for now.)

 

Killing XActions for Reporting – slowly and with pleasure

The new Pentaho-Metadata data-source scripting-extension I produced recently seems to be well-received. We now have a great opportunity to fully cut back on XActions for plain reporting uses.

Many users still have to stick with their old XAction driven reports. I assume that they do not do that because they enjoy the pain, or love programming in XML. No, the majority needs to drive parameter queries where the query itself is computed. Reasons for that are many:

  • If everything is placed in one query, the query gets insanely complex and unmaintainable.
  • The query access legacy systems with no sane data models or weird partitioned tables.
  • The data-source needs to be configured based on some other parameter before it is used.

 Now users who have been locked in by these cases can now free themselves from the slavery of weird XML-programming. I happily announce that

Pentaho Reporting now ships with sane scripting support for JDBC, Pentaho Metadata, all Mondrian and all OLAP4J data-sources.

The data-source scripting system of the old days is now a dark memory of the past. The new scripting is integrated into the data-source itself, so scripts become more reusable. Editing large scripts on the report’s “query::name” attribute was never really fun anyway.

The scripting extension allows you to configure the data-source before it is used for the first time. You can now configure all aspects of the data-source via your script, including but not limited to any supported connection-parameter like the JNDI name or additional JDBC-connection properties:

var jndiConnectionProvider = dataFactory.getConnectionProvider();
jndiConnectionProvider.setConnectionPath("java://myjndiconnection");

You cannot get away with not configuring the data-sources at all, as the report designer and all other tools love to have something static to work with. But you can always reconfigure them to your liking before the report is run. The scripts are always called before the a query is executed on the data-source, even inside the Pentaho Report Designer or the data-source editors.

I don’t believe that we will need similar scripting support for the Kettle, Java-Method-Invocation, Table or the Scriptable data-sources. These data-sources either do their own scripting anyway or would not profit from any scripting abilities.

Pentaho Reporting’s Metadata DataSources now with more scripting power

Pentaho Reporting is rather flexible. No, astonishingly flexible. Calculations for styles, attributes and even queries make sure that there is always a way to tweak a report exactly the way you need it. But until now, there were a few things that were not exactly easy. Custom Data-sources and their method of calculated queries were one of them.

With today’s code drop, this story has fundamentally changed. The Pentaho Metadata data-source is the first data-source that combines static queries with optional scripting features. There is no need to have a “Custom-Metadata” datasource anymore. No need to cram your query calculation logic into a tiny field on the report’s query-attribute.

Here it is: The new and greatly improved Pentaho Metadata Data-Source:

The scripting extensions hide beneath the two new tabs on the top and on the query-text area. Those who don’t need scripting or who are scared of having to program can safely continue with their life – just ignore the scripting-tabs and you will be safe in your static world. But if you feel you are born to greatness and therefore you cannot be limited to predefined queries, then follow me on a exciting journey into the land of power-scripting.

The scripting extensions are split into a global script and a per-query script. The global script can be used to define shared functions or global variables that can be seen by all query-scripts. The init()function can also be used to change the configuration of the data-source itself. The per-query-scripts allow you to (a) customize the query string, (b) calculate the “additional fields”  information for the query-caching and (c) allows you to post-process the returned table-model.

The original “Custom-*-Datasources” always had the problem that you can only add one of these datasources per report or subreport. Now that we combine scripts with named queries, we can have a independent script for each of the queries and of course as many datasources as necessary.

The scripting backend uses the JSR-223 (javax.script) scripting system. By default we ship with both JavaScript and Groovy support. There are quite a few JSR-223 enabled languages out there, so if you have a favourite language, just drop it into the classpath on both the Pentaho Platform server and the Pentaho Report Designer and you are ready to code.

The picture above shows you the global script pane. For the two languages (JavaScript and Groovy) we ship with, we have a prepared template script that contains some documentation as well as the empty function declarations of the functions we expect to be able to call.

It is absolutely OK to delete any function you dont need – in that case Pentaho Reporting simply ignores that part. This will help you to keep your scripts small and simple. You can load scripts from external sources as well, but be aware that it is your responsibility to make these external scripts available to the report at runtime. So if you load a script from a local directory, you have to ensure that the same works on the BI-Server (or your production environment) too.

For the upcoming Pentaho Reporting 3.9.0 release, the Pentaho Metadata datasource is the only datasource with this sort of scripting. Once that design has proven itself and I have more feedback on how well it works, we will add this scripting system to all major data-sources in Pentaho Reporting 4.0.

How to create a fancy tooltip in HTML reports with PRD

When you create reports that are mainly used on the web, you probably want to enrich your reports with some basic interactivity. Charts need links and tooltips on their data, drill downs need to be defined and hopefully the information overload of ordinary reports gets reduced via fancy images, hidden sections that only show up on demand and other techniques.

The most basic way of creating a annotation on a report is to provide tooltips. Sadly the HTML creators were weird scientists who were used to long and boring lists of footnotes instead of in-lined annotations.

Today I am showing you how to create fancy, JavaScript based tooltips onto a report. You can adapt the same technique to create other interactive elements, including Google Maps integrated reports or other Web-2.0 mash-ups.

Rich-Text tooltips: The idea

The tool-tip system I am integrating is based on the blog posting written by Michael Leigeber, a web designer and .NET developer who runs the Leigeber Web Development Blog.

The tool-tips behaviour is defined in a central script that gets included in the report’s generated HTML output. To include the script, we simply copy the whole raw HTML/javascript into the report attribute “html::append-body”. When executing the report, the contents of this attribute are copied into the HTML file before the first table is rendered.

If you want to know how the tooltip script works, please refer to Michael Leigeber’s blog posting.

On each element that we define, we can now define the tooltip as HTML text on the “onmouseover” attribute.

tooltip.show('Testing  123 ', 200);

The tooltip gets hidden when the mouse leaves the element via the “onmouseout” attribute.

tooltip.hide()

The first parameter in the “show” method call is used as ‘innerHTML’ on the generated tooltip element. So you instead of just plain text, you can include any HTML content you like, including images, tables or animations.

Dynamic Tooltips: Show me my data in the tooltip

We all agree that static tooltips are rather boring. A tooltip, well-placed can mean all the difference between a information-overloaded report and a report that shows the exactly the right information you need at exactly the right time. Show the main sales numbers, and move the detail content into the tooltip.

In Pentaho Reporting, most attributes and styles can either contain a static value or can be computed at runtime via an expression or formula. When you see a green plus on the last column of either the style or attribute table, then you will be able to add a calculation for that property.

To make the tooltip show your own data, you will need to make the first parameter of the tooltip.show(..) function call dynamic. The “onmouseover” property expects a string that is valid JavaScript. The reporting engine does not interpret this string at all, it just passes it into the resulting HTML document and lets the browser decide what to do with it.

So all we need to do, is to compute a JavaScript text that contains our data:

="tooltip.show('" & [myfield] & "');"

But careful. If myfield contains any character with special meaning in JavaScript, like quotes, your script is not going to work. To make it work we will need some proper quoting to turn the text into proper JavaScript code.

The Pentaho Reporting Engine offers the “QUOTETEXT” function for this purpose. This function takes two parameter. The first parameter is the text you want to quote, and the second parameter is a constant telling how you want to quote the text: “javascript”, “xml”, “html”, “formula-string” or “formula-reference”.

For now we will need two sets of qouting, “javascript” and “html” as our quoting, and alter the formula above to read:

="tooltip.show('" & QUOTETEXT(QUOTETEXT([myfield];"html");"javascript") & "');"

The inner QUOTETEXT ensures that the text given in [myfield] is proper HTML and that all special HTML characters are encoded properly. So ‘>’ gets converted into >, ‘<' into < and so on. The outer QUOTEXT function then ensures that the resulting text is also proper JavaScript code. It encodes all single and double quote characters and all newlines, tabs and so on into their properly encoded JavaScript counter parts. This is already a working tooltip, even though it is a bit primitive. Earlier on I said, you can use HTML text as tooltip. So lets do a bit of HTML magic here. Again, text in HTML needs to be encoded properly as well.

="tooltip.show('" & QUOTETEXT("

" & QUOTETEXT([PRODUCTNAME];"html") & "


" & QUOTETEXT([PRODUCTDESCRIPTION];"html") & "

";"javascript") & "');"

This produces a tooltip that prints the product code from the steel-wheels example and the product description divided by a horizontal line.

You can grab a sample report for Pentaho Reporting 3.8.0-GA:
fancy-html-tooltip.prpt

Access Databases with Dynamic Table Names in Pentaho Reporting

Every now and then we get the question on how to create a report on database tables with a dynamic date stamped table name. Sounds weirdly cryptic? Well, instead of having a single table for all orders, these systems may have many small tables for all orders of a specific month.

I tried hard to come up with a potential reason for these things:

  • Maybe the database is so weak that it can’t handle more than a few thousand rows in a table. But both MySQL nor PostgreSQL are quite capable for some serious data storage (if used right). And even Oracle can do that – and if you have enough money to buy a Oracle license you have money to buy a server (-farm, Oracle ain’t cheap!) too.
  • Maybe they want to speed up data access to the table. After all, a SELECT * FROM TABLE takes a long time if you don’t use a WHERE clause or a proper index.
  • Maybe they have not heard of normalization.
  • Maybe they love to have separate tables so that they can remove the data of older months. A SQL DELETE command is black magic. Deleting the database file in the filesystem is quicker.

So if you are using such a system – drop me a note. I am dying to know what application creates such a data-structure.

Reporting on these database structures is not straight forward and usually considerably slower than working on a proper data model. And here is how you do it, in simple steps:

Prerequisite: The data model we use in this tutorial

Lets assume we have a database filled with sales data. For the sake of simplicity, lets assume there are several tables named in the schema of “SALES_yyyymm”. The ‘yyyy’ part of the table name represents the four digit year, the ‘mm’ represents the two-digit month.

The tables are made up of the following structure (including some sample data*):

CUSTOMERNUMBER DATE ORDERID ARTICLEID QUANTITY PRICE
0001 20091015 112 Fluffy Dogs 100 15
0001 20091015 112 Gumpy Rabbit 9 14
0001 20091015 112 Angry Birds 12 5
0002 20091024 113 Busy Bees 99 1

*) Note: Yes I know that putting strings as article ID is bad. But it safes me from outlining a second table.

Prerequisite 2: Reporting made dynamic with a date parameter

Creating a static report with a hardcoded table name is easy. Write your query and be done with it. The tricky part is to create a report that selects the right table based on the a given date. That date could be the current system date for reports returning the current month’s turnover or it could be a historic date.

If you want to make the report dynamic, you will need a parameter to feed the report’s target date into the reporting system and into the query. The table-name in the database contains a partial date value. To make the report return data for a given date, we must calculate the database table name when the report runs based on whatever date the user entered.

Standard SQL datasources do not allow you to parametrize structural elements of a query. This is a security measure to prevent an attack known as “SQL Injection“.

If you need to compute structural elements of a query, you will have to use a “JDBC (Custom)” datasource (Data -> Add Data-Source -> Advanced -> JDBC (Custom)). Whenever you use a data-source marked as (Custom), you opt out of the standard parametrization. At that point, it is your responsibility to properly validate all incoming user data and to ensure that no one can enter illegal values.

The Simple-Case: Reporting within the same table

The Custom JDBC datasource only contains connection information. The actual SQL query that is executed is computed at runtime. To define a query, put a formula or expression into the “query::name” attribute of your master- or sub-report object.

Mike Tarallo create a excellent video tutorial on how to use the Custom JDBC data sources. So instead of repeating everything here, watch the video and learn what the Custom JDBC datasource is and how to use it properly.

Simple Parametrization: Using a string parameter

So we need to declare a parameter to select the table the report should use. You want it dead simple (just for testing, of course!)? Use a plain text parameter that allows the user to enter the year and month to use.

Parameter:
 Name: Simple
 Value Type: String

and finally, to prevent SQL injections, some validation:

Post-Processing-Formula: =IF(LEN([Simple]) = 6; IF (ISNUMBER([Simple]); [Simple]; NA()); NA())

Complex Parametrization ideas

You can use a ordinary date parameter to make use of the date-picker. Note that this returns a Date object. You need to convert this date into a valid text. For that add a second hidden parameter with the following Post-Processing Formula:

=MESSAGE("{0,date,yyyyMM}";[Date])

Alternatively, assuming you use a proper database, you can extract the valid dates from your database. Just query the database metadata for the tables that are available. (I am using MySQL for the examples here. Consult your database manual on how to access the database schema.)

SELECT SUBSTRING(table_name, 5) AS "SalesDates" 
  FROM INFORMATION_SCHEMA.TABLES
 WHERE table_schema = ''
   AND table_name LIKE 'SALES_%'

You can use this information to feed a list parameter.

Create your query

I assume you have a parameter that contains a suitable string (6 characters, yyyyMM format). Lets call this parameter “DateSuffix”. On the master-report’s “query-name” attribute, create a formula with the following syntax:

="SELECT * FROM SALES_" & [DateSuffix]

At runtime, the report engine will evaluate the formula and construct the proper query string. That query string is then executed by the Custom JDBC Data-Source.

Tip: The report designer does not use queries from formulas during the design process. Read this article to learn how to get designtime column information into the Pentaho Report Designer.

Summary

To create a simple dynamic report:

  • Create a parameter that returns the suffix of your table
  • Add a Custom JDBC Datasource to your report
  • Add a formula to the query-name-attribute of your report

Creating reports that summarize a full year

Single reports for a single month – that is easy stuff. But sooner or later someone will come back to you and ask for sales data spanning the last year.

Lets skip the parameter defining details. Create a parameter that returns the year as string and call it “YearText”.

To your database, each of the monthly tables is unrelated to any other of the monthly tables. Databases are great on providing referential integrity on the data within the tables. But generally they care little about the names of the tables and particularly do not assume that table names have any higher meaning.

In SQL merging data from several tables happens via an UNION operator. Tables that are merged via UNION must have the same structure – the number and types of the columns must match.

A SQL Union takes the form of

 UNION 

Or for our example:

SELECT * FROM SALES_200910 
 UNION
SELECT * FROM SALES_200911

Each select statement in a UNION is considered separately. Any “ORDER BY” clause is applied only to the particular table. If you want to sort globally, you will have to wrap the UNION result into another SQL query via a SUB-SELECT. All proper databases support Sub-Selects, but again: For details, consult your database manual.

Here is an example on how to use a sub-select to sort globally

SELECT * FROM 
  (SELECT * FROM SALES_200910 
     UNION
   SELECT * FROM SALES_200911) 
ORDER BY ARTICLEID

Sub-selects are considerably slower than normal selects. Most databases will start to use temporary tables for the selection result and therefore will answer such queries a lot slower than normal queries. If your performance goes down a lot, consider creating a data warehouse.

A query spanning a whole year will be long and complex.

SELECT * FROM SALES_200901
     UNION
SELECT * FROM SALES_200902
     UNION
SELECT * FROM SALES_200903
     UNION
SELECT * FROM SALES_200904
     UNION
SELECT * FROM SALES_200905
     UNION
SELECT * FROM SALES_200906
     UNION
SELECT * FROM SALES_200907
     UNION
SELECT * FROM SALES_200908
     UNION
SELECT * FROM SALES_200909
     UNION
SELECT * FROM SALES_200910
     UNION
SELECT * FROM SALES_200911
     UNION
SELECT * FROM SALES_200912

The matching formula for making the query dynamic would look like:

="SELECT * FROM SALES_" & [YearText] & "01
     UNION
SELECT * FROM SALES_" & [YearText] & "02
     UNION
SELECT * FROM SALES_" & [YearText] & "03
     UNION 
..

Creating more complex reports on such a model is a pain. I would recommend to create a single SQL VIEW that contains all your sales table.

This normalizes your data model without breaking your existing programsViews are virtual database tables that make it easier to write SQL queries.

CREATE VIEW Sales AS 
  SELECT * FROM SALES_200901
    UNION
  SELECT * FROM SALES_200902
    UNION
  SELECT * FROM SALES_200903
    UNION
  ..

If your database supports materialized views, it will create a permanent temporary table that can considerably speed up your reports.

The insanely complex case: Queries spanning over a user defined period of time

So you implemented a yearly sales report. The world is bright. Really? Not long.

One unhappy morning, your boss will come to your office telling you he wants to see the sales data for the last four month. And last years first half of the year, oh, and a up to date report showing the last 90 days of sales (ex: 15th of January to 13th of April).

If you use a consolidated view as outlined above, you are safe. But for the sake of argument, assume your database administrator’s religion does not allow the creation of VIEWs**.

Creating SQL queries using a BeanShell-Script

Formulas are a great tool for relatively simple problems. But as soon as the problem is more complex or requires dynamic elements, formulas quickly become unmaintainable.

Attributes can be computed by any report expression that returns a suitable value for that attribute. For complex logic, I recommend to use one of the Scripting expressions that comes with Pentaho Reporting.

The Bean-Scripting-Framework (BSF) Expression allows you to use a programming language to compute the query string. The BSF expression comes with support for Java, JavaScript, Groovy and many more programming languages.

Here is an example that computes a query for the last 90 days:

import java.util.Calendar;
import java.util.Date;
import org.pentaho.reporting.engine.classic.core.DataRow;

    Date date = (Date) dataRow.get("Date");
    Calendar cal = Calendar.getInstance();
    cal.setTime(date);
    StringBuilder b = new StringBuilder();
    b.append("SELECT * FROM SALES_");

    int year = cal.get(Calendar.YEAR);
    int month = cal.get(Calendar.MONTH);
    b.append(String.format("%1$04d", year));
    b.append(String.format("%1$02d", month));
    b.append(" WHERE ORDERDATE  ${Date} ");
      }
    }
    return b.toString();

This results in a query string that looks like this:

SELECT * FROM SALES_201106 WHERE ORDERDATE  ${Date} 
 UNION 
SELECT * FROM SALES_201104 WHERE ORDERDATE > ${Date} 
 UNION 
SELECT * FROM SALES_201103 WHERE ORDERDATE > ${Date} 

And now: Happy reporting!

**) This happens more often than you would think.

Creating your own Parameter UI for the Pentaho BI-Server

Our BI-Server ships with a default GWT parameter UI for the parameter defined on a report. If you had been around for a while, then you will remember the sigh of relieve when we freed everyone from the tyranny of XActions for running simple reports. Since then the parameter capabilities of the reporting module grew and grew with every release making these parameters easier to use than the XActions’s original design.

GWT is nice for a old and grumpy Java developer like me, as I do not have to worry about JavaScript (untyped, for heaven’s sake, untyped!). But the hardcore nerds like Pedro “JavaScript is my life” Alves do not like the monolithic garbage the GWT compiler spits out. To slow, to heavy, and foremost: Not really extensible unless you recompile the beast for each change. And worst of all: I agree to these complaints.

However, there is a silver lining on the horizon. Our architecture is open, so you are able to replace the GWT code with your own magic with no problems at all. And here is how you would do it:

Basics: The parameter UI architecture

The parameter UI works as a REST service. The server receives calls and sends responses based on the parameters given, without holding any server side state elsewhere.

(1) The browser loads the report.html page and initializes the GWT parameter application (GWT from now on).

(2) GWT calls the server’s report handler with “renderMode=PARAMETER” or “renderMode=XML”. If there are values for any of the parameters known, then these values are given on the URL using ordinary HTTP-GET or POST requests.

The URL that is called is something like this:

http://demo.pentaho.com/pentaho/content/reporting?renderMode=XML&solution=steel-wheels&path=%2Freports&name=Product+Sales.prpt

(3) The server responds with the parameter XML document. If the renderMode is XML, this document also contains the number of pages in the report. The server only returns page numbers if the parameter validate correctly and if the pagination does not cause any other errors.

(If you are logged into the demo server, call the URL from step 2 in your browser to see the XML document the server returns.)

(4) GWT creates a UI for all parameters based on the Parameter XML document. All information is given as attributes on the parameter. The parameter’s possible values and current value are given in that document as well. These sets of values can change if a other parameter changes.

(5) If all parameters validated correctly (according to the Parameter XML document), it now sends a request to retrieve the rendered report. Again, this is a ordinary HTTP-GET call with all parameters attached onto the URL.

http://demo.pentaho.com/pentaho/content/reporting?renderMode=REPORT&solution=steel-wheels&path=%2Freports&name=Product+Sales.prpt

(6) The Browser displays the report content in the IFrame below the GWT parameter UI.

(7) Paging through the report jumps back to step (6) and updates the report frame.

(8) Changing a parameter value jumps back to step (2) and updates the parameter information.

The mystical Parameter XML

The parameter XML document is a description of all known parameters that the reporting plugin understands. The same format is also used by the Analyzer component and you can even get parameter information in this format out of XActions.

<?xml version="1.0" encoding="UTF-8"?>
  <parameters accepted-page="-1" autoSubmitUI="true" is-prompt-needed="false" layout="vertical" page-count="1" paginate="true" subscribe="false">
    <parameter is-list="true" is-mandatory="false" is-multi-select="false" is-strict="true" name="PROD_LINE" type="java.lang.String">
      <attribute name="role" namespace="http://reporting.pentaho.org/namespaces/engine/parameter-attributes/core" value="user"/>
      <attribute name="parameter-layout" namespace="http://reporting.pentaho.org/namespaces/engine/parameter-attributes/core" value="horizontal"/>
      <attribute name="parameter-render-type" namespace="http://reporting.pentaho.org/namespaces/engine/parameter-attributes/core" value="togglebutton"/>
      <attribute name="label" namespace="http://reporting.pentaho.org/namespaces/engine/parameter-attributes/core" value="Line"/>
      <attribute name="mandatory" namespace="http://reporting.pentaho.org/namespaces/engine/parameter-attributes/core" value="true"/>
      <values>
        <value label="Classic Cars" null="false" selected="true" type="java.lang.String" value="Classic Cars"/>
        <value label="Motorcycles" null="false" selected="false" type="java.lang.String" value="Motorcycles"/>
        <value label="Ships" null="false" selected="false" type="java.lang.String" value="Ships"/>
        <value label="Planes" null="false" selected="false" type="java.lang.String" value="Planes"/>
      </values>
    </parameter>
    <parameter is-list="true" is-mandatory="false" is-multi-select="false" is-strict="true" name="PROD_CODE" type="java.lang.String">
      <attribute name="role" namespace="http://reporting.pentaho.org/namespaces/engine/parameter-attributes/core" value="user"/>
      <attribute name="parameter-visible-items" namespace="http://reporting.pentaho.org/namespaces/engine/parameter-attributes/core" value="6"/>
      <attribute name="parameter-render-type" namespace="http://reporting.pentaho.org/namespaces/engine/parameter-attributes/core" value="list"/>
      <attribute name="label" namespace="http://reporting.pentaho.org/namespaces/engine/parameter-attributes/core" value="Product"/>
      <attribute name="mandatory" namespace="http://reporting.pentaho.org/namespaces/engine/parameter-attributes/core" value="true"/>
      <values>
        <value label="1952 Alpine Renault 1300" null="false" selected="true" type="java.lang.String" value="S10_1949"/>
        <value label="1972 Alfa Romeo GTA" null="false" selected="false" type="java.lang.String" value="S10_4757"/>
        <value label="1962 LanciaA Delta 16V" null="false" selected="false" type="java.lang.String" value="S10_4962"/>
        <!-- loads of parameter values removed -->
        <value label="1961 Chevrolet Impala" null="false" selected="false" type="java.lang.String" value="S24_4620"/>
        <value label="1982 Camaro Z28" null="false" selected="false" type="java.lang.String" value="S700_2824"/>
      </values>
    </parameter>
    <parameter is-list="true" is-mandatory="true" is-multi-select="false" is-strict="true" name="output-target" type="java.lang.String">
      <attribute name="role" namespace="http://reporting.pentaho.org/namespaces/engine/parameter-attributes/core" value="system"/>
      <attribute name="preferred" namespace="http://reporting.pentaho.org/namespaces/engine/parameter-attributes/core" value="true"/>
      <attribute name="parameter-group" namespace="http://reporting.pentaho.org/namespaces/engine/parameter-attributes/core" value="parameters"/>
      <attribute name="parameter-group-label" namespace="http://reporting.pentaho.org/namespaces/engine/parameter-attributes/core" value="Report Parameters"/>
      <attribute name="label" namespace="http://reporting.pentaho.org/namespaces/engine/parameter-attributes/core" value="Output Type"/>
      <attribute name="parameter-render-type" namespace="http://reporting.pentaho.org/namespaces/engine/parameter-attributes/core" value="dropdown"/>
      <attribute name="hidden" namespace="http://reporting.pentaho.org/namespaces/engine/parameter-attributes/core" value="false"/>
      <values>
       <value label="HTML (Paginated)" null="false" selected="true" type="java.lang.String" value="table/html;page-mode=page"/>
       <value label="HTML (Single Page)" null="false" selected="false" type="java.lang.String" value="table/html;page-mode=stream"/>
       <value label="PDF" null="false" selected="false" type="java.lang.String" value="pageable/pdf"/>
       <value label="Excel" null="false" selected="false" type="java.lang.String" value="table/excel;page-mode=flow"/>
       <value label="Excel 2007" null="false" selected="false" type="java.lang.String" value="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet;page-mode=flow"/> 
       <value label="Comma Separated Value" null="false" selected="false" type="java.lang.String" value="table/csv;page-mode=stream"/> 
       <value label="Rich-Text-Format" null="false" selected="false" type="java.lang.String" value="table/rtf;page-mode=flow"/>
       <value label="Text" null="false" selected="false" type="java.lang.String" value="pageable/text"/>
     </values>
   </parameter>
   <!-- many many more parameter -->
 </parameters>

To create a usable UI, you need to look at some critical information in the parameter UI.

The parameter element itself contain the most critical information for a parameter.

<parameter is-list="true" is-mandatory="true" is-multi-select="false" is-strict="true" name="output-target" type="java.lang.String">

The parameter name gives the parameters internal name. Usually each parameter also has a “label” defined as attribute. The “label” is shown to the user, the “name” is sent to the server. The parameter element also indicates whether a parameter is a list or plain parameter. List parameter have a predefined list of values from which the user can choose from. Some list parameter allow the user to select multiple values (“is-multi-select”) or control whether the user can specify additional values (“is-strict”) or whether the parameter can be empty (“is-mandatory”). You can use this to perform client-side validity checks if you want to. But no matter what you do, the server checks the input anyway, as client side input is untrusted input. Not all parameter that are listed in the parameter document are shown to the user. Some are system level parameter that are only useful in certain situations (like all subscription parameter). A parameter is hidden if the hidden attribute is set to true.

 <attribute name="hidden" namespace="http://reporting.pentaho.org/namespaces/engine/parameter-attributes/core" value="false"/>

How a parameter is presented to the user is defined by the “parameter-render-type”. Right now, this attribute can have the following values:

  • dropdown
  • list
  • radio
  • checkbox
  • togglebutton
  • textbox
  • datepicker
  • multi-line
 <attribute name="parameter-render-type" namespace="http://reporting.pentaho.org/namespaces/engine/parameter-attributes/core" value="dropdown"/>

Data formats for sending and receiving data

Sending parameter information over the internet can be a funny exercise. HTTP only allows to send text, so objects like dates and numbers need to be encoded properly. The Pentaho Reporting plugin expects all parameters in a locale-independent standard format. Numbers must be encoded as decimal numbers in plain english format (point as decimal separator, no thousands separators) using the format string “#0.#########”. Dates and Times must be given in the standard ISO format using the format string “yyyy-MM-dd’T’HH:mm:ss.SSS” or (if encoded with timezone information “yyyy-MM-dd’T’HH:mm:ss.SSSZZZZ”). (Also see: “Its about time – better Date parameter handling in Pentaho Report Designer“) Multi-selection parameter must be given by repeating the parameter name for each value. The order of multi-selection values is taken into account when processing the parameters on the server side. Example:

../report?multi-select=Value1&multi-select=Value2

Now all we need is a bit of JavaScript magic on the client side to replace the grumpy old GWT parameter application with a lightweight and smooth pure JavaScript/AJAX solution. Stay tuned ..

Additional Reading Material

Integrating PHP/ASP/Ruby-Web-Applications with Pentaho Reporting

From time to time we get the same question on the Pentaho Reporting Forum:
“How do I integrate the Pentaho Bi-Server and reporting capabilities with my own (PHP/ASP/JSP/Ruby/whatever-) web-application?” Thrilled by the BI-Server’s blazing features, many users want to enrich their applications with rich reporting and analysis capabilities, and unlike the dinosaurs of the BI-Market, our Open-Source solution makes this easy and cost effective.

The techniques I am going to describe in this blog-post are no magic and can be applied to other web-application (like our Data-Integration server) as well.

First, you need to get a Pentaho system up and running. For now, and for the sake of easiness, lets stick with the standard Pre-Configured Installation (PCI). The Pentaho PCI is a Apache Tomcat server that runs the Pentaho BI-Server as its only web-application.

Installing the Pentaho BI-Server

  1. Download Pentaho
  2. Unpack the ZIP or tar.gz archive into a directory. On Linux or Unix systems, I would recommend something like “/opt” or “/usr/local/” as target. On Windows .. well, in that case you are just experimenting and not planing for production, I assume. So any folder will do.
  3. Setup a publish password

    You need that later for publishing reports or other files to the server. This is the major obstacle for most new users when trying to get their own content up and running.

  4. Edit the file “bi-server-ce/tomcat/webapps/pentaho/WEB-INF/web.xml”

    Change the parameter “fully-qualified-server-url” to the public URL of your server.

    
    
      fully-qualified-server-url
      http://localhost:8080/pentaho/
    
    
  5. Optional: If some other tomcat installation runs on your system, your port 8080 may be in use already.

    You can change the port in the file “bi-server-ce/tomcat/conf/server.xml”:

  6. Start the Hypersonic database
  7. Start the Pentaho BI-server
  8. Optional: Install the BI-Server as a system service.

You should now be able to direct your web-browser to the BI-Server and should see the login screen when you type http://localhost:8080/ into your browser’s address bar.

Integration by Direct Linking

Integrating a BI-Server into a existing PHP application is easy now. All you need is a running BI-Server instance that is accessible by your users. From now on, I will simply use the Pentaho Demo Server for our little integration project.

http://demo.pentaho.com

The Pentaho Demo Server is a slightly tuned down installation of the Pentaho BI-Server. Anything I am going to show will work in the same way with the standard community edition.

When refering to the two applications, I will use “application” for the PHP, ASP or Ruby application that wants to utilize the Pentaho Platform. I will use the term “Pentaho” for the Pentaho BI-Server.

The most primitive way of connection two web applications together is to make the application include Pentaho in an frame or by opening up the specific report or analysis view in a new window.


  
    My Web Application
  
  
    

    This file could have been generated by any server side program. Use your phantasy here.    

   

    Reports can be parametrized by adding the parameters for the report to the URL. The parameters must be properly URL encoded. Use UTF-8 as character encoding.    

   

    This shows how to open up a report in a new Window:          A report    

     

Inside the Pentaho Platform, each report has its own page that can be accessed independently of the Pentaho User Console. The sample above links to a report on the system, using the predefined username and password. Note that the username and password is given in cleartext. To make this more secure you will need to set up a Single-Sign-On solution.

Pentaho Enterprise Edition comes with support for setting up Single-Sign-On for CAS rather easily via some setup scripts that take care of the ugly work. If you are using community edition and are willing to spend the time, you can always set it up yourself.

Alternatively you can set up Pentaho to work with Active Directory or LDAP instead.

Here are some sites with documentation to get you started.

And don’t forget: The Pentaho Wiki also contains a load of information.

Internally, Pentaho uses Spring Security, so with enough time and patience you should be able to connect Pentaho to anything on this planet for authentication purposes.

Access session and environment variables in a PRPT

Reports need to interact with the outside world in many ways. To formalize this, we allow to pass values from the outside to the inside. Pentaho Reporting knows several classes of inputs:

  • Environment variables
  • Parameter: User supplied values usually coming from an interaction of the user with a user interface
  • DataRow-Fields: part of the result-sets from the data-sources
  • Expressions and Functions: Calculated values

Today, we will talk about environment variables. Environment variables are system level parameter. You can use them to configure reports based on where you deploy the report.

We provide several properties for use within the Pentaho BI-Server.

  • serverBaseURL: the server URL, for example: “http://localhost:8080”
  • pentahoBaseURL: the location of the Pentaho Web-application: “http://localhost:8080/pentaho”
  • solutionRoot: The file-system location where the local copy of the solution repository can be found.
  • requestContextPath: the servlet’s request context path pointing to the local web-application: “/pentaho/”
  • username: The current user’s name
  • roles: The roles as CSV-encoded list. The CSV string is quoted according to the Excel CSV quoting rules.

Inside the BI-Server, environment variables can also be used to interact with the server’s environment to query session and global parameter information. The report environment allows you to access values stored on the current user’s session or as global parameter by using a special prefix.

Requesting a environment property “session:name” will query the session context for an entry called “name” and returns the string representation of this object to the report.

Likewise, prefixing a environment property with “global:” will query the global parameter collection of the BI-Server. Therefore querying a environment variable “global:name” will look for a entry “name” in the Bi-Server’s global parameters.

Accessing Environment variables with Formulas

The Pentaho Reporting Engine offers the formula function “ENV” to access the report environment properties.

Definition:

=ENV(String)

To access the variable called “example” as plain text, use

=ENV("example")

To access a value bound to the server-side session variable “example” use

=ENV("session:example")

If the variable is a CSV encoded string, you can use the CSVARRAY function to parse the string into a array of strings. This array is then suitable to use for multi-selection parameters or SQL-IN clauses.

=CSVARRAY(ENV("roles"))

You can now use these variables as part of other formulas, for instance to compute URLs pointing to documents on the server server.

=ENV("serverBaseURL") & "/myotherwebapp/downloads/pentaho-report-designer.zip"

Accessing Environment variables as predefined fields

When you open up the Pentaho Report Designer, you will see several of the well-known environment variables listed as fields on the report itself. Using the ENV function is flexible, but not very convenient. So we added a automatic mapping to the reporting engine.

The various env:* fields you see in a report are what I call “well-known” fields. As Pentaho Engineers we know knows that they exist on the server and are heavily used by our users. So we added a mapping to the system so that they automatically become fields without having to use a formula function to read their values.

The list of “well-known” fields is defined in the global report configuration. The global configuration can be edited by creating or editing a file called “classic-engine.properties” inside your Pentaho Report Designer and Pentaho BI-Server directories.

Inside the Pentaho Report Designer, this file resides in the “resources” directory in your Pentaho Report Designer installation directory. For the BI-Server you can find this file in the “pentaho/WEB-INF/classes” directory. To make your reports work consistently inside the Report Designer and the Server, make sure that both files have matching definitions.

For example: If you want to map your environment-property “value” into the field “env::value” then you would add the following line to your “classic-engine.properties” file:

org.pentaho.reporting.engine.classic.core.env-mapping.value=env\:\:value

In case value contains a comma separated list (using the Excel-CSV quoting rules for strings that need to be quoted), then you can make that list available as array by appending -array to its name.

org.pentaho.reporting.engine.classic.core.env-mapping.value-array=env\:\:value-array

The engine will recognize the “-array” suffix and will look for a “value” property instead, parses that into a string-array and returns you the array. The array then can be used in a SQL-IN clause.

The prefix “env::” is a convention inside the reporting engine to separate environment fields from other fields. If you omit this, you have to make sure that none of the fields returned by your data sources and none of your expressions or functions provide a value under the same name. This would overshadow your environment property and renders it inaccessible.

Performance monitoring and tuning in Pentaho Reporting

Are your reports running slow? Do you have problems finding the solution to the problem? The performance of reports depends on many factors, and as many of these factors are interdependent, it is not always easy to untangle that mix.

But since the release of Pentaho Reporting 3.8, at least there is some help in your quest for performance with some additional log settings that spit out some performance metrics.

With this release, we added a bunch of loggers at strategic places to monitor the performance of running reports.

org.pentaho.reporting.engine.classic.core.ProfileReportProcessing=true

This setting enables the profiling mode via the logger “org.pentaho.reporting.engine.classic.core.layout.output.PerformanceProgressLogger”.
If this property is set to ‘true’, a new progress-monitor gets added to the report process. This monitor can print periodic status messages while the report is processed.

The Monitor itself can be tuned to output only the information that you seem most relevant. You can enable the logger to print a statement on every new page, on every new processing level or you can print periodic updates for every 5000 rows that have been processed by the engine.

org.pentaho.reporting.engine.classic.core.performance.LogPageProgress=true
org.pentaho.reporting.engine.classic.core.performance.LogLevelProgress=true
org.pentaho.reporting.engine.classic.core.performance.LogRowProgress=true
org.pentaho.reporting.engine.classic.core.DebugDataSources=true
org.pentaho.reporting.engine.classic.core.ProfileDataSources=true

Pentaho Reporting is a layouting engine that combines tabular data from a data source with calculated data and a layout description to produce the final documents. During the last few years I have not seen a report where the report functions or expressions formed the bottle neck of the report processing, so I happily ignore them unless everything else fails.

The database as bottle neck

If you are using SQL or MQL data sources, then your database may be responsible for the slow execution of the report.

Depending on the database, the use of scrollable result sets may be lethal to the report performance. With a scrollable result-set the database only supplies a sliding window of the data to the reporting engine. This reduces the reporting engine’s memory footprint, but puts a larger strain on the network and I/O capabilities of the system. If you pull a lot of data and the database is slow on supplying the data via a scrollable result-set then you pay a lot for accessing the data without knowing it. The query itself will return quickly, as it only initializes the cursor leaving the job of transferring the data out of the cost-equation.

You can test whether the data source is the problem by disabling the use of scrollable result sets via the report configuration setting:

org.pentaho.reporting.engine.classic.core.modules.misc.tablemodel.TableFactoryMode=simple

Be aware that the memory consumption will go up with this setting as we fully buffer the result set in the local memory. On the positive side, access to the local memory is blazing fast and therefore your report will run faster now.

I usually test the database first, as this test is simple to make and rules out a large chunk of the potential trouble makers.

The layout as bottle neck

In the reporting engine, the performance of the report is usually limited by the layouting and text processing that is necessary to produce the report. Performance is mainly dependent on the amount of content printed. The more content you generate, the more time we will need to perform all layout computations.

Subreports: Inline vs Banded

Large Inline-sub-reports are the most notorious reason for bad performance. The layouted output of a inline-sub-report is always stored in memory. Layouting pauses until the sub-report is fully generated and then inserted into the layout model and subsequently printed. The memory consumption for this layouting model is high, as the full layout model is kept in memory until the report is finished. If the amount of content of the sub-report is huge, you will run into out-of-memory exceptions in no time.

A inline-sub-report that consumes the full width of the root-level band should be converted into a banded sub-report. Banded sub-reports are layouted in a streaming fashion and output is generated while the sub-report is processed. The memory footprint for that operation is small as only the active band or the active page has to be held in memory.

Images loaded from HTTP-URLs

Some image generating scripts on the web can cause troubles with the reporting engine. For content loaded via HTTP connections, we rely on the cache control headers sent by the server. For this we check the “last-modified” header of the HTTP-response. If there is none, then we consider this file non-cacheable.

We check cache header against the original object by issuing a HEAD request to the URL. This may cause problems in case the image service provider does not properly implement a HEAD handler.

In Pentaho Reporting 3.8, we work around most of these troubles by assuming the worst from the web-authors. A non-cachable file is still cached for at least 10 seconds (the default configuration) as most programmers of web-services do not care, do not know or are not able to send a “last-modified” header or implement a HEAD response handler. For long running reports, this value must be increased to a larger number that matches at least the expected runtime length of the report.

You can also disable cache control completely, which will cause the reporting engine to never reload the content as long as the main resource cache still holds a reference to it.

To configure both settings, edit the “loader.properties” file in the root of your classpath and set the following properties:

# Controls the mandatory lifetime of HTTP objects before we check for updates on the object 
org.pentaho.reporting.libraries.resourceloader.config.url.FixedCacheDelay=10000

# Disable cache updates completely. Once a resource is loaded it stays loaded until its 
# end of life in the underlying resource cache has been reached.
org.pentaho.reporting.libraries.resourceloader.config.url.FixBrokenWebServiceDateHeader=false

Misconfigured caching as source of slow-downs

As a general rule of thumb: Caching must be configured properly via a valid EHCache file. If caching is disabled or misconfigured, then we will run into performance trouble when loading reports and resources. A misconfigured cache can considerably slow down reports if you use a resource-message, resource-label or resource-field element in the report. The resource elements load their translations via LibLoader, which in return depends on properly working caches to avoid repeated and expensive I/O operations.

You can configure EHCache via an “eh-cache.xml” file placed into the root of the classpath.

Output targets as bottle neck

Pageable outputs

A pageable report generates a stream of pages. Each page has the same height, even if the page is not fully filled with content. When a page is filled, the layouted page will be passed over to the output target to render it in either a Graphics2D context or a streaming output (PDF, Plaintext, Pageable-HTML etc).

When the content contains a manual pagebreak, the engine will consider the page as full. If the pagebreak is a “before-print” break, the break will be converted to a “after-break” on the previous band. To apply this change, the internal report states must be rolled back and parts of the report processing restarts to regenerate the layout with the new constraint in place. A similar roll-back happens, if the current
band does not fit on the page.

To avoid unnecessary rollbacks prefer “break-after” pagebreak markers over “break-before” markers.

In most cases, the pageable outputs will be fairly simple and non-problematic, as they only hold a minimal amount of data in memory.

Table outputs

A table export produces a table output from a fully layouted display model. A table export cannot handle overlapping elements and therefore has to remove such elements.

In versions prior to Pentaho Reporting 3.8, we also added some debug information to the layout nodes. This did increases the memory consumption but made developing reporting solutions easier. Although these debug settings should never be enabled in production environments, many users of the community edition may still run with these settings enabled. In Pentaho Reporting 3.8 we changed the built-in default so that no debug-information gets added. Those who could interpret the output probably know how to enable the setting, and everyone else just suffered the bad performance.

# Config settings for the 'classic-engine.properties' file
org.pentaho.reporting.engine.classic.core.modules.output.table.base.ReportCellConflicts=false
org.pentaho.reporting.engine.classic.core.modules.output.table.base.VerboseCellMarkers=false

Streaming HTML exports

In HTML exports, there are a few settings that can affect export performance.

org.pentaho.reporting.engine.classic.core.modules.output.table.html.CopyExternalImages=true

This setting controls whether images linked from HTTP(S) or FTP sources are linked from their original source or copied (and possibly recoded) into the output directory. The default is “true” as this ensures that reports always have the same image.

Set it to false if the image is dynamically generated and should always show the most recent view. Disabling this setting also avoid some I/O for copying and recoding the images during the report generation, although for most images used in the report this performance gain wont be noticeable at all.

org.pentaho.reporting.engine.classic.core.modules.output.table.html.InlineStyles=false
org.pentaho.reporting.engine.classic.core.modules.output.table.html.ExternalStyle=true
org.pentaho.reporting.engine.classic.core.modules.output.table.html.ForceBufferedWriting=true

The style settings and the buffered writing settings control how stylesheets are produced and whether the generated HTML output will be held in a buffer until the report processing is finished.

Style information can either be inlined, stored in a external *.css file or can be contained in the HEAD element of the generated HTML file (when Inlinestyles is set to false and ExternalStyle is also set to false).

Buffering is forced when styles need to be inserted into the element of a report. The Buffering configuration setting should be set to true if the resulting report is read by a browser, as browsers request all resources from the server as soon as they find a reference to them in the HTML stream. If a browser requests a stylesheet that has not yet been fully generated, the report
will not be rendered correctly.

It is safe to disable buffering if the styles are inlined, as the browser will not need to fetch a external stylesheet in that case. Inlined styles greatly increase the size of the generated HTML and may slow down the rendering of the generated HTML file tremendously if the file is large.

Buffered content will appear slower to the user than non-buffered content, as browsers render partial HTML pages while data is still received from the server. Buffering delays that rendering until the report is fully processed on the server.

Reports that are only written to disk or downloaded from the server to be archived and then forgotten should have buffering disabled so that they can be streamed directly to the client.

Buffering consumes memory and can potentially lead to OutOfMemory conditions on smaller servers if too many users request large reports at the same time. However, in most reasonable configurations, buffering is harmless. So far no browser is able to properly render a HTML file that is larger than 20 MB. So if excessive buffering causes troubles then rest assured that in case buffering works your browser surely wont.