Monthly Archives: July 2011

How to use the Barcode element in Pentaho Reporting

Barcodes are a easy way to convey information in a machine readable way. With Pentaho Reporting, you can print a large set of barcodes in your reports. Barcodes in Pentaho Reporting are provided by the “simple-barcodes” element. And creating barcodes is as simple as the name of this element promises.

Pentaho Reporting 3.8 uses the Barbecue project as Barcode renderer and supports the following barcode types:

  • 2of5
  • 2of5-interleaved
  • Codabar
  • Code39(ext)
  • Code128(a,b,c)
  • EAN13
  • UCC-EAN128
  • ISBN
  • PostNet
  • PDF-417

With Pentaho Reporting 4.0, we also integrate the Barcode4J project (in addition to the existing Barbecue barcodes), and thus add support for the following barcode types:

  • EAN8
  • EAN128
  • UPCE
  • DataMatrix
  • RoyalMail
  • US-Postal-Service IntelligentMail

Think of a barcode element as some sort of fancy text- or number-field. A barcode element prints a single field value and generates a barcode vector image for it. To add a barcode to your report, all you need to do is drag the barcode element from the palette on the left-hand side of the window into the report canvas.

Each barcode has its range of allowed characters. The barcodes EAN, UPCA, ISBN, 2of5 and PostNet only accept numbers. The Code39, Code128 and Codabar barcodes accept a limited set of characters (the letters A-Z, 0-9 and a more or less extensive set of extra characters, all from the lower 128 characters of the ASCII-Charset). And some are free to accept nearly everything as input (PDF-417 and DataMatrix).

Usually, when you print barcodes, your data should already be in a format that can be handled by the barcode you are going to use. All barcoded data is generated by machines for other machines, and the person who generated the data usually (or hopefully) knows the accepted format already.

Like ordinary text-fields, barcode elements either receive a static text in the “value” attribute or a field-name in the “field” attribute.

If the barcode support printing a textual representation of the data, you can enable this via the “barcode-settings::show-text” attribute. This is also the place to fine tune the bar-width or height or to enable the calculation of checksums for the data given.

Styling of barcode elements

Styling the elements is easy. Like the “Sparkline Elements”, barcodes reuse many of the existing style properties.

  • text-color: The foreground colour of the barcode. This is the colour the bars are rendered with.
  • background-color: The background colour of the barcode. The barcode area will be filled with that colour prior to printing the barcode. This is handy to ensure that the barcode is printed on a white background for maximum contrast.
  • font-family, font-size, bold, italics: The text properties define the style and size of the text printed on the barcode. Not all barcodes use text.
  • scale: Whether the barcode is scaled according to the element’s size. Some barcodes do not have an intrinsic size and therefore are always scaled.
  • keep-aspect-ratio: If the barcode is scaled, this defines whether the ratio between the width and the height remains constant. For some barcodes, this is mandatory to ensure that the code can be scanned properly.

And that’s it. As said: There is no big secret to using barcodes. So happy coding!

Access Databases with Dynamic Table Names in Pentaho Reporting

Every now and then we get the question on how to create a report on database tables with a dynamic date stamped table name. Sounds weirdly cryptic? Well, instead of having a single table for all orders, these systems may have many small tables for all orders of a specific month.

I tried hard to come up with a potential reason for these things:

  • Maybe the database is so weak that it can’t handle more than a few thousand rows in a table. But both MySQL nor PostgreSQL are quite capable for some serious data storage (if used right). And even Oracle can do that – and if you have enough money to buy a Oracle license you have money to buy a server (-farm, Oracle ain’t cheap!) too.
  • Maybe they want to speed up data access to the table. After all, a SELECT * FROM TABLE takes a long time if you don’t use a WHERE clause or a proper index.
  • Maybe they have not heard of normalization.
  • Maybe they love to have separate tables so that they can remove the data of older months. A SQL DELETE command is black magic. Deleting the database file in the filesystem is quicker.

So if you are using such a system – drop me a note. I am dying to know what application creates such a data-structure.

Reporting on these database structures is not straight forward and usually considerably slower than working on a proper data model. And here is how you do it, in simple steps:

Prerequisite: The data model we use in this tutorial

Lets assume we have a database filled with sales data. For the sake of simplicity, lets assume there are several tables named in the schema of “SALES_yyyymm”. The ‘yyyy’ part of the table name represents the four digit year, the ‘mm’ represents the two-digit month.

The tables are made up of the following structure (including some sample data*):

CUSTOMERNUMBER DATE ORDERID ARTICLEID QUANTITY PRICE
0001 20091015 112 Fluffy Dogs 100 15
0001 20091015 112 Gumpy Rabbit 9 14
0001 20091015 112 Angry Birds 12 5
0002 20091024 113 Busy Bees 99 1

*) Note: Yes I know that putting strings as article ID is bad. But it safes me from outlining a second table.

Prerequisite 2: Reporting made dynamic with a date parameter

Creating a static report with a hardcoded table name is easy. Write your query and be done with it. The tricky part is to create a report that selects the right table based on the a given date. That date could be the current system date for reports returning the current month’s turnover or it could be a historic date.

If you want to make the report dynamic, you will need a parameter to feed the report’s target date into the reporting system and into the query. The table-name in the database contains a partial date value. To make the report return data for a given date, we must calculate the database table name when the report runs based on whatever date the user entered.

Standard SQL datasources do not allow you to parametrize structural elements of a query. This is a security measure to prevent an attack known as “SQL Injection“.

If you need to compute structural elements of a query, you will have to use a “JDBC (Custom)” datasource (Data -> Add Data-Source -> Advanced -> JDBC (Custom)). Whenever you use a data-source marked as (Custom), you opt out of the standard parametrization. At that point, it is your responsibility to properly validate all incoming user data and to ensure that no one can enter illegal values.

The Simple-Case: Reporting within the same table

The Custom JDBC datasource only contains connection information. The actual SQL query that is executed is computed at runtime. To define a query, put a formula or expression into the “query::name” attribute of your master- or sub-report object.

Mike Tarallo create a excellent video tutorial on how to use the Custom JDBC data sources. So instead of repeating everything here, watch the video and learn what the Custom JDBC datasource is and how to use it properly.

Simple Parametrization: Using a string parameter

So we need to declare a parameter to select the table the report should use. You want it dead simple (just for testing, of course!)? Use a plain text parameter that allows the user to enter the year and month to use.

Parameter:
 Name: Simple
 Value Type: String

and finally, to prevent SQL injections, some validation:

Post-Processing-Formula: =IF(LEN([Simple]) = 6; IF (ISNUMBER([Simple]); [Simple]; NA()); NA())

Complex Parametrization ideas

You can use a ordinary date parameter to make use of the date-picker. Note that this returns a Date object. You need to convert this date into a valid text. For that add a second hidden parameter with the following Post-Processing Formula:

=MESSAGE("{0,date,yyyyMM}";[Date])

Alternatively, assuming you use a proper database, you can extract the valid dates from your database. Just query the database metadata for the tables that are available. (I am using MySQL for the examples here. Consult your database manual on how to access the database schema.)

SELECT SUBSTRING(table_name, 5) AS "SalesDates" 
  FROM INFORMATION_SCHEMA.TABLES
 WHERE table_schema = ''
   AND table_name LIKE 'SALES_%'

You can use this information to feed a list parameter.

Create your query

I assume you have a parameter that contains a suitable string (6 characters, yyyyMM format). Lets call this parameter “DateSuffix”. On the master-report’s “query-name” attribute, create a formula with the following syntax:

="SELECT * FROM SALES_" & [DateSuffix]

At runtime, the report engine will evaluate the formula and construct the proper query string. That query string is then executed by the Custom JDBC Data-Source.

Tip: The report designer does not use queries from formulas during the design process. Read this article to learn how to get designtime column information into the Pentaho Report Designer.

Summary

To create a simple dynamic report:

  • Create a parameter that returns the suffix of your table
  • Add a Custom JDBC Datasource to your report
  • Add a formula to the query-name-attribute of your report

Creating reports that summarize a full year

Single reports for a single month – that is easy stuff. But sooner or later someone will come back to you and ask for sales data spanning the last year.

Lets skip the parameter defining details. Create a parameter that returns the year as string and call it “YearText”.

To your database, each of the monthly tables is unrelated to any other of the monthly tables. Databases are great on providing referential integrity on the data within the tables. But generally they care little about the names of the tables and particularly do not assume that table names have any higher meaning.

In SQL merging data from several tables happens via an UNION operator. Tables that are merged via UNION must have the same structure – the number and types of the columns must match.

A SQL Union takes the form of

 UNION 

Or for our example:

SELECT * FROM SALES_200910 
 UNION
SELECT * FROM SALES_200911

Each select statement in a UNION is considered separately. Any “ORDER BY” clause is applied only to the particular table. If you want to sort globally, you will have to wrap the UNION result into another SQL query via a SUB-SELECT. All proper databases support Sub-Selects, but again: For details, consult your database manual.

Here is an example on how to use a sub-select to sort globally

SELECT * FROM 
  (SELECT * FROM SALES_200910 
     UNION
   SELECT * FROM SALES_200911) 
ORDER BY ARTICLEID

Sub-selects are considerably slower than normal selects. Most databases will start to use temporary tables for the selection result and therefore will answer such queries a lot slower than normal queries. If your performance goes down a lot, consider creating a data warehouse.

A query spanning a whole year will be long and complex.

SELECT * FROM SALES_200901
     UNION
SELECT * FROM SALES_200902
     UNION
SELECT * FROM SALES_200903
     UNION
SELECT * FROM SALES_200904
     UNION
SELECT * FROM SALES_200905
     UNION
SELECT * FROM SALES_200906
     UNION
SELECT * FROM SALES_200907
     UNION
SELECT * FROM SALES_200908
     UNION
SELECT * FROM SALES_200909
     UNION
SELECT * FROM SALES_200910
     UNION
SELECT * FROM SALES_200911
     UNION
SELECT * FROM SALES_200912

The matching formula for making the query dynamic would look like:

="SELECT * FROM SALES_" & [YearText] & "01
     UNION
SELECT * FROM SALES_" & [YearText] & "02
     UNION
SELECT * FROM SALES_" & [YearText] & "03
     UNION 
..

Creating more complex reports on such a model is a pain. I would recommend to create a single SQL VIEW that contains all your sales table.

This normalizes your data model without breaking your existing programsViews are virtual database tables that make it easier to write SQL queries.

CREATE VIEW Sales AS 
  SELECT * FROM SALES_200901
    UNION
  SELECT * FROM SALES_200902
    UNION
  SELECT * FROM SALES_200903
    UNION
  ..

If your database supports materialized views, it will create a permanent temporary table that can considerably speed up your reports.

The insanely complex case: Queries spanning over a user defined period of time

So you implemented a yearly sales report. The world is bright. Really? Not long.

One unhappy morning, your boss will come to your office telling you he wants to see the sales data for the last four month. And last years first half of the year, oh, and a up to date report showing the last 90 days of sales (ex: 15th of January to 13th of April).

If you use a consolidated view as outlined above, you are safe. But for the sake of argument, assume your database administrator’s religion does not allow the creation of VIEWs**.

Creating SQL queries using a BeanShell-Script

Formulas are a great tool for relatively simple problems. But as soon as the problem is more complex or requires dynamic elements, formulas quickly become unmaintainable.

Attributes can be computed by any report expression that returns a suitable value for that attribute. For complex logic, I recommend to use one of the Scripting expressions that comes with Pentaho Reporting.

The Bean-Scripting-Framework (BSF) Expression allows you to use a programming language to compute the query string. The BSF expression comes with support for Java, JavaScript, Groovy and many more programming languages.

Here is an example that computes a query for the last 90 days:

import java.util.Calendar;
import java.util.Date;
import org.pentaho.reporting.engine.classic.core.DataRow;

    Date date = (Date) dataRow.get("Date");
    Calendar cal = Calendar.getInstance();
    cal.setTime(date);
    StringBuilder b = new StringBuilder();
    b.append("SELECT * FROM SALES_");

    int year = cal.get(Calendar.YEAR);
    int month = cal.get(Calendar.MONTH);
    b.append(String.format("%1$04d", year));
    b.append(String.format("%1$02d", month));
    b.append(" WHERE ORDERDATE  ${Date} ");
      }
    }
    return b.toString();

This results in a query string that looks like this:

SELECT * FROM SALES_201106 WHERE ORDERDATE  ${Date} 
 UNION 
SELECT * FROM SALES_201104 WHERE ORDERDATE > ${Date} 
 UNION 
SELECT * FROM SALES_201103 WHERE ORDERDATE > ${Date} 

And now: Happy reporting!

**) This happens more often than you would think.

Video: Design for Developers

More often than I like to see it, I receive sample reports (usually as part of a bug report) that makes me feel sorry for the poor souls receiving the resulting documents.

Reporting is as much about numbers as it is about conveying a message effectively. Yes, it is possible to just send out a flood of numbers in a 100+ pages document and let the receiver mine for the right information. But in most cases, such reports simply never get read.

Remember: Data is not information.

Data is just random facts thrown out into the world. Information is data put into a context that makes sense to the receiver. Don’t send out data, send information.

A simple way of doing this is to begin from the end. Don’t think about what data you can push out, think about what information the receiver of your report needs and what they are supposed to do with it. And then either cut out everything that distracts them from the core message. Or, if you are required to put all your raw data into the document for legal or organizational reasons, at least produce a short summary with information and put the data into an appendix.

Choosing the right data for a maximal informative report is only one aspect of making good reports. The important parts need to be properly emphasized. Your reader needs to be guided by layout and structure. Your report must be understandable – hopefully at the first glimpse, or it will be filed under “read later”. And ‘later’ never happens anyway.

Good design is not a magical thing and you don’t have to be an expert to make a report that looks decent. Most design guidelines are rather complex and not necessarily practical. This video here is.

Idan Gazit’s talk “Design for developers: Making your Frontend Suck Less” contains some very basic (and extremely helpful) design advice. Even though he talks about web-applications, the majority of his advice can be easily applied to the design of reports.

Idan Gazit’s talk “Design for developers: Making your Frontend Suck Less”

Enrich your report with sparklines

Sparklines are a great way to convey a lot of information in a simple and readable way. Sparklines can be used to plot changes of metrics over time. Edward Tufte uses temperatures of patients in a hospital and share prices as examples.

Bart Maertens wrote about sparklines when Pentaho Reporting 3.5 came out. At that time, he had to do a lot of SQL magic to get data into them. With the addition of the MULTIVALUEQUERY formula function in Pentaho Reporting 3.8, this process got a lot easier.

Pentaho Reporting comes with three Sparkline type elements. Sparkline elements are minimalistic, they do not have any axis labels or legends.

  1. Line Sparks: A small line chart that shows the change of values over time. Line charts are good to visualize trends over time. The last data point can be highlighted for better readability.
  2. Bar Sparks: A small bar chart that allows to compare absolute values over a time scale. Bar charts should be used when you need to concentrate on the ratio between values instead of the general trends.
  3. Pie Sparks: A small pie chart that visualizes a single value in relation to a total value.


Now how do you add sparklines to a report created with the Pentaho Report Designer?

Getting the data

Except for Line and Bar charts, you will need a array of numeric values to display the data. You can construct arrays from different values of the same data row with a OpenFormula function like this:

=NORMALIZEARRAY({[Value1] | [Value2] | [Value3]})

The NORMALIZEARRAY formula function converts a internal OpenFormula array into a valid Java-Array that can be used by other components.

Alternatively, you can get a array of values from a data-source by using the “MULTIVALUEQUERY” formula function.

With a SQL datasource using the sample data you can use a query like this to get the sales numbers for each year:

SELECT
     SUM("ORDERFACT"."TOTALPRICE") AS "Value",
     "ORDERFACT"."YEAR_ID",
     "ORDERFACT"."PRODUCTCODE"
FROM
     "ORDERFACT"
WHERE 
          "ORDERFACT"."PRODUCTCODE" = ${PRODUCTCODE}
GROUP BY
     "ORDERFACT"."PRODUCTCODE",
     "ORDERFACT"."YEAR_ID"

This creates a result-set similar to this:

Value                   YEAR_ID PRODUCTCODE
-------------------------------------------
57363.100000000006 2003 S18_1749
19656.739999999998 2003 S18_2248
33451.850000000006 2003 S18_4409
12768.21 2003 S24_3969
39983.85999999999 2003 S18_2325
52395.560000000005 2003 S18_2795
11024.69 2003 S24_1937
14770.83 2003 S24_2022
38895.38 2003 S18_1342
22257.399999999998 2003 S18_1367
72913.26999999999 2003 S10_1949
40103.54 2003 S10_4962
46714.39 2003 S12_1666

You can then reference this data from a formula with the MULTIVALUEQUERY formula function.

=MULTIVALUEQUERY(“SparklineQuery”, “Value”)

Pie sparks simply require a single value in the range of 0 (zero) to 1 (one). Zero represents an empty pie (0% of the total), 0.5 a pie filled to the half (50% of the total), and 1 represents a fully filled pie (100% of the total).

Formatting the sparklines

Sparklines get formatted via the styles properties.

Bar-Sparks:

  • last-color: The color of the last bar
  • high-color: The color of the bar with the largest value.
  • text-color: The color of all other bars

Line-Sparks:

  • last-color: The color of the marker dot at the end of the spark
  • text-color: The color of the line

Pie-Spark:

  • low-color: The color of the slice if the value is below the “low-slice” threshold.
  • medium-color: The color of the slice if the value is below the “medium-slice” threshold.
  • high-color: The color of the slice if the value is below the “high-slice” threshold.
  • text-color: The color of the pie that is not filled by the slice.

Charting: Categorical Charts and XY-Charts

Reporting without charting is like zombies without the inevitable hunt for fresh brains. You can do it, yes, but it is sure not fun.

These days, charting in Pentaho Reporting is done via the “chart-element” in the Pentaho Report Designer. Drag the chart field into the canvas, double click on it to open the chart-editor, and start configuring your chart.

So far it’s all theory, lets see how charting really works.

Architecture of Charting

Charting in the Pentaho Reporting engine requires three parts. It requires (1) a data-collector to extract the charting-data from the datasources, (2) a chart-expression to produce a chart from the collected data and (3) a report element to display the resulting chart object.

The chart element that you can drag into the report is a front-end to hold the data-collector and the chart-expression and – of course – is responsible for rendering the chart once it has been produced.

Each chart-type is produced by a separate chart-expression. The chart-editor dialog selects the right chart expression for you when you click on one of the chart-type buttons on the top of the window. The visual properties of the chart are configured on the left-hand side of the chart dialog.

Depending on the chart type, you will need a suitable chart-datacollector. Data-collectors are configured in the right-hand table of the dialog. The dialog only offers collectors that can produce valid datasets for the currently selected chart type.

Chart Types

The chart types the Pentaho Report Designer supports can be grouped into three different groups:

1. Categorical Charts

A categorical chart uses a nominal scale to group data. The charting system makes no assumptions on relationships between the data. The X-Axis of such charts display labels for each data point, and all datapoints are printed in the order they arrive at the data-collector.

You can use categorical charts to display nominal data. In statistics, the nominal scale is the lowest measurement level you can use. A nominal scale, as the name implies, is simply some placing of data into categories, without any order or structure.

A example of a nominal scale is the sales regions of a company. Regions have no real relationship or natural order between each other. One region is as good as an other (from a statistical point of view).

Pentaho Reporting supports the following categorical charts:

  • Bar Chart
  • Line Chart
  • Area Chart
  • Combined Bar and Line Chart
  • Waterfall Charts
  • Radar Charts

2. XY-Charts

A XY-Chart uses an interval scale to organize the data it displays. All data items are comparable and can be sorted and all data points have a defined distance between each others.

The most common example for a interval scale in the context of reporting is the date-axis for comparing events on a time scale. (sales over the last years).

Pentaho Reporting supports the following XY-Charts:

  • Bar Chart
  • Line Chart
  • Area Chart
  • Combined Bar and Line Chart
  • Combined Area and Line Chart
  • Scatter Plot Charts
  • Bubble Charts

3. Pie and Ring charts

Pie charts can be used to compare the overall ratio of several numeric values. Your values must be complete (ie add up to 100% of what you want to show) or your chart will be misleading.

Pie charts should only be used if you are comparing the size of a slide with the overall size of the pie. However, this type of charts is not suitable for comparing the sizes of two slices or comparing different pie charts.

The Multi-Pie-Chart chart type should not be used at all. It only exists because some users insisted on this sort of functionality. But this sort of chart is hard to read at best and usually just plainly misleading.

Creating a categorical bar chart

Now, in the hands-on section of this entry, lets create a bar chart that shows the sales of our various product lines over the last few years. This example uses the sample data that comes with the Pentaho Report Designer.

Create a new report in the Report Designer and follow me.

First, You need data

Create a new JDBC datasource (Data->Add Datasource->JDBC). Select the predefined connection “SampleData(Memory)”, create a new query called “chart-data” and add the following query text.

SELECT
     PRODUCTS.PRODUCTLINE,
     ORDERFACT.YEAR_ID,
     sum(ORDERFACT.TOTALPRICE) AS SALES,
     sum(ORDERFACT.QUANTITYORDERED) AS VOLUME
FROM
     PRODUCTS INNER JOIN ORDERFACT ON PRODUCTS.PRODUCTCODE = ORDERFACT.PRODUCTCODE
GROUP BY
     PRODUCTS.PRODUCTLINE,
     ORDERFACT.YEAR_ID
ORDER BY
     PRODUCTS.PRODUCTLINE ASC,
     ORDERFACT.YEAR_ID ASC

Make this query your currently active query: In the data-tab, expand the new JDBC data-source and right-click on the query node below and select “Select Query”.

Then you need a chart object

Drag the chart-object from from left hand palette into the report-header area. This ensures that the chart is printed once on the report. If you put a chart into the Group-header or -footer the chart is printed for each occurrence of that group. Putting a chart into the details band makes no sense. It would print the exact same chart as many times as there are rows in your data source.

And finally we need to configure the chart

Double click on the chart object to start the chart-editor. The “Bar Chart” should already be selected.

First, we define what data the chart will be using. On the right-hand side you find the chart-data collector. Enter the following values into the properties:

  • Category Column: YEAR_ID
  • Value Column: SALES
  • Series-by-Field: PRODUCTLINE

A chart could have several series for each column, which can be useful if you want to compare data-series with each other.

Next define the chart’s appearance. For now, we keep it simple and just change the Chart’s title and the y-axis title.

  • Chart Title: Product Line Sales Trend
  • Y-Axis Title: Sales

Confirm the changes in the chart editor by pressing “OK” on the bottom of the dialog and preview your report.

You can find several chart examples in the samples that ship with the Pentaho Report Designer. Explore them!

Pentaho Reporting 3.8.1-RC1 released

On Friday without much ado, Pentaho uploaded the Release Candidate 1 of Pentaho Reporting 3.8.1 to the Sourceforge servers. This bug-fix release contains only bug-fixes, among the more noteworthy we have PRD-3349 and PRD-3375 (Numbers rounded wrongly) and some changes to the parameter handling so that cascading prompts get more usable inside the report designer and the Pentaho BI-Server.

Full list of all cases closed in this release

Download: Download Pentaho Report Designer 3.8.1-RC1