Category Archives: Report Designer & Engine

Crosstab update – Pagebreaks and header visibility

prd-screen-capture-6It has been a while since I wrote something about the eternal project. So here’s a quick update.

I just checked in a few changes to the crosstab backend and the magical create-a-crosstab dialog. In addition to selecting the row- and column-dimensions (as usual), you now get a bunch of extra options for your crosstab.

The most interesting ones are the switch from a static width and height (80x20pt) to relative sizes. With that a crosstab now tries to fill the available space as good as possible, expanding and shrinking elements when needed.

Marius requested an option to show title headers for the measures. You can control whether you want such headers (they are there by default) or not. As bonus, you get control over the title headers of your column dimensions as well, in case you like it minimalistic.

Last but not least: When a crosstab is larger than a single page, then we now create proper pagebreaks and preprint the header-section on the next page.

For this release, that basically concludes the feature hunt. Until we actually wrap up and do a release build, its hardening and bug-fixing time. So give it a try, and if you tickle a bug out of it, I would be pleased if you could feed our JIRA beast with it.

PRD-2087: Widows, Orphans and we are all keeping together, aren’t we?

Thomas_kennington_orphans_1885One long-standing, never resourced, never fixed issue we had was the case of managing orphans and widows in reports. Well, with the cold wind of austerity blowing over Europe, we can’t forget the widows and orphans, can’t we?

What are Widows and Orphans?

In typography, an orphan is the single line left of a paragraph left on the previous page. A widow is a lonely line that did not fit on the previous page and now sits alone on the next page. With texts, these rules are somewhat easy to solve, as paragraphs are a flat list and not nested into each other.

In the field of reporting, we usually care less about lines of text, we care about the greater unit of sections. When you create a report, you don’t want a group-header being all alone at the bottom of the page, without at least one more details band to go with it. Likewise, a group-footer should not be the only thing on the last page for that group. The trouble starts when you consider these rules in a deeply hierarchical structure as we see in reports.

Like so many layouting concepts, orphans and widows are easy to explain, but usually a pain to resolve. Orphan and Widow rules are cumulative. When you have nested groups, then the orphan declarations of the outer group cannot be solved in isolation.

Lets take the simple example of a two-level report , where each group declares that it wants at least two sections as orphan area. Assuming the group-headers are filled, it means that group’s header and at least the next section must be kept together. For the outer group, that is the outer group-header and the inner group-header. For the inner group it is the group-header and the first details section.

The inner group’s header is covered by two orphan rules now. It is both part of the first group’s unbreakable section, as well as part of the second group’s section. When rules partially overlap each other both rules must be merged.

Last but not least, in the light of these rules, we now can redefine the ‘keep-together’ (or in PRD speech: Avoid-page-break-inside) as a infinitely large number of orphans in the break-restricted area.

How to use this feature

The Orphans, Widows and Keep-together properties can be defined on any section or band. By default, all root-level bands (details, group-header,footer etc) have a default value for ‘keep-together’ of ‘true’.

The Orphan and Widow style settings take a positive integer as value. Negative values are ignored.

A orphan or widow constraint controls how pagebreaks within that element are handled. A widow or orphan constraint only affects child nodes of the element that has the constraint defined. So if you want to keep a group-header together with the next few detail sections, you have to define the orphan-constraint on the group-element. Defining it on the group-header will not have the desired effect.

The reporting engine treats all root-level bands as elements that count as content in the keep-together calculations. All other elements are ignored for the purpose of the widow-orphan calculations. If you explicitly want an element to be used for these calculations, you can set the style-key ‘widow-orphan-opt-out’ to false on that element.

If a element that counts for the widow-orphan calculation contains other widow-orphan enabled elements, the parent element will be ignored for the widow-orphan calculations.

Elements with an canvas or row-layout form a leaf node for the widow-orphan calculation. Their child elements cannot take part in any of the parent’s widow- and orphan calculations. However, they can establish their own widow-orphan context. Therefore, all subreports, even inline-subreports, can declare widow-orphan rules.

The defaults built into the reporting engine ensure that each section on the report is treated as an element for the widow-orphan calculations, even across subreports.

Performance

Solving widow and orphan rules is a costly exercise. Our reporting engine allows user-calculations and user-defined formatting to react to page break events. This allows you, for instance, to reset row-banding at the beginning of the page, or to format pages differently for odd and even page numbers. And finally, it allows you to update the page-header and page-footer on a page break so that you can show data from the current page on the headers.

If an section is finished (for instance a group has been fully processed), we can safely evaluate widows and orphans for that group.

For ongoing content generation: When an orphan value greater than zero is declared on a section, the engine suspends the layout calculation until enough content has been generated to fulfill all orphan rules currently on the report. Likewise, for widow-calculations the report processing is suspended until more than the number of widow elements have been generated as content – and only those elements that are not marked as covered by the widow-rule will be considered for layouting.

Suspending the layout processing can have a severe negative impact on the report processing time. When the engine suspends the layout-calculation, it keeps the unfinished layout in memory until it reaches a point where the layout can be safely calculated again. In the worst case, this suspends the layouting until the report generation finishes.

Keeping the unfinished layout in memory does consume more memory than the normal streaming report processing. When the engine finally detects a page break that fulfills all orphan and widow rules that are active on the report, it has to roll-back to the state that generated the last visible element on the current page to inform any page-event listener about the page break in the right context. Every rollback is expensive and the reporting engine has to discard any content that had already been generated after that page break, as functions may have reconfigured the report state in preparation or response of the page break.

Orphan calculations are usually less expensive as Widow or Keep-together rules.

However, if you export large amounts of data, try to avoid widow- or orphan-rules on your report. Your report will finish up to 100% faster that way.

 

Finally: This major fix is available for both Pentaho Reporting 3.9 and Pentaho Reporting 4.0. The fix did not make it into this month’s roll-up release for the Pentaho Suite 4.8.1 release, but will be available for the general public in the next roll-up release in July. In the mean time, the fix is in the source code repositories, ready to be checked out and built locally ;)

 

LibCGG – how to render CCC charts without a server

Sunburst_chart_smallThe CGG plugin does a nice job, trouble is: It is vendor locked-in. Lets see whether we can change that.

Years ago the smart guys at Web-Details started to use Protovis to create modern charts for their Dashboards in a project called CCC (Community Chart Components). Inevitably, these charts need to be printed from time to time, so shortly after that they created the CGG-plugin for the Pentaho BI-Server to do that.

I like the Bi-Server. I also like printing. But I don’t like having to have a server running to get my charts as images into a report. So a few weeks ago, I took the CGG plugin and pruned everything that relates to BI-Server specific code. Refactored. Sliced it a bit. As a result, we now have LibCGG, readily available on GitHub.

What is LibCGG

LibCGG is an abstract layer to render CCC/Protovis charts. Its only focus is rendering. It takes the relevant javascript that makes up the charts and produces SVG or PNG output. LibCGG comes with some JUnit test-cases showing that simple samples provided by Web-Details actually run. None of these samples have been modified in any way, they just run.

What is it NOT

LibCGG does not deal with data-sources. It does provide an interface that can be implemented, but it does not come with data-sources itself.
LibCGG does not deal with HTTP requests or even the format in which charts may or may not be stored, defined or delivered to users. It is up to the actual implementation to deal with that. I have modified a version of CGG to use LibCGG as a prove of concept. After all, we dont want to loose functionality, don’t we?

What do we need to use LibCGG in the reporting engine?

At the moment, I have not written any glue code to connect LibCGG with the reporting engine. Ultimately, this will happen though – why else would I care to separate CGG from the server? The barriers are surprisingly low. Pentaho Reporting already handles SVG data, and thus LibCGG needs just a thin wrapper around an existing element for a first show-off.

After that, we will need a chart editor. Pedro assured me that CCC charts come with enough metadata to make it easy to get a basic one up and running quickly. Once we have that, I am sure our UI team will want to come in to make that experience less geeky.

And last but not least: We need to separate CCC from CDA (Community Data Access) a bit. At the moment, there is a silent assumption that CCC charts exclusively communicate with a CDA datasource. It should not be too hard to reroute those calls to directly go to the report’s declared data-sources instead.

And now: The one million dollar question: When .. will it be ready?

With a bit of week-end magic, how about May? April should (hopefully) see us get feature complete on the committed features for Pentaho Reporting 4.0, so there is plenty of time for some Ninja coding. I even have a designated place for it: The ‘extensions-charting’ module, which was reserved for Pentaho’s next-generation charting that never really made it. CCC – be welcome, and never mind the ghosts of past visualizations.

Introducing the Pentaho Reporting compatibility mode

Every time I worked on the heart of Pentaho Reporting, the layout system, in the past, I wondered how the heck am I going to ensure that I do not break our customers existing reports – again.

Before we started work on the crosstab mode, we put a tiny layer of safety onto the engine by creating a set of ‘golden sample’ reports. A golden sample is the pre-rendered output of a report. Each time we make a change, our automated tests generate the output again and compare it with the known good output we have stored.

Over the last four long weeks (where I expected only two to spend) I rewrote large parts of the layout system. Crosstabs are a more dynamic structure than banded list-reports. While banded reports only grow downwards to fill an endless stream of pages, crosstabs grow both horizontally and vertically. The crosstab expands to the right for each new value of the column dimensions it finds, and it expands vertically, when the engine prints more row-dimension values.

The newly introduced table-layout system that powers the crosstabbing requires stricter rules for the layout elements to arrange them in a sensible fashion. Ordinarily, we want the resulting layout to be minimal (use as little space as possible, within the constraints set by the designer), stable (produce the same layout every time) and performant (don’t make me wait).

The old layout rules, however, were historically grown. They evolved around bugs, misunderstandings and the desperate need to not break reports already created ages ago. Breaking reports is fun – if fun includes loosing customers or getting angry calls. I value my sleep, so no more breaking reports for me, if I can avoid it.

From now on, Pentaho Reporting contains a brand new compatibility layer. This layer emulates all the old and buggy behavior to get a report output that is as close to the original release as possible. Our main concern with the compatibility is not necessarily to emulate show-stopper bugs, but to avoid those subtle changes where your report elements start slightly shifting around. When that happens, you can end up with either more pages than before, overlapping elements (and thus lost data in Excel and HTML exports) or anything in between.

How does it work?

Since Pentaho Reporting 3.9.0-GA, each report file contains a version marker in the “meta.xml” file contained in each PRPT-file. When we parse a report, we also read that version number and store it as the default compatibility setting. The report-designer preserves this setting over subsequent load/save cycles, so editing an old report in PRD-4.0 does not automatically erase or replace that marker. 

We consider reports without a marker to be old reports that must have been created with PRD-3.8.3 or an even earlier version. Of course, the reporting engine treats any of the ancient xml-formats and the PRD-3.0 “.report” files as ancient and gives them the version number “3.8.3″.

When a report is executed, the report processor checks the compatibility marker. If it is an pre-4.0 marker, we enable our first compatibility mode. The mode changes how elements are produced and how styles are interpreted.

The most important part change is, that a defined ‘min-width’ or ‘min-height’ automatically serves as a definition for a ‘max-width’ and ‘max-height’. There are additional rules, for instance, we ignore layout settings on structural sections, like groups or the group-bodies.

The most important rule, however, is: If you have a legacy report, it cannot contain tables, and thus cannot contain any crosstabs. Tables require a proper interpretation of the layout rules. The old rules tend to contradict each other from time to time, which causes great distress to the table-calculations.A distressed table-layout calculation may commit suicide or may throw away your data, so we better do not allow that to happen.

So before you can start to use newer features in the reporting system, you have to

Migrate your reports

Report migration is the process of rewriting a report’s layout definition to match the new layout rules. During that rewrite, we try to keep the layout as closely as possible to the original. While we are at it, we remove some invalid properties (like layout styles on groups) and migrate the sizing to the updated width/height system (not using the min-max height hack).

You can initiate the migration by entering the migration dialog via “Extra->Migration”. The dialog will list what will happen to your report, and will prompt you to save your report before the migration starts. The migration cannot be undone with the “Edit->Undo” function, so this saved report is your security blanket for the migration.

If you are sure that your report will be fine without the rewrite, you can manually force a report to a different compatibility level via the “compatibility-level” attribute on the master-report object. Be aware that this voids your warranty – your report may run just fine, or it may blow up completely. All bets are open.

Once the migration is done, your report should work as before, but within the corset of the new, and stricter, layout rules.

And to be sure: Let me repeat it – you only need to migrate reports when you want to use new features on them. Your old and already published reports will continue to work just fine without any manual intervention.

Bonus Content: Min/Max and Preferred width and height

Until recently, the layout system was not able to handle the layout constraints for minimum, maximum and preferred sizes correctly. The safe and default option was to rely on the minimum sizes only. The system magically treated all minimum sizes as maximum sizes for most cases, unless the element had a dynamic-height flag set or had an explicit maximum width or height.

With PRD-4.0, the layout system uses better rules with less contradictions. Therefore it is safe now to rely on the preferred size for most cases.

In reports in PRD-4.0 compatibility, a minimum size defines an absolute minimum, and the element will not shink below that size. The maximum size defines the absolute maximum, if defined, then the element will never grow larger than that. The preferred size defines a flexible recommended size. In most cases this is the size your box will use.

But if your element has content that requires more space, it will get it (up to the limit imposed by the maximum size). Each element computes a ‘minimum chunk size’ – think of it as the largest word in text – and uses the maximum of the chunk-size and the defined preferred-width as effective size.

Try our new compatibility mode. See whether it preserves your reports, and if not, please, please, file me a bug-report!

Style Sheets in Pentaho Reporting 4.0 – I blame Marius

As a small deviation from the usual crosstab and layouter work, I cleaned out the style system of Pentaho Reporting.

It all started not long ago on a stormy night. Creepy rays of moonlight, dark clouds looming over the mountains, wolves howling. On this day, Marius was doing some experiments in getting cascading stylesheet styling into the reporting engine. (I suspect lightning rods and wild laughter was on the menu as well.)

His approach reminded me on the dreadful task I had to go through for the next release. Our style system, a ghoulish monster, has overstayed its welcome. Being ashamed of its nature, we did not expose it to anyone. So this rotten pile of code was sitting around, festering and stinking. I had to kill it. It was an act of mercy for all of us. 

With the style crud gone, the reporting engine now has a simplified style system. With the monster, all style definitions on elements always showed a resolved picture of the global style status. For example, each label’s style fully queried all parent elements to compute the effective font information whenever anyone queried it.

In the new world elements no longer maintain that state. Styles are now resolved as part of the report processing after the style and attribute expressions have been evaluated. It not only simplifies cloning (a lot!), it also tunnels style computation through a single code path.

And then there came polly .. ah, well, the release of Suite 4.8, and the customary fun-week, where the daily drill is suspended for exploring the possibilities of the code.

Mix parts of LibCSS – the zombie that wanted to be a CSS-styled better reporting system – with the single code-path of style resolving, and you have CSS3 selector on top of the existing reporting style-system.

Pentaho Reporting now has a fully fledged style system. Any master-report contains a style-definition, which is basically what you know as style-sheet from HTML. Each style-definition consists of style-rules, where each rule has CSS3-selectors attached.

Style-definitions can either be part of the report definition itself or they can be stored as external files so that they can be shared between reports. The external files have the extension “.prptstyle” and are xml-files containing standard element style definitions.

The Pentaho Report Designer allows you to edit both internal and external style definitions. Use ‘Window’->’Style Definition Editor’ to edit standalone files, or use ‘Format’->’Edit Style Definition…’ to edit the internal style definition.

Both editor allow you to load and save ‘prptstyle’ files. Loading a style-file in the internal editor has the effect of importing that external style-definition into the local report, and saving exports it.

So don’t wait, download your own multi-functional report-designer now!

If you download within the next 24 hours, you not only get a style-definition editor, you also get this:

The ‘master-report’ object now has a new attribute called ‘style-sheet-reference’. Set this to a file-name or URL and Pentaho Reporting will load an external stylesheet when the report runs. Put a ‘prptstyle’ file on a public server and all your reports can share the same style definition.

All elements now also have two new attributes called ‘style-class’ and ‘id’ (not to be mixed with ‘xml-id’ for the HTML export) which allow you to match style-rules to elements. Like HTML’s ‘class’ attribute, class is a whitespace separated list of style-class-names that should apply to the element.

And don’t forget: The green-plus means that all of these attributes can be computed via a style-expression.

Go grab your PRD preview now!

A quick introduction to crosstabs in Pentaho Reporting 4.0

Crosstabs have been on the horizon for several years now. They lived a happy, undisturbed life along with the unicorns and gnomes guarding the pot of gold at the end of the rainbow.

With an endless recession and central banks selling off their gold reserves, the unicorn has been sold to a meat factory and the gnomes now assemble luxury cell-phones in a chinese factory.

So the day had to come that crosstabs have to work for a living. That day is now.

The added crosstab capabilities are the head feature of Pentaho Reporting 4.0 (now to be released in April 2013). To make crosstabs possible, the reporting engine had to move away from the strict banded approach of top-down layouting. The engine now has a true table layout that ensures that even misbehaving crosstab definitions will look and behave good.

Some sort of crosstab abilities were in experimental mode for very much of the last 4 years. The good news: The current implementation fixes all the problems we had with that experimental stuff. The bad news: Your old crosstab definitions will most-likely not work in 4.0 – you will have to recreate them properly with the new version.

To spin off crosstabs quickly, the Pentaho Report Designer ships with a quick-and-dirty crosstab creation dialog. This dialog pops up whenever you add a crosstab group.

Until we update the report designer with a proper user interface, you will have to edit all crosstab properties and elements in the structure tree.

Be aware: The code is pretty much happy-path – stray too far away from it and the big bad wolf will eat you, your children and your data. Wolves are a protected species, so bear your share and help to feed them properly. Before you are fully digested – file bug reports and feature requests in our JIRA system.

Enough prose: Let me just list some of the notable current behaviours and general ramblings.

  • normalization works now, but it is not all-knowing. It will only normalize on existing data items. If you whole dataset has Q1, Q2 and Q4 but fails to mention Q3 at least once, you will not see a Q3 anywhere.
  • The engine wants its data raw and bloody. Denormalize your data! If you use OLAP datasources, use Denormalized version of these datasources.
  • Sort your data to match your group structure.
  • column-headers currently do not repeat on the subsequent pages.
  • Setting your cell-contents (or any other row-layout child) to a percentage width will yield a zero-width in the final layout. (PRD-3970)
  • the report wizard really does not know how to handle anything crosstab related. (PRD-3860)
  • you can add charts into a crosstab cell.
  • if there is more than one row of data for a single crosstab cell, you have the choice to print either the first item, the last item or all items in a large list.

Features that will come, but are not implemented yet:

  • butterfly headers: Move row-headers into the middle of the data. (PRD-4005)
  • Subreports on all crosstab cells (PRD-4006)
  • style- and attribute-expressions for all crosstab elements (PRD-4007)
  • The no-data-band on the crosstab-group has no effect. (PRD-4008)
  • Summary-rows and columns will be positioned either at the start or the end of the respective group. (PRD-4009)
  • Details header will be printed. (API is defined, but not used yet). Printing of them will be optional. (PRD-3949)
  • Header cells can either be spanned cells, or can repeat its content for each child. (PRD-3386)
  • There will be a switch to make row- and column-headers disappear. You can then use the crosstab purely as a layouting utility. (PRD-4010)
  • sorting of data. This will be optional and will blow up if you expect us to do big-data sorting. Get a real database with SQL support ;)

Now go and download a CI build of PRD-trunk, will you!?

Printing Watermarks with Pentaho Report Designer

A watermark is a image or other content that is printed in the background of your report. You can use watermarks to indicate a status of your report (Confidential, Draft or For Internal Use Only). Or you can use watermarks to print a form template into the background of your report.

A well defined watermark can transform a report from a uninspired bunch of numbers into a fully branded document that reinforces your companies image in all receivers of the report.

So how do you define a Watermark in Pentaho Reporting?

Activating the watermark

When you create a new report, the watermark section is hidden. Before you can create new elements on this section, you will have to show it in the Pentaho Report Designer.

  1. Switch to the Structure Tree
  2. Locate the Watermark node under the master report node.
  3. Select the Watermark node and switch the properties table to the “Attribute” view
  4. Set the “hide-on-canvas” attribute to “false”

At the bottom of your design area in the report designer window, you will now see the watermark section below the page-footer section.

Properties of a watermark

The watermark section is printed behind all other content. Watermarks are special content that is only useful for printing and the PDF export. Plain-Text exports and all table-exports do not allow overlapping content. A watermark would prevent the printing of other content on that page. Therefore watermarks are disabled for these export types. Even if your watermark section contains content, the Pentaho Reporting Engine would not process any of it.

A watermark always spans the complete width and height of a page and content printed on the watermark does not affect any other content on the page.

Defining content for watermark sections

Watermark section is a ordinary band. You can drag and drop fields, labels or images into the watermark just as you would do for any other of the bands. The watermark is printed when a new page is started and fields print the same values as fields defined in the page header. Like all page-header or -footer bands, watermarks cannot contain any subreports.

The Pentaho Report Designer ships with the “Production Reports/Invoice” sample report that makes use of a watermark to show a background image to reinforce the branding of the “Steelwheels” toy company and prints a stamp text on the report.

Align Text in Number Fields and other Formatting Basics

Today lets talk about you can format report elements and what options our Pentaho Report Designer offers you.

Anatomy of Elements

Elements can be grouped into two classes. Text elements transform data into a textual representation. Labels, String-Fields, Message-Fields and Number- and Date-Fields are examples of these fields. Graphical elements produce a image to display the data. Charts, Sparklines or Image-Fields are examples for this class of elements.

A report element in Pentaho Reporting consists of style information and attributes. Most styles and attributes can be provided either statically or can be calculated by a function. The static properties are called “design time properties”. The calculated properties are called “runtime properties”.

Attributes control the element’s behaviour and how data is processed for displaying in the element. Examples for this are format-strings, rich-text processing and so on. Style properties control the visual appearance of the data printed.

Basic Formatting

With the Pentaho Report Designer you will find many formatting options that you can find in other text processor or graphical programs.

All elements share some common properties:

  • visible: Defines whether a element is shown on the final print out. Visibility is mostly used as runtime property instead of being specified at design time.
  • min-width, min-height: The width and height of an element define how much space a element takes on the paper. It also controls how many lines of text can be printed and how many characters fit onto each line. The size is either given in percentages (of the width and/or height of the element) or in points (1/72th of inches).
  • x, y: The position of the element if placed into a canvas element. The position is either given in percentages (of the width and/or height of the element) or in points (1/72th of inches).
  • dynamic-height: Dynamic height is a flag you can set to let the element expand its height to match the content you attempt to print.
  • various border properties: Allows you to draw a border around the element. You can define the border of the currently selected element very easily by using the menu option “Format->Border..”
  • padding: Padding allows you to insert some space between the edge of the element (where the borders sit) and the content printed in the element. The padding size is given in points (1/72th of inches).
  • text-color, background-color: Defines the element’s foreground and background colours.

Text elements

For text elements, the most common style you are going to change is probably the font properties.

  • font-name: Defines the font name. The font must be available, or a default font is used instead. This is especially important if you intend to publish the report to a server, which may not have the same fonts as you installed.
  • font-size: Padding allows you to insert some space between the edge of the element (where the borders sit) and the content printed in the element.
  • bold, italics, underline, strike-through: Various flags controlling the appearance of the text
  • embedded: This flag is only useful for PDF exports. It controls whether the font is embedded into the PDF document. Only TTF-fonts can be embedded and not all fonts allow you to embed them.
  • h-align, v-align: Positions the text within the element.

Graphical Elements

Graphical elements have different properties based on what sort of element you are dealing with.

  • scale, aspect-ratio: Defines whether the shape will scale up or down to fit the element’s bounding box. If “keep-aspect-ratio”
  • fill-element: Defines whether the shape will be filled. This has no effect on lines. Elements will be filled with the colour defined in fill-color
  • draw-outline: Defines whether the shape outline will be drawn. The outline will be drawn with the colour defined in text-color
  • stroke
  • : Defines the stroke that is used to draw the outline of the shape.

Attributes

Most of the attributes for elements are advanced properties and not required to actually get a basic report up and running. So I will concentrate on the ones that really matter:

  • value: The static value that should be printed. If this is defined,
  • field: The name of the field from the reporting engine reads the value. If a value is given in the “value” attribute, that static value will be used and the field will be ignored.
  • if-null: A static value that is printed if the field would evaluate to otherwise.
  • data-format: A flag that defines whether the element will update its format string from the meta-data given by the data source.
  • style-format: A flag that defines whether the element will update its style properties from the meta-data given by the data source.
  • show-changes: If this flag is set to true, the element will only print if the value has changed from the previous value. The element will always print if it is the first element in the group or on the page.

Message-, Date- and Number-Fields

  • format: Defines the format string. To make number fields align correctly, ensure that your format specifies a fixed set of decimals in the format string and that you right-align the field in the style options.

Armed with this knowledge, formatting elements in a report should no longer be a daunting task. “Be fruitful and multiply your reports, and fill the sheets and subdue it; and have dominion over your data and over the numbers of the accounting and over every piece of information that moves in your company.”

Pentaho Report Designer – Essential Readings to get started

When you download the Pentaho software for the very first time, the bunch of cryptic applications can be daunting. Reading about all the new and complex concepts of OLAP-cubes, data-warehouses, ETL processes to fill the data-warehouses or metadata layers is daunting at best.

If all you want to do is start with a report to see your own data printed, then at first, you can ignore the cryptic stuff and stick to the bread-and-butter parts: Using the Pentaho Report Designer to create your first report. Lets find out how.

First, you need to make one critical decision: Is what you want a classic report that is primarily meant to be printed (or used as if being printed)? Or do you want to explore your data to discover new knowledge.

Reading: Learn about the two kinds of reports.

If you are seeking an analytic tool for exploring your data, then you have no other option than to setup a OLAP-Cube to use the Pentaho Analyzer. These days, for simple cases you can get quick results with Pentaho’s Agile BI Initiative, so check them out.

From now on, I assume you are more interested in conventional reports with the sturdiness and durability of real paper. For over 500 years, since the invention of the double-entry bookkeeping, business reporting played a crucial rule in the business world. Reports are made to last.

To get started with the Pentaho Report Designer, you do not need any special setup or a data-warehouse as Pentaho Reporting can work with most relational databases after installing the right JDBC driver. You should get these drivers from your database vendor.

The Pentaho Report Designer already comes with drivers for MySQL, Hypersonic and H2 database systems. You can install your own drivers by copying the driver jar into the “lib/jdbc” directory inside the report designer.

Reading: Follow the walk-through to learn how to create your first report

This tutorial uses the sample database that we provide inside the report designer. This step-by-step guide will show you how to setup a database connection, how to get data into your report and how to layout elements to make a visually appealing result. When you use the SQL datasources you need a basic understanding of SQL. If you are not familiar with SQL, you can learn the basics of SQL in this rather good tutorial. Your database manual should also contain a introduction to SQL.

Work through the tutorial and you are ready to make your first steps in the world of Pentaho’s Business Intelligence tools.

There are some articles in this blog that help you further:

MySQL with Pentaho Report Designer for those who dont read manuals

Several people in the forums seem to have severe problems setting up MySQL or understanding how MySQL works. They also seem to have severe problems reading the database manuals pointed out in the forum, insisting the problem lays within the Report Designer. So I am going to write up some simple explanations on how to set up MySQL, how to configure the database users for access over the network and how to work around all the little and large weird quirks of that database. And then hopefully I just can point these users to here and never have to deal with it again.*


I do assume that you have MySQL installed and up and running. If not, then go and get PostgreSQL, a database I personally trust.

Connecting to a MySQL database – the basics

All the Pentaho tools and thus the Pentaho Report Designer are Java Applications and therefore use JDBC to access the databases. For other databases, the general rule is: Use the driver that ships with your database and keep the version numbers in sync.

To work with a MySQL database, you will need a suitable MySQL driver. In my experience MySQL drivers have not been the most stable nor most reliable drivers around. At Pentaho, we recommend to use the MySQL JDBC Driver version 5.1.10 as default driver, as this seems to be the most stable driver. Other versions may produce invalid results, and if you want to use any other version, make sure you test it carefully.

If you use the Pentaho Report Designer to create a connection, you will be asked for several properties. To connect to the database directly, select the connection mode “native”. Then enter your hostname or IP-address, port (default port number for MySQL is 3306) and username and password.

For generic JDBC database connections, you need the JDBC-URL, your username and password and optionally the connection properties:

Setting Value

Driver

The driver class to load. The name of the class that implements java.sql.Driver in MySQL Connector/J is com.mysql.jdbc.Driver.
JDBC-URL

The JDBC URL for the MySQL JDBC Driver uses the following format. Items in square brackets ([, ]) are optional.

jdbc:mysql://[host:port]/[database][?propertyName1][=propertyValue1][&propertyName2][=propertyValue2]...

If the host name is not specified, it defaults to 127.0.0.1. If the port is not specified, it defaults to 3306, the default port number for MySQL servers.


Username

Your user name as given to you by your database administrator. See below how MySQL translates that information and matches this against the internal user database.

Password

Your password for the connection.

To get the same connection as shown in the dialog screenshot above, you would have to specify the JDBC-URL:

jdbc:mysql://localhost:3306/sampledata

with username “pentaho_user” and “password” as password.

In most cases your database will be running on a different host than the computer you are editing reports or where the BI-Server is installed. When the database is running on a different computer, you will be able to distribute the work-load of answering requests and generating reports a lot better. If both the reporting server and the database share resources like the CPU, memory or disks, there is always the danger that both systems compete for resources and slow each other down unnecessarily.

User management and access control in MySQL

When you connect to a MySQL server, your identity is determined by the host from which you connect and the user name you specify. A MySQL server side user name is the combination of the user name you specified on the connection (“pentaho_user” in the example above) and the host from where you try to connect to the database server.

MySQL users are defined by their username and hostname and generally written in the format ‘username’@'hostname’.

In the example above we connect to the special host address ‘localhost’ (or by its IP address: 127.0.0.1). Only applications running on the same computer can make connections to this interface. By default a server will listening to these connections will see them as originating from “localhost” itself.

So in the above example, the full identity would be ‘pentaho_user’@'localhost’.

MySQL uses a two-stage authentication system. In the first stage it checks, whether you are allowed to connect to the database and validates your username and password. In the second stage MySQL checks whether you have the access rights to see the contents of the tables stored in the database or whether you are allowed to update or delete data.

When we try to establish a connection to the MySQL server, the server will check its local user database for an entry matching the username and password. If there is no user defined for the given hostname, you will see the error message

Host [...] is not allowed to connect to this MySQL server

The error message

Access denied for user 'pentaho_user'@'localhost' (using password: YES)

indicates that either the password is invalid or that this user has not been defined for connections from this host.

If you see the error message

ERROR 1044 (42000): Access denied for user 'pentaho_user'@'localhost' to database 'sampledata'

then you need to grant additional permissions to the user to allow access to the database.

Listing the defined users

MySQL stores all administrative information in a special system database called ‘mysql’. You can connect to this database like any other defined database. To see which users are defined, invoke the following SQL statement via the MySQL console:

SELECT User, Host FROM mysql.user;

MySQL will check your identity (username and host) against the user table. Both the user and the host must exist. If the table contains a host name called “%” this particular entry matches all hosts. If the table contains a empty user name, it matches any user name.

To allow the user ‘pentaho_user’ to connect from any localhost, you would need to have a matching entry for it in the table.

+------------------+-----------+
| User             | Host      |
+------------------+-----------+
| pentaho_user     | localhost | 
+------------------+-----------+

To allow connections for ‘pentaho_user’ from any host, a entry with ‘%’ as hostname must exist.

+------------------+-----------+
| User             | Host      |
+------------------+-----------+
| pentaho_user     | %         | 
+------------------+-----------+

Big fat warning: Even though both entries have the same user name (pentaho_user), they define two entirely different and unrelated identities for the database. It is totally possible and legal for ‘pentaho_user’ from localhost to have access to different databases than the same user for the other hosts. For MySQL these are two different entities.

Connections from Pentaho Report Designer via Connections from PhpMyAdmin or other Server Side Software

The MySQL identity system of treating users from different hosts as completely different entities can cause a lot of headaches, especially if you are not familiar with the underlying workings of the internet.

If your MySQL server is set up to only allow connections from specific hosts, you can run into problems where you can access the database from your web-server but not from within the report designer or vice versa.

Lets look at the most common usage scenarios:

Pentaho Reporting, the Web-Server and MySQL run on the same machine

This is a typical development scenario, where the developer installed all software on his own box for convenience.

+---------------+
| PRD-Host      |
+---------------+
| Web-Server    |
+---------------+
| My-SQL Server |
+---------------+

MySQL must be configured to allow access from ‘localhost’ and all connections in the report must use ‘localhost’ as the hostname for the MySQL connections.

A identity for ‘pentaho_user’@'localhost’ must exist.

+------------------+-----------+
| User             | Host      |
+------------------+-----------+
| pentaho_user     | localhost | 
+------------------+-----------+

Pentaho Reporting, the Web-Server and MySQL all use different machines

This is the typical production scenario. Both the database server and the web-server run on different machines for maximum performance. Both machines are usually installed in a server room or data centre.

+----------+          +------------+         +---------------+
| PRD-Host |  --+-->  | Web-Server |  ---->  | My-SQL Server |
+----------+    |     +------------+         +---------------+
                |                                 /|\ 
                +----------------------------------+

Both the developer sitting on the Pentaho Report Designer and the Web-Server need access to the database. While the developer creates the reports he needs access to the database (or a test database with a similar structure than the production database). When the report runs, the web-server must connect to the database as well to execute the reports.

Both the developer PC and the web-server must have entries for the user ‘pentaho_user’ in the user table.

+------------------+-----------------------------+
| User             | Host                        |
+------------------+-----------------------------+
| pentaho_user     | IP or hostname of webserver | 
+------------------+-----------------------------+
| pentaho_user     | IP or hostname of PRD-host  | 
+------------------+-----------------------------+

Alternatively, you could allow access from all hosts. Note that if your database server is not shielded by a firewall, then you may easily open your database to the world. Any self-respecting database administrator may try to strangulate you for suggesting such an configuration.

+------------------+-----------+
| User             | Host      |
+------------------+-----------+
| pentaho_user     | %         | 
+------------------+-----------+

The Web-Server and MySQL share the same machine

You find this set up occasionally in smaller hosting companies or from people who own just a single server. The database and web-server share the same machine, which is hosted in the data centre or server room. The developer runs the report designer on his own desktop machine.

+----------+          +---------------+ 
| PRD-Host |  --+-->  | Web-Server    |
+----------+    |     +---------------+
                +-->  | My-SQL server |
                      +---------------+

The web-server can access the MySQL server via ‘localhost’, while the developer needs to access the MySQL server from his own host outside of the server room.

+------------------+-----------------------------+
| User             | Host                        |
+------------------+-----------------------------+
| pentaho_user     | localhost                   | 
+------------------+-----------------------------+
| pentaho_user     | IP or hostname of PRD-host  | 
+------------------+-----------------------------+

So if you ever wondered why you can access your MySQL server from PhpMyAdmin or similar tools, but not from the Pentaho Report Designer, you should now be able to see the answer and to fix the problem.

You can find more information on this topic in the chapter about MySQL user management in the MySQL manual.

Fine grained access permissions

Even if you are able to connect to the database, you may not be able to query the database. The second stage of MySQL’s security system checks your identity against a set of defined permissions. These permissions control whether you are able to query or modify the data in the database or whether you are allowed to change the table structures or perform other administrative tasks.

Pentaho Reporting only queries the database. The Pentaho reporting tool does not modify data and does not perform any administrative tasks. All MySQL users need to have the “SELECT” permission on all objects you want to access from the reports.

To grant a specific user the permission to select data, use the following statement from the MySQL console.

GRANT SELECT ON database.* TO 'pentaho_user'@'localhost';

MySQL connection properties you should know

To make the JDBC driver work properly in MySQL, you will have to specify some additional connection properties. The driver’s default settings can cause severe problems for the reporting system. These defaults have been defined for historic releases of MySQL and are harmful these days.

Property Definition Comment
useCompression Use zlib compression when communicating with the server (true/false)? Defaults to ‘false’. Useful if you access a remote server through a low bandwidth connection. Disable for production systems.
passwordCharacterEncoding What character encoding is used for passwords? Leaving this set to the default value (null), uses the platform character set, which works for ISO8859_1 (i.e. “latin1″) passwords. For passwords in other character encodings, the encoding will have to be specified with this property, as it’s not possible for the driver to auto-detect this. If you are not using an all ASCII password and work outside the area of Western European languages, knowing about this is a life-saver.
characterEncoding If ‘useUnicode’ is set to true (which is the default), what Java character encoding should the driver use when dealing with strings? (defaults is to ‘autodetect’). If the encoding cannot be determined, then an exception will be raised. Sometimes MySQL server do detect the character set correctly and therefore return garbage when querying tables with non-ascii data. In this case this property can be used to override the autodetection of character sets.
functionsNeverReturnBlobs Should the driver always treat data from functions returning BLOBs as Strings – specifically to work around dubious metadata returned by the server for GROUP BY clauses? The MySQL JDBC-Driver returns string results of functions in SQL statements as byte-arrays. If you use any functions in SELECT statements, set this property to “true” or you will see garbage in the results.
emptyStringsConvertToZero Should the driver allow conversions from empty string fields to numeric values of ’0′? MySQL treats empty strings or values as zero for numeric columns by default. If you need to see the values as they are stored in the database without helpful corrections from the JDBC driver, then set this flag to false.
zeroDateTimeBehavior What should happen when the driver encounters DATETIME values that are composed entirely of zeros (used by MySQL to represent invalid dates)? Valid values are “exception”, “round” and “convertToNull”. Use this if your select statement fails with an exception and you have a zero-date in it. Set it to ‘convertToNull’ to come to a more reasonable result.
serverTimezone Override detection/mapping of timezone. Used when timezone from server doesn’t map to Java timezone Use this if your dates returned by the database seem to be offset by some hours. Also see ‘useGmtMillisForDatetimes’, ‘useJDBCCompliantTimezoneShift’, ‘useLegacyDatetimeCode’ and ‘useTimezone’ for more time zone related functions.
holdResultsOpenOverStatementClose Should the driver leave the result sets open on Statement.close() (enabling violates JDBC specification) This can be a very helpful property for reports with subreports. It allows the Pentaho Reporting engine to process several server-side resultsets in parallel. Without this setting, the reporting engine will fully buffer result-sets in memory instead, which increases the memory consumption of the reporting system.

Further readings

If you have made it this far, then let me tell you one big hint: MySQL comes with a nice administration tool called My SQL Workbench which allows you to administrate the server without having to go through the command line.

* Yeah. At least I can fake hope, even when I don’t feel it.