PRD-3553 – or the low performance of large reports

The funny thing about most bugs is, that for most parts they go completely unnoticed. No one ever hits them or if they do they think its just normal weird behaviour. (Thanks, Crystal Reports & Co for training your former (and our current) users!)

One of these bugs I had in my bucket for the last week was a rather nasty error condition on reports with a large number of rows. Reports like that are usually used either for data exports or for performance tests, as a human user barely reads the first sentences of an e-mail, not to speak of hundreds of pages of dull numbers.

What was the bug?

Reports with large numbers of rows were incredible slow when run as Excel/HTML/RTF/Table-CSV or Table-XML export. The slow down was worse the more rows were in the report.

The original e-mail contained a test report that demonstrated the issue. (Note: Providing me with a replication path tremendously increases your chances of getting a bug-fix fast. You help me, and I’ll help you.) The tests they ran on their machine showed a clearly exponential curve:

# of Rows Time in seconds
5000 16
10000 28
20000 90
50000 360

What caused the bug?


Short version: The iterative processing in the reporting engine was broken and resulted in a increasingly large layout tree. Iterating this tree to calculate the layout gets more and more expensive the larger the tree gets.

Long version:
The Pentaho reporting engine uses an iterative layouting engine to process the report. During the layouting stage we build up a layouting-DOM tree containing the various boxes and content that makes up a report. Our model is heavily inspired by the CSS specification and is optimized towards using as little memory as possible. Boxes get added to the layouter by the reporting (data processing) engine and once the layout reaches a stable intermediate state, the content is printed and the processed boxes get removed from the layout-DOM. This results in a sliding zone of activity on the layout model and allows us to process huge reports with minimal memory footprint. And we don’t even have to swap to disk for this – all processing happens in memory.

To make this model work, we have to track which layout nodes that have been processed already and which nodes could still move around in the final document.

The whole process of tagging finished nodes worked nicely for paginated reports, but failed for flow- and stream-layouted reports. The system never tagged nodes as ready for print, and so the layout tree grew bigger and bigger. At the end, this resulted in larger processing times (as the iteration of the tree took longer) and a huge memory foot-print (as more nodes have to be held in memory).

To fix this problem, I had to fix the “ready-for-process” tagging [SVN].

Once the iterative process sprung to life, the memory foot-print went down, but the processing performance was not as good as I would have expected. In fact, the iterative processing worked so well, that it caused more overhead than it actually saved time. A quick and dirty throttling of the layouter’s processing queue made performance jump up. We now only process every 50th iterative event, and thus trading a bit more memory costs for a huge increase in processing speed.

How do we perform now

The bug fix was a absolute success beyond my wildest dreams. I can say I have witnessed an increase of 800% in report processing speed. (Ok, it is not difficult for this class of bugs: all you need is patience and a large enough report.)

# of Rows Time in seconds
Previous to the fix After the fix Change in %
5000 11 11 0%
10000 24 22 9%
20000 414 47 880%
50000 (crash) 146 (a rather large number, trust me)

When can I get this fix?

The fix will be included in the 3.8.2-GA release and should be built within the next weeks. At the moment, everyone at Pentaho is still busy finalizing the BI-Server 4.0/PRD-3.8.1 release, so it will take a moment before we can fire up another release.

In the meantime, you can either grab the sources from the Subversion repository or grab the CI-build. The 3.8.2 engine is a direct replacement of the 3.8.1 build, so you can patch your system by just copying the final jar-files over. I have not tested whether you can do some similar patching with 3.8 or 3.7 installations.

This entry was posted in Basic Topic on by .
Thomas

About Thomas

After working as all-hands guy and lead developer on Pentaho Reporting for over an decade, I have learned a thing or two about report generation, layouting and general BI practices. I have witnessed the remarkable growth of Pentaho Reporting from a small niche product to a enterprise class Business Intelligence product. This blog documents my own perspective on Pentaho Reporting's development process and our our steps towards upcoming releases.

2 thoughts on “PRD-3553 – or the low performance of large reports

  1. Nick

    Hi, We are using Pentaho Reporting in our project. What I am facing that we are using some inline subreport (subreports in details group), and the performance of the report is really slow. The query executes in couple of seconds but the report output takes around couple of Minutes.

    Moreover we see a lot of OutOfMemory issues. We have a 32 bit system and the memory allocated to Pentaho is maxed out at 2GB.

    We are using 3.5.1 version.

    Our report structure is contains an XACTION, used for parameter selection, which then forwards the requested parameters to PRPT.

    Please suggest if there is anything that we can do.

    Thanks,
    Nick

  2. Thomas Morgner

    This version is ancient. I do not have the bandwidth to port fixes to years-old versions. Fixes usually only make it into the latest bug-fix release, unless a customer requests it for their production use via the official Pentaho support channels.

    At the moment, the most recent version is 3.8.2.

Comments are closed.