The funny thing about most bugs is, that for most parts they go completely unnoticed. No one ever hits them or if they do they think its just normal weird behaviour. (Thanks, Crystal Reports & Co for training your former (and our current) users!)
One of these bugs I had in my bucket for the last week was a rather nasty error condition on reports with a large number of rows. Reports like that are usually used either for data exports or for performance tests, as a human user barely reads the first sentences of an e-mail, not to speak of hundreds of pages of dull numbers.
What was the bug?
Reports with large numbers of rows were incredible slow when run as Excel/HTML/RTF/Table-CSV or Table-XML export. The slow down was worse the more rows were in the report.
The original e-mail contained a test report that demonstrated the issue. (Note: Providing me with a replication path tremendously increases your chances of getting a bug-fix fast. You help me, and I’ll help you.) The tests they ran on their machine showed a clearly exponential curve:
|# of Rows||Time in seconds|
What caused the bug?
Short version: The iterative processing in the reporting engine was broken and resulted in a increasingly large layout tree. Iterating this tree to calculate the layout gets more and more expensive the larger the tree gets.
The Pentaho reporting engine uses an iterative layouting engine to process the report. During the layouting stage we build up a layouting-DOM tree containing the various boxes and content that makes up a report. Our model is heavily inspired by the CSS specification and is optimized towards using as little memory as possible. Boxes get added to the layouter by the reporting (data processing) engine and once the layout reaches a stable intermediate state, the content is printed and the processed boxes get removed from the layout-DOM. This results in a sliding zone of activity on the layout model and allows us to process huge reports with minimal memory footprint. And we don’t even have to swap to disk for this – all processing happens in memory.
To make this model work, we have to track which layout nodes that have been processed already and which nodes could still move around in the final document.
The whole process of tagging finished nodes worked nicely for paginated reports, but failed for flow- and stream-layouted reports. The system never tagged nodes as ready for print, and so the layout tree grew bigger and bigger. At the end, this resulted in larger processing times (as the iteration of the tree took longer) and a huge memory foot-print (as more nodes have to be held in memory).
To fix this problem, I had to fix the “ready-for-process” tagging [SVN].
Once the iterative process sprung to life, the memory foot-print went down, but the processing performance was not as good as I would have expected. In fact, the iterative processing worked so well, that it caused more overhead than it actually saved time. A quick and dirty throttling of the layouter’s processing queue made performance jump up. We now only process every 50th iterative event, and thus trading a bit more memory costs for a huge increase in processing speed.
How do we perform now
The bug fix was a absolute success beyond my wildest dreams. I can say I have witnessed an increase of 800% in report processing speed. (Ok, it is not difficult for this class of bugs: all you need is patience and a large enough report.)
|# of Rows||Time in seconds|
|Previous to the fix||After the fix||Change in %|
|50000||(crash)||146||(a rather large number, trust me)|
When can I get this fix?
The fix will be included in the 3.8.2-GA release and should be built within the next weeks. At the moment, everyone at Pentaho is still busy finalizing the BI-Server 4.0/PRD-3.8.1 release, so it will take a moment before we can fire up another release.
In the meantime, you can either grab the sources from the Subversion repository or grab the CI-build. The 3.8.2 engine is a direct replacement of the 3.8.1 build, so you can patch your system by just copying the final jar-files over. I have not tested whether you can do some similar patching with 3.8 or 3.7 installations.