Ever felt that getting the Pentaho BI-server to spit out CSV or Excel files should be faster? As unbelievable as it sounds, with case PRD-4921 closed, we have up to 5 (in words: five!) times faster exports now.
Its one of those occasions where talking about customer problems creates a wacky idea that falls on fruitful ground.
Many customers use the Interactive Reporting tools (Pentaho Interactive Reporting, Saiku Reporting or the ugly Adhoc Reporting (WAQR)) to get data out of the data-warehouse into Excel or CSV files. Such reports are usually rather simple list reports, with no fancy charting or complex layout structures. However, the reporting engine does not know that, it’s pessimistic nature always assumes the worst.
The Pentaho Reporting engine allows an insane degree of freedom, and via custom report functions, it allows to reconfigure the report on the fly while the report is running. But with that freedom, we no longer can make any assumptions about how a report will look like in the next row. Thus the engine, and the layout subsystem, assume nothing and (apart from a bit of caching of reusable bits) recalculate everything from scratch.
Which takes time.
With PRD-4921, I added a fast-mode to some of the export types (Stream CSV, Stream HTML and XLS/XLSX). The new exporters check whether the report uses only ‘safe’ features, and if so, switches to a template based output instead of using the full layouting.
A report is safe if it does not contain any of the following items:
- inline subreports. They are evil, as they can appear anywhere and can be of any complexity.
- crosstabs. They are a complex layout and can’t be easily condensed into templated sections.
- functions that listen for page-events, other than the standard page-function and page-of-pages functions. During fast-mode, we don’t generate page events and thus these wont output correct values. I am willing to ignore page functions, as data exports are less concerned about page numbers.
- any layout-processor-function other than the Row-Banding function. They exist to rewrite the report, which stops us from making assumptions about the report’s structure.
If a report is not safe, the engine falls back to the normal, slow mode. You now just have to wait a bit longer to get your data, but you wont get sudden service interruptions.
For fast reports, the engine produce a template of each root-level band. If the style of a band changes over time (as a result of having Style-expressions), we produce a template for each unique combinations of styles the reporting engine encounters.
Once the engine has a valid template, it can skip all layouting on all subsequent rows of data and can just fill in the data into the template’s place holders. The resulting output is exactly the same as the slow output – minus the waiting time.
So how does this system perform? Here is what my system produces using a 65k rows report (to stay within the limits imposed by Excel97) with 35 columns of data exported. The report has no groups, it is just one big fat stream of data. All times are given in seconds.
|Export||5.1 with fix||5.0 with fix||5.0 GA|
As you can see from the data, the fix gave a 4 to 5 times speed up for HTML and CSV exports. The Excel exports were extra slow in 5.0 (and 4.x), and a few fixes in the layout handling and Excel specific exports gave the ‘normal’ mode a speed-up of 3 to 4 times. On top of that, we now have the fast mode, that gives another 2-3 times more raw speed.
Not bad for one week of frantic coding, I guess.
Go grab the 5.1 CI builds to give it a go. You will need an updated BI-Server reporting plugin to make the BI-server (and thus the Adhoc reporting tools) pick up that change.
The 5.0 branch does not have those changes in, so don’t even try the CI builds for it. As the 5.0 codeline is in lock-down for bug-fixes only, these performance improvements will take a while to go in, as we don’t want to introduce regressions that break systems in production.