Ever felt that getting the Pentaho BI-server to spit out CSV or Excel files should be faster? As unbelievable as it sounds, with case PRD-4921 closed, we have up to 5 (in words: five!) times faster exports now.
Its one of those occasions where talking about customer problems creates a wacky idea that falls on fruitful ground.
Many customers use the Interactive Reporting tools (Pentaho Interactive Reporting, Saiku Reporting or the ugly Adhoc Reporting (WAQR)) to get data out of the data-warehouse into Excel or CSV files. Such reports are usually rather simple list reports, with no fancy charting or complex layout structures. However, the reporting engine does not know that, it’s pessimistic nature always assumes the worst.
The Pentaho Reporting engine allows an insane degree of freedom, and via custom report functions, it allows to reconfigure the report on the fly while the report is running. But with that freedom, we no longer can make any assumptions about how a report will look like in the next row. Thus the engine, and the layout subsystem, assume nothing and (apart from a bit of caching of reusable bits) recalculate everything from scratch.
Which takes time.
With PRD-4921, I added a fast-mode to some of the export types (Stream CSV, Stream HTML and XLS/XLSX). The new exporters check whether the report uses only ‘safe’ features, and if so, switches to a template based output instead of using the full layouting.
A report is safe if it does not contain any of the following items:
- inline subreports. They are evil, as they can appear anywhere and can be of any complexity.
- crosstabs. They are a complex layout and can’t be easily condensed into templated sections.
- functions that listen for page-events, other than the standard page-function and page-of-pages functions. During fast-mode, we don’t generate page events and thus these wont output correct values. I am willing to ignore page functions, as data exports are less concerned about page numbers.
- any layout-processor-function other than the Row-Banding function. They exist to rewrite the report, which stops us from making assumptions about the report’s structure.
If a report is not safe, the engine falls back to the normal, slow mode. You now just have to wait a bit longer to get your data, but you wont get sudden service interruptions.
For fast reports, the engine produce a template of each root-level band. If the style of a band changes over time (as a result of having Style-expressions), we produce a template for each unique combinations of styles the reporting engine encounters.
Once the engine has a valid template, it can skip all layouting on all subsequent rows of data and can just fill in the data into the template’s place holders. The resulting output is exactly the same as the slow output – minus the waiting time.
So how does this system perform? Here is what my system produces using a 65k rows report (to stay within the limits imposed by Excel97) with 35 columns of data exported. The report has no groups, it is just one big fat stream of data. All times are given in seconds.
Export | 5.1 with fix | 5.0 with fix | 5.0 GA |
---|---|---|---|
Fast-CSV | 4.5 | 5.4 | – |
CSV | 25.8 | 24.5 | 24.8 |
Fast-XLS | 11.7 | 11.3 | – |
XLS | 53.2 | 51.3 | 213.4 |
Fast-XLSX | 31.3 | 37.7 | – |
XLSX | 86.0 | 82.8 | 232.4 |
Fast-HTML | 10.0 | 11.1 | – |
HTML | 42.9 | 43.5 | 44.9 |
66.7 | 69.2 | 66.4 |
As you can see from the data, the fix gave a 4 to 5 times speed up for HTML and CSV exports. The Excel exports were extra slow in 5.0 (and 4.x), and a few fixes in the layout handling and Excel specific exports gave the ‘normal’ mode a speed-up of 3 to 4 times. On top of that, we now have the fast mode, that gives another 2-3 times more raw speed.
Not bad for one week of frantic coding, I guess.
Go grab the 5.1 CI builds to give it a go. You will need an updated BI-Server reporting plugin to make the BI-server (and thus the Adhoc reporting tools) pick up that change.
The 5.0 branch does not have those changes in, so don’t even try the CI builds for it. As the 5.0 codeline is in lock-down for bug-fixes only, these performance improvements will take a while to go in, as we don’t want to introduce regressions that break systems in production.
awesome!
Hi Thomas,
I am struggling a bit with a crosstab report.
I have got ~10000 records to display on a crosstab but it takes ages to generate the report.
First of all i have got a heavy sql statement which takes 3 seconds to fetch the data from db. Pentaho version is 3.9.2-GA and the database is MsSql 2008 R2.
A progress bar in a dialog shows multiple stages and each stage recomputes 10000 rows starting from 1 to 10000. Stage 1 is pretty fast. Stage 2 is slow and stage 3 takes ages to display the report. Altogether takes more than 10 minutes.
I have displayed the same amount of data on a plain report which took 1 minute 30 seconds. I have got some functions to add ids and names to the fields created dynamically as well. So I know this affects too. However I am still wondering whether this is an accepted behavior of Pentaho or is there anything to do to normalize the computing time. There is not much on the internet regarding this issue.
Any help is appreciated.
Thanks in advance.
Oz.
Crosstabs in version 3.9 is fairly broken. Use the latest CI build and you will see large improvements in performance and correctness of crosstabs. Luckily the release of the Pentaho BI-Suite and Pentaho Reporting 5.1 is only a few weeks away.
thank you very much for your time to reply.
Oz.
Hi Thomas
Where can I find the updated BI-Server reporting plugin for the CI 5.1 builds?
Thanks!
In the CI server under the name “reporting-platform-plugin”.
http://ci.pentaho.com/job/reporting-platform-plugin/
However, only use that if you know how to manually assemble a BI-server – otherwise use the full build here:
http://ci.pentaho.com/job/BISERVER-CE/
If you intend to use this build to patch up an older 5.0 server – then dont. It wont work as the BI-server code itself changed in addition to the reporting engine and reporting plugin code.
Ok, thank you very much!