Performance tuning settings for Pentaho Reporting

General thoughts on report processing and performance

Performance of PR is mainly dependent on the amount of content printed.
The more content you generate, the more time we will need to perform
all layout computations.

Use Inline Subreports with care

Large Inline-subreports are the most notorious reason for bad performance.
The layouted output of a inline-subreport is always stored in memory. Layouting
pauses until the subreport is fully generated and then inserted into the layout
model and subsequently printed. Memory consumption for this layouting
model is high, as the full layout model is kept in memory until the
report is finished. If the amount of content of the subreport is huge,
you will run into out-of-memory exceptions in no time.

A inline-subreport that consumes the full width of the root-level band
should be converted into a banded subreport. Banded subreports are
layouted and all output is generated while the subreport is processed.
The memory footprint for that is small as only the active band or the
active page has to be held in memory.

Resource Caching

When images are embedded from servers (HTTP/FTP sources) it is critical
for good performance that the server produces a LastModifiedDate header.
We use that header as part of the caching. A missing header means we
do not cache the resource and will reload the image every time we access it.

As a general rule of thumb: Caching must be configured properly via
a valid EHCache file. If caching is disabled or misconfigured, then
we will run into performance trouble when loading reports and resources.

Performance Considerations for Output types


Within PR there are three output types, each with its own memory and CPU
consumption characteristics.

(1) Pageable Outputs

A pageable report generates a stream of pages. Each page has the same
height, even if the page is not fully filled with content. When a page is filled,
the layouted page will be passed over to the output target to render it
in either a Graphics2D context or a streaming output (PDF, Plaintext,
HTML etc).

Prefer “break-after” over “break-before” pagebreaks.

When the content contains a manual pagebreak, the page will be considered
full. If the pagebreak is a “before-print” break, the break will be
converted to a “after-break” and the internal report states will be rolled
back and parts of the report processing restarts to regenerate the layout
with the new constraints. A similar roll-back happens, if the current
band does not fit on the page.

Stored PageStates: Good for Browsing a report, but eat memory

When processing a pageable report, the reporting engine assumes that
the report will be run in interactive mode. To make browsing through
the pages faster, a number of page-states will be stored to allow us
to restart output processing at that point.

Reports that are run to fully export all pages usually do not need
to store those pagestates. A series of settings controls the number
and frequency of the pagestates stored:

org.pentaho.reporting.engine.classic.core.performance.pagestates.PrimaryPoolSize=20
org.pentaho.reporting.engine.classic.core.performance.pagestates.SecondaryPoolFrequency=4
org.pentaho.reporting.engine.classic.core.performance.pagestates.SecondaryPoolSize=100
org.pentaho.reporting.engine.classic.core.performance.pagestates.TertiaryPoolFrequency=10

The reporting engine uses three lists to store the page-states. The default
configuration looks as follows:

The first 20 states (Pages 1 to 20) are stored into the primary pool. All states
are stored with strong references and will not be garbage collected.

The next 400 states (pages 21 to 421) are stored into the secondary pool. Of those
every fourth state is stored with a strong reference and cannot
be garbage collected as long as the report processor is open.

All subsequent states (pages > 421) are stored in the tertiary pool
and every tenth state is stored as strong reference.

For a 2000 pages report a total of about 270 states will be stored with strong references.

In server mode, the settings could be cut down to

org.pentaho.reporting.engine.classic.core.performance.pagestates.PrimaryPoolSize=1
org.pentaho.reporting.engine.classic.core.performance.pagestates.SecondaryPoolFrequency=1
org.pentaho.reporting.engine.classic.core.performance.pagestates.SecondaryPoolSize=1
org.pentaho.reporting.engine.classic.core.performance.pagestates.TertiaryPoolFrequency=100

This reduces the number of states stored for a 2000 page report to 22, thus cutting
the memory consumption for the page states to a 1/10th.

(Note: In PRD 3.7 full exports no longer generate page states and thus these settings
will have no effect on such exports. They still affect the interactive mode.)

(2) Table exports

A table export produces a table output from a fully layouted display model. A
table export cannot handle overlapping elements and therefore has to remove
such elements.

To support the debugging of report layouts, we store a lot of extra information
into the layout model. This increases the memory consumption but makes developing
reporting solutions easier. These debug settings should never be enabled in
prodcution environments. In 3.6 and earlier the pre-built “classic-engine” has them enabled, as this helps inexperienced developers to find their report-definition errors faster.

org.pentaho.reporting.engine.classic.core.modules.output.table.base.ReportCellConflicts=true
org.pentaho.reporting.engine.classic.core.modules.output.table.base.VerboseCellMarkers=true

Note: With PRD-3.7, the defaults for these settings will change to “false” as
we assume that most users use PRD for developing reports now. PRD comes with
its own method to detect overlapping elements and does not rely on these settings.

Special Notes on the HTML export

In HTML exports, there are a few settings that can affect export performance.

org.pentaho.reporting.engine.classic.core.modules.output.table.html.CopyExternalImages=true

This controls whether images linked from HTTP(S) or FTP sources are linked from their
original source or copied (and possibly recoded) into the output directory. The
default is “true” as this ensures that reports always have the same image.

Set to false if the image is dynamically generated and should display the most
recent view.

org.pentaho.reporting.engine.classic.core.modules.output.table.html.InlineStyles=false
org.pentaho.reporting.engine.classic.core.modules.output.table.html.ExternalStyle=true
org.pentaho.reporting.engine.classic.core.modules.output.table.html.ForceBufferedWriting=true

The style settings and the buffered writing settings control how stylesheets are produced and whether the generated HTML output will be held in a buffer until the report processing is finished.

Style information can either be inlined, stored in a external *.css file or
can be contained in the element of the generated HTML file.
(Inlinestyles == false and ExternalStyle == false)

Buffering is forced when styles need to be inserted into the element of a
report. Buffering should be set to true if the resulting report is read by a
browser, as browsers request all resources they find in the HTML stream. If a
browser requests the stylesheet that has not yet been fully generated, the report
cannot display correctly.

It is safe to disable buffering if the styles are inlined, as the browser will
not need to fetch a external stylesheet in that case.

Buffered content will appear slower to the user than non-buffered content, as
browsers render partial HTML pages while data is still received from the server.
Buffering delays that rendering until the report is fully processed on the server.

This entry was posted in Advanced Topic, Tech-Tips on by .
Thomas

About Thomas

After working as all-hands guy and lead developer on Pentaho Reporting for over an decade, I have learned a thing or two about report generation, layouting and general BI practices. I have witnessed the remarkable growth of Pentaho Reporting from a small niche product to a enterprise class Business Intelligence product. This blog documents my own perspective on Pentaho Reporting's development process and our our steps towards upcoming releases.

3 thoughts on “Performance tuning settings for Pentaho Reporting

  1. saad

    Hi ,
    I am using pentaho 3.7 to create prpt report and I integrated the report into my web application (running in linux) . It generate the report but I takes too much memory when I run it and it doesn’t FREE the memory . I would like to follow the advices that you wrote but I don’t know how to configure the EHCache file…. Thanks for any suggestion that will help me to figure out how to FREE and REDUCE the memory.(the memory keep increasing even after i close the application!!)

  2. Thomas Morgner

    You find great information about EHCache on their website: http://ehcache.org/documentation/configuration.html

    Put the edited ehcache.xml file into your “WEB-INF/classes” directory.

    However, it seems you need to learn a bit more about how Web-Applications really work. You *close* a web-application by shutting down your server. Just closing your browser does not do anything. Also: Once the Java Virtual Machine allocates physical memory, it does not free it. Inside the VM’s memory, it performs its own memory management – but that is a different story and any text book about more advanced Java Programming can tell you the details.

    Alternatively, follow the links given in this Stackoverflow question and you should have a fairly good idea of how Java manages its memory.

    http://stackoverflow.com/questions/2957171/jvm-memory-management-garbage-collection-book

  3. Abhishek Kumar

    Hi Thomas,

    We have started using pentaho reporting engine recently and having issues in rendering the reports if it contains rows higher than 1000. As long as the number of rows returned by the underlying query is within 100-1500 rows the results come fast. But with the increasing number of rows the performance drops drastically. Even though the database takes under 30 seonds for 100000 rows to be returned the report rendering takes more than 5-6 minutes and it hangs also many times. the database we are using is enterprisedb a variant of postgresql. Please provide us some insights into what may be the cause and also what is a possible workaround for this problem?

Comments are closed.