Performance monitoring and tuning in Pentaho Reporting

Are your reports running slow? Do you have problems finding the solution to the problem? The performance of reports depends on many factors, and as many of these factors are interdependent, it is not always easy to untangle that mix.

But since the release of Pentaho Reporting 3.8, at least there is some help in your quest for performance with some additional log settings that spit out some performance metrics.

With this release, we added a bunch of loggers at strategic places to monitor the performance of running reports.

org.pentaho.reporting.engine.classic.core.ProfileReportProcessing=true

This setting enables the profiling mode via the logger “org.pentaho.reporting.engine.classic.core.layout.output.PerformanceProgressLogger”.
If this property is set to ‘true’, a new progress-monitor gets added to the report process. This monitor can print periodic status messages while the report is processed.

The Monitor itself can be tuned to output only the information that you seem most relevant. You can enable the logger to print a statement on every new page, on every new processing level or you can print periodic updates for every 5000 rows that have been processed by the engine.

org.pentaho.reporting.engine.classic.core.performance.LogPageProgress=true
org.pentaho.reporting.engine.classic.core.performance.LogLevelProgress=true
org.pentaho.reporting.engine.classic.core.performance.LogRowProgress=true
org.pentaho.reporting.engine.classic.core.DebugDataSources=true
org.pentaho.reporting.engine.classic.core.ProfileDataSources=true

Pentaho Reporting is a layouting engine that combines tabular data from a data source with calculated data and a layout description to produce the final documents. During the last few years I have not seen a report where the report functions or expressions formed the bottle neck of the report processing, so I happily ignore them unless everything else fails.

The database as bottle neck

If you are using SQL or MQL data sources, then your database may be responsible for the slow execution of the report.

Depending on the database, the use of scrollable result sets may be lethal to the report performance. With a scrollable result-set the database only supplies a sliding window of the data to the reporting engine. This reduces the reporting engine’s memory footprint, but puts a larger strain on the network and I/O capabilities of the system. If you pull a lot of data and the database is slow on supplying the data via a scrollable result-set then you pay a lot for accessing the data without knowing it. The query itself will return quickly, as it only initializes the cursor leaving the job of transferring the data out of the cost-equation.

You can test whether the data source is the problem by disabling the use of scrollable result sets via the report configuration setting:

org.pentaho.reporting.engine.classic.core.modules.misc.tablemodel.TableFactoryMode=simple

Be aware that the memory consumption will go up with this setting as we fully buffer the result set in the local memory. On the positive side, access to the local memory is blazing fast and therefore your report will run faster now.

I usually test the database first, as this test is simple to make and rules out a large chunk of the potential trouble makers.

The layout as bottle neck

In the reporting engine, the performance of the report is usually limited by the layouting and text processing that is necessary to produce the report. Performance is mainly dependent on the amount of content printed. The more content you generate, the more time we will need to perform all layout computations.

Subreports: Inline vs Banded

Large Inline-sub-reports are the most notorious reason for bad performance. The layouted output of a inline-sub-report is always stored in memory. Layouting pauses until the sub-report is fully generated and then inserted into the layout model and subsequently printed. The memory consumption for this layouting model is high, as the full layout model is kept in memory until the report is finished. If the amount of content of the sub-report is huge, you will run into out-of-memory exceptions in no time.

A inline-sub-report that consumes the full width of the root-level band should be converted into a banded sub-report. Banded sub-reports are layouted in a streaming fashion and output is generated while the sub-report is processed. The memory footprint for that operation is small as only the active band or the active page has to be held in memory.

Images loaded from HTTP-URLs

Some image generating scripts on the web can cause troubles with the reporting engine. For content loaded via HTTP connections, we rely on the cache control headers sent by the server. For this we check the “last-modified” header of the HTTP-response. If there is none, then we consider this file non-cacheable.

We check cache header against the original object by issuing a HEAD request to the URL. This may cause problems in case the image service provider does not properly implement a HEAD handler.

In Pentaho Reporting 3.8, we work around most of these troubles by assuming the worst from the web-authors. A non-cachable file is still cached for at least 10 seconds (the default configuration) as most programmers of web-services do not care, do not know or are not able to send a “last-modified” header or implement a HEAD response handler. For long running reports, this value must be increased to a larger number that matches at least the expected runtime length of the report.

You can also disable cache control completely, which will cause the reporting engine to never reload the content as long as the main resource cache still holds a reference to it.

To configure both settings, edit the “loader.properties” file in the root of your classpath and set the following properties:

# Controls the mandatory lifetime of HTTP objects before we check for updates on the object 
org.pentaho.reporting.libraries.resourceloader.config.url.FixedCacheDelay=10000

# Disable cache updates completely. Once a resource is loaded it stays loaded until its 
# end of life in the underlying resource cache has been reached.
org.pentaho.reporting.libraries.resourceloader.config.url.FixBrokenWebServiceDateHeader=false

Misconfigured caching as source of slow-downs

As a general rule of thumb: Caching must be configured properly via a valid EHCache file. If caching is disabled or misconfigured, then we will run into performance trouble when loading reports and resources. A misconfigured cache can considerably slow down reports if you use a resource-message, resource-label or resource-field element in the report. The resource elements load their translations via LibLoader, which in return depends on properly working caches to avoid repeated and expensive I/O operations.

You can configure EHCache via an “eh-cache.xml” file placed into the root of the classpath.

Output targets as bottle neck

Pageable outputs

A pageable report generates a stream of pages. Each page has the same height, even if the page is not fully filled with content. When a page is filled, the layouted page will be passed over to the output target to render it in either a Graphics2D context or a streaming output (PDF, Plaintext, Pageable-HTML etc).

When the content contains a manual pagebreak, the engine will consider the page as full. If the pagebreak is a “before-print” break, the break will be converted to a “after-break” on the previous band. To apply this change, the internal report states must be rolled back and parts of the report processing restarts to regenerate the layout with the new constraint in place. A similar roll-back happens, if the current
band does not fit on the page.

To avoid unnecessary rollbacks prefer “break-after” pagebreak markers over “break-before” markers.

In most cases, the pageable outputs will be fairly simple and non-problematic, as they only hold a minimal amount of data in memory.

Table outputs

A table export produces a table output from a fully layouted display model. A table export cannot handle overlapping elements and therefore has to remove such elements.

In versions prior to Pentaho Reporting 3.8, we also added some debug information to the layout nodes. This did increases the memory consumption but made developing reporting solutions easier. Although these debug settings should never be enabled in production environments, many users of the community edition may still run with these settings enabled. In Pentaho Reporting 3.8 we changed the built-in default so that no debug-information gets added. Those who could interpret the output probably know how to enable the setting, and everyone else just suffered the bad performance.

# Config settings for the 'classic-engine.properties' file
org.pentaho.reporting.engine.classic.core.modules.output.table.base.ReportCellConflicts=false
org.pentaho.reporting.engine.classic.core.modules.output.table.base.VerboseCellMarkers=false

Streaming HTML exports

In HTML exports, there are a few settings that can affect export performance.

org.pentaho.reporting.engine.classic.core.modules.output.table.html.CopyExternalImages=true

This setting controls whether images linked from HTTP(S) or FTP sources are linked from their original source or copied (and possibly recoded) into the output directory. The default is “true” as this ensures that reports always have the same image.

Set it to false if the image is dynamically generated and should always show the most recent view. Disabling this setting also avoid some I/O for copying and recoding the images during the report generation, although for most images used in the report this performance gain wont be noticeable at all.

org.pentaho.reporting.engine.classic.core.modules.output.table.html.InlineStyles=false
org.pentaho.reporting.engine.classic.core.modules.output.table.html.ExternalStyle=true
org.pentaho.reporting.engine.classic.core.modules.output.table.html.ForceBufferedWriting=true

The style settings and the buffered writing settings control how stylesheets are produced and whether the generated HTML output will be held in a buffer until the report processing is finished.

Style information can either be inlined, stored in a external *.css file or can be contained in the HEAD element of the generated HTML file (when Inlinestyles is set to false and ExternalStyle is also set to false).

Buffering is forced when styles need to be inserted into the element of a report. The Buffering configuration setting should be set to true if the resulting report is read by a browser, as browsers request all resources from the server as soon as they find a reference to them in the HTML stream. If a browser requests a stylesheet that has not yet been fully generated, the report
will not be rendered correctly.

It is safe to disable buffering if the styles are inlined, as the browser will not need to fetch a external stylesheet in that case. Inlined styles greatly increase the size of the generated HTML and may slow down the rendering of the generated HTML file tremendously if the file is large.

Buffered content will appear slower to the user than non-buffered content, as browsers render partial HTML pages while data is still received from the server. Buffering delays that rendering until the report is fully processed on the server.

Reports that are only written to disk or downloaded from the server to be archived and then forgotten should have buffering disabled so that they can be streamed directly to the client.

Buffering consumes memory and can potentially lead to OutOfMemory conditions on smaller servers if too many users request large reports at the same time. However, in most reasonable configurations, buffering is harmless. So far no browser is able to properly render a HTML file that is larger than 20 MB. So if excessive buffering causes troubles then rest assured that in case buffering works your browser surely wont.

Reporting Tales

Pentaho Reporting Tips and Tricks

Performance monitoring and tuning in Pentaho Reporting

The database as bottle neck

The layout as bottle neck

Subreports: Inline vs Banded

Images loaded from HTTP-URLs

Misconfigured caching as source of slow-downs

Output targets as bottle neck

Pageable outputs

Table outputs

Streaming HTML exports

The database as bottle neck

The layout as bottle neck

Subreports: Inline vs Banded

Images loaded from HTTP-URLs

Misconfigured caching as source of slow-downs

Output targets as bottle neck

Pageable outputs

Table outputs

Streaming HTML exports

Related posts: