Category Archives: Performance

Understanding and tuning the Java garbage collector

While working on one of my many side-projects, I stumbled across a very complete and understandable explanation of the various garbage collector algorithms used inside the JDK and their performance impacts.

If you want to tickle the most out of your BI-server or Carte server installations, the article “Java Garbage Collectors distilled” at the Mechanical Sympathy blog may be worth a read.

Look out for the JDK Performance trap

Would you believe that running the same code on the same system can be up to 100% faster depending on your choice of the JDK you use? (And mind you, I am talking standard Oracle JDKs here, nothing experimental or expensive!)

A couple of weeks ago I decided to run my standard performance tests for the Pentaho Reporting engine. I just upgraded from an six year old computer to a shiny new big, fat machine and was itching to give it a run. All the software freshly installed, the filesystem sparkly clean, lets try and see how the various reporting engine versions run.

And lo and behold, Pentaho Reporting 4.8 seemed to outperform Pentaho Reporting 5.0 by a staggering 100%. Digging into it with JProfiler, I could not find a clear target – comparing the ratios of time spent in all major subsystems was the same between both versions – just that 5.0 was twice as slow at every task than 4.8.

After nearly two weeks of digging, I finally found the culprit: For some arcane reasons I was using a 64-bit JDK for the 4.8 test, but a 32-bit JDK for the 5.0 test run. Using the exact same JDK fixed it.

The numbers: JDK7 – 32bit vs 64bit vs server vs client vm

I have a standard test, a set of reports that print 100k elements in configurations of 5000 rows/20 elements per row, 10000 rows and 10 elements, 20000 rows and 5 elements and 50000 rows and 2 elements.

Here are the results of running Pentaho Reporting 5.0 in different JDK configurations. All times are in seconds.

JVM configuration 5k_20 10k_10 20k_5 50k_2
32bit / -client 3.75 4.31 5.52 9.203
32bit / -server 2.2 2.5 3.2 5.3
64bit / -client 1.92 2.2 2.8 4.75
64bit / -server 1.9 2.18 2.78 4.75

Running Pentaho Reporting 4.8 in the same configuration yields no different results (appart from statistical noise). So with all the major work that went into Pentaho Reporting 5 at least we did not kill our performance.

So the lesson learned is: Choose your JVM wisely, and if you happen to use a 32-bit system or JDK, make sure you use the ‘-server’ command line switch for all your Java tools, or you wait twice as long for your data to appear.

 

Doing the performance dance (again)

I just changed another bit of the table-export while integrating a patch for PRD-3631. Although the patch itself did take a few illegal shortcuts, it showed me a easier way of calculating the cell-backgrounds for the HTML, Excel and RTF exports of Pentaho Reporting.

After a bit more digging, I also fixed some redundant calls in the HTML and Excel exports for merged cells and row-styles. Both resulted in repeated calls to the cell-background calculator and were responsible for slowing down the reporting engine more than necessary.

The performance of my test reports improved a bit with those changes. But if any, then this case has shown me that clean report design is the major driver of a fast export.

The performance for the reports went up by 15 to 30 percent, with the larger changes on the reports with larger row-counts. However, the reports I test are surely non-representive, as there all elements are perfectly aligned and the report is designed to avoid merged cells.

The patch specifically claims to address performance problems in the cell-style calculation. Agreed, there were problems, and the patch addressed them. But there was no way I could see a 100% improvement on normal reports. Well, not reports that are well-designed and use the powerful little helpers that the Pentaho Report Designer offers to make reports well-aligned.

When I receive production reports for debugging, the picture is usually more bleak. Fields are place rather randomly, and usually misaligned by a few points. They start and end on rather random positions, and usually elements are not aligned to element boundaries across different sections.

Let’s take this visually fairly report as an example:

The many fine grey lines you can see mark element boundaries. The more lines you see, the more cells your report will have. Each cell not only means larger HTML files, it also means more processing time spent on computing cell-styles and cell-contents. Thick grey lines spanning across the whole section usually indicate elements that are off by less than one pixel.

These lines are produced by the Report Designer’s “View->Element Alignment Hints” feature. When this menu item is selected, you will get a better idea on how your report will look when exported into a table-export. If you cannot see the details clearly, zoom in. The Report designer happily magnifies the working area for you.

When exported to HTML, this report here created a whopping 35 columns
with another 35 rows. That is potentially 1225 cells. The resulting
HTML file has a size of 21,438 bytes. For a report with just a few items of text, this is a lot.

In general you’ll want to avoid having to many of these boundaries. In the basic design courses, teachers tell fairly early on that layout where element edges are aligned look cleaner and more pleasing for the eye. When you look at adverts or magazines, you can see this on how articles and images seem to sit along visual boundaries or dividing lines. For a well-designed report this is no different.

To help you design your reports in a well-designed fashion, the report designer comes with the “View->Snap to Elements” feature.

To clean up a report, I usually start by aligning elements that are sitting close together. Visually, it makes no difference whether a element starts at x=24 or x=24.548. For the reporting engine, this makes a difference, as a dumb little engine cannot decide whether the user was just lazy or had a very good reason to have a cell at exactly these positions (or whether some visual design would break by attempting to blindly fix it). 

With the “Snap To Elements” enabled, just select one element and drag the mis-aligned edge until it snaps off its current position. Then move it back into position. This time it will snap to one of the other elements. If your edges are very close, I drag the current edge towards the top (for the y-axis) or the left (x-axis) until it leaves the crowded area. When I return with it, it will snap to the first (top-most or left-most) edge in the group of elements by default. Repeat that with all elements that sit near the edge and very soon you should only see one thin line indicating a perfect alignment.

Additionally, you can also change the element’s position and size to a integer number in the “style” table on the right-hand side of the Report Designer. When you do that for all elements, your alignment task will become a lot easier. Now elements are either aligned or at least one point apart (and the mis-alignment is easier to spot).

The quickly cleaned up version of my sample report now has only 24 columns and 16 rows, but visually, you cannot tell the difference between the two of them. Theoretically, the resulting table can have 384 cells, compared to the mis-aligned report a reduction to a quarter of the original 1225 cells. And finally, the generated HTML file shrunk to a size of mere 8,853 bytes, one third of the original size. In my experience and with those numbers in mind the computing time for this optimized report should be roughly 10% to 15% better than the optimized version. In addition to that slight boost, your report will download faster and rendering the report in the browser will be a lot quicker as well.  

So remember, performance optimization starts in your report designer: When you optimize your report designs it instantly pays off with quicker rendering and smaller downloads.

Further optimization

That report uses line-elements to draw borders around the statistic sections. By using sub-bands with border definitions, that report could be simplified further, but I’ll leave that as an exercise for the class.

Spotlight on caching: LibLoader

When the Classic-Engine was started, we did not care much about resource-loading. Resource-loading patterns is what happens to other people, not us. When a resource was needed, we simple wrote some code to load the resource in place. We lived our happy life, until, at one day, we wanted to support image-loading.

Image-loading is easy, as long as you just use the AWT-Toolkit and its built in capabilities. But with Pixie (our WMF-renderer) and the many other image libraries out there, we hit the wall the first time. Not the resource-loading code that has been scattered all around the code backfired at us. Either we copy the new image-handling code to every sinlgle occurence  of the old code, or we would create some callable library code.

Luckily we chose the library path and created an image-factory. Our XML parser code was the same story – first a funny collection of random code, and in the next moment a nice little library. Then came the Drawable-factory, so we now have three different resource-loader implementations.

With the dawn of the Flow-Engine, the numbers started to explode. DataSource-definitions, stylesheets, subreports – suddenly every piece was (potentially) loadable.

The walls started to move so that they could hit us from a better angle ..

On a parallel thread of events, deep inside the Pentaho plattform, a new potential source of  problems was hatched by the Pentaho-engineers. Smart as they are (as long as they dont fall into the ‘.., but it works’ pattern of creating horrible hacks, of course), they created their Platform to be independent of the underlying storage system. So in the Platform, you store your reports and their resources either in a ‘SolutionRepository’, which can be anything, from a filesystem based solution to a relational database).

And actually, that was the point, where it made click. What we need is simply a common, reusable and (hopefully) well-designed library, that helps us with locating and loading resources. I’m bad at creating fancy names, so I simply called it ‘LibLoader’.

LibLoader is here to solve all resource loading problems, we encountered in the past time:

  • resource loading is slow, so it adds caching to the IO-layer
  • resource creation is slow, so it caches the actual parsing or resource-interpretation step
  • resource loading from different sources like filesystems to db-repositories to network storages is awfully complicated, so we address this as well by creating a common resource naming schema.
  • And last but not least: It must be lightweight and should not reinvent the wheel.

To add effective caching, we have to solve a couple of problems.

1.Make your cacheable resources identifiable

First, we have to create some sort of naming schema. For us, there are only two systems that seem to be important: Hierarchical storages, where your entries form a tree with parent child relations (like in filesystems), and flat storages, where entries have no relation between each other (like in database storages). Names must have at least some minimal interoperationality – so it must be possible to go from one naming system (like URLs) to another system (like the infamous solution repository). And finally: for hierarchical names, there must be some way to construct names using relative paths (as this is crucial inside the reporting engine).

2.Make it possible to detect changes in the underlying resources

Thats quite easy: Every storage system has such a facility. Filesystems call it ‘last modified timestamp’, CVS called it version, as does the Solution-repository. All we have to do is to map it into a global scope. For that, we simply let the storage system implement a service interface. So we pushed the whole problem down to the low-level layer. Problem solved 🙂

3.Avoid unnecessary work

Parsing itself is also an expensive operation. XML processing is no fast job – and even if you use streaming parsers like SAX you waste a lot of time doing all the string processing and construction of Attribute-Collections.

So if your resource can be stored safely (so it is protected from changes or it is immutable), LibLoader can also cache the parsing result for optimal performance.

4.Make it easy to use

For historical reasons, Classic-Engine supports multiple report-definition descriptions. Our policy on parsing resources is simple: The user of our code should not care about where it came from, the only thing he should have to care about is the result. In Classic-Engine, and now in LibLoader, we implemented a resource-loading multiplexer. Sounds complicated, but its meaning is simple: For any resource that should be loaded, the library tries all known resource-handlers to interpret the raw-data. So if there is at least one implementation that handles the given data and is able to produce the requested resource-type from it, the resource will be loaded.

Now, if a new resource-type should be added, the implementor only has to care about the acutal loading, caching and dependency management comes at no additional costs.

The Art of Caching: How to be lazy in a good way

We all agree that avoiding work is a ‘Good Thing'(tm). However, in most cases we can’t simply use the direct approach and tell the caller (in a more or less friendly way) to get lost. And if we move to common programming languages, most of them even lack the ‘Tell to get lost’ operation.

So we have to find other ways.

The most common way of avoiding work, is caching. Caching is the process of storing results of an expensive computation for later reuse. Therefore, implementing caches trades memory for computation time.

Caching can be done on all levels of a computation. First, we can cache the final result of an operation. Where this is not possible, we can fall back to cache intermediate results. And if even this path is blocked, we can try to collect hints to make later computations more efficient and cache those hints instead.

How can you tell, whether caching makes sense in your code?

  1. Repeated computations: Caching only makes sense, if the operation in question is repeatedly called with the same parameters. The more the operation is repeated, the more we’ll gain through the caching. Adding caching to a one-time operation is a waste of time.
  2. Expensive computations: The more expensive an computation is, the more caching is justified. Even an operation that is executed only twice in your programms lifetime may be a good target for caching, if it’s execution expensive enough.

But of course, not everything is cachable by default. There are a couple of prequesites that must be fulfilled before you can think about implementing the cache:

  1. Deterministic operations without side-effects: If your application’s architecture uses a clean model and its operations are kept free of any hidden side-effects, then you should not have trouble adding caches. But if you write spaghetti-code and heavily rely on a global state – then adding caches will create a uncontrollable monster with funny, but hard to track state-bugs. Good luck in that case – you will need it.
  2. Separatable units: The operation in question must be a separate unit with one entry and one exit-point. It should be a function in the mathematical sense: The result of the function should only depend on the given input parameters. The function’s call must be repeatable – the same input parameters must always yield the same result. Therefore the function must not modify the global state.
  3. Clean inputs: The operation must have identifiable boundaries: The input parameters must be identifiable (which can be hard if you have to deal with a deeply nested object-structure). It must be possible to compare the parameters with as little costs as possible. The fewer parameters the operation has, and the easier it is to compare them, the more efficient the caching will be.
  4. Clean output: The computed result must be an atomic unit. The result that is stored in the cache must retain its state and must not be modifiable from outside. Ideally, the result is an immutable object. If that’s not possible, the result should allow to create an independent copy of itself (Java, for instance, offers the clone() method for that purpose).

In object oriented systems, caching can be added easily. As classes are treated as black-boxes (we know which operations the object offers, but we dont need to know how these operations are implemented), we can rely on the polymorphism mechanism to substitute the original implementation with the cache. One of the easiest way to implement caching is to implement the cache as proxy for the original implementation.

Lets look at some code to make this clearer.

public interface ExpensiveOperationModule
{
public Object compute (Object parameter);
}

public class ExpensiveOperationModuleImpl
implements ExpensiveOperationModule
{
public Object compute (Object parameter)
{
final long startTime = System.currentTimeInMillis();
while ((startTime + 100000) < System.currentTimeInMillis())
{
// this loop simulates a long running computation.
// sit back and wait while we wait for the computation
// to be finished.
// This loop will take 100 seconds to complete and
// will bring your CPU usage to 100% to show you how
// expensive it is.
}
return String.valueOf (parameter);
}
}

// lets run it
ExpensiveOperationModule module =
new ExpensiveOperationModuleImpl();
System.out.println ("Start");
module.compute (new Integer (10));
System.out.println ("One");
module.compute (new Integer (10));
System.out.println ("Two");

As you can see in the implementation, the computation is a long running task, that takes a long time to execute. Each operation takes painfull 100 seconds. Let’s cache that.

public class CachingExpensiveOperationModule
implements ExpensiveOperationModule
{
private ExpensiveOperationModule parent;
private Object lastParameter;
private Object lastResult;

public CachingExpensiveOperationModule
(ExpensiveOperationModule parent)
{
this.parent = parent;
}

public Object compute (Object parameter)
{
if (lastParameter != null)
{
if (lastParameter.equals(parameter))
{
return lastResult;
}
}
lastResult = parent.compute(parameter);
lastParameter = parameter;
return lastResult;
}
}

// lets run this one
ExpensiveOperationModule module =
new CachingExpensiveOperationModule
(new ExpensiveOperationModuleImpl());
System.out.println ("Start");
module.compute (new Integer (10));
System.out.println ("One");
module.compute (new Integer (10));
System.out.println ("Two");

Except for the first line, the main method has not changed. But with the caching in place, the second call to ‘compute()’ returns immediately, as the result had been computed previously. Of course, this example is very simple and does not work efficiently in case a complexer call-sequence is used (for example: alternating parameters: compute(10), compute(20), compute(10) ..).

In most cases, caching can be seen as a simple Key => Value lookup, where the input parameters of the computations act as key and the computation result forms the value. All serious programming languages offer a Map or Dictionary collection type, which maps a Key into a Value. When using Java, the built-in HashMap is almost always a good Map implementation that offers good performance on lookups.

But a word of warning: When implementing cache-lookup strategies, make sure that these lookups are as inexpensive as possible. Caching makes no sense if the cache-lookup becomes more expensive than the operation that should be cached. A huge and complicated key will be a lot more expensive to compare (and therefore to lookup) than a simple key.

The Quest for Performance

The new layouter implementation once again sent me on the quest for more performance.

I’m sure we all know the three basic principles of performant job execution:

(1) Cache: The best job is the job, where you don’t have to work at all.
(2) Algorithms: If you can’t aviod work, then do it as quick as possible, using the best algorithms available.
(3) Resources: Dont make a mess, while doing your job. You’ll have to clean up afterwards – and that costs time too.

After applying these priciples, the Layouter implementation is now three times faster than before (400 seconds pagination time vs. 120 seconds pagination time with the new layouter for 200.000 rows). However, this is still bad compared to the 75 seconds of the old layouter. But that lazy beast had mastered the art of caching – it avoided almost every real work by using its extensive caches.

A new layout algorithm for the Classic-Engine

When you’re afraid of touching your old code, it is a good time for a rewrite. The code that I was afraid most was burried deep inside the Layout-Controller.

The development of layout-engine of the Classic-Engine was purely driven by need with an almost obscene lack of planing. However, it worked surprising well – the engine was swift and fast. Even when the layout-manager implementation was almost already out of control (better dont touch it, but still working), we were able to get more performance out of it.

Stephane Grenier, founder and author of LandLordMax, was truly amazed about the performance gains we achieved in the last year. To quote his comment on his Blog

In regards to performance, as far as I understood the biggest gains were in upgrading to 0.8.7-10 I even posted a message to the forum about this. Based on the performance differences, and the fact that version 0.8 (based on information from their head developer) will eventually be able to print combined reports (which is great if you want to print multiple invoices directly within the invoice list) we definitely upgraded to the latest 0.8 version.

But in JFreeReport 0.8.8, the subreporting caused some major changes in the layouting system. If you add new features, you always pay a price for it.

Fixing this innocent little bug #1681595 is more difficult than expected. A quick and dirty fix would tear down the remaining architecture – at the end, we may have implemented a fix for one bug, but laid the fundament for a whole new series of bug. No! Heading down this path would have been stupid and a receipt for doom and desperation.

But: We already stole the subreport-implementation from the Flow-Engine. And once you’ve started stealing, you can’t stop halfways. LibLayout, the layouting system of that engine looks nice too ..

During the last week, I’ve downported and modified the rendering pipeline of LibLayout. The Classic-Engine has no need for CSS-StyleSheets and it does not (and cannot [for compatibility reasons]) make use of the DOM-oriented architecture of LibLayout. So all we need is the last stage of the rendering process: The part where abstract elements turn into content.

So far, the extracted renderer is working and produces reports with the new layouting system. With this layouter, a many of the most hated restrictions of the old engine will be history:

  • Bands can span multiple pages.
  • Text-Elements can produce formatted text. A single text element can use several different fonts and font-styles.
  • Text-Elements can contain inline-images.
  • Dynamic-Height Text-Elements will be considerable faster.
  • The layouter now has documented rules how the layout is computed.

For me, the last point is the most important one. It took me several days to understand how the old layouting system really works. It is easy, if all elements are static. But as soon as one element uses relative positioning (percentages instead of absolute values), things became very messy. With the new system in place, the report layout finally behaves deterministic.

But you alway pay a price ..

As the layouter now uses different principles to perform its job, the internal API massively changed. There old OutputTargets are almost gone – right now they are dummy classes to maintain some backward compatiblity. All of the old layouter implementation classes are gone and only a few will continue to exist as dummy or backward-compatibility classes.

There will be border-cases, where the new layouting engine does not exactly behave like the old one. Everyone who relied on the old behaviour, that bands will not span multiple pages, will now have to change the report definitions.

Functions which modify the page-footer or a repeating group-footer during the page-finished event will produce different results now. In the old days, such changes never changed the page-break position. Now, this can happen. If the page-footer grows, it may start to move the last band to the next page. However, it is still better not to change the footer-structure during this even.

Well, the fact, that – right now – this two-weeks old implementation is slower than the old one should surprise no one. Once we have the same insane amount of caching and heuristics in place, we shall come back to our old performance high.

One final word: Once this implementation is complete, the next release of the classic-engine will be called ‘JFreeReport-0.8.9’. One step closer to a ‘1.0’ version.