Spotlight on caching: LibLoader

When the Classic-Engine was started, we did not care much about resource-loading. Resource-loading patterns is what happens to other people, not us. When a resource was needed, we simple wrote some code to load the resource in place. We lived our happy life, until, at one day, we wanted to support image-loading.

Image-loading is easy, as long as you just use the AWT-Toolkit and its built in capabilities. But with Pixie (our WMF-renderer) and the many other image libraries out there, we hit the wall the first time. Not the resource-loading code that has been scattered all around the code backfired at us. Either we copy the new image-handling code to every sinlgle occurence  of the old code, or we would create some callable library code.

Luckily we chose the library path and created an image-factory. Our XML parser code was the same story – first a funny collection of random code, and in the next moment a nice little library. Then came the Drawable-factory, so we now have three different resource-loader implementations.

With the dawn of the Flow-Engine, the numbers started to explode. DataSource-definitions, stylesheets, subreports – suddenly every piece was (potentially) loadable.

The walls started to move so that they could hit us from a better angle ..

On a parallel thread of events, deep inside the Pentaho plattform, a new potential source of  problems was hatched by the Pentaho-engineers. Smart as they are (as long as they dont fall into the ‘.., but it works’ pattern of creating horrible hacks, of course), they created their Platform to be independent of the underlying storage system. So in the Platform, you store your reports and their resources either in a ‘SolutionRepository’, which can be anything, from a filesystem based solution to a relational database).

And actually, that was the point, where it made click. What we need is simply a common, reusable and (hopefully) well-designed library, that helps us with locating and loading resources. I’m bad at creating fancy names, so I simply called it ‘LibLoader’.

LibLoader is here to solve all resource loading problems, we encountered in the past time:

  • resource loading is slow, so it adds caching to the IO-layer
  • resource creation is slow, so it caches the actual parsing or resource-interpretation step
  • resource loading from different sources like filesystems to db-repositories to network storages is awfully complicated, so we address this as well by creating a common resource naming schema.
  • And last but not least: It must be lightweight and should not reinvent the wheel.

To add effective caching, we have to solve a couple of problems.

1.Make your cacheable resources identifiable

First, we have to create some sort of naming schema. For us, there are only two systems that seem to be important: Hierarchical storages, where your entries form a tree with parent child relations (like in filesystems), and flat storages, where entries have no relation between each other (like in database storages). Names must have at least some minimal interoperationality – so it must be possible to go from one naming system (like URLs) to another system (like the infamous solution repository). And finally: for hierarchical names, there must be some way to construct names using relative paths (as this is crucial inside the reporting engine).

2.Make it possible to detect changes in the underlying resources

Thats quite easy: Every storage system has such a facility. Filesystems call it ‘last modified timestamp’, CVS called it version, as does the Solution-repository. All we have to do is to map it into a global scope. For that, we simply let the storage system implement a service interface. So we pushed the whole problem down to the low-level layer. Problem solved 🙂

3.Avoid unnecessary work

Parsing itself is also an expensive operation. XML processing is no fast job – and even if you use streaming parsers like SAX you waste a lot of time doing all the string processing and construction of Attribute-Collections.

So if your resource can be stored safely (so it is protected from changes or it is immutable), LibLoader can also cache the parsing result for optimal performance.

4.Make it easy to use

For historical reasons, Classic-Engine supports multiple report-definition descriptions. Our policy on parsing resources is simple: The user of our code should not care about where it came from, the only thing he should have to care about is the result. In Classic-Engine, and now in LibLoader, we implemented a resource-loading multiplexer. Sounds complicated, but its meaning is simple: For any resource that should be loaded, the library tries all known resource-handlers to interpret the raw-data. So if there is at least one implementation that handles the given data and is able to produce the requested resource-type from it, the resource will be loaded.

Now, if a new resource-type should be added, the implementor only has to care about the acutal loading, caching and dependency management comes at no additional costs.

This entry was posted in Development, Performance on by .
Thomas

About Thomas

After working as all-hands guy and lead developer on Pentaho Reporting for over an decade, I have learned a thing or two about report generation, layouting and general BI practices. I have witnessed the remarkable growth of Pentaho Reporting from a small niche product to a enterprise class Business Intelligence product. This blog documents my own perspective on Pentaho Reporting's development process and our our steps towards upcoming releases.