Monthly Archives: May 2007

Dead and buried – now back alive

No, I’m not talking about this blog, which also went dead quiet during the last month ..

Back in december, after the release of the Flow-Engine (JFreeReport 0.9), I declared that the classic engine now is dead. I wanted to let it die directly after the release of version 0.8.8.

I dug its grave, called the carpenter to make a cheap coffin for it, went into the neighbours garden to steal some flowers (you can’t have a burial without flowers, right) and shoved earth on it.

But like every Zombie the thing came back from its grave. Bug-fix after bug-fix went into it. I’m quite sure somewhere in the swamps of Florida, there’s also a voodo witch (paid by Pentaho, of course :)) who casted her own dark magic to resurrect the thing. So finally it was back alive.

I’ve seen enough B- and C-Movies to know how to handle such a situation.

In April I’ve drawn the final card and ripped out the heart of the engine. Blood everywhere. Organs and meat flew around – just think of the worst horror movie you’ve seen, multiply it with the sum of evil of 7 years of unrestrained development and you could come near to what this monster was now. But the parts kept moving. They crawled and reunited – and finally reached the jars where I kept the replacement parts of the Flow-Engine. And in true horror I’ve seen how this old and undead monster ate the very heart of the Flow-Engine and implanted itself the renderer of that sacred child.

And now the classic engine is back! Stronger, bigger, and eviler than ever. It now shares major parts with the Flow-Engine, and if one grows, the other one will receive the good as well.

Watch out for the final version of Pentaho Reporting Classic 0.8.9 coming to a network node near you at the end of the month.

The Art of Caching: How to be lazy in a good way

We all agree that avoiding work is a ‘Good Thing’(tm). However, in most cases we can’t simply use the direct approach and tell the caller (in a more or less friendly way) to get lost. And if we move to common programming languages, most of them even lack the ‘Tell to get lost’ operation.

So we have to find other ways.

The most common way of avoiding work, is caching. Caching is the process of storing results of an expensive computation for later reuse. Therefore, implementing caches trades memory for computation time.

Caching can be done on all levels of a computation. First, we can cache the final result of an operation. Where this is not possible, we can fall back to cache intermediate results. And if even this path is blocked, we can try to collect hints to make later computations more efficient and cache those hints instead.

How can you tell, whether caching makes sense in your code?

  1. Repeated computations: Caching only makes sense, if the operation in question is repeatedly called with the same parameters. The more the operation is repeated, the more we’ll gain through the caching. Adding caching to a one-time operation is a waste of time.
  2. Expensive computations: The more expensive an computation is, the more caching is justified. Even an operation that is executed only twice in your programms lifetime may be a good target for caching, if it’s execution expensive enough.

But of course, not everything is cachable by default. There are a couple of prequesites that must be fulfilled before you can think about implementing the cache:

  1. Deterministic operations without side-effects: If your application’s architecture uses a clean model and its operations are kept free of any hidden side-effects, then you should not have trouble adding caches. But if you write spaghetti-code and heavily rely on a global state – then adding caches will create a uncontrollable monster with funny, but hard to track state-bugs. Good luck in that case – you will need it.
  2. Separatable units: The operation in question must be a separate unit with one entry and one exit-point. It should be a function in the mathematical sense: The result of the function should only depend on the given input parameters. The function’s call must be repeatable – the same input parameters must always yield the same result. Therefore the function must not modify the global state.
  3. Clean inputs: The operation must have identifiable boundaries: The input parameters must be identifiable (which can be hard if you have to deal with a deeply nested object-structure). It must be possible to compare the parameters with as little costs as possible. The fewer parameters the operation has, and the easier it is to compare them, the more efficient the caching will be.
  4. Clean output: The computed result must be an atomic unit. The result that is stored in the cache must retain its state and must not be modifiable from outside. Ideally, the result is an immutable object. If that’s not possible, the result should allow to create an independent copy of itself (Java, for instance, offers the clone() method for that purpose).

In object oriented systems, caching can be added easily. As classes are treated as black-boxes (we know which operations the object offers, but we dont need to know how these operations are implemented), we can rely on the polymorphism mechanism to substitute the original implementation with the cache. One of the easiest way to implement caching is to implement the cache as proxy for the original implementation.

Lets look at some code to make this clearer.

public interface ExpensiveOperationModule
{
public Object compute (Object parameter);
}

public class ExpensiveOperationModuleImpl
implements ExpensiveOperationModule
{
public Object compute (Object parameter)
{
final long startTime = System.currentTimeInMillis();
while ((startTime + 100000) < System.currentTimeInMillis())
{
// this loop simulates a long running computation.
// sit back and wait while we wait for the computation
// to be finished.
// This loop will take 100 seconds to complete and
// will bring your CPU usage to 100% to show you how
// expensive it is.
}
return String.valueOf (parameter);
}
}

// lets run it
ExpensiveOperationModule module =
new ExpensiveOperationModuleImpl();
System.out.println ("Start");
module.compute (new Integer (10));
System.out.println ("One");
module.compute (new Integer (10));
System.out.println ("Two");

As you can see in the implementation, the computation is a long running task, that takes a long time to execute. Each operation takes painfull 100 seconds. Let’s cache that.

public class CachingExpensiveOperationModule
implements ExpensiveOperationModule
{
private ExpensiveOperationModule parent;
private Object lastParameter;
private Object lastResult;

public CachingExpensiveOperationModule
(ExpensiveOperationModule parent)
{
this.parent = parent;
}

public Object compute (Object parameter)
{
if (lastParameter != null)
{
if (lastParameter.equals(parameter))
{
return lastResult;
}
}
lastResult = parent.compute(parameter);
lastParameter = parameter;
return lastResult;
}
}

// lets run this one
ExpensiveOperationModule module =
new CachingExpensiveOperationModule
(new ExpensiveOperationModuleImpl());
System.out.println ("Start");
module.compute (new Integer (10));
System.out.println ("One");
module.compute (new Integer (10));
System.out.println ("Two");

Except for the first line, the main method has not changed. But with the caching in place, the second call to ‘compute()’ returns immediately, as the result had been computed previously. Of course, this example is very simple and does not work efficiently in case a complexer call-sequence is used (for example: alternating parameters: compute(10), compute(20), compute(10) ..).

In most cases, caching can be seen as a simple Key => Value lookup, where the input parameters of the computations act as key and the computation result forms the value. All serious programming languages offer a Map or Dictionary collection type, which maps a Key into a Value. When using Java, the built-in HashMap is almost always a good Map implementation that offers good performance on lookups.

But a word of warning: When implementing cache-lookup strategies, make sure that these lookups are as inexpensive as possible. Caching makes no sense if the cache-lookup becomes more expensive than the operation that should be cached. A huge and complicated key will be a lot more expensive to compare (and therefore to lookup) than a simple key.