Understanding and tuning the Java garbage collector

While working on one of my many side-projects, I stumbled across a very complete and understandable explanation of the various garbage collector algorithms used inside the JDK and their performance impacts.

If you want to tickle the most out of your BI-server or Carte server installations, the article “Java Garbage Collectors distilled” at the Mechanical Sympathy blog may be worth a read.

Blazing fast data exports with Pentaho Reporting 5.1

Ever felt that getting the Pentaho BI-server to spit out CSV or Excel files should be faster? As unbelievable as it sounds, with case PRD-4921 closed, we have up to 5 (in words: five!) times faster exports now.

Its one of those occasions where talking about customer problems creates a wacky idea that falls on fruitful ground.

Many customers use the Interactive Reporting tools (Pentaho Interactive Reporting, Saiku Reporting or the ugly Adhoc Reporting (WAQR)) to get data out of the data-warehouse into Excel or CSV files. Such reports are usually rather simple list reports, with no fancy charting or complex layout structures. However, the reporting engine does not know that, it’s pessimistic nature always assumes the worst.

The Pentaho Reporting engine allows an insane degree of freedom, and via custom report functions, it allows to reconfigure the report on the fly while the report is running. But with that freedom, we no longer can make any assumptions about how a report will look like in the next row. Thus the engine, and the layout subsystem, assume nothing and (apart from a bit of caching of reusable bits) recalculate everything from scratch.

Which takes time.

With PRD-4921, I added a fast-mode to some of the export types (Stream CSV, Stream HTML and XLS/XLSX). The new exporters check whether the report uses only ‘safe’ features, and if so, switches to a template based output instead of using the full layouting.

A report is safe if it does not contain any of the following items:

  • inline subreports. They are evil, as they can appear anywhere and can be of any complexity.
  • crosstabs. They are a complex layout and can’t be easily condensed into templated sections.
  • functions that listen for page-events, other than the standard page-function and page-of-pages functions. During fast-mode, we don’t generate page events and thus these wont output correct values. I am willing to ignore page functions, as data exports are less concerned about page numbers.
  • any layout-processor-function other than the Row-Banding function. They exist to rewrite the report, which stops us from making assumptions about the report’s structure.

If a report is not safe, the engine falls back to the normal, slow mode. You now just have to wait a bit longer to get your data, but you wont get sudden service interruptions.

For fast reports, the engine produce a template of each root-level band. If the style of a band changes over time (as a result of having Style-expressions), we produce a template for each unique combinations of styles the reporting engine encounters.

Once the engine has a valid template, it can skip all layouting on all subsequent rows of data and can just fill in the data into the template’s place holders. The resulting output is exactly the same as the slow output – minus the waiting time.

So how does this system perform? Here is what my system produces using a 65k rows report (to stay within the limits imposed by Excel97) with 35 columns of data exported. The report has no groups, it is just one big fat stream of data. All times are given in seconds.

Export 5.1 with fix 5.0 with fix 5.0 GA
Fast-CSV 4.5 5.4 -
CSV 25.8 24.5 24.8
Fast-XLS 11.7 11.3 -
XLS 53.2 51.3 213.4
Fast-XLSX 31.3 37.7 -
XLSX 86.0 82.8 232.4
Fast-HTML 10.0 11.1 -
HTML 42.9 43.5 44.9
PDF 66.7 69.2 66.4

As you can see from the data, the fix gave a 4 to 5 times speed up for HTML and CSV exports. The Excel exports were extra slow in 5.0 (and 4.x), and a few fixes in the layout handling and Excel specific exports gave the ‘normal’ mode a speed-up of 3 to 4 times. On top of that, we now have the fast mode, that gives another 2-3 times more raw speed.

Not bad for one week of frantic coding, I guess.

Go grab the 5.1 CI builds to give it a go. You will need an updated BI-Server reporting plugin to make the BI-server (and thus the Adhoc reporting tools) pick up that change.

The 5.0 branch does not have those changes in, so don’t even try the CI builds for it. As the 5.0 codeline is in lock-down for bug-fixes only, these performance improvements will take a while to go in, as we don’t want to introduce regressions that break systems in production.

Book Review: Pentaho Data Integration Cookbook (2nd Edition)

Pentaho Data Integration Cookbook (2nd edition)After my review for the newest Pentaho Reporting book, Packt Publishing asked me to write a review for the latest Pentaho Data Integration Cookbook as well, which just came out in December.

Pentaho Data Integration (PDI) is Pentaho’s answer to overpriced and proprietary ETL tools. In a Business Intelligence setting, you use ETL tools like PDI to populate your data warehouse, and outside of that, PDI is a Swiss army knife of tools to move and transform  vast amounts of data virtually from and to any system or format.

María Carina Roldán and Adrián Sergio Pulvirenti already wrote the first edition of the Pentaho Data Integration Cookbook. Maria is a Webdetails fellow and this is her fourth book about Pentaho Data Integration. For this book they are joined by Alex Meadows, from Red Hat, and a long term member of the Pentaho Community.

The book provides a very hands-on approach to PDI. True to its title as a cookbook, the book divides all information into handy recipes that show you – in a very practical way – an no fuss example of the problem and solution. All recipes follow the same schema:

  • state the problem,
  • create a transformation or job in PDI to solve the problem,
  • and after that, provide a detailed explanation of what just happened and explain potential pitfalls

The book covers every potential area, from database input and output to text-files, XML and Big-Data access (Hadoop and MongoDB). After the basic Input and Output tasks, the book explores the various flow control and data lookup options PDI offers. It also ventures beyond ordinary ETL tasks by showing how PDI integrates with Pentaho Reporting and the Pentaho BI-Server. If you want it even more advanced than that, the book also covers how to read and manipulate PDI transformations as XML files or read them from databases.

The book’s examples stay simple and practical and when used in conjunction with the downloadable content, are easy to follow. If you worked with ETL tools before or have at least a basic understanding of how to get data in and out of databases and computer systems, you will find this book an valuable companion that answers your questions quickly and completely.

The book is a bit sparse on screen shots, which makes it more difficult to follow the examples without the accompanying downloads. If you run PDI in a non-English environment, I would thus recommend to switch it to English first, so that the label texts align with the books description.

The authors focus on solving practical tasks at hand and keeps that focus through the whole book. For a reference book, this is a perfect set up and I enjoyed the fact that you can jump directly to the correct chapter simply by reading the headings in the Table of Contents.

In conclusion, this book is great for practitioners who need a quick reference guide and need solutions to solve their starting problems with PDI. If you are familiar with ETL from other tools, then this book gets you started in no time at all.

If you are both new to PDI and data warehousing in general, make sure to read this book in conjunction with general introductory books on Data Warehousing or ETL, like Ralph Kimbals The Data Warehouse ETL Toolkit or Matt Caster’s and Roland Bouman’s Pentaho Kettle Solutions.

Look out for the JDK Performance trap

Would you believe that running the same code on the same system can be up to 100% faster depending on your choice of the JDK you use? (And mind you, I am talking standard Oracle JDKs here, nothing experimental or expensive!)

A couple of weeks ago I decided to run my standard performance tests for the Pentaho Reporting engine. I just upgraded from an six year old computer to a shiny new big, fat machine and was itching to give it a run. All the software freshly installed, the filesystem sparkly clean, lets try and see how the various reporting engine versions run.

And lo and behold, Pentaho Reporting 4.8 seemed to outperform Pentaho Reporting 5.0 by a staggering 100%. Digging into it with JProfiler, I could not find a clear target – comparing the ratios of time spent in all major subsystems was the same between both versions – just that 5.0 was twice as slow at every task than 4.8.

After nearly two weeks of digging, I finally found the culprit: For some arcane reasons I was using a 64-bit JDK for the 4.8 test, but a 32-bit JDK for the 5.0 test run. Using the exact same JDK fixed it.

The numbers: JDK7 – 32bit vs 64bit vs server vs client vm

I have a standard test, a set of reports that print 100k elements in configurations of 5000 rows/20 elements per row, 10000 rows and 10 elements, 20000 rows and 5 elements and 50000 rows and 2 elements.

Here are the results of running Pentaho Reporting 5.0 in different JDK configurations. All times are in seconds.

JVM configuration 5k_20 10k_10 20k_5 50k_2
32bit / -client 3.75 4.31 5.52 9.203
32bit / -server 2.2 2.5 3.2 5.3
64bit / -client 1.92 2.2 2.8 4.75
64bit / -server 1.9 2.18 2.78 4.75

Running Pentaho Reporting 4.8 in the same configuration yields no different results (appart from statistical noise). So with all the major work that went into Pentaho Reporting 5 at least we did not kill our performance.

So the lesson learned is: Choose your JVM wisely, and if you happen to use a 32-bit system or JDK, make sure you use the ‘-server’ command line switch for all your Java tools, or you wait twice as long for your data to appear.

 

Pentaho 5.0 hits the sourceforge servers – go grab it!

Today the long wait is over, as Pentaho released the community editions of version 5.0 of the Pentaho BA-Suite, which includes Pentaho Reporting and Pentaho Data Integration.

Pentaho Reporting 5.0 is a big step closer to the crosstab feature that has been sorely missing. To keep the waiting time exciting, we added some goodies.

The feature with the most impact is the support for CSS-inspired style-sheets. These stylesheets combine CSS3 selectors with our own flavor of style-properties. The stylesheet selectors are powerful enough to replace the majority of uses of style-expressions for conditional formatting, and above all: they can be shared across reports.

A new “sequence” data-source joined the list of data sources, and combines an easy API and an auto-generated UI into a handy package. This data source allows you to feed data into the reporting engine by writing just a handful of classes. This reduces the time to spin of a new data source from days to hours.

While speaking of data sources: Pentaho Reporting 5 has a new way to interface with Pentaho Data-Integration. The big-data-mode uses a convention based approach to make KTR files (PDI transformations) available as standalone data sources. Inputs and outputs are auto-mapped and we reuse PDI-dialogs inside the report-designer to configure queries. This hides the complexity of the data-integration process and makes querying data a seamless experience.

The BA-server contains a host of goodies, including the final transition away from the home cooked file repository to a JCR-backed repository, along with a full set of REST APIs to communicate with the server. Anyone trying to integrate the BA-server into an existing set of applications can now rejoice in happiness.

So go to community.pentaho.com and grab the software.

Unix: Tell your system how to treat PRPT files as ZIP files

As a BI-Solution developer, every now and then there comes a time to bring out the serious tools. When dealing with report files, you may find that you need to peek inside the ZIP structure, to trace down an error, or to simply look at some of its metadata quickly.

But unzipping manually is tedious. Luckily, on every Linux system (and on Cygwin), the Midnight Commander makes file operations on the console a lot easier. The MC treats ZIP files as a virtual file system, and allows you to browse them like subdirectories and makes it possible to add, edit or remove files from these archives easily.

The only problem is: It only works if the Midnight Commander recognizes the file as a ZIP file. And sadly, for PRPTs it does not.

Internally the Midnight Commander relies on the ‘file’ command, which relies on the ‘magic’ database to check files against known signatures. So lets teach MC how to play nice with PRPTs.

Create a $HOME/.magic file with your favourite text editor, and add the following content:

# ZIP archives (Greg Roelofs, c/o zip-bugs@wkuvx1.wku.edu)
0    string        PK\003\004
>30    ubelong        !0x6d696d65
>>4    byte        0x00        Zip archive data
!:mime    application/zip
>>4    byte        0x09        Zip archive data, at least v0.9 to extract
!:mime    application/zip
>>4    byte        0x0a        Zip archive data, at least v1.0 to extract
!:mime    application/zip
>>4    byte        0x0b        Zip archive data, at least v1.1 to extract
!:mime    application/zip
>>0x161    string        WINZIP          Zip archive data, WinZIP self-extracting
!:mime    application/zip
>>4    byte        0x14        Zip archive data, at least v2.0 to extract
!:mime    application/zip

>30    string        mimetype
>>50    string    vnd.pentaho.        Zip archive data, Pentaho Reporting

Then all you need to do is compile the file, and you are ready to go:

file -C -m $HOME/.magic

This creates a $HOME/.magic.mgc file, which contains the compiled magic code.
Test it via

file a_test_report.prpt

and if it says, “Zip archive data, Pentaho Reporting” your system has learned how to treat PRPTs as ZIPs.

Book Review: Pentaho 5.0 Reporting by Example

PRD5-bookAfter visiting the Pentaho London User Group, Diethard Steiner surprised me with a brand new book about Pentaho Reporting: Pentaho 5.0 Reporting By Example. I had been way to bogged down with the 5.0 release to notice much about anything, but missing a whole book is new even for me. This book is so fresh, the software it is describing has not even been released as community edition.

The book is written by the two founders of eGluBI, an Argentinian BI-Consultancy and training company. Both have a strong background as lecturers at the Univerity Aeronautic Instutute in Cordoba, Argentinia. Their teaching experience shows throughout the book, as the writing is hands on, practical and concentrates on getting the mechanics across instead of drowning you in theory or endless lists of properties.

The book starts with a quick overview about Pentaho Reporting showing examples of some of the reports you can create. It then dives directly into the learning action and gets you started by installing the report-designer and giving a tour around the user interface.

When you go through the content of the book, you’ll notice that the book swings back and forth between guided, step-by-step actions titled “Time for action” and, after you created something on the screen, an explaination section named “What just happened”, that gives you some theoretical understanding of the task you just performed.

This very hands-on approach effectively demonstrates the mechanics of the reporting engine, without distracting you with unnecessary information. Along the way it showers you in bits of instant gratification, which makes the dry topic of business reporting a rather pleasant experience.

When you work through the chapters, you will touch all the important bits and pieces, from Data-sources, parameters and formulas to groups, charts and subreports.

The books structure reminds me of a course or practical teaching session and shows that both authors have an teaching background.

The book is clearly aimed at beginners, and thus concentrates on breadth instead of depth. I think this is one of the strong points of this book . It focuses on helping you understand what is going on, and enables you to find your way around the more advanced topics in the Pentaho Infocentre or the forum.

The only thing I found puzzling was the servlet programming example hidden in the last chapter. The whole book is aimed at non-programmers and business users, and thus the coding part feels out of place. And aside from that, they covered the BI-server and publishing reports there. As a integrator, I would recommend to run a BI-server in parallel to your own web-application. It saves you from reimplementing boring infrastructure parts, like security, reposities and so on, and the servlet specs contain enough goodies to access content from other web-applications on the same server if needed.

Verdict

Would I recommend ‘Pentaho 5.0 Reporting by Example’ to new users? Absolutely. This book greatly lowers the hurdles to become productive with the Pentaho Report Designer, and helps you getting started quicker. If you are a seasoned Pentaho Reporting user, you probably won’t find much new knowledge in here. But you might want to hand out copies of the book to clients to help them on their road to success.

And if your job is to teach Pentaho Reporting to new users, or to create a beginners course for Pentaho Reporting, then this book forms an ideal base for this teaching work.

CI builds for Pentaho Reporting 5.0

As the Pentaho CI Server had technical difficulties offering report-designer snapshots over the last few weeks (which finally are resolved now) I finally decided to dust off my Jenkins and Artifactory servers to get community builds out there.

And even though the Pentaho CI server now once again offers a Pentaho Report Designer snapshot build, I think it is a good idea to have a second fall over CI server.

So from now on, you can always get the latest build from the CI-Builds page on this blog. I will add a feed for the 3.9-x branch and an additional (inofficial and unsupported) 5.0 branch in a few days.

So go grab your CI build!

Pentaho Reporting version will be 5.0, in sync with BA-Server

It is now official. The next release of Pentaho Reporting will be numbered 5.0, in sync with the Pentaho Suite 5.0. Future releases will then keep in sync with the suite version number and will be released at the same time as the Pentaho Server.

Although it is a shame that you will no longer be able to impress friends and family with the arcane knowledge of what PRD goes with a given release of the BA-Server, it will simplify the story for support, marketing, sales and everyone else.

We are now also officially in feature-lock-down. We thus have finally entered the last stages of the development process (the remaining phases being ‘stress’, ‘panic’ and ‘outright agony’). That means, no new features get added, regardless how good they are, and we now concentrate solely on working though the remaining items of the sprint backlogs.

Pentaho Reporting extensions points

Pentaho Reporting provides several extension points for developers to add new capabilities to the reporting engine. When you look at the code of both the reporting engine and the report-designer, you can easily see many of the existing modules.

Each extension point comes with a meta-data structures and is initialized during the boot-up process. The engine provides the following extension points:

  • Formula Functions
  • Named Function and Expressions
  • Data-Sources
  • Report Pre-Processors
  • Elements
    • Attributes
    • Styles

Formula functions are part of LibFormula. LibFormula is Pentaho’s adaption of the OpenFormula standard. OpenFormula is a vendor independent specification for spreadsheet calculations. Formula functions provide a very easy way to extend the formula language with new elements without having to worry about the details of the evaluation process. It is perfect if you want to encapuslate an calculation and still be flexible to use it in a general purpose calculation.

Named functions and expressions are the bread-and-butter system to calculate values in a report. Expressions can be chained together by referencing the name of an other expression or database field. Named functions are the only way to calculate values over multiple rows. Adding functions is relatively easy, as named functions only need the implementation as well as the necessary metadata.

Data-Sources are responsible for querying external systems and to provide the report with tabular massdata. Pentaho reporting already ships with data-sources for relational data, OLAP, a PDI data-source that executes ETL-Transformations to compute the data for the report and various scripting options. Adding a data-source is more complex, as an implementor needs to write the datasource, the meta-data and the xml-parsing and writing capabilities. In addition to that, the author needs to provide a UI to configure their new data-source.

With Pentaho Reporting 4.0 we add two additional data-source options, which make it easier to create new data-sources.

The first option uses our ETL tool as backend to parametrize template-transformations. Therefore a data-source developer only has to provide the transformation template, and the system will automatically provide the persistence as well as all dialogs needed to configure the data-source.

The second option uses a small parametrized Java-Class, similar to formula expressions. These calculations, called sequences, are managed by the Seqence-Data-Source, which takes care of all persistence and all UI needs.

Report-Pre-Processors are specialized handlers that are called just before the report is executed the first time. They allow you to alter the structure of the report based on parameter values or query results. These implementations are ‘heavy stuff’ for the advanced user or system integrator.

Last but not least, you can create new element types. Elements hold all data and style information to produce a renderable data-object. The reporting engine expects elements to return either text (with additional hooks to return raw objects for export-types who can handle them), graphics or other elements. An element that produces other elements for printing acts as a macro-processor and can return any valid content object, including bands and subreports.

Element metadata is split into 3 parts. The element itself is a union of the element’s type, attributes and style information. Implementing new basic elements requires you to write a new ElementType implementation (the code that computes the printable content) and to declare all styles and attributes the element uses.

The available style-options are largely defined by the capabilities of the underlying layout engine and thus relatively static in their composition.

An element’s attributes is a more free-form collection of data. Elements can contain any object as attributes. The build-in xml-parser and writer handles all common primitive types (string, numbers, dates and arrays thereof). If you want to use more complex data structures, you may have to write the necessary xml-parser and writer handlers yourself.