Category Archives: Announcements

Book Review: Pentaho Data Integration Cookbook (2nd Edition)

After my review for the newest Pentaho Reporting book, Packt Publishing asked me to write a review for the latest Pentaho Data Integration Cookbook as well, which just came out in December.

Pentaho Data Integration (PDI) is Pentaho’s answer to overpriced and proprietary ETL tools. In a Business Intelligence setting, you use ETL tools like PDI to populate your data warehouse, and outside of that, PDI is a Swiss army knife of tools to move and transform vast amounts of data virtually from and to any system or format.

María Carina Roldán and Adrián Sergio Pulvirenti already wrote the first edition of the Pentaho Data Integration Cookbook. Maria is a Webdetails fellow and this is her fourth book about Pentaho Data Integration. For this book they are joined by Alex Meadows, from Red Hat, and a long term member of the Pentaho Community.

The book provides a very hands-on approach to PDI. True to its title as a cookbook, the book divides all information into handy recipes that show you – in a very practical way – an no fuss example of the problem and solution. All recipes follow the same schema:

state the problem,
create a transformation or job in PDI to solve the problem,
and after that, provide a detailed explanation of what just happened and explain potential pitfalls

The book covers every potential area, from database input and output to text-files, XML and Big-Data access (Hadoop and MongoDB). After the basic Input and Output tasks, the book explores the various flow control and data lookup options PDI offers. It also ventures beyond ordinary ETL tasks by showing how PDI integrates with Pentaho Reporting and the Pentaho BI-Server. If you want it even more advanced than that, the book also covers how to read and manipulate PDI transformations as XML files or read them from databases.

The book’s examples stay simple and practical and when used in conjunction with the downloadable content, are easy to follow. If you worked with ETL tools before or have at least a basic understanding of how to get data in and out of databases and computer systems, you will find this book an valuable companion that answers your questions quickly and completely.

The book is a bit sparse on screen shots, which makes it more difficult to follow the examples without the accompanying downloads. If you run PDI in a non-English environment, I would thus recommend to switch it to English first, so that the label texts align with the books description.

The authors focus on solving practical tasks at hand and keeps that focus through the whole book. For a reference book, this is a perfect set up and I enjoyed the fact that you can jump directly to the correct chapter simply by reading the headings in the Table of Contents.

In conclusion, this book is great for practitioners who need a quick reference guide and need solutions to solve their starting problems with PDI. If you are familiar with ETL from other tools, then this book gets you started in no time at all.

If you are both new to PDI and data warehousing in general, make sure to read this book in conjunction with general introductory books on Data Warehousing or ETL, like Ralph Kimbals The Data Warehouse ETL Toolkit or Matt Caster’s and Roland Bouman’s Pentaho Kettle Solutions.

Pentaho 5.0 hits the sourceforge servers – go grab it!

Today the long wait is over, as Pentaho released the community editions of version 5.0 of the Pentaho BA-Suite, which includes Pentaho Reporting and Pentaho Data Integration.

Pentaho Reporting 5.0 is a big step closer to the crosstab feature that has been sorely missing. To keep the waiting time exciting, we added some goodies.

The feature with the most impact is the support for CSS-inspired style-sheets. These stylesheets combine CSS3 selectors with our own flavor of style-properties. The stylesheet selectors are powerful enough to replace the majority of uses of style-expressions for conditional formatting, and above all: they can be shared across reports.

A new “sequence” data-source joined the list of data sources, and combines an easy API and an auto-generated UI into a handy package. This data source allows you to feed data into the reporting engine by writing just a handful of classes. This reduces the time to spin of a new data source from days to hours.

While speaking of data sources: Pentaho Reporting 5 has a new way to interface with Pentaho Data-Integration. The big-data-mode uses a convention based approach to make KTR files (PDI transformations) available as standalone data sources. Inputs and outputs are auto-mapped and we reuse PDI-dialogs inside the report-designer to configure queries. This hides the complexity of the data-integration process and makes querying data a seamless experience.

The BA-server contains a host of goodies, including the final transition away from the home cooked file repository to a JCR-backed repository, along with a full set of REST APIs to communicate with the server. Anyone trying to integrate the BA-server into an existing set of applications can now rejoice in happiness.

So go to community.pentaho.com and grab the software.

Book Review: Pentaho 5.0 Reporting by Example

After visiting the Pentaho London User Group, Diethard Steiner surprised me with a brand new book about Pentaho Reporting: Pentaho 5.0 Reporting By Example. I had been way to bogged down with the 5.0 release to notice much about anything, but missing a whole book is new even for me. This book is so fresh, the software it is describing has not even been released as community edition.

The book is written by the two founders of eGluBI, an Argentinian BI-Consultancy and training company. Both have a strong background as lecturers at the Univerity Aeronautic Instutute in Cordoba, Argentinia. Their teaching experience shows throughout the book, as the writing is hands on, practical and concentrates on getting the mechanics across instead of drowning you in theory or endless lists of properties.

The book starts with a quick overview about Pentaho Reporting showing examples of some of the reports you can create. It then dives directly into the learning action and gets you started by installing the report-designer and giving a tour around the user interface.

When you go through the content of the book, you’ll notice that the book swings back and forth between guided, step-by-step actions titled “Time for action” and, after you created something on the screen, an explaination section named “What just happened”, that gives you some theoretical understanding of the task you just performed.

This very hands-on approach effectively demonstrates the mechanics of the reporting engine, without distracting you with unnecessary information. Along the way it showers you in bits of instant gratification, which makes the dry topic of business reporting a rather pleasant experience.

When you work through the chapters, you will touch all the important bits and pieces, from Data-sources, parameters and formulas to groups, charts and subreports.

The books structure reminds me of a course or practical teaching session and shows that both authors have an teaching background.

The book is clearly aimed at beginners, and thus concentrates on breadth instead of depth. I think this is one of the strong points of this book . It focuses on helping you understand what is going on, and enables you to find your way around the more advanced topics in the Pentaho Infocentre or the forum.

The only thing I found puzzling was the servlet programming example hidden in the last chapter. The whole book is aimed at non-programmers and business users, and thus the coding part feels out of place. And aside from that, they covered the BI-server and publishing reports there. As a integrator, I would recommend to run a BI-server in parallel to your own web-application. It saves you from reimplementing boring infrastructure parts, like security, reposities and so on, and the servlet specs contain enough goodies to access content from other web-applications on the same server if needed.

Verdict

Would I recommend ‘Pentaho 5.0 Reporting by Example’ to new users? Absolutely. This book greatly lowers the hurdles to become productive with the Pentaho Report Designer, and helps you getting started quicker. If you are a seasoned Pentaho Reporting user, you probably won’t find much new knowledge in here. But you might want to hand out copies of the book to clients to help them on their road to success.

And if your job is to teach Pentaho Reporting to new users, or to create a beginners course for Pentaho Reporting, then this book forms an ideal base for this teaching work.

Pentaho Reporting version will be 5.0, in sync with BA-Server

It is now official. The next release of Pentaho Reporting will be numbered 5.0, in sync with the Pentaho Suite 5.0. Future releases will then keep in sync with the suite version number and will be released at the same time as the Pentaho Server.

Although it is a shame that you will no longer be able to impress friends and family with the arcane knowledge of what PRD goes with a given release of the BA-Server, it will simplify the story for support, marketing, sales and everyone else.

We are now also officially in feature-lock-down. We thus have finally entered the last stages of the development process (the remaining phases being ‘stress’, ‘panic’ and ‘outright agony’). That means, no new features get added, regardless how good they are, and we now concentrate solely on working though the remaining items of the sprint backlogs.

Crosstabs got pushed from 5.0 – but my work goes on

211px-Punishment_sisyph On Friday, Pentaho development and product management made the decision to remove crosstabs out of the next release. Therefore, the Pentaho Suite 5.0 will not claim support for crosstabs. The decision came upon us after looking at the scope of the remaining work across the whole stack.

So let me give some context to this decision and how it impacts the ongoing and future development work of the reporting stack.

What does it mean for the Pentaho Suite 5.0 release?

The BI-Suite will ship with Pentaho Reporting 4.0. I will continue to finalize the reporting engine’s support for crosstabs, and that will ship with the release. The engine will fully support crosstab reports.

However, the user interface for creating crosstabs will be basically limited to what we have in the current development version, plus fixes to make it work stable. Given the fact that we can’t reasonably expect anyone but consultants and programmers to use a feature that has no proper UI, we will refrain from publicly claiming that the reporting system supports crosstabs (yet).

That means, you create crosstabs via a dialog and you can select and format elements in the graphical editor. Everything beyond that won’t have proper UI support, and thus reconfiguring an existing crosstab requires hard work in the structure tree.

The Pentaho Reporting engine has still some interesting new features to offer to make a upgrade worthwhile. Stylesheet support now allows you to create shared style definitions that can be hosted centrally. A new class of datasources driven by Kettle templates adds support for Big-Data/No-SQL datasources. These templates are parametrized, and deployed as plugins and thus allow you to write reusable datasources that are user-friendly to use. And last but not least: We opened up the layout system, giving you new options on how to format reports.

What were the reasons for the push?

The decision was a classical result of reducing the risk of the upcoming release. The next release contains a massive rewrite of the internals of the Pentaho BI-Server, updating the architecture to the standards of the 21st century. Adding the JCR repository and REST services required paying for a large amounts of work, which will pay off in the future with faster release cycles and easier to maintain code.

We have roughly 6 weeks left until the code needs to be finished and handed over to the release process. Many of the committed features have only few bits missing or need debugging to be checked off as finished.

Over the last month or two, the crosstab UI work was somewhat sidelined by bug-fixing work for the service packs and by work on other features. With too many tasks for not enough hands, at some point something has to give in.

Faced with a large workload that would finally leave little time for QA and documentation, it was only sensible to cut our loses short for this release. To create a release that contains as many goodies as possible, it is more sensible to finish what is nearly done than to work on the larger risk items.

It is difficult to push cases where the missing feature would be a regression or leave a visible gap. At the end, the crosstab cases were clearly separate from the other features, and with clear lines for the cut the safest to push.

Pentaho publishes its first montly Service Pack

Pentaho_Content Last week, Pentaho delivered it’s first service pack full of bug-fixes for the last two releases to our existing customers. I think this now marks the point where Pentaho crossed over from being a wild teenager towards being an responsible adult.

We provide commercial support for our customers as part of the Pentaho support offering, and as part of that we have a long history of fixing critical bugs in releases outside of the normal release cycle.

The main selling point of any commercial support contract is that of an insurance policy – if something goes wrong and your critical systems are down, there is someone who cares and who can fix them for you. It is the kind of service that lets managers sleep at night in the safe knowledge that their factories will run and their reports continue to be delivered when they wake up the next day.

Until recently, customers with show-stopping problems (severity class 1, with no work-around) would have to go through an escalation process to get the bug-fix machine rolling. The escalation, received by our support department, would land on the desk of the engineering group, who would scramble on their feet to fix it as fast as possible. After we have a fix, it goes through some more testing (which can include that we send out early versions of the fix to validate that it really works in your production system) before it gets wrapped up and officially handed back to support as a ‘customer deliverable’ patch release.

One major drawback of that system was: If a bug was not a show-stopper, you would have a rather hard time to get that through as an worthy emergency fix. This easily leads to situations where a low-intensity bug affects a lot of customers, making everyone unhappy, but it never gets addressed for existing releases, as the bug is not severe enough for a single customer.

This system is working well when there is a crisis, and stays around. Sometimes you just can’t wait until the next patch release comes out.

But for us, as engineering group, dropping all tools and jumping onto emergency bug-fixes causes large disruptions in our engineering process. Emergency patches are born in an expensive process.

Therefore, Pentaho now introduces ‘Service Packs’. Similar to how Microsoft, Oracle and all the old companies publish bug-fixes for their software on a regular schedule, Pentaho’s service packs are following that same approach.

Roughly every 4 weeks – to be precisely, usually in the 3rd week of the month – we package up all bug-fixes that we created over the last month, and make it available to all customers as a patch release.

When we allocate some quality bug-fix time in our planing way before there is a panic call, we can work on the fixes without having to jump around wildly. We get more work done by concentrating a week or so on fixing a series of bugs than by context-switching between our product development work and us delivering emergency fixes.

And when we fix bugs regularly, it makes everyone happy.

Customers are happier, as they see we care, that we fix bugs that annoy them, even though they are not blocker problems. Engineering is happier, as we can fix bugs under less pressure, creating a larger number of fixes with less tears. And when it comes to renewal, sales is happier too, as customers who got help during the year are more likely to see the value of a support contract.

How do we decide what issues get fixed?

When the time comes to assemble the list of things we want to address, we have a list of criteria that help us pick and choose. Here are some of the criteria we use, but bear in mind that this list is in no particular order and not complete:

How critical is it? (we rather fix critical issues than cosmetic issues)
Is it a regression of an existing functionality?
What is the impact on customer(s)?
Is there a work around available
How many customers reported the problem?
Is it a data & security issue?
How complex is the fix? Does it require large changes? Is it risky?
How close to the patch package cut-off date has this bug been reported?

All these metrics get mixed together to help us form an opinion. So a more severe bug that affects only one customer in a highly arcane scenarios may get fix later than a small fix that affects dozens of customers.

Some issues cannot be solved in the short time frame of the allocated bug-fix time. These issues are likely to be scheduled to the next feature release, especially if fixing them involves major code work, along with the risk to create new problems. A bug fix is not really a bug fix if it introduces new bugs, right?

We currently produce service packs for the Pentaho 4.5 and Pentaho 4.8 release. For report designer, this maps to Pentaho Report Designer 3.9 and 3.9.1 respectively.

Let me repeat it to be extra clear: The old escalation process for show-stopper problems (severity class 1, no workaround available) is still there and will not go away. So when you encounter an issue that has a very negative impact on your operations, please continue to use the escalation process to make us aware of that. We then work together to resolve your problem.

Adding Service packs as an additional tool just makes it easier for us to improve our existing products in a more timely fashion, with fixes made to work within your existing product and installation. This way, getting and installing bug fixes can be as easy as installing the latest Windows Update, so that you can spend more time growing your business.

Moving to Git and easier builds

During the last year, as part of my work with the Rabbit-Stew-Dio, I fell in love with Git. Well, sort of, that marriage is not without conflict, and from time to time I hate it too. But when the time came to move all our Pentaho Reporting projects to Git, we all were happy to jump on that boat.

As a result, you can now access all code for the 4.0/TRUNK version of Pentaho Reporting via our GitHub Project. This project contains all libraries, all runtime engine modules and everything that forms the report-designer and design-time tools.

Grab it via:

git clone git@github.com:pentaho/pentaho-reporting.git

Code organization

Our code is split into three groups of modules.

“/libraries” contains all shared libraries and code that provides infrastructure that is not necessarily reporting related.
“/engine” contains the runtime code for Pentaho Reporting. If you want to embed our reporting engine into your own Swing application or whether you want to deploy it as part of a J2EE application, this contains all your ever need.
“/designer” contains our design-time tools, like the report-designer and the report-design-wizard. It also contains all data source UIs that are used in both the Report Designer and Pentaho Report Wizard.

If you use IntelliJ Idea for your Java work, then you will be delighted to find that the sources act as a fully configured IntelliJ project. Just open the ‘pentaho-reporting’ directory as project in IntelliJ and off you go. If you use Eclipse, well, why not give IntelliJ a try?

Branching system

At Pentaho we use Scrum as our development process. We end up working on a set of features for about 3 weeks, called a Sprint. All work for that Sprint goes into a feature branch (sprint_XXX-4.0.0GA) and gets merged with the master at the end of the sprint.

If you want to keep an eye on our work while we are sprinting, check out the sprint branches. If you prefer is more stable, and are happy with updates every three weeks, stick to the master-branch.

During a Sprint, our CI system will build and publish artifacts from the sprint branches. If you don’t want that, then it is now easy to get your own build up and running in under 5 minutes (typing time, not waiting time).

Building the project

The project root contains a global multibuild.xml file that can build all modules in one go. If you want it more finely granulated, each top level group (‘libraries’, ‘engine’, ‘designer’) contains its own ‘build.xml’ file to provide the same service for these modules.

To successfully build Pentaho Reporting, you do need Apache Ant 1.8.2 or newer. Go download it from the Apache Ant Website if you haven’t done it yet.

After you cloned our Git repository, you have all the source files on your computer. But before you can use the project, you will have to download the third party libraries used in the code.

On a command line in the project directory, call

ant -f multibuild.xml resolve

to download all libraries.

If you’re going to use IntelliJ for your work, you are all set now and can start our IntelliJ project.

To build all projects locally, invoke

ant -f multibuild.xml continuous-local-testless

to run.

If you feel paranoid and want to run the tests while building, then use the ‘continuous-local’ target. This can take quite some time, as it also runs all tests. Expect to wait an hour while all tests run.

ant -f multibuild.xml continuous-local

After the process is finished, you will find “Report Designer” zip and tar.gz packages in the folder “/designer/report-designer/assembly/dist”.

If you get OutOfMemoryErrors pointing to a JUnitTask, or if you get OutOfMemory “PermGen Space” errors, increase the memory of your Ant process to 1024m by setting the ANT_OPTS environment variable:

export ANT_OPTS="-Xmx1024m -XX:MaxPermSize=256m"

Building the project on a CI server

Last but not least: Do you want to run Pentaho Reporting in your own continuous integration server and you want to publish all created artifacts to your own maven-server? Then make sure you set up Maven to allow you to publish files to a repository.

Install Artifactory or any other maven repository server.
Copy one of the ‘ivy-settings.xml’ configurations from any of the modules and edit it to point to your own Maven server. Put this file into a location outside of the project, for instance into “$HOME/prd-ivy-settings.xml”
Download and install maven as usual, then configure it to talk to the Artifactory server.

Edit your $HOME/.m2/settings.xml file and locate the ‘servers’ tag. Then configure it with the username and password of a user that can publish to your Artifactory server.
Replace ‘your-server-id’ with a name describing your server. You will need that later.
Replace ‘publish-username’ and ‘publish-password’ with the username and password of an account of your artifactory installation that has permission to deploy artifacts.

<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0"           
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"           
          xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 
                    http://maven.apache.org/xsd/settings-1.0.0.xsd">
   ...
   <servers>
     <server>
       <id>your-server-id</id>
       <username>publish-username</username>
       <password>publish-password</password>
       <configuration>
         <wagonprovider>httpclient</wagonprovider>
         <httpconfiguration>
           <put>
             <params>
               <param>
                 <name>http.authentication.preemptive</name>
                 <value>%b,true</value>
               </param>
             </params>
           </put>
         </httpconfiguration>
       </configuration>
     </server>
   </servers>
    ..
</settings>

Now set up your CI job. You can either override the ivy properties on each CI job, or your can create a global default by creating a ‘$HOME/.pentaho-reporting-build-settings.properties’ file. The settings of this file will be included in all Ant-builds for Pentaho Reporting projects.

ivy.settingsurl=file:${user.home}/prd-ivy-settings.xml
ivy.repository.id=your-server-id
ivy.repository.publish=http://repo.your-server.com/ext-snapshot-local

After that, test your setup by invoking

ant -f multibuild.xml continuous

It should run without errors now. If you see errors on publish, check your Maven configuration or your Artifactory installation.

Conclusion

With the new build structure and the move to Git, it has become tremendously easy to download and work with the Pentaho Reporting source code. Even formerly daunting tasks like setting up an CI server have become simple enough to be documented in a single blog post.

Enjoy!

Pentaho Reporting 3.9.1 released on SourceForge

Doug just uploaded the latest stable release of the Pentaho Report Designer on Sourceforge. Thanks Doug!

This is a bug-fix release addressing issues in the drill-linking editor and the formula-editor. It also enabled support for smarter Mondrian-Caching via the new ‘JDBCConnectionUUID’ property.

Be aware that using PRD-3.9.1 requires that you also use BI-Server 4.8. If you have an older BI-Server installation, you have to stick to the report-designer that came with it or upgrade your server to 4.8.

Thank you Sulaiman to handle the whole release process! Well done.

Saiku Reporting 1.0 hits the market

.. and WAQR can finally rest in peace.

The community edition of the Pentaho Bi-Platform just got a whole lot better: On Friday, Marius Giepz released version 1.0 of Saiku-Reporting (SR).

Saiku-Reporting is a adhoc reporting plugin for the BI-Server. SR uses your existing Pentaho Metadata models to produce stylish reports. SR provides you with everything you have would expect from a adhoc tool:

Drag&Drop Report-Design with What-you-see-is-what-you-get design of your report
Export to: PDF,CSV,XLS,CDA,PRPT
Uses PRPT-Templates (bye, bye, WAQR templates!)
Grouping
Aggregation
Totals
Calculate additional columns with formulas

Creating a report is rather simple. Select your data from the left-hand side of your screen and drag it into the fields for columns, groups or filters. Double-click on the field-names to open a dialog with some more formatting options.

When creating your reports, you have the choice to view your data either as paginated report or as raw-data. The raw-data view comes in handy when you create calculated columns, as you can see the data in the same way the reporting engine actually sees it.

And last but not least: Saiku-Reporting 1.0 is miles ahead of WAQR, so now we can finally bury that zombie for good.

Thank you Marius for that new member of the Pentaho space. And thank for your not naming it CAR (Community Adhoc Reporting). 😀

Pentaho Reporting 3.9-RC is on the slide to its release

Right now the engineers at Orlando are building the Release Candidate of the 4.0 release of the BI-Server. Along with it, we will have a bug-fix release of the Report-Designer and Reporting Engine named 3.9-RC.

The release ships with only a handful of new features and a ship load of bug-fixes.

New feature: Justified Text

This feature has been sitting on my list of things to do for ages. More specifically, since the old “Pre-Pentaho” days, when the internet was still young. We now have a new text-alignment option for your content and a button for it on the toolbar.

New feature: Heavily Scripted Data-Sources

This is my personal favourite of this release. All major data-sources (SQL, MetaData, Mondrian and OLAP4J) can now customize both the configuration of the data-source, the query that gets executed and the result-set that gets produced. At the moment we ship with support for Java-Script and Groovy scripting to make these customizations.

Improvement: CDA-datasource local calls

When running on the same server, CDA data-sources now call the CDA plugin via Java-calls instead of routing all calls through a network layer. THank you, web-details, for this addition.

Improvement: JQuery based report-viewer

Well, not sure whether this is a new feature or improvement, but we now have a pure Java-Script based report-viewer based on JQuery and standard JavaScript. It is so clean that even I can understand that code. Over the last few years, we heard more and more desperate calls for a better report viewer that can be customized to customer needs. If you are a web-developer or know one, you now are one step closer to a customized report parametrization experience. Credits go to Jordan Ganoff for coming up with this great addition.

Improvement: Data-sources now obfuscate stored passwords.

When writing PRPT files, we now obfuscate all passwords so that they are not as asy to read. Note that this does not make any change to any skilled attacker – as anyone with a debugger or access to the source-code can see how these passwords are en- and de-coded. If you need true security for your database credentials, you still have to use JNDI connections and ensure that only trustworthy administrators can access the server configuration.

Bug-Fixes: A load of layouter related issues. Subreports processing with page-footers and repeated group-footers was buggy. The HTML content produced by the HTML output writer is now much cleaner and less verbose. The table-datasource editor now can remove multiple rows at once and has other usability problems fixed as well. For a complete list, have a look at the release notes on our JIRA system.

Get the full list of cases in our JIRA system.

Reporting Tales

Pentaho Reporting Tips and Tricks