Pentaho Data Integration Cookbook (2nd edition)After my review for the newest Pentaho Reporting book, Packt Publishing asked me to write a review for the latest Pentaho Data Integration Cookbook as well, which just came out in December.

Pentaho Data Integration (PDI) is Pentaho’s answer to overpriced and proprietary ETL tools. In a Business Intelligence setting, you use ETL tools like PDI to populate your data warehouse, and outside of that, PDI is a Swiss army knife of tools to move and transform  vast amounts of data virtually from and to any system or format.

María Carina Roldán and Adrián Sergio Pulvirenti already wrote the first edition of the Pentaho Data Integration Cookbook. Maria is a Webdetails fellow and this is her fourth book about Pentaho Data Integration. For this book they are joined by Alex Meadows, from Red Hat, and a long term member of the Pentaho Community.

The book provides a very hands-on approach to PDI. True to its title as a cookbook, the book divides all information into handy recipes that show you – in a very practical way – an no fuss example of the problem and solution. All recipes follow the same schema:

  • state the problem,
  • create a transformation or job in PDI to solve the problem,
  • and after that, provide a detailed explanation of what just happened and explain potential pitfalls

The book covers every potential area, from database input and output to text-files, XML and Big-Data access (Hadoop and MongoDB). After the basic Input and Output tasks, the book explores the various flow control and data lookup options PDI offers. It also ventures beyond ordinary ETL tasks by showing how PDI integrates with Pentaho Reporting and the Pentaho BI-Server. If you want it even more advanced than that, the book also covers how to read and manipulate PDI transformations as XML files or read them from databases.

The book’s examples stay simple and practical and when used in conjunction with the downloadable content, are easy to follow. If you worked with ETL tools before or have at least a basic understanding of how to get data in and out of databases and computer systems, you will find this book an valuable companion that answers your questions quickly and completely.

The book is a bit sparse on screen shots, which makes it more difficult to follow the examples without the accompanying downloads. If you run PDI in a non-English environment, I would thus recommend to switch it to English first, so that the label texts align with the books description.

The authors focus on solving practical tasks at hand and keeps that focus through the whole book. For a reference book, this is a perfect set up and I enjoyed the fact that you can jump directly to the correct chapter simply by reading the headings in the Table of Contents.

In conclusion, this book is great for practitioners who need a quick reference guide and need solutions to solve their starting problems with PDI. If you are familiar with ETL from other tools, then this book gets you started in no time at all.

If you are both new to PDI and data warehousing in general, make sure to read this book in conjunction with general introductory books on Data Warehousing or ETL, like Ralph Kimbals The Data Warehouse ETL Toolkit or Matt Caster’s and Roland Bouman’s Pentaho Kettle Solutions.

