Access Databases with Dynamic Table Names in Pentaho Reporting

Every now and then we get the question on how to create a report on database tables with a dynamic date stamped table name. Sounds weirdly cryptic? Well, instead of having a single table for all orders, these systems may have many small tables for all orders of a specific month.

I tried hard to come up with a potential reason for these things:

  • Maybe the database is so weak that it can’t handle more than a few thousand rows in a table. But both MySQL nor PostgreSQL are quite capable for some serious data storage (if used right). And even Oracle can do that – and if you have enough money to buy a Oracle license you have money to buy a server (-farm, Oracle ain’t cheap!) too.
  • Maybe they want to speed up data access to the table. After all, a SELECT * FROM TABLE takes a long time if you don’t use a WHERE clause or a proper index.
  • Maybe they have not heard of normalization.
  • Maybe they love to have separate tables so that they can remove the data of older months. A SQL DELETE command is black magic. Deleting the database file in the filesystem is quicker.

So if you are using such a system – drop me a note. I am dying to know what application creates such a data-structure.

Reporting on these database structures is not straight forward and usually considerably slower than working on a proper data model. And here is how you do it, in simple steps:

Prerequisite: The data model we use in this tutorial

Lets assume we have a database filled with sales data. For the sake of simplicity, lets assume there are several tables named in the schema of “SALES_yyyymm”. The ‘yyyy’ part of the table name represents the four digit year, the ‘mm’ represents the two-digit month.

The tables are made up of the following structure (including some sample data*):

CUSTOMERNUMBER DATE ORDERID ARTICLEID QUANTITY PRICE
0001 20091015 112 Fluffy Dogs 100 15
0001 20091015 112 Gumpy Rabbit 9 14
0001 20091015 112 Angry Birds 12 5
0002 20091024 113 Busy Bees 99 1

*) Note: Yes I know that putting strings as article ID is bad. But it safes me from outlining a second table.

Prerequisite 2: Reporting made dynamic with a date parameter

Creating a static report with a hardcoded table name is easy. Write your query and be done with it. The tricky part is to create a report that selects the right table based on the a given date. That date could be the current system date for reports returning the current month’s turnover or it could be a historic date.

If you want to make the report dynamic, you will need a parameter to feed the report’s target date into the reporting system and into the query. The table-name in the database contains a partial date value. To make the report return data for a given date, we must calculate the database table name when the report runs based on whatever date the user entered.

Standard SQL datasources do not allow you to parametrize structural elements of a query. This is a security measure to prevent an attack known as “SQL Injection“.

If you need to compute structural elements of a query, you will have to use a “JDBC (Custom)” datasource (Data -> Add Data-Source -> Advanced -> JDBC (Custom)). Whenever you use a data-source marked as (Custom), you opt out of the standard parametrization. At that point, it is your responsibility to properly validate all incoming user data and to ensure that no one can enter illegal values.

The Simple-Case: Reporting within the same table

The Custom JDBC datasource only contains connection information. The actual SQL query that is executed is computed at runtime. To define a query, put a formula or expression into the “query::name” attribute of your master- or sub-report object.

Mike Tarallo create a excellent video tutorial on how to use the Custom JDBC data sources. So instead of repeating everything here, watch the video and learn what the Custom JDBC datasource is and how to use it properly.

Simple Parametrization: Using a string parameter

So we need to declare a parameter to select the table the report should use. You want it dead simple (just for testing, of course!)? Use a plain text parameter that allows the user to enter the year and month to use.

Parameter:
 Name: Simple
 Value Type: String

and finally, to prevent SQL injections, some validation:

Post-Processing-Formula: =IF(LEN([Simple]) = 6; IF (ISNUMBER([Simple]); [Simple]; NA()); NA())

Complex Parametrization ideas

You can use a ordinary date parameter to make use of the date-picker. Note that this returns a Date object. You need to convert this date into a valid text. For that add a second hidden parameter with the following Post-Processing Formula:

=MESSAGE("{0,date,yyyyMM}";[Date])

Alternatively, assuming you use a proper database, you can extract the valid dates from your database. Just query the database metadata for the tables that are available. (I am using MySQL for the examples here. Consult your database manual on how to access the database schema.)

SELECT SUBSTRING(table_name, 5) AS "SalesDates" 
  FROM INFORMATION_SCHEMA.TABLES
 WHERE table_schema = ''
   AND table_name LIKE 'SALES_%'

You can use this information to feed a list parameter.

Create your query

I assume you have a parameter that contains a suitable string (6 characters, yyyyMM format). Lets call this parameter “DateSuffix”. On the master-report’s “query-name” attribute, create a formula with the following syntax:

="SELECT * FROM SALES_" & [DateSuffix]

At runtime, the report engine will evaluate the formula and construct the proper query string. That query string is then executed by the Custom JDBC Data-Source.

Tip: The report designer does not use queries from formulas during the design process. Read this article to learn how to get designtime column information into the Pentaho Report Designer.

Summary

To create a simple dynamic report:

  • Create a parameter that returns the suffix of your table
  • Add a Custom JDBC Datasource to your report
  • Add a formula to the query-name-attribute of your report

Creating reports that summarize a full year

Single reports for a single month – that is easy stuff. But sooner or later someone will come back to you and ask for sales data spanning the last year.

Lets skip the parameter defining details. Create a parameter that returns the year as string and call it “YearText”.

To your database, each of the monthly tables is unrelated to any other of the monthly tables. Databases are great on providing referential integrity on the data within the tables. But generally they care little about the names of the tables and particularly do not assume that table names have any higher meaning.

In SQL merging data from several tables happens via an UNION operator. Tables that are merged via UNION must have the same structure – the number and types of the columns must match.

A SQL Union takes the form of

 UNION 

Or for our example:

SELECT * FROM SALES_200910 
 UNION
SELECT * FROM SALES_200911

Each select statement in a UNION is considered separately. Any “ORDER BY” clause is applied only to the particular table. If you want to sort globally, you will have to wrap the UNION result into another SQL query via a SUB-SELECT. All proper databases support Sub-Selects, but again: For details, consult your database manual.

Here is an example on how to use a sub-select to sort globally

SELECT * FROM 
  (SELECT * FROM SALES_200910 
     UNION
   SELECT * FROM SALES_200911) 
ORDER BY ARTICLEID

Sub-selects are considerably slower than normal selects. Most databases will start to use temporary tables for the selection result and therefore will answer such queries a lot slower than normal queries. If your performance goes down a lot, consider creating a data warehouse.

A query spanning a whole year will be long and complex.

SELECT * FROM SALES_200901
     UNION
SELECT * FROM SALES_200902
     UNION
SELECT * FROM SALES_200903
     UNION
SELECT * FROM SALES_200904
     UNION
SELECT * FROM SALES_200905
     UNION
SELECT * FROM SALES_200906
     UNION
SELECT * FROM SALES_200907
     UNION
SELECT * FROM SALES_200908
     UNION
SELECT * FROM SALES_200909
     UNION
SELECT * FROM SALES_200910
     UNION
SELECT * FROM SALES_200911
     UNION
SELECT * FROM SALES_200912

The matching formula for making the query dynamic would look like:

="SELECT * FROM SALES_" & [YearText] & "01
     UNION
SELECT * FROM SALES_" & [YearText] & "02
     UNION
SELECT * FROM SALES_" & [YearText] & "03
     UNION 
..

Creating more complex reports on such a model is a pain. I would recommend to create a single SQL VIEW that contains all your sales table.

This normalizes your data model without breaking your existing programsViews are virtual database tables that make it easier to write SQL queries.

CREATE VIEW Sales AS 
  SELECT * FROM SALES_200901
    UNION
  SELECT * FROM SALES_200902
    UNION
  SELECT * FROM SALES_200903
    UNION
  ..

If your database supports materialized views, it will create a permanent temporary table that can considerably speed up your reports.

The insanely complex case: Queries spanning over a user defined period of time

So you implemented a yearly sales report. The world is bright. Really? Not long.

One unhappy morning, your boss will come to your office telling you he wants to see the sales data for the last four month. And last years first half of the year, oh, and a up to date report showing the last 90 days of sales (ex: 15th of January to 13th of April).

If you use a consolidated view as outlined above, you are safe. But for the sake of argument, assume your database administrator’s religion does not allow the creation of VIEWs**.

Creating SQL queries using a BeanShell-Script

Formulas are a great tool for relatively simple problems. But as soon as the problem is more complex or requires dynamic elements, formulas quickly become unmaintainable.

Attributes can be computed by any report expression that returns a suitable value for that attribute. For complex logic, I recommend to use one of the Scripting expressions that comes with Pentaho Reporting.

The Bean-Scripting-Framework (BSF) Expression allows you to use a programming language to compute the query string. The BSF expression comes with support for Java, JavaScript, Groovy and many more programming languages.

Here is an example that computes a query for the last 90 days:

import java.util.Calendar;
import java.util.Date;
import org.pentaho.reporting.engine.classic.core.DataRow;

    Date date = (Date) dataRow.get("Date");
    Calendar cal = Calendar.getInstance();
    cal.setTime(date);
    StringBuilder b = new StringBuilder();
    b.append("SELECT * FROM SALES_");

    int year = cal.get(Calendar.YEAR);
    int month = cal.get(Calendar.MONTH);
    b.append(String.format("%1$04d", year));
    b.append(String.format("%1$02d", month));
    b.append(" WHERE ORDERDATE  ${Date} ");
      }
    }
    return b.toString();

This results in a query string that looks like this:

SELECT * FROM SALES_201106 WHERE ORDERDATE  ${Date} 
 UNION 
SELECT * FROM SALES_201104 WHERE ORDERDATE > ${Date} 
 UNION 
SELECT * FROM SALES_201103 WHERE ORDERDATE > ${Date} 

And now: Happy reporting!

**) This happens more often than you would think.

This entry was posted in Advanced Topic on by .
Thomas

About Thomas

After working as all-hands guy and lead developer on Pentaho Reporting for over an decade, I have learned a thing or two about report generation, layouting and general BI practices. I have witnessed the remarkable growth of Pentaho Reporting from a small niche product to a enterprise class Business Intelligence product. This blog documents my own perspective on Pentaho Reporting's development process and our our steps towards upcoming releases.

6 thoughts on “Access Databases with Dynamic Table Names in Pentaho Reporting

  1. Mariano García Mattío

    Hi, thanks for the post.
    I need to connect to the database name dynamically. I’ve been using sources PDI, and PDI that finally does.
    But, for this case I need use Java API call. Can be passed the URL as parameter? or the connection in self?
    regards
    Mariano

  2. Thomas Morgner

    PRD does not (yet) support fully dynamic connections.

    We do support JNDI to allow you to use server defined connections based on the current runtime environment. This way you can have a different connection while developing the report, another connection for testing and finally a connection to the production database when the report gets promoted to the production system.

  3. wselwood

    I once worked on a system that billed telephone calls. A million calls per day was about normal. Daily tables for call records. The partitioning was mainly done for the speed of loading, each day loaded into its own table with no indexes, once loaded index’s added. Doesn’t impact reporting while loading as they are looking at different tables. The call break down section of peoples telephone bills were annoying.

    I used to use the xactions to build the query for me (It was a while ago) did wonder how it was done these days.

    Thanks

  4. Shanty George

    Thomas,

    Enjoy your blogs..Is there a way to retrieve data in Pentaho Report Desginer from Oracle database using range of numbers as parameters (also pad with zeros)? For example, I like to enter order number range parameter as 100-300 and expect the sql to be like select * from order where order_number in (‘00100′,’00200′,’00300’)

    Thanks
    Shanty

  5. malay

    Hi,

    I wanted to give autocomplete limit in text box. I have one prpt report that contains text box parameter, when I deployed it over the server it shows auto complete which some times turn into problem if the number of records are more. I wanted to achieve some functionality where I can give limit to the auto suggestion. Similar to this jsfidle example http://jsfiddle.net/andrewwhitaker/vqwBP/
    also I have posted my question here with the attached example .
    http://forums.pentaho.com/showthread.php?173407-autocomplete-in-textbox

  6. Rey

    Is this typo, missing a = or > sign?

    SELECT * FROM SALES_201106 WHERE ORDERDATE ${Date}
    UNION

Comments are closed.