Every now and then we get the question on how to create a report on database tables with a dynamic date stamped table name. Sounds weirdly cryptic? Well, instead of having a single table for all orders, these systems may have many small tables for all orders of a specific month.
I tried hard to come up with a potential reason for these things:
- Maybe the database is so weak that it can’t handle more than a few thousand rows in a table. But both MySQL nor PostgreSQL are quite capable for some serious data storage (if used right). And even Oracle can do that – and if you have enough money to buy a Oracle license you have money to buy a server (-farm, Oracle ain’t cheap!) too.
- Maybe they want to speed up data access to the table. After all, a SELECT * FROM TABLE takes a long time if you don’t use a WHERE clause or a proper index.
- Maybe they have not heard of normalization.
- Maybe they love to have separate tables so that they can remove the data of older months. A SQL DELETE command is black magic. Deleting the database file in the filesystem is quicker.
So if you are using such a system – drop me a note. I am dying to know what application creates such a data-structure.
Reporting on these database structures is not straight forward and usually considerably slower than working on a proper data model. And here is how you do it, in simple steps:
Prerequisite: The data model we use in this tutorial
Lets assume we have a database filled with sales data. For the sake of simplicity, lets assume there are several tables named in the schema of “SALES_yyyymm”. The ‘yyyy’ part of the table name represents the four digit year, the ‘mm’ represents the two-digit month.
The tables are made up of the following structure (including some sample data*):
CUSTOMERNUMBER |
DATE |
ORDERID |
ARTICLEID |
QUANTITY |
PRICE |
0001 |
20091015 |
112 |
Fluffy Dogs |
100 |
15 |
0001 |
20091015 |
112 |
Gumpy Rabbit |
9 |
14 |
0001 |
20091015 |
112 |
Angry Birds |
12 |
5 |
0002 |
20091024 |
113 |
Busy Bees |
99 |
1 |
*) Note: Yes I know that putting strings as article ID is bad. But it safes me from outlining a second table.
Prerequisite 2: Reporting made dynamic with a date parameter
Creating a static report with a hardcoded table name is easy. Write your query and be done with it. The tricky part is to create a report that selects the right table based on the a given date. That date could be the current system date for reports returning the current month’s turnover or it could be a historic date.
If you want to make the report dynamic, you will need a parameter to feed the report’s target date into the reporting system and into the query. The table-name in the database contains a partial date value. To make the report return data for a given date, we must calculate the database table name when the report runs based on whatever date the user entered.
Standard SQL datasources do not allow you to parametrize structural elements of a query. This is a security measure to prevent an attack known as “SQL Injection“.
If you need to compute structural elements of a query, you will have to use a “JDBC (Custom)” datasource (Data -> Add Data-Source -> Advanced -> JDBC (Custom)). Whenever you use a data-source marked as (Custom), you opt out of the standard parametrization. At that point, it is your responsibility to properly validate all incoming user data and to ensure that no one can enter illegal values.
The Simple-Case: Reporting within the same table
The Custom JDBC datasource only contains connection information. The actual SQL query that is executed is computed at runtime. To define a query, put a formula or expression into the “query::name” attribute of your master- or sub-report object.
Mike Tarallo create a excellent video tutorial on how to use the Custom JDBC data sources. So instead of repeating everything here, watch the video and learn what the Custom JDBC datasource is and how to use it properly.
Simple Parametrization: Using a string parameter
So we need to declare a parameter to select the table the report should use. You want it dead simple (just for testing, of course!)? Use a plain text parameter that allows the user to enter the year and month to use.
Parameter:
Name: Simple
Value Type: String
and finally, to prevent SQL injections, some validation:
Post-Processing-Formula: =IF(LEN([Simple]) = 6; IF (ISNUMBER([Simple]); [Simple]; NA()); NA())
Complex Parametrization ideas
You can use a ordinary date parameter to make use of the date-picker. Note that this returns a Date object. You need to convert this date into a valid text. For that add a second hidden parameter with the following Post-Processing Formula:
=MESSAGE("{0,date,yyyyMM}";[Date])
Alternatively, assuming you use a proper database, you can extract the valid dates from your database. Just query the database metadata for the tables that are available. (I am using MySQL for the examples here. Consult your database manual on how to access the database schema.)
SELECT SUBSTRING(table_name, 5) AS "SalesDates"
FROM INFORMATION_SCHEMA.TABLES
WHERE table_schema = ''
AND table_name LIKE 'SALES_%'
You can use this information to feed a list parameter.
Create your query
I assume you have a parameter that contains a suitable string (6 characters, yyyyMM format). Lets call this parameter “DateSuffix”. On the master-report’s “query-name” attribute, create a formula with the following syntax:
="SELECT * FROM SALES_" & [DateSuffix]
At runtime, the report engine will evaluate the formula and construct the proper query string. That query string is then executed by the Custom JDBC Data-Source.
Tip: The report designer does not use queries from formulas during the design process. Read this article to learn how to get designtime column information into the Pentaho Report Designer.
Summary
To create a simple dynamic report:
- Create a parameter that returns the suffix of your table
- Add a Custom JDBC Datasource to your report
- Add a formula to the query-name-attribute of your report
Creating reports that summarize a full year
Single reports for a single month – that is easy stuff. But sooner or later someone will come back to you and ask for sales data spanning the last year.
Lets skip the parameter defining details. Create a parameter that returns the year as string and call it “YearText”.
To your database, each of the monthly tables is unrelated to any other of the monthly tables. Databases are great on providing referential integrity on the data within the tables. But generally they care little about the names of the tables and particularly do not assume that table names have any higher meaning.
In SQL merging data from several tables happens via an UNION operator. Tables that are merged via UNION must have the same structure – the number and types of the columns must match.
A SQL Union takes the form of
UNION
Or for our example:
SELECT * FROM SALES_200910
UNION
SELECT * FROM SALES_200911
Each select statement in a UNION is considered separately. Any “ORDER BY” clause is applied only to the particular table. If you want to sort globally, you will have to wrap the UNION result into another SQL query via a SUB-SELECT. All proper databases support Sub-Selects, but again: For details, consult your database manual.
Here is an example on how to use a sub-select to sort globally
SELECT * FROM
(SELECT * FROM SALES_200910
UNION
SELECT * FROM SALES_200911)
ORDER BY ARTICLEID
Sub-selects are considerably slower than normal selects. Most databases will start to use temporary tables for the selection result and therefore will answer such queries a lot slower than normal queries. If your performance goes down a lot, consider creating a data warehouse.
A query spanning a whole year will be long and complex.
SELECT * FROM SALES_200901
UNION
SELECT * FROM SALES_200902
UNION
SELECT * FROM SALES_200903
UNION
SELECT * FROM SALES_200904
UNION
SELECT * FROM SALES_200905
UNION
SELECT * FROM SALES_200906
UNION
SELECT * FROM SALES_200907
UNION
SELECT * FROM SALES_200908
UNION
SELECT * FROM SALES_200909
UNION
SELECT * FROM SALES_200910
UNION
SELECT * FROM SALES_200911
UNION
SELECT * FROM SALES_200912
The matching formula for making the query dynamic would look like:
="SELECT * FROM SALES_" & [YearText] & "01
UNION
SELECT * FROM SALES_" & [YearText] & "02
UNION
SELECT * FROM SALES_" & [YearText] & "03
UNION
..
Creating more complex reports on such a model is a pain. I would recommend to create a single SQL VIEW that contains all your sales table.
This normalizes your data model without breaking your existing programsViews are virtual database tables that make it easier to write SQL queries.
CREATE VIEW Sales AS
SELECT * FROM SALES_200901
UNION
SELECT * FROM SALES_200902
UNION
SELECT * FROM SALES_200903
UNION
..
If your database supports materialized views, it will create a permanent temporary table that can considerably speed up your reports.
The insanely complex case: Queries spanning over a user defined period of time
So you implemented a yearly sales report. The world is bright. Really? Not long.
One unhappy morning, your boss will come to your office telling you he wants to see the sales data for the last four month. And last years first half of the year, oh, and a up to date report showing the last 90 days of sales (ex: 15th of January to 13th of April).
If you use a consolidated view as outlined above, you are safe. But for the sake of argument, assume your database administrator’s religion does not allow the creation of VIEWs**.
Creating SQL queries using a BeanShell-Script
Formulas are a great tool for relatively simple problems. But as soon as the problem is more complex or requires dynamic elements, formulas quickly become unmaintainable.
Attributes can be computed by any report expression that returns a suitable value for that attribute. For complex logic, I recommend to use one of the Scripting expressions that comes with Pentaho Reporting.
The Bean-Scripting-Framework (BSF) Expression allows you to use a programming language to compute the query string. The BSF expression comes with support for Java, JavaScript, Groovy and many more programming languages.
Here is an example that computes a query for the last 90 days:
import java.util.Calendar;
import java.util.Date;
import org.pentaho.reporting.engine.classic.core.DataRow;
Date date = (Date) dataRow.get("Date");
Calendar cal = Calendar.getInstance();
cal.setTime(date);
StringBuilder b = new StringBuilder();
b.append("SELECT * FROM SALES_");
int year = cal.get(Calendar.YEAR);
int month = cal.get(Calendar.MONTH);
b.append(String.format("%1$04d", year));
b.append(String.format("%1$02d", month));
b.append(" WHERE ORDERDATE ${Date} ");
}
}
return b.toString();
This results in a query string that looks like this:
SELECT * FROM SALES_201106 WHERE ORDERDATE ${Date}
UNION
SELECT * FROM SALES_201104 WHERE ORDERDATE > ${Date}
UNION
SELECT * FROM SALES_201103 WHERE ORDERDATE > ${Date}
And now: Happy reporting!
**) This happens more often than you would think.