Pages

Monday, September 8, 2014

Pentaho Data Integration

In order to present the ETL process using the project Pentaho Data Integration, a small case study is proposed. The case study consist in a Car Dealer Business, which has the following basic information sources:

  • Sales database implemented in MySQL called ibis_dealer_sales. The relational model is presented in the next figure.
  • CRM database implemented in MySQL called ibis_dealer_crm. The relational model is presented in the next figure.
  • Car details, which is an Excel file. Its data is presented in he next figure

The objective is creating an ETL process for including all required information in the datawarehouse presented in the next figure

For this case study, a Job is created. This job includes all transformations of the ETL process. The following job includes four transformations. They are executed sequentially.

The first transformation, has the following steps:

  • crm_customer gets the table customer from the database ibis_dealer_crm
  • crm_category gets the table category from the database ibis_dealer_crm
  • Concat name allows concatenate the first name and last name of the customer
  • Stream lookup allows joining information from both sources based on the primary key of the table category and foreign key (which relates the customer with the category) of the table customer.
  • dwh_customer exports all information to the data base ibis_dealer_dwh

Each step in the transformation that requires information from one database allows creating connection. Next wizard is used for this task. It allows including all required parameters for the connection.

The second transformation, has the following steps:

  • sales_branch gets the table branch from the database ibis_dealer_sales
  • sales_city gets the table city from the database ibis_dealer_sales
  • sales_country gets the table country from the database ibis_dealer_sales
  • Stream lookup allows joining information from both sources based on the primary key of the table city and foreign key (which relates the branch with the city) of the table branch.
  • Stream lookup allows joining information from previous sources based on the primary key of the table country and foreign key (which relates the city with the country) of the table city.
  • dwh_customer exports all information to the data base ibis_dealer_dwh

The third transformation, has the following steps:

  • sales_car gets the table car from the database ibis_dealer_sales
  • Microsoft Excel Input gets the information from the excel file
  • Stream lookup allows joining information from both sources based on the attribute idmodel the table car and the column Id model of the excel file.
  • dwh_car exports all information to the data base ibis_dealer_dwh

For configuring the excel file, it is necessary to take into account the following configurations by double clicking the step.

  • In the tab “Files”, add the excel file
  • In the tab “Sheet” get the sheetnames. Select the first row and the first column of the information (e.g., in the file presented, the cell with the text “Id Model”, which corresponds whit the first cell is in the column A row 2).
  • In the tab “Fields” get fields from header row.

The fourth transformation, has the following steps:

  • sales_sale gets the table sale from the database ibis_dealer_sales
  • sales_car gets the table car from the database ibis_dealer_sales
  • Stream lookup allows joining information from both sources based on the attribute primary key of the table car and the foreign key (which relates the city with the country) of the table sale.
  • Calculator allows calculating the profit for each sale.
  • dwh_sale exports all information to the data base ibis_dealer_dwh

Finally, by executing the job, all information is extracted from the 3 sources, transformed based on the requirements and loaded in the data warehouse.

5 comments: