“Handle” your data with Talend Open Studio
Integration-server Maggio 30th, 2007At the beginning it was our .txt that held all the information we needed kept in separate lines.
Then, in order to provide better information the relational databases arrived. It was able to link data efficiently and therefore made it easier to extract what was required.
As time went by, many different and more sophisticated formats hit the market, creating an integration problem.
Nowadays we have at least the following formats to choose from:
• XML
• Excel
• CSV
• Flat file
• All the other databases existing on the market
It looks like a lost cause, but there is a range of products that are meant to deal with this very problem called the Integration Server, or simply the ETL (Extract, Transform & Load).
This software can extract data from a generic format source, make changes (either manipulating the incoming data or filtering it) and finally saving the result into a specific format in output.
In this article we are going to take a look at a Talend Open Source product called OPEN STUDIO.
This software is available for Windows and Linux (binaries and source code format) and the installation is very simple, as it does not require any interaction with the user.
Once the download has started it is easy to spot that Open Studio is based on Eclipse, and therefore using GUI will be easier for whoever is already familiar with the famous IDE Java.
In order to work on a project with Open Studio, we need to create a job, using a file/format type to start with, and then make one or more transformations and a destination file/format type.
GUI has two modalities available: one based on drag & drop of the elements and the other a textual one (perfect if a modification on certain parts of the transformation is needed).
There is only one problem with this software, the back end is written in Perl.
This choice of language is understandable, as Perl is great on managing regular expressions and textual manipulations, also it has a wide support for the most common RDBMS. However, you do need to know this language if you want to make customizzazioni of the processes.
For this reason the Open Studio 2.0 version has introduced the Java support, allowing a wider range of programmers to work on modifications of projects codes.
Java support is stable but does not cover all components. We might still need to use Perl for some transformations.
That said, in the majority of cases there is no need to use the code as the visual components do the job perfectly.
Let’s have a look at an average use of this product:
1. File Excel in input
2. Apply a transformation to one of the column (in this case double the present value)
3. New file Excel in output
Let’s create a new job. From the palette objects let’s get three components:
1. tFileInputExcel
2. tMap
3. tFileOutputExcel
By clicking on each one we can define characteristics such as the file name, the scheme they refer to and for the tMap component, the transformation we want to make once the components are in position. We can link them logically from the first to the third in sequence by clicking the right button and planning the link.
TMap allows us to always get the transformation we need to apply to our data in visual modality.
It is possible to define more than one exit file; for example, one file with values that satisfy a certain group of criteria and another with all the scrapped values.
A key tool is the integrated debugger that monitors step by step the whole job, using all the ausiili typical of an IDE, such as watch expression, local variables etc.
Since this tool is used to automate recurring tranformations (for example daily reception of XML feed that needs to be put in a database), the scheduler is a must.
Planning a job starts by creating a cron job which is added to a cron file for its execution. This way you will have the native support in a Linux environment, while for Windows you will need to install the correct software package.
P.S.: During the carrying out of some jobs, some error from Perl might come out. This is due to the lack of certain libraries, just take a note of the missing package and launch PPM to download and install it.
Here is the list of some books you can read to further your knowledge on the above subject :
…and now you are a bit more conscious!








Recent Comments