User Tools

Site Tools


Sidebar

4. Transformations

4.1 Definition

A transformation is a set of operations to be done on the extracted data. It includes steps and hops in a certain order and produces the transformed data as an output.

4.2 Documenting

A good practice in creating and maintaining transformations is to document them. A short phrase for each transformation and each step can be added with a right click.

To insert a note for a transformation, right-click anywhere in the transformation area, choose New note and write the note.

The new note appears in the transformation area in a yellow frame and you can move it anywhere in the area.

It is also possible to document each step individually. To do so, right-click on a step, choose Edit step description and write a short description.

These descriptions can later be used by the application to document the whole process. See the CookBook documentation for further details about generating documentation of your ETL processes.

4.3 Transformation steps

A step is one part of a transformation. Steps can provide you with a wide range of functionality ranging from reading text-files to implementing slowly changing dimensions.

Characteristics of transformation steps

  • A step needs to have a unique name in a transformation;
  • Steps read data coming from the incoming hops or from an outside input in the case of the first step;
  • Steps write data to one or many outgoing hops;
  • In a transformation, steps are running simultaneously, each in its own thread, as they are being fed rows from incoming hops;
  • Parallelism is a major characteristic of steps in a transformation, it is therefore not possible to determine an order in which steps are run in a transformation;
  • To perform tasks in a specific order, please refer to the Jobs section later in this documentation.

Transformation step descriptions

Here is a list of transformation step categories:

Here is a list of transformation steps available in Spatialytics ETL.

Input

Step name Description
Access InputThis step reads directly from Microsoft Access “MDB” files
CSV file InputThis step provides the ability to read data from a delimited file
CSW InputThe CSW Input step provides you with the ability to read data from CSW catalog
De serialize from fileThe De-serialize from file step, formerly known as Cube Input, reads rows of data from a binary Kettle file containing rows and metadata
ESRI Shapefile ReaderThe ESRI Shapefile Reader step provides you with the ability to read data from one or more Esri Shapefile files. It also allow to associate a DBF file to selected shapefile file.
Excel InputThe Excel Input step provides you with the ability to read data from one or more Excel and OpenOffice files
Fixed File InputThis step is used to read data from a fixed width file
Generate RowsThis step generates random numbers, integers, strings and even UUID
Generate Random ValueThis Generate Rows step type outputs a number of rows
Get File NamesThe Get File Names step allows you to get information associated with file names on the file system
Get Files Row CountThis step counts the number of rows in a file or set of files
Get Sub folder names This step retrieve subfolder names from selected directory or directories
Get System InfoThis Get System Info step retrieves information from the Kettle environment
Get Data From XMLThis step provides the ability to read data from any type of XML file using XPath specifications
GML File InputThe GML File Input step provides you with the ability to read data from one or more GML files
JSON InputThe JSON Input step extracts relevant portions out of JSON structures, files or incoming fields, and outputs rows.
KML File InputThe KML File Input step provides you with the ability to read data from one or more KML files
LDAP InputThe LDAP Input step allows you to read information like users, roles and other data from an LDAP server
LDIF Input LDIF is short for LDAP Data Interchange Format
Mondrian InputThis step reads data from a database using a Mondrian schema description (Catalog) and an MDX query
OGR InputThe OGR Input step provides you with the ability to read data from one or more OGR (GDAL) data sources
Property InputThis step reads from a Java properties files
RSS Input This step provides ability to read articles from RSS feeds.
S3 CSV Input
SalesForce InputThe SalesForce Input step provides you with the ability to read data directly from SalesForce using the SalesForce Web Service
SOS InputThe SOS Input step provides you with the ability to read data from a Sensor Observations Service
Shapefiles InputThe Shapefile File Input step provides you with the ability to read data from one or more Shapefile files
Table InputThis step is used to read information from a database including geographic database, using a connection and SQL
Text File InputThe Text File Input step is used to read data from a variety of different text-file types
WFS InputThis step is used to retrieve data from a WFS server
Xbase InputThe XBase step makes it possible to read data from most types of DBF file derivatives called the XBase family (for example, dBase III/IV, Foxpro, Clipper, and more)
Output
Step name Description
Access OutputStores records into an MS-Access database table
CSWOutput The CSW Output step provides you with the ability to store data in CSW catalog
Delete This step is the same as the Update step except that instead of updating, rows are deleted
Excel Output The Excel Output step allows you to write data to one or more Excel files
GML File Output The GML File Output step provides you with the ability to write data to one or more GML files
Insert / UpdateThe Insert/Update step first looks up a row in a table using one or more lookup keys
JSON OutputThe JSON Output step provides you with the ability to write data to one or more JSON files
KML OutputThe KML File Output step provides you with the ability to write data to one or more KML files
OGR OutputThe OGR Output step provides you with the ability to write data to one or more OGR data destination
Property OutputThis step outputs a set of rows of data to a Java properties files
RSS Output This step reads RSS streams
SQL File OutputThis step writes input data to a text file in the form of a set of SQL statements
Serialize to fileThis step, formerly known as Cube File Output, stores rows of data in a binary format in a file
Shapefiles Output This step provides you with the ability to write data to one or more Shapefile files
Synchronize after mergeThis step can be used in conjunction with the Merge Rows (diff) transformation step
Table OutputThis step allows you to load data into a database table
Text File OutputThis step is used to export data to text file format
UpdateUpdate is similar to the Insert/Update step except that updates only are performed
XML OutputThis step allows you to write rows from any source to one or more XML files
Transform
Step name Description
Add XML The XML column step allows you to encode the content of a number of fields in a row in XML
Add a checksumThis step calculates checksums for one or more fields in the input stream and adds this to the output as a new field
Add constantsThe Add constant values step is a simple and high performance way to add constant values to the stream
Add sequenceThe Add sequence step adds a sequence to the stream
CalculatorThis calculator step provides you with predefined functions that can be executed on input field values
Closure generatorThis step was created to allow you to generate a Reflexive Transitive Closure Table for Mondrian
Dummy PluginThe Dummy step does not do anything. Its primary function is to be a placeholder for testing purposes.
Meta Geocoding module Spatialytics ETL's Meta Geocoding module is a tool used to geocode data by using existing geocoders
Number rangeCreate ranges based on numeric field
Replace in StringReplace all occurences a word in a string with another word
Row NormaliserDe-normalised information can be normalised using this step type
Row De normaliserDenormalises rows by looking up key-value pairs and by assigning them to new fields in the output rows
Row flattenerFlattens consecutive rows based on the order in which they appear in the input stream
Select valuesSelect or remove fields in a row
SRS TransformationThe SRS Transformation step provides you with the ability to Transform a Spatial Reference System and reproject geometric features
Set SRSThe Set SRS step provides you with the ability to set a Spatial Reference System on geometric features
Sextante ModuleSpatialytics ETL's Sextante module is a tool that enables advance geoprocessing by proving various algorithms to simplify, smooth, split and run many other complex processes on geometries
Split fieldsThe Field splitter step allows you to split fields based on delimiter information
Sort rowsThe Sort rows step sorts rows based on the fields you specify and on whether they should be sorted in ascending or descending order
Spatial AnalysisThe Spatial Analysis step provides you with the ability to analyse topology and properties of geometric features
Split field to rowsThis step allows you to split a row containing a delimited field into multiple new rows, one per split value
String cutCuts a portion of a substring. Returns blank if the designated field is out of range.
Unique rowsRemove double rows and leave only unique occurrences
Unique rows by HashSetRemove double rows and leave only unique occurrences by using a HashSet
Value mapperMaps values of a certain field from one value to another
XSL TransformationTransform XML stream using XSL (eXtensible Stylesheet Language)

Utility

Step name Description
Clone rowThis step creates copies (clones) of a row and outputs them directly after the original row to the next steps
Delay rowFor each input row, the “Delay row” step will wait the indicated timeout period before giving the row to the next step
Execute a processYou can use the Execute a process step to execute a shell script on the host where a job will run
If field value is nullThe step “If field value is null” is able to replace nulls by a given value
MailThis pertains to the Mail transformation step, not the Mail job entry
Metadata structure of streamThe Metadata Structure step will read the fields of a stream and output their metadata
Null if...If the string representation of a certain field is equal to the specified value, then the value is set the null (empty)
Process filesThis step copies, moves or deletes files by giving the source and target file names in the data stream
Write to logThis step writes a specific string to the PDI logging system

Flow

Step name Description
AbortThis step type allows you abort a transformation upon seeing input
Append streamsThe “Append streams” step reads the data from two steps, only processing the second stream after the first is finished
Blocking stepThe Blocking step blocks all output until the very last row is received from the previous step
Detect empty streamThis page contains the index for the documentation on all the standard steps in Pentaho Data Integration
Dummy (do nothing)The Dummy step does not do anything. Its primary function is to be a placeholder for testing purposes
Filter rowsThe Filter rows step allows you to filter rows based on conditions and comparisons
Identify last row in a streamThis page contains the index for the documentation on all the standard steps in Pentaho Data Integration
Switch CaseWhat this step does is implement the Switch/Case statement found in popular programming languages like Java

Scripting

Step name Description
Execute SQL scriptYou can execute SQL scripts with this step, either once, during the initialization phase of the transformation, or once for every input-row that the step is given
Execute row SQL scriptIn addition to the step Execute SQL script you can define the field name within the data stream where the SQL statement is defined
FormulaThe Formula step can calculate Formula Expressions within a data stream
Modified Java Script Value
Regex EvaluationThis step type allows you to validate an input field against regular expression
User Defined Java ExpressionThis step allows you to enter User Defined Java Expressions as a basis for the calculation of new values

Lookup

Step name Description
Call DB ProcedureThe Call DB Procedure step allows you to execute a database procedure (or function) and get the result(s) back
Check if a column existsThis steps allows you to verify the existence of a specific column in a database table
Database JoinThe Database Join step allows you to run a query against a database using data obtained from previous steps
Database LookupThe Database lookup step allows you to look up values in a database table
Dynamic SQL rowThe Dynamic SQL row step allows you to execute a SQL statement that is defined in a database field
File existsThis steps verifies the existence of a file where the filename comes from previous steps
HTTP PostThis steps submits an http post request to a specified url
HTTP clientThe HTTP client step performs a simple call to a base URL with options appended
Stream LookupThe Stream lookup step type allows you to look up data using information coming from other steps in the transformation
Table ExistsYou can use the Table Exists step to verify if a certain table exists in a database
WPS ClientThe WPS Client step provides you with the ability to use a remote WPS service instance to execute exposed processes
Web services lookupThis step performans a Web Services lookup using the Web Services Description Language, WSDL for short

Joins

Step name Description
Join rows (Cartesian product)The Join rows step allows you to produce combinations (Cartesian product) of all rows in the input
Merge JoinThe Merge Join step performs a classic merge join between data sets with data coming from two different input steps
Merge rows (diff)Merge rows allows you to compare two streams of rows
Sorted MergeThe Sorted Merge step merges rows coming from multiple input steps providing these rows are sorted themselves on the given key fields
XML JoinThe XML Join Step allows to add xml tags from one stream into a leading XML structure from a second stream

Data Warehouse

Step name Description
Combination lookup updateUpdate a junk dimension in a data warehouse
Dimension lookup update Update a slowly changing dimension in a data warehouse

Validation

Step name Description
Credit card validatorThe Credit card validator step will help you tell: (1) if a credit card number is valid (uses LUHN10 (MOD-10) algorithm) (2) which credit card vendor handles that number (VISA, MasterCard, Diners Club, EnRoute, American Express (AMEX),…)
Data validatorData validation is typically used to make sure that incoming data has a certain quality
Mail ValidatorThis step checks if an email address is valid or not
XSD ValidatorThis step performs an XSD validation against data in a file or in an input field

Statistics

Step name Description
Analytic QueryThis step allows you to peek forward and backwards across rows
Group ByThis step allows you to calculate values over a defined group of fields
Reservoir sampling Transform Samples a fixed number of rows from the incoming stream
Sample rowsFilter rows based on the line number
Univariate Statistics This step computes some simple stats based on a single input field

Job

Step name Description
Copy rows to resultThis step allows you to transfer rows of data (in memory) to the next transformation (job entry) in a job
Get VariablesThis step allows you to get the value of a variable
Get files from resultEvery time a file gets processed, used or created in a transformation or a job, the details of the file, the job entry, the step, etc. is captured and added to the result
Get rows from resultThis step returns rows that were previously generated by another transformation in a job
Set variablesThis step allows you to set variables in a job or in the virtual machine
Set files in resultThis step can be used to route the list of files to the results stream

Mapping

Step name Description
Mapping (sub transformation)When you want to re-use a certain sequence of steps, you can turn the repetitive part into a mapping
Mapping inputThis step acts as a place-holder in a Mapping transformation a.k.a. sub-transformation
Mapping outputThis step acts as an output place-holder in a Mapping transformation a.k.a. sub-transformation

Inline

Step name Description
InjectorInjector step to allow to inject rows into the transformation through the java API
Socket ReaderSocket reader. A socket client that connects to a server (Socket Writer step)
Socket WriterSocket writer. A socket server that can send rows of data to a socket reader

Experimental

Step name Description
Greenplum Bulk LoaderThe Greenplum Load step wraps the Greenplum GPLoad data loading utility
GZIP CSV Input
MySQL Bulk LoaderThe MySQL Bulk Loader is an experimental step in which we will stream data from inside Kettle to a named pipe using “LOAD DATA INFILE 'FIFO File' INTO TABLE ….” into the database
PostgreSQL Bulk LoaderThe PostgreSQL bulk loader is an experimental step in which we will to stream data from inside Kettle to the psql command using “COPY DATA FROM STDIN” into the database
Set Field ValueSet value of a field with another value field
Set Field Value to a ConstantSet value of a field to a constant

Deprecated

Step name Description
Aggregate rowsThis step type allows you to quickly aggregate rows based upon all the rows
XML InputThis step allows you to read information stored in XML files
Streaming XML InputThe purpose of this step is to provide value parsing
Get Previous Row Fields This step allows to retrieve previous row fields

Bulk Loading

Step name Description
Infobright loaderLoad data to an Infobright database table
Oracle Bulk LoaderUse Oracle Bulk Loader to load data
MonetDB bulk loaderLoad data into MonetDB by using their bulk load command in streaming mode
LucidDB bulk loaderLoad data into LucidDB by using Remote Rows UDX

4.4. Transformation Hops

Hops link steps between them and define the data flow. There are three types of hops:

Copy

In this case, all the rows are copied to every step to which they are sent. In the figure below, for example, the 50 rows that come from the Filter rows step are sent to the steps Formula 1, Formula 2 and Formula 3 in parallel.

Distribute

In this type of distribution, the rows are sent alternatively to each step to which the hops are pointing. In the figure below, for example, the first row that comes from the Filter rows step is sent to the Formula 1 step, the second row is sent to the Formula 2 step, the third row is sent to the Formula 3 step, the fourth row is sent to the Formula step and so on. The step Formula 1 will then receive the rows 1, 4, 7, 10, 13, etc. The step Formula 2 will receive the rows 2, 5, 8, 11, 14, etc. And the step Formula 3 will receive the rows 3, 6, 9, 12, 15, etc.

Conditional output

It is also possible to define conditions for hops. For example, in the case of integer output from the Filter rows step, one can specify that all odd numbers will be output to the Formula step while all even numbers will flow to the Formula 2 step.

Hops also play a buffer role between two steps. This buffer is called a rowsetand can be set in Menu/Transformation/Miscellaneous/Nb of rows in rowset. The maximum value is 1 000 000 000. It is important to bear in mind that loops are not allowed in transformations.

en/spatialytics_etl/004_transformations.txt · Last modified: 2013/02/22 16:56 by jmathieu

Page Tools