Meanwhile, back at the farm

The personal blog of Michael Farmer

Archive for the ‘Data Warehousing’ Category

Parallel ETL Servers

Comments

I recently read an abstract by Google on MapReduce. The core principles of MapReduce are first, evaluate which tasks you need to perform that can be done in parallel.  Then execute those tasks in parallel utilizing commodity resources (this is the “map” portion of MapReduce).  Next, take the results of your parallel execution and consolidate (reduce) the results of all your parallel processes into a single result.  The effect is that you can perform very large computations such as sorting and indexing on very large data sets very quickly and effectively.  This is basically what makes Google so fast.

As I started to contemplate my own programming practices, I was able to identify ways that I can utilize the principles of MapReduce in my own applications.  Let’s take ETL, for example.  Many of the tasks in ETL can be performed in parallel.  Loading dimension tables is often a precurser to loading fact tables.  Since dimension tables rarely have dependencies on other tables, they can all be loaded in parallel.  If you had a lot of computation (CPU and RAM intensive) that needed to be done to load the dimensions then those intensive tasks can scale horizontally if your ETL architecture allowed for it.  Here is a simple diagram of what I’m talking about.

The concept here is simple.  Lets say that Dimension 1 is a type 2 dimension which requires maintaining history over time.  Loading type 2 dimensions can be a time consuming and very resource heavy ETL operation.  But dimensions 2 and 3 are simple type 1 dimensions that are going to be inserts with no lookups.  Dimension 4 may also be a type 2 dimension with complex history management.  Under a traditional ETL server setup you would load each dimension in serial.  There would be 4 ETL operations running on a single server that would load each table.  It is easy to see that in this traditional method, the total time to load all dimensions would be the sum of each of the load times.

In many ETL platforms you are allowed to run ETL jobs in parallel but only on a single server.  This works by starting up individual processes to run each job.  This helps, but still requires that your ETL server have enough RAM and CPU to perform the ETL jobs in parallel.  The ideal solution here would be to scale utilizing additional servers.  This can be really expensive, however, as most commercial ETL platforms attach ETL servers to either the machine or to the CPU.

This is another reason why we need a new ETL framework that can scale horizontally across hardware without additional licensing costs.  Interpreted languages like perl, python, and ruby are ideal for this because they are free and very flexible.  They can also be version controlled to make deploying new code on all the different servers a snap. The trouble with using these languages is that writing ETL is very monotinous when you have to write it from scratch every time.  This is another reason why an ETL framework is needed.

Bringing this post back to the main point: by reading about MapReduce, I was able to apply the ideas and principles to a programming task at hand.  The concepts of parallelization, although I knew about them before, came to greater light and understanding by seeing it a little differently as applied by Google.

Reblog this post [with Zemanta]

Written by mikefarmer

December 31st, 2008 at 10:09 am

Posted in Data Warehousing

Tagged with ,

The ETL Process

Comments

I am currently working on a framework for ETL written in Ruby.  For a brief description on why I am doing this, please read the linked post. To see all related content, view everything tagged “ETL Framework“.

Now that we have some ETL Steps that we can build we need some logical way of running them.  Favoring composition over inheritence, I went with a container class called the ETLProcess.  The basic UML for the ETLProcess class along with its relationship to the ETLStep class is below:

ETLProcess class with relationships to ETLStep classes

ETLProcess class with relationships to ETLStep classes

A lot can be said about why I chose this design, but I think the best way to talk about it is to actually show how it works.  The following ruby code shows how you would use ETLProcess and ETLStep classes to build out your ETL:

require 'etl_lib'

etl = ETLProcess.new 'My ETL Process'
etl.add_step(MyETLStep1.new)
etl.add_step(MyETLStep2.new)
etl.start

Lets walk through the parts here to get an idea of what is going on.  The first line is a simple require to make sure that we have access to the ETLProcess and ETLStep classes.  Next, we create a new object called “etl” that is the ETL process.  Then we add the steps to the etl object with the add_step method.  A lot of behind the scenes stuff happens when this is done which I will go into in another post.  Notice that when a new step is added to the process that we actually create a new object on the fly. When the add_step method is called, the ETLStep object is added to an array called steps.  Why not just add them directly?  Because we need to do a lot with that Step object before it is ready to be run.

The start method simply loops through the steps array and executes the start method on the step.  I wanted to hide a lot of the background stuff from the actual ETL contained in the ETLStep, thus the two methods in the step, run and start.  The start method does some background work and then executes the run method in the step.  This keeps the development of new ETL sqeaky clean and focused on the actual ETL rather than on backend tasks like bookmarking and database connection handling.

The other interesting thing that the ETLProcess class does is that it tracks when there is an error in the ETLStep and controls whether the step should be rolled back or not.  By default, the ETL will run the rollback method in the Step if it fails.  To turn this functionality off, you can specify it as an option when you add the step to the etl object.  For example:

etl.add_step(MyStep.new, {:rollback_on_fail => false})

Another way the ETLProcess controls the running of ETL Steps is by providing a global rollback method.  That way, if you wanted, you could rollback the entire ETL process if something didn’t work just right.  Here’s an example:

require 'etl_lib'

etl = ETLProcess.new 'My ETL Process'
etl.add_step(MyETLStep1.new)
etl.add_step(MyETLStep2.new)
etl.start

success = true
# code to check to see if the ETL ran as expected..
# set success to false if it doesn't look good

etl.rollback unless success

That’s a quick introduction to the ETLProcess Class.  The next big hurdle is how to create unit tests for the ETL steps.  I’ll show you how I tackled that problem in the next post.

Written by mikefarmer

September 23rd, 2008 at 11:46 am

Posted in Data Warehousing

Tagged with

Infobright – Open source data warehousing

Comments

Keep your eyes on Infobright. They are doing some amazing things with redefining the Data Warehousing physical architecture.  This is is the kind of thinking that needs to happen if Data Warehousing is ever going to be made available to the long-tail.

What Infobright is doing is creating a new MySQL database engine that you would use as opposed to MyISAM or InnoDB.  This kind of low level optimization using column-based tables rather than row-based allows them to store terabytes of data at very high compression with very fast query times.  This removes the need to perform common DW optimizations such as Materialized Views and Table Partitioning.

This product is still really young and has a ways to go before it’s ready for mainstream but it is definitely something to keep your eyes on.

Written by mikefarmer

September 19th, 2008 at 10:23 am

Posted in Data Warehousing

Tagged with ,

The ETL Step

Comments

I am currently working on a framework for ETL written in Ruby.  For a brief description on why I am doing this, please read the linked post. To see all related content, view everything tagged “ETL Framework“.

The foundation of my ETL framework is the ETL step.  An ETL step is where the developer actually programs a specific task for the ETL to perform.  An example would be to have the ETL download files from an FTP site or to load one or more tables that are related.  The key here is that the task be at a level that you want to track progress.  Some steps may even be re-usable.  For example, you may want to have a step that loads files from a specific location into a processing queue for later steps to parse and load.  You may want just one step with which you can pass a parameter that indicates where the source files live and then just execute that task several times.

Since Ruby is, by its nature, object oriented, the ETL should be designed in an object-oriented manner.  Here is a simple class diagram of the ETLStep class.

ETLStep Class Diagram <br>(inherited properties omitted for subclasses)

ETLStep Class Diagram (inherited properties omitted for subclasses)

The superclass, or parent class, is where all the functionality of the class lives.  Here we can control bookmarking and meta information about the ETL step.  Notice also that ETLStep is abstract.  As in Rails and other Ruby based frameworks, in order to actually write ETL in this framework you actually create a subclass of ETLStep.  The subclasses require only two methods, run and rollback.  The run method is where you develop your ETL process and the rollback is where you specify instructions on how to rollback that step.

The other methods and attributes in the parent class are for additional functionality that I will talk about in another post.  This design gives us some very basic, but powerful functionality when building ETL.  Each step has responsibility for all the ETL that it performs.  If it should fail in any way, the rollback method should clean up after the task leaving the data environment just how it was before the task ran.  Also, each ETL step inherits the ability to perform logging, access database connections, and access global information from the entire ETL process.  The ETL developer doesn’t have to build this, it just comes through inheritance.

So now all that is needed is a way to execute the run method.  This is done by the ETLProcess object. In the next post, I’ll talk about the ETLProcess class and how to actually create and run the ETL.

Written by mikefarmer

September 17th, 2008 at 11:40 am

Posted in Data Warehousing

Tagged with ,

Developing an ETL Framework

Comments

I have recently undertaken a project to write an ETL framework from the ground up.  Why would I want to do this?  In my study of the data warehousing world there are basically two options for building ETL.  1) you can buy a commercial off-the-shelf tool or 2) you can build your own using a combination of scripting languages and stored procedures.  If you know of anything else, let me know.  Here are some of the pros and cons of these options:

Commercial Tool Pros:

  • Well tested and very powerful
  • Technical Support
  • Usually have a GUI to aid in building complex ETL
  • Scalable
  • Handles very large data movement
  • Uniform ETL Development and Maintenance
  • Prepackaged Transforms
  • Data Lineage

Commercial Tool Cons

  • VERY VERY Expensive
  • Overkill for many ETL initiatives
  • Heavy architecture requires specialized skills for maintenance

Scripting Pros:

  • Get something up very quickly
  • Very lightweight
  • Extremely flexible (you can write the ETL to do anything you want)
  • Very low cost of ownership (You can run scripted ETL on a VM if you want)

Scripting Cons:

  • Unwieldy maintenance
  • No scalability
  • Very little code reuse
  • No data lineage

Both solutions have their advantages.  If I had my preference, I would always use a commercial tool in combination with stored procedures (Oracle please).  Unfortunately, not all ETL projects have the budget for this so we compensate with scripting.  I love scripting, don’t get me wrong.  In many ways, scripting was the first ETL tool long before data warehousing was entering the mainstream.  For the project I’m on, scripting was really the only solution.  Because I’ve used an off-the-shelf solution and learned a lot from its benefits, I was really dreading starting from scratch and losing all the built in functionality.  I decided, however, that I could get much of the functionality that I missed by building a simple ETL framework first.  Here is some of the functionality I wanted:

  1. Centralized database connection management (something like ODBC, but utilizing native connections)
  2. A simple database wrapper to take some of the pain out of the interfacing with the database
  3. Ability to reuse ETL steps
  4. Data bookmarking
  5. Integrated testing
  6. Reusable ETL tools

My first inclination was to look for a framework that already exists.  I didn’t really want to use Perl because I have never really had a chance to learn Perl and I don’t know it at all.  Under my current timeline, my best option was to write the ETL in Ruby.  The only Ruby based ETL framework that I could find was a plugin for Ruby on Rails called ActiveWarehouse.  This is a nice start, but it didn’t have the functionality I needed and I didn’t want to waste time digging through someone else’s code trying to figure out how the thing worked.  So, I finally decided to write my own.  What has come out of this is a very interesting framework for building, testing, and deploying ETL that has all the virtues of scripted ETL with some of the advantages of the big boys.  In the coming weeks, I will be writing about some of the design decisions I’ve made and about my experience developing the ETL framework.  I hope the tiny community that I have can help me flush out some issues and provide me with some constructive criticism so that what comes out of this project is actually something that can be a big benefit to others in similar ETL building situations.

Written by mikefarmer

September 16th, 2008 at 7:50 pm

Posted in Data Warehousing

Tagged with ,

Worklight Puts Web 2.0 on Corporate Data

Comments

I stumbled upon a new product by Serendipity Technologies called Worklight.

Worklight is a secure and scalable server based software product that provides workers and consumers with ‘Web 2.0-style’ access to corporate data stored in enterprise information systems and applications… Worklight extracts information from enterprise applications and serves it in a secure and personalized fashion using a host of web 2.0 technologies, both inside and outside the firewall. Personnel are able to easily define their own application views, and then display them through tools like RSS aggregators, popular web based personalized homepages, or desktop gadgets.
(taken from the Worklight Product Sheet – registration is required to see the product sheet!)

The architecture is pretty simple and it’s very close to my Event Based Reporting idea. Basically, you build a server product that aggregates snapshots of data from different data sources and applications. Then you feed those snapshots via RSS feeds, Web Services, and the like.

I really think this approach is the future of data delivery. Of course there will always be a need for printed reports, but actionable data needs to be delivered in more general and malleable forms so that tech-saavy managers can gain the competitive edge they need.

Written by mikefarmer

June 21st, 2007 at 9:09 am