Meanwhile, back at the farm

The personal blog of Michael Farmer

Archive for the ‘ETL’ tag

Parallel ETL Servers

Comments

I recently read an abstract by Google on MapReduce. The core principles of MapReduce are first, evaluate which tasks you need to perform that can be done in parallel.  Then execute those tasks in parallel utilizing commodity resources (this is the “map” portion of MapReduce).  Next, take the results of your parallel execution and consolidate (reduce) the results of all your parallel processes into a single result.  The effect is that you can perform very large computations such as sorting and indexing on very large data sets very quickly and effectively.  This is basically what makes Google so fast.

As I started to contemplate my own programming practices, I was able to identify ways that I can utilize the principles of MapReduce in my own applications.  Let’s take ETL, for example.  Many of the tasks in ETL can be performed in parallel.  Loading dimension tables is often a precurser to loading fact tables.  Since dimension tables rarely have dependencies on other tables, they can all be loaded in parallel.  If you had a lot of computation (CPU and RAM intensive) that needed to be done to load the dimensions then those intensive tasks can scale horizontally if your ETL architecture allowed for it.  Here is a simple diagram of what I’m talking about.

The concept here is simple.  Lets say that Dimension 1 is a type 2 dimension which requires maintaining history over time.  Loading type 2 dimensions can be a time consuming and very resource heavy ETL operation.  But dimensions 2 and 3 are simple type 1 dimensions that are going to be inserts with no lookups.  Dimension 4 may also be a type 2 dimension with complex history management.  Under a traditional ETL server setup you would load each dimension in serial.  There would be 4 ETL operations running on a single server that would load each table.  It is easy to see that in this traditional method, the total time to load all dimensions would be the sum of each of the load times.

In many ETL platforms you are allowed to run ETL jobs in parallel but only on a single server.  This works by starting up individual processes to run each job.  This helps, but still requires that your ETL server have enough RAM and CPU to perform the ETL jobs in parallel.  The ideal solution here would be to scale utilizing additional servers.  This can be really expensive, however, as most commercial ETL platforms attach ETL servers to either the machine or to the CPU.

This is another reason why we need a new ETL framework that can scale horizontally across hardware without additional licensing costs.  Interpreted languages like perl, python, and ruby are ideal for this because they are free and very flexible.  They can also be version controlled to make deploying new code on all the different servers a snap. The trouble with using these languages is that writing ETL is very monotinous when you have to write it from scratch every time.  This is another reason why an ETL framework is needed.

Bringing this post back to the main point: by reading about MapReduce, I was able to apply the ideas and principles to a programming task at hand.  The concepts of parallelization, although I knew about them before, came to greater light and understanding by seeing it a little differently as applied by Google.

Reblog this post [with Zemanta]

Written by mikefarmer

December 31st, 2008 at 10:09 am

Posted in Data Warehousing

Tagged with ,

The ETL Step

Comments

I am currently working on a framework for ETL written in Ruby.  For a brief description on why I am doing this, please read the linked post. To see all related content, view everything tagged “ETL Framework“.

The foundation of my ETL framework is the ETL step.  An ETL step is where the developer actually programs a specific task for the ETL to perform.  An example would be to have the ETL download files from an FTP site or to load one or more tables that are related.  The key here is that the task be at a level that you want to track progress.  Some steps may even be re-usable.  For example, you may want to have a step that loads files from a specific location into a processing queue for later steps to parse and load.  You may want just one step with which you can pass a parameter that indicates where the source files live and then just execute that task several times.

Since Ruby is, by its nature, object oriented, the ETL should be designed in an object-oriented manner.  Here is a simple class diagram of the ETLStep class.

ETLStep Class Diagram <br>(inherited properties omitted for subclasses)

ETLStep Class Diagram (inherited properties omitted for subclasses)

The superclass, or parent class, is where all the functionality of the class lives.  Here we can control bookmarking and meta information about the ETL step.  Notice also that ETLStep is abstract.  As in Rails and other Ruby based frameworks, in order to actually write ETL in this framework you actually create a subclass of ETLStep.  The subclasses require only two methods, run and rollback.  The run method is where you develop your ETL process and the rollback is where you specify instructions on how to rollback that step.

The other methods and attributes in the parent class are for additional functionality that I will talk about in another post.  This design gives us some very basic, but powerful functionality when building ETL.  Each step has responsibility for all the ETL that it performs.  If it should fail in any way, the rollback method should clean up after the task leaving the data environment just how it was before the task ran.  Also, each ETL step inherits the ability to perform logging, access database connections, and access global information from the entire ETL process.  The ETL developer doesn’t have to build this, it just comes through inheritance.

So now all that is needed is a way to execute the run method.  This is done by the ETLProcess object. In the next post, I’ll talk about the ETLProcess class and how to actually create and run the ETL.

Written by mikefarmer

September 17th, 2008 at 11:40 am

Posted in Data Warehousing

Tagged with ,

Developing an ETL Framework

Comments

I have recently undertaken a project to write an ETL framework from the ground up.  Why would I want to do this?  In my study of the data warehousing world there are basically two options for building ETL.  1) you can buy a commercial off-the-shelf tool or 2) you can build your own using a combination of scripting languages and stored procedures.  If you know of anything else, let me know.  Here are some of the pros and cons of these options:

Commercial Tool Pros:

  • Well tested and very powerful
  • Technical Support
  • Usually have a GUI to aid in building complex ETL
  • Scalable
  • Handles very large data movement
  • Uniform ETL Development and Maintenance
  • Prepackaged Transforms
  • Data Lineage

Commercial Tool Cons

  • VERY VERY Expensive
  • Overkill for many ETL initiatives
  • Heavy architecture requires specialized skills for maintenance

Scripting Pros:

  • Get something up very quickly
  • Very lightweight
  • Extremely flexible (you can write the ETL to do anything you want)
  • Very low cost of ownership (You can run scripted ETL on a VM if you want)

Scripting Cons:

  • Unwieldy maintenance
  • No scalability
  • Very little code reuse
  • No data lineage

Both solutions have their advantages.  If I had my preference, I would always use a commercial tool in combination with stored procedures (Oracle please).  Unfortunately, not all ETL projects have the budget for this so we compensate with scripting.  I love scripting, don’t get me wrong.  In many ways, scripting was the first ETL tool long before data warehousing was entering the mainstream.  For the project I’m on, scripting was really the only solution.  Because I’ve used an off-the-shelf solution and learned a lot from its benefits, I was really dreading starting from scratch and losing all the built in functionality.  I decided, however, that I could get much of the functionality that I missed by building a simple ETL framework first.  Here is some of the functionality I wanted:

  1. Centralized database connection management (something like ODBC, but utilizing native connections)
  2. A simple database wrapper to take some of the pain out of the interfacing with the database
  3. Ability to reuse ETL steps
  4. Data bookmarking
  5. Integrated testing
  6. Reusable ETL tools

My first inclination was to look for a framework that already exists.  I didn’t really want to use Perl because I have never really had a chance to learn Perl and I don’t know it at all.  Under my current timeline, my best option was to write the ETL in Ruby.  The only Ruby based ETL framework that I could find was a plugin for Ruby on Rails called ActiveWarehouse.  This is a nice start, but it didn’t have the functionality I needed and I didn’t want to waste time digging through someone else’s code trying to figure out how the thing worked.  So, I finally decided to write my own.  What has come out of this is a very interesting framework for building, testing, and deploying ETL that has all the virtues of scripted ETL with some of the advantages of the big boys.  In the coming weeks, I will be writing about some of the design decisions I’ve made and about my experience developing the ETL framework.  I hope the tiny community that I have can help me flush out some issues and provide me with some constructive criticism so that what comes out of this project is actually something that can be a big benefit to others in similar ETL building situations.

Written by mikefarmer

September 16th, 2008 at 7:50 pm

Posted in Data Warehousing

Tagged with ,