Archive for the ‘Programming’ tag
Parallel ETL Servers
I recently read an abstract by Google on MapReduce. The core principles of MapReduce are first, evaluate which tasks you need to perform that can be done in parallel. Then execute those tasks in parallel utilizing commodity resources (this is the “map” portion of MapReduce). Next, take the results of your parallel execution and consolidate (reduce) the results of all your parallel processes into a single result. The effect is that you can perform very large computations such as sorting and indexing on very large data sets very quickly and effectively. This is basically what makes Google so fast.
As I started to contemplate my own programming practices, I was able to identify ways that I can utilize the principles of MapReduce in my own applications. Let’s take ETL, for example. Many of the tasks in ETL can be performed in parallel. Loading dimension tables is often a precurser to loading fact tables. Since dimension tables rarely have dependencies on other tables, they can all be loaded in parallel. If you had a lot of computation (CPU and RAM intensive) that needed to be done to load the dimensions then those intensive tasks can scale horizontally if your ETL architecture allowed for it. Here is a simple diagram of what I’m talking about.
The concept here is simple. Lets say that Dimension 1 is a type 2 dimension which requires maintaining history over time. Loading type 2 dimensions can be a time consuming and very resource heavy ETL operation. But dimensions 2 and 3 are simple type 1 dimensions that are going to be inserts with no lookups. Dimension 4 may also be a type 2 dimension with complex history management. Under a traditional ETL server setup you would load each dimension in serial. There would be 4 ETL operations running on a single server that would load each table. It is easy to see that in this traditional method, the total time to load all dimensions would be the sum of each of the load times.
In many ETL platforms you are allowed to run ETL jobs in parallel but only on a single server. This works by starting up individual processes to run each job. This helps, but still requires that your ETL server have enough RAM and CPU to perform the ETL jobs in parallel. The ideal solution here would be to scale utilizing additional servers. This can be really expensive, however, as most commercial ETL platforms attach ETL servers to either the machine or to the CPU.
This is another reason why we need a new ETL framework that can scale horizontally across hardware without additional licensing costs. Interpreted languages like perl, python, and ruby are ideal for this because they are free and very flexible. They can also be version controlled to make deploying new code on all the different servers a snap. The trouble with using these languages is that writing ETL is very monotinous when you have to write it from scratch every time. This is another reason why an ETL framework is needed.
Bringing this post back to the main point: by reading about MapReduce, I was able to apply the ideas and principles to a programming task at hand. The concepts of parallelization, although I knew about them before, came to greater light and understanding by seeing it a little differently as applied by Google.

![Reblog this post [with Zemanta]](http://img.zemanta.com/reblog_c.png?x-id=d5d39f3e-1316-4811-8ef1-161e2b0bdf9b)