Saturday, October 11, 2008

GridGain and GigaSpaces. Process data in time.

Today I'd like to talk about GigaSpaces and GridGain mostly in a context of job scheduling with some key concepts of both products.

We all know that sometimes it's very critical to execute particular job at certain time. Say you have database available from 2pm to 5pm and all tasks/jobs that have certain type need to be executed within given time range. This feature is quite essential for the monsters like Globus but what about this niche taken by GridGain/Hadoop/GigaSpaces/JPPF?

Lets shed some light starting with GigaSpaces.

GigaSpaces computational grid (to my personal understanding) is a consequence of having data grid. In GigaSpaces you declare so called processing units (PU). Basically it's a kind of bean that picks up some data from the GigaSpaces distributed cache, processes them and puts results back into the cache. Master node then picks up the results. One should deploy PUs on certain nodes in the Grid.

So as far as I see all nodes in GigaSpaces grid are both data and computational one (correct me if I'm not right) and the way PU interracts with the data grid has some pros and cons:
Pros:
  • Out-of-the-box load optimized balancing. As soon as PU finish processing it is ready to handle another portion of data
  • You can start as many PUs as you need and on those nodes that you need at deployment time.
  • With some minor changes you can implement scheduling by starting/deploying PUs at certain time.
Cons:
  • If I understand it right one should start PUs (or at least define their number) at deployment time. So if grid becomes overloaded new PUs won't be created automatically. And this holds scalability at certain level. But on the other hand you can start more PUs at deployment time if you need.
This kind of architecture gives some extra features:
  • Connected jobs could be implemented by having different PUs and every PU does its own duty and put result into the cache and this result is a source data for another PU.
  • One PU could return back several results (this IMHO needs some custom code to be written)
  • One can implement universal PU and pass processing logic together with the data that should be processed.
Now lets take a look at GridGain.

Unlike the GigaSpaces GridGain is a computation-centric product in which every node can execute any task. User can control what task/jobs he'd like to execute on which node at runtime so this is a little bit more flexible (if you have some nodes failed you can still execute your computations) but you need to configure data grid in GridGain and put data independently (usually outside the processing code).

Pros:
  • Since you execute processing code on master node and every node can execute any task you have automatic computations re-balancing in case of remote nodes fail.
  • Computational nodes and data nodes are different and you can "connect" GridGain with the existing data-grid infrastracture.
  • In case of overload user can define policy that hadles it on collision SPI level.
Cons:
  • Load balancing needs to be defined by configuration and it is not so optimal as in GigaSpaces despite of different policies, but in conjunction with job steeling it should be fine.
  • It's hard to implement jobs scheduling, or at least it requires new collision resolution SPI implemented from the scratch.
GridGain also has connected jobs implemented through the task session and processing logic separated from the data.

And again there are no conclusions. Up to you.

5 comments:

Kirill Uvaev said...

Just to clarify: you can define PU which uses remote space (or not even using it at all :). In this case you can have GigaSpaces cluster with only computational PUs. For such kind of PUs you can define scaling policy, which will allow to scale amount of processors if for example working queue will grow over defined value (http://www.gigaspaces.com/wiki/display/XAP66/Service+Grid+Processing+Unit+Container#ServiceGridProcessingUnitContainer-SLAPolicy).

Andy Chung said...

Thanks for the sharing.

To resolve the 'cons' (i.e. the mentioned Load balancing & It's hard to implement jobs scheduling) of GridGain, is there any roadmap to improve these? Or GridGain has a intention to leave these areas to developers in future release?

dkharlamov said...

Thanks you all guys.

As for the load balancing in GridGain it has several policies and works fine if you use early (at "map" phase one) and late load balancing (ie job stealing). I'm not sure if there is a ticket in JIRA to implement time scheduling in GridGain because I don't work at GridGain anymore.

Guy Nirpaz said...

Hello Dimitry,

Allow me to clarify a little bit further the philosophy behind GigaSpaces compute grid features.

We call it SVF - Service Virtualization Framework - the objective is to enable multiple grid-computing paradigms by using familiar APIs like Spring Remoting.

In paradigms I refer to Map/Reduce, Master/Worker, Remote Commands and other familiar grid execution schemes.

With GigaSpaces you can deploy a Processing Unit which contains remote service (ala Spring Remoting) and call it from a client without the need to really 'uderstand' grids. Developers benefit from dependency injection and context binding on the the remote end.
Further more, we've created a new executor services capability in 6.6 which enables 'sending' jobs to nodes on demand based on application context.
With regards to dynamic scale up and down, one can define dynamic scaling policies to enable such common scenarios.

For more information please refer to: http://www.gigaspaces.com/wiki/display/XAP66/Remoting+(SVF)
and: http://www.gigaspaces.com/wiki/display/XAP66/Executors+Component

Cheers,
Guy Nirpaz,
GigaSpaces Technologies

Anonymous said...

Gigaspaces is cool, but too expensive, even for small projects...