Thursday, October 16, 2008

Elastic Grid on EC2

Recently I had a talk with Jerome Bernard from Elastic Grid and was really impressed by their product.

It was about deploying, scaling in and out any application by adding new instances to the same EC2 node or new one. I remembered time spent on writing own scripts to control applications state on EC2 - A kind of nightmare. Of course EC2 guys give you some ways to start new EC2 nodes, install images but they are not very convenient IMHO.

Elastic Grid provides simple UI to control any kind of application and see number of instances, their current state, react on application or even EC2 instance failure. This is very intuitive thing what I like ;) But the most interesting is that they have policy-like files to automate the process. Wow! You don't even need to glance at your application. Just define the scaling behaviour for the overload and underload time.

Jerome presented this product at JavaZone 08 this year in Oslo and here is a link to his presentation. But believe me, the real life example that he showed was much more impressive to me.

Elastic Grid slides

Monday, October 13, 2008

Hey what's up with Linux?

Recently I found myself in troubles with Fedora Core 9. I'm actually a Linux guy and can solve some minor issues by myself. Or may be more complicated ones ;)

So I started the day as usually with music in my head (I makes my day) and decided to copy some stuff to the player via USB. After I unmounted it and turned on I found all files corrupted. WTF I thought - it makes me starting my laptop again!! And I pressed power button...

In a few seconds I was confused with GDM that did not start well. Looked like it was re-spawned again and again and all I saw was a blanking cursor. Shoot. Something was definitely wrong with my box.

I remembered everything I did, all packages I installed and ... removed them ... but I did not help.

After an hour or so I came to strange conclusion - I just had no space on "/" (root). I understand why all files on my music player were cut off. I understand GDM manager that cannot occupy some free space to show the menu. What I don't understand is why FC did not tell me that there is no free space. Neither in logs nor as an error/warning message on the screen.

I think Fedora Core 9 is much less reliable then previous versions. I'm not looking forward for the new version anymore.

Saturday, October 11, 2008

GridGain and GigaSpaces. Process data in time.

Today I'd like to talk about GigaSpaces and GridGain mostly in a context of job scheduling with some key concepts of both products.

We all know that sometimes it's very critical to execute particular job at certain time. Say you have database available from 2pm to 5pm and all tasks/jobs that have certain type need to be executed within given time range. This feature is quite essential for the monsters like Globus but what about this niche taken by GridGain/Hadoop/GigaSpaces/JPPF?

Lets shed some light starting with GigaSpaces.

GigaSpaces computational grid (to my personal understanding) is a consequence of having data grid. In GigaSpaces you declare so called processing units (PU). Basically it's a kind of bean that picks up some data from the GigaSpaces distributed cache, processes them and puts results back into the cache. Master node then picks up the results. One should deploy PUs on certain nodes in the Grid.

So as far as I see all nodes in GigaSpaces grid are both data and computational one (correct me if I'm not right) and the way PU interracts with the data grid has some pros and cons:
Pros:
  • Out-of-the-box load optimized balancing. As soon as PU finish processing it is ready to handle another portion of data
  • You can start as many PUs as you need and on those nodes that you need at deployment time.
  • With some minor changes you can implement scheduling by starting/deploying PUs at certain time.
Cons:
  • If I understand it right one should start PUs (or at least define their number) at deployment time. So if grid becomes overloaded new PUs won't be created automatically. And this holds scalability at certain level. But on the other hand you can start more PUs at deployment time if you need.
This kind of architecture gives some extra features:
  • Connected jobs could be implemented by having different PUs and every PU does its own duty and put result into the cache and this result is a source data for another PU.
  • One PU could return back several results (this IMHO needs some custom code to be written)
  • One can implement universal PU and pass processing logic together with the data that should be processed.
Now lets take a look at GridGain.

Unlike the GigaSpaces GridGain is a computation-centric product in which every node can execute any task. User can control what task/jobs he'd like to execute on which node at runtime so this is a little bit more flexible (if you have some nodes failed you can still execute your computations) but you need to configure data grid in GridGain and put data independently (usually outside the processing code).

Pros:
  • Since you execute processing code on master node and every node can execute any task you have automatic computations re-balancing in case of remote nodes fail.
  • Computational nodes and data nodes are different and you can "connect" GridGain with the existing data-grid infrastracture.
  • In case of overload user can define policy that hadles it on collision SPI level.
Cons:
  • Load balancing needs to be defined by configuration and it is not so optimal as in GigaSpaces despite of different policies, but in conjunction with job steeling it should be fine.
  • It's hard to implement jobs scheduling, or at least it requires new collision resolution SPI implemented from the scratch.
GridGain also has connected jobs implemented through the task session and processing logic separated from the data.

And again there are no conclusions. Up to you.

Friday, October 10, 2008

You want to fire some developers? I know whom

Strange slogan isn't it? But surprisingly not.

A friend of mine is looking for the investment. His company has pretty cool product that clearly shows developers contribution and skills in certain problem domain. Like the PMD or XRadar with some extra unique features.

The idea of the product is to give the comprehencive view of the team and every developer in it. If developers contribution was less then expected than it's a subject of discussion of the salary or extra payments and so on.

What he gonna do is to build the server around theh product and some plugins to the IDEs and continuous integration tools like Bamboo or Cruise Control, make this server scalable using for example GridGain or any other computational grid product and one could easily set up this product within the corporative network and keep tracing how effective teams are.

IDE plugin gives developer his, say, effectiveness like the average speed and his own one and thus should motivate him to work better.

Adding database support will give manager a history and good tool to see particular developer progress or development director the simple yet powerfull way to track all projects.

This should definitely lead to the company success as it gives unified, manageble view of what happens inside.

Tuesday, October 7, 2008

Hadoop and GridGain. The major difference.

I'm not going to compare products themselves, more of that I'm not going to argue what is better and what is not. Let guys from Hadoop and GridGain spend (waste) their time on that.

What I'm definitely gonna do is to find out the major or principal differences of those two products to give you a chance to choose appropriate one.

1) Unlike the GridGain, Hadoop is initially data processing oriented product. According to their Map/Reduce description (taken from this page http://hadoop.apache.org/core/docs/current/mapred_tutorial.html) they split data set into the small pieces (usually files on HDFS). Thus, their Map/Reduce approach is data oriented.
On the other side is the GridGain. This framework is intended to split computational tasks. So you need to find out the way to split you complicated calculation (task) into the simple pieces named jobs and execute them on the grid.

2) Both of them can process large data set, but Hadoop has underlying HDFS (distributed file system) which works with huge files and is able to carry tens of millions of files. The GridGain rather relies on distributed caches and unlike the Hadoop it has pluggable SPIs to work with different cache implementations.

3) Hadoop has Map/Reduce implementation which is pretty close to the Google one, on the other hand GridGains one to my understanding is more flexible but has some differences with Googles one.

4) Hadoops jobs/tasks are executed as external Java processes with their own configuration. GridGains jobs/tasks are executed in Grid node space (within the node VM).

5) GridGain supports Windows/Linux/MacOs. Hadoop has been tested on Linux as they say on their site.

That's what I'd like to say in this post. I won't make any conclusions. All of them are up to you.
In my next post I will compare GridGain with GigaSpaces (I'll try to learn latest GigaSpaces features to be more objective).

Monday, October 6, 2008

Reducing expenses

Just a thought that our software development business is about reducing expenses. Whatever we develop we do it for the users who'd like to make more money of their business.

One could say that some of the software (i.e. operating system) is not considered to reduce anything, but let me object. Even if you develop operating system you do it to make the device usable either mobile phone or laptop.

This laptop is to make something easier, faster and thus cheaper (0f course there are a lot of end-users who use those devices for entertainment but those who sold them those devices use laptops as well for business)

Thursday, October 2, 2008

Looking for the senior position in Russia/Europe

Hi,
Shoot happens sometimes. And the most strange and dangerous thing in this life is a misunderstanding.
Thus I'm looking for the senior position in Russia/Europe. I have deep knowledges in Java and familiar with wide range of technologies.

But anyway I would prefer remote work. So if you have any offers - welcome.

Wednesday, October 1, 2008

Grid your code (cont.)

Thinking about grid usages I found another case which can be even more usable than all previously described.

What do you usually do to balance hundreds and thousands user requests? Apache? IS? What about application server that handles all those HTTP requests Tomcat or something like this?

Right LAMP or the things like this work just fine but require different software installed and configured. And since all those requests usually go to the database and take pretty much time (seconds+) it becomes a subject of our discussion.

What grid can give you in this case (a lot of HTTP requests)? Let's see.

Say you have Tomcat and Grid integrated into it the way you can make a call to the Grid from your servlet or whatever you have under application server. And now let's think about the rest grid nodes like a "cloud" - just the abstraction that gives us some computing power. So when your servlet gets HTTP request it makes a few of request transformation and then posts heavy task/jobs to the grid.

Now it's grid duty to execute this job(s) and find out CPU/IO on another node and you have Tomcat box almost free and ready to get new requests and post new jobs.

GridGain as an Enterprise level Grid has tied integration with different application servers and also it has built-in load balancer to spread automatically jobs amond grid nodes.

Thus the only thing you need is to install GridGain under Tomcat with a few configuration and change your servlet the way to post Gridgain tasks/jobs on the Grid.

Simple yet powerful.