Sunday, August 24, 2008

Grid your code

Most of the people "know" or may be "feel" the ways to parallelize their code, but anyway I see pretty much questions about it. In general they look like "How can I execute my code in parallel?" or "What should I do with my code?".

So I think it would be a good idea to give some hints to make it simpler for everyone.

First of all I would like to notice that I'm not going to give you comprehensive or deep knowledges about code parallelization. If you need them you should better read some books and articles or study it at university :) What I'd like to give you in this post is just a general approaches which you can apply to your code to make it more robust and effective.

Two simple and basic ways to execute your code in parallel are:
  1. Postponed data
  2. Loops

1. Postponed data.

Lets take a look at the abstract code:

SomeData data = calculateSomeData();
doAnythingElse(doItLong);
handleSomeData(data);

Variable data calculated at the very beginning used later after some other calculations and thus can be calculated in parallel with doAnythingElse() method. Simplest way is to use executor service for this (if we are talking about Java) or any other appropriate manner (depends on the programming language).

Of course you can argue and exchange first two lines of code like this:

doAnythingElse(doItLong);
SomeData data = calculateSomeData();
handleSomeData(data);

And of course you will be right, but what if doAnythingElse() method waits for something (some external data) or uses network or any other "slow" devices? In this case CPU load will be really low and you will waste it. That's why it's much better to execute code in parallel.

2. Loops.

Everyone used "for" or "while" loops when he/she coded and hardly ever thought about their parallelization. But executing these loops in parallel is a really good way to speed up your code. One should understand that it makes sense if loop body takes long enough (not just 5 microseconds ;)).

Typical for-loop is

for (condition)
for-loop-body

And usually we know number of loops to be executed. So we can execute for-loop-body in parallel known number of times and then "merge" execution results into the final loop result.

There are some issues you may run into and one of then is that loop-bodies may depend on each other. This happens pretty often and if next execution at the very beginning requires results of the previous one then this loop cannot be executed in parallel. But if next execution needs previous results later (say in the middle of the execution) then these are "connected jobs" and they still can be executed in parallel.

To sum it up:
Execute in parallel your loops and calculate all data in parallel wherever it's possible but always keep in mind that parallelization has some overheads and do not parallelize "short" calculations.

Saturday, August 16, 2008

Progress bar with GridGain

Running task that takes long one could be interested in getting feedback from jobs about their state or progress they made so far. This can be shown on GUI as a progress bar, which all of us got used to see in any kind of applications.

Some time ago, discussing difference between GridGain and Hadoop, last mentioned some features that GridGain did not support and one of them was counters (a.k.a events for the progress bar) and now I decided to deny it. Even those version that we discussed (GridGain 2.0.3) already had two different ways of calculating progress. Lets take a look at one of them - the most simple and as flexible as it can be.

Basically GridGain has event system inside and one can track/collect events from time to time and for example receive JOB_FINISHED event to move progress bar on GUI. But sometimes jobs can take long as well and it would be much more convenient to track every job checkpoints as well to get progress bar moving smoothly.

In this case distributed task session and its attributes work just perfect. The main idea behind the solution is to set up so called attributes listener on master (or task originating) node and listen for the attributes from all task jobs. Jobs in their turn will post/set attributes as soon as they get new state. These events will drive progress bar as smoothly as often jobs set/change these attributes.

Lets take a look at the simple code on master node and in the job and see the approach.

Here is a simple task that creates 10 jobs.

GridProgressTask Code

public class GridProgressTask extends
GridTaskSplitAdapter<String, Void> {
@Override
public Collection<? extends GridJob> split(int gridSize,
String arg) throws GridException {
List<GridJob> jobs = new ArrayList<GridJob>(10);

for (int i = 0; i < 10; i++) {
jobs.add(new GridProgressJob());
}

return jobs;
}

public Void reduce(List<GridJobResult> results)
throws GridException {
return null;
}
}

Now job code which is for simplicity sake will create 2 attributes.

GridProgressJob Code

public final class GridProgressJob extends GridJobAdapter<String> {
@GridTaskSessionResource
private GridTaskSession ses = null;

public Serializable execute() throws GridException {
ses.setAttribute("state", "gotIt");
...
ses.setAttribute("anotherState", "gotIt");
...
return null;
}
}

OK. We expect 20 attributes/values to be set and our progress bar should be moved 5 points forward every time when we receive new attribute value.

Here is a master node code where we start our task.

Master node Code

GridFactory.start();

try {
Grid grid = GridFactory.getGrid();

GridTaskFuture<Void> future = grid.execute(
GridProgressTask.class, "Argument");

future.getTaskSession().addAttributeListener(
new GridTaskSessionAttributeListener() {
public void onAttributeSet(Serializable key,
Serializable val) {
// Move our progress bar here.
}
}, true
);

future.get();
}
finally {
GridFactory.stop(true);
}

That's it. Some additional lines of code give you simple way of tracking jobs state. Of course you can make it more complicated and treat different attributes or values as different states if you need. But again it is simple and elegant.