Antbook

—————-

by R. Zuczek

- Introduction

2. The cube

3. Point and ant variables

4. Input/output

5. Assigning data to points

6. Embedded spaces

7. Subroutines and other constructs

8. Asynchronous rules

9. Implementation

10. Creation theory

11. Epilogue - Introduction

——————

This is an attempt to revive an old and little known „ant method” of programming highly parallel algorithms. I invented this method 20 years ago when I worked

on my dissertation on parallel computations at Warsaw University. Later I made some improvements. You can find references to my old papers at the end of this text.

The ant programming shows its power in describing of parallel algorithms where very large number (say 65,536) of processes are active simultaneously. The description of such computations would be impossible if the problem being solved have not exhibited some symmetry of data and operations. The space called the n-cube, hypercube, or simply cube, as well as other spaces determine this symmetry.

The ant method makes use of power of traditional procedural languages, particularly of the C-language. As a companion file to this tutorial you can find the kernel program kern.c enabling you run ant computations on your IBM PC and Microsoft DOS. You can also find programs and subroutines as examples of fundamental parallel algoritms. I tested the kernel with the Borland C-compiler v.2.0 configured as described in the subpage named Readme. I believe the kernel would work with other C-compilers and operating systems after some modifications.

Of course, the described implementation for PC will not speed up computations, i.e. you certainly do not obtain a supercomputer. Maybe you will get results in longer time due to the extra operations inside the kernel. The kernel has also other inconveniences and limitations, however it is intended primarily as a tool for experimenting with parallel algorithms. On this ocassion you will see what should be done for the massively parallel machines to be competitive in the world of traditional computers.

The informal picture of an ant computation can be given as follows. Ants, the ficticious beings of point size, do motions in space called a cube. They move from one point

to another, expand, exit, and perform computations on data associated with points and data that are carried by them. You can think the programs by which the behaviour

of ants is governed are carried by ants. Due to the symmetry of data and algorithms, all the ants carry the same ant program.

The important feature of the ant method is that ant programs describe synchronous computations. Every ant program is interpreted by means of synchronous

computation. Every ant program must be designed to produce a correctly defined synchronous computation. The work of ants in synchronous interpretation

resembles mechanical clockwork. There is no indeterminism in the motion of ants, all moves are strictly programmed. On the other hand, when an ant program

is performed on a computer and some asynchronous rule is applied, the computational process may look less coordinated. Nevertheless, it produces the same, correct results.

- The n-cube

——————-

Generally, a space is a set of points together with some structure imposed on it. The n-cube is a set of N points, where N is n-th power of 2. For instance 3-cube has 8 points, 4-cube has 16 points, etc. Let us mark the p points of the n-cube by integers 0,1,…,N-1. A vector r can be added to a given point p to express a move to neighbouring point. Vectors are represented by numbers 0,1,…,N-1, and addition is the C-language operation ^. In other words, if points and vectors are writen in binary form, we add bits modulo 2. For example 3= 0011(bin), 10= 1010(bin) and 3^10= 1001(bin)= 9.

The most simple, elementary vectors in the n-cube are so called generators, which are integers {1,2,4,8,…}, i.e. powers of 2. When expressed in binary, generators have form „all zeroes and a single one”. The generators are called generators because they generate all the elements of the n-cube when added to the starting point 0 in all possible ways.

The n-cube can be drawn as a graph where vertices represent points and edges represent generators. For example, the 3-cube can be drawn as follows:

. 6 ———— 7 generators:

. /| /| 1

4 ————- 5 | | ———-

| | | | | 4

| 2 ———- | 3 | /

| / |/ / 2

0 ————- 1 /

The cube can be considered as a subspace of all-containing, „infinite dimensional” oo-cube. Thus, a cube can be defined by a subset of the set of all possible generators {1,2,4…. }. We call such a subset a base of a given cube. Note, that a base can be expressed as a single integer. For instance, the number

56 == 111000 (bin)

determines the base {32,16,8}. We say that a cube is spread over generators from a given base. A cube doesn’t necessarily contain zero. For example, consider a cube containing zero and spread over 32,16,8. Let us add a shift vector d to the 8 points. The resulting set {p ^ d, for p= 0,1,…,7} also is called a cube.

We distinguish three operations by which ants can move in the n-cube: transfer, expansion and quit. Transfer, denoted by t(r), moves an ant from a given point p along the vector r to the point v+r. Expansion, denoted tx(r), splits an ant in a given point p so that two ants will be created after the operation: an old one that will stay in p and a new one that will appear in p+r. The operation of quit, denoted o(), performs exit of an ant from space.

The function

w(r) = number of binary 1’s in r

is called the weight of r. This is the measure that says how many elementary transfers t(g), g from {1,2,4,…}, is contained in t(r). It is commonly supposed that transfers will be the most time consuming operations in any future implementation. Thus, it is of upmost importance to develop programs which are optimalized with respect to the

sum of w(r) for all r used in computation. The functions contained in bope.c mostly support writing programs using t(g), tx(g), where g are generators.

Now we explain how moving operations are used in ant programs. The base programming construct is a sequence of operations. Assume a single ant placed in point 0 of the 3-cube and consider the following sequence:

tx(1); /* executed by 1 ant */

tx(2); /* executed by 2 ants */

tx(4); /* executed by 4 ants */

There is a special rule of intepreting ant programs that says that interpretation is by synchronous computation. As comments suggest, tx(1) is executed by initial ant. Two ants that arise after this operation execute next command tx(2) synchronously, creating so 4 ants. Finaly these 4 ants execute synchronously tx(4) and 8 ants are obtained in the 8 points of the 3-cube as a result.

Let us look what happens when the 8 ants will continue with the following sequence:

t(1); /* transfer along gen 1 */

t(2); /* transfer along gen 2 */

t(4); /* transfer along gen 4 */

By executing t(1), the four ants being placed in 0,1,2,3 will transfer to points 4,5,6,7; at the same time, synchronously and instantly, the four ants in 4,5,6,7 will transfer to 0,1,2,3. This ressembles mirroring relative to vertical plane (see the picture of the 3-cube). You can observe a similar effect during execution of the next two transfers. Finally, the ant that was originaly in 0 will be placed in 7, the ant originaly in 1 will be in 6, etc.

The two main priciples of ant programming are:

– program is to be intepreted synchronously with respect to operations t(), tx(), and o()

– program is to be written so that no collisions between ants can occur during synchronous program execution

Below presented is a complete ant program consistig of the operations discussed. The main program assumes a single ant initially placed in point zero.

#include <kern.h>

void main()

{

tx(1); tx(2); tx(4); /* expansion */

t(1); t(2); t(4); /* permutation */

o(); /* global quit */

}

The program is finished by the operation o() which causes exit of all eight ants from space. This is a legal ending of a program for our kernel.c implementation (where so called global asynchronous rule is used), but generally you must insert a sequence into a program which retracts all the ants into a single ant in point zero before final quit.

- Point and ant variables

—————————–

In order ants to do some useful work, variables need to be introduced into programs. It is natural to distinguish two kinds of variables: those representing data in points and those representing data in ants. We apply a simple convention in which point data are those defined as globals, whereas ant data are those defined as locals in programs. In order to better distinguish the two kinds of variables, we recommend that point data be written by small letters, whereas ant data be written by capital letters.

Examples:

a= A; /* ant makes copy of ant data A to point variable a */

A= a+A; /* ant adds point and ant data and stores the result in A */

…

Every expression describes a local operation performed by one ant in the point where this ant is currently placed.

The program that follows is a modification of the previous example. The expansion sequence is rewritten to a for-cycle. The integer variable G that controls the cycle represents a generator. The values of G are subsequently 1,2,4. The variable G necessarily had to be defined as ant variable. If we were using point variable, say g, for this purpose, each new ant produced by the operation tx(g) would have found the value of g undefined in the „expanded to” point v^g.

#include <kern.h>

int a,g; /* point variables */

void main()

{

int A,G; /* ant varables */

for (G= 1; G <= 4; tx(G), G<<= 1); /* expansion */

a= 1; /* a= 1 in all points */

g= 4;

while (g) { /* for g=4,2,1 */

A= a; /* pick up the value a */

t(g); /* transfer along g */

a= a+ A; /* add to the point value */

g= g >> 1; /* divide g by 2 */

}

o(); /* a== 8 in all points */

}

Let us see in detail how a program (as that above) is to be interpreted. Let us fix an arbitrary state of the interpretative computation. There are some arithmetic computations at the beginning of that state, which are performed in all points where ants are placed. (This is true at the very beginning where there is a single initial ant). These arithmetic computations can be run completely asynchronously up to the moment, when a moving statement t(), tx(), or o() is encountered in program by some ant in some point. Every ant encountering such statement must temporarily stop its activity on this statement and wait until all ants encounter a similar moving operation in the program. Most likely it will be the same moving statement in the same place in program.

When all the ants eventually reach the place before the moving statement, it comes the moment, when they all perform the motion synchronously and instantly. After this, everything can run asynchronously again, until new moving statement is encountered, etc.

- Input/output

——————-

Using input/output operations as printf(), scanf(), fgets(), etc. is straigthforward – simply an ant in a point performs printing, reading, writing. Of course, normally we can’t assign printers and screens with every point of space. Usually we will also be restricted in using too many files. Because of these restrictions, normally only the initial ant opens files, windows, etc. and subseqent ants use them in some „unpredictable” order.

The following sample program is almost self-explanatory. An initial ant expands to the 2-cube, where 4 ants open their own files and write something to them. The program illustrates the using of a special system variable

currentpoint

provided by the kernel. It is a point variable and its value is p in point p. It should be treated as read-only in your programs.

#include <stdio.h>

#include <stdlib.h>

#include <kern.h>

char *fname= ” „;

FILE *f;

void main()

{

tx(1); tx(2); /* expansion to 4 points */

itoa(currentpoint,fname,2); /* name of file derived from point*/

f= fopen(fname,”w”); /* open file */

fputs(„hello world\n”,f); /* write something to it */

fputs(„from point „,f);

fputs(fname,f);

o();

}

- Assigning data to points

——————————–

Parallel algorithms work with arrays of data. Every element of an array must be placed in some point of the cube. The assignment of data to points is a delicate problem because it affects complexity and speed of programs. Some „theory” to this problem can be found in the section about program design. But sometimes theory may prove to be unpractical or needlessly complicated. Particularly, if no data transfer is needed, as in the following example, the theory „fails”, because any assignment is good.

We present an example of program to display something on the screen by means of multiple ants. The initial ant creates a window consisting of 64 = 8 x 8 squares. Every square on the screen has its counterpart in a character variable a that is placed in a point of the 6-cube. We have chosen the following correspondence between the (i,j)-th element of the matrix of squares and points in space:

i= currentpoint & 7; /* 7 == 000111 (bin) */

j= (currentpoint & 56) >> 3; /* 56 == 111000 (bin) */

The program is as follows:

#include <kern.h>

#include <bope.h>

#include <conio.h>

int i,j; /* coordinates */

char a; /* character to display */

void main()

{

clrscr();

window(40,4,55,11);

clrscr(); /* initial ant opens window */

xpand(7); /* expansion along 1-st row */

xpand(56); /* expansion along 1-st column */

a= currentpoint+48; /* assign character */

i= currentpoint & 7; /* row index i */

j= (currentpoint & 56) >> 3; /* column index */

gotoxy(i*2+1,j+1); /* place into window */

putch(c); /* display it */

o(); /*quit */

}

The expansion is performed by means of the function xpand() defined in the kernel extension bope.c. The argument of this function is integer representing a subbase

i.e. a subset of generators (not a vector). The efect of the operation xpand(subbase) is the same as that of the sequence of operations tx(g) for g taken from the subbase.

Suppose now we want to move array elements in space. Consider the following one-dimensional array of numbers

a0,a1,…,a7

assigned to the points of the 3-cube, i-th element assigned to i. If we want to shift this array cyclicaly one place to the right (i-th element to the point (i+1) modulo 8), we can do it by applying the function traa(int subbase,int k) defined in bope.c. The parameter subbase defines a subcube where the data are stored. The second parameter defines the shift value (not vector, simply a number). The function can be used as follows:

/* 8 ants in 3-cube */

A= a; /* pick up element of array */

traa(7,1); /* transfer arithmetically by 1, modulo 8 */

a= A; /* replace old a */

The function traa(subbase,k) is implemented by means of t(r), where r is a vector, which depends on the point where the ant is currently placed. If the subbase had the form 00011..111(bin), the vector r would be expressed

r= currentpoint ^ ((currentpoint + k) & subbase))

On the other hand, if the array elements a0,a1,…a7 were assigned to points this way:

a0 a1 a2 a3 a4 a5 a6 a7

0 1 3 2 6 7 5 4

the operation gtrans(7) could be used instead of traa(7,1). The function gtrans(int subbase) is implemented by means of t(g), where g is a generator. Thus, the cost of gtrans() is always 1. In fact, t(g) == gtrans(g) for a generator g, i.e. a subbase consisting of a single generators. The algoritm used in gtrans() follows from the following picture:

.. 4 ————-5

.. | /

.. 7 –|———– 6

.. | | —> cube

.. | 3 ————-2

.. | /

.. 0 ————– 1

The presented mapping is known as Gray code. The Gray code can be defined in form of initialized array, or in form of function or macro:

gray[8] = {0,1,3,2,6,7,5,4};

#define gray(i) (i ^ (i >> 1))

The main property of Gray coding is that the neighboring elements of the sequence in gray[] differ in one bit position. The function gray(), its inverse antigray() and the transfer function gtrans(), are defined in bope.c.

- Embedded spaces

——————-

The previous discussion leads to the abstraction of general spaces and embedding them into a cube. On the two pictures below two ways of embedding the most simple spaces – cycles are presented.

0 —>– 1 —>– 2 —>– 3

| |

| | traa(7,1)

| |

7 –<— 6 –<— 5 –<— 4

0 —>– 1 —>– 3 —>– 2

| |

| | gtrans(7)

| |

4 –<— 5 –<— 7 –<— 6

The numbers represent points of the 3-cube. In both cases the operation xpand(7) can be used to expand a single ant into the cycle. In order to move in the cycles we use either traa() or gtrans() functions, respectively.

By the way, the 1-cube is a cycle itself. We draw the corresponding graph simply

0 —— 1

and we can use any of the operations t(1), traa(1,1) and gtrans(1) to move in this space.

Direct product is the operation producing larger, more dimensional spaces from given smaller spaces. The following graph represents the direct product of two 4-cycles embedded into the 4-cube:

| | | |

— 00 —- 01 —- 02 —- 03 –>

| | | |

— 10 —- 11 —- 12 —- 13 –> traa(3,1)

| | | | traa(12,1)

— 20 —- 21 —- 22 —- 23 –>

| | | |

— 30 —- 31 —- 32 —- 33 –>

| | | |

v v v v

The symbols „ij” represent points (4*i + j) of the 4-cube. The four „row” and four „column” cycles can be used to transfer in this space. The correspondig operations are traa(3,1) and traa(12,1), respectively. Two consecutive expansions xpand(3) and xpand(12) along two „perpendicular” cycles perform expansion to all the 16 points. The same can be done by applying xpand(15).

The picture below shows the Gray embedding of the 4×4 space:

| | | |

— 00 —- 01 —- 03 —- 02 –>

| | | |

— 10 —- 11 —- 13 —- 12 –> gtrans(3)

| | | | gtrans(12)

— 30 —- 31 —- 33 —- 32 –>

| | | |

— 20 —- 21 —- 23 —- 22 –>

| | | |

v v v v

- Subroutines and other constructs

—————————————

This section was meant to contain notes about using „special” C-language constructs. In this respect, this section is not complete. In principle, any C constructs or system calls can be used. If any „side effect” will encounter it should be taken as „a rule of programming”. For example, if you transfer a pointer

int *a= „xxxx”

to another point by means of

A= a; t(g); a= A;

you obtain a pointer to the local string in the new point, not to string in the previous point.

Special attention should be taken when using moving operations. For example, the „if” statement with t() should have always contain „else” part with t() or o():

if (something) t(r);

else t(0); /* or o(), … */

Otherwise the program would work differently/incorrectly. The general rule is to write a „single stream” programs with respect to transfer operations. In other words, a program should be written so that at any time of the interpretative computation it executes transfer operation(s) located „approximately” in the same place of a program.

It is known that subroutines represent the main abstraction method in the process of programming. We can write (parallel) algorithms in form of collection of functions, place them in libraries, and use them in complex programs as parts. When writing functions we must bear in mind that parameters as well as local variables of functions are pushed onto the stack. Because these stack variables are treated as ant variables, they are all transferred across space by the operations t() and tx(), if t() and/or tx() are actually used inside a function. If we do not want or need some of local variables to be transferred, we define them as static.

The following subroutine for matrix transposition is typical in this respect. The matrix is supposed to be defined in main program by

float a;

The constants

#define ROWBASE 7

#define COLBASE 56

determine where the matrix is placed in the cube.

The function of transposition is invoked by

transpose(&a,ROWBASE,COLBASE);

It does’t make difference whether the arithmetic or Gray embedding of the matrix into the cube is used. The idea of transposition is to transfer the value of a from the point where currentpoint has form „ij” to the point where currentpoint has form „ji”. A version of the algorithm presented below looks complex because it performs the transposition as a sequence of elementary transpositions. If we write „ij” in binary as „i1 i2 i3 j1 j2 j3”, then we can perform 3 elementary transpositions „i1 j1” -> „j1 i1”, „i2 j2” -> „j2 i2”, „i3 j3” -> „j3 i3” to transfer data from „ij” to „ji”. Furthermore, the transposition of a „binary” matrix

0 —- 2

| |

| |

1 —- 3

is to be performed in two steps:

0 —- 2 1 —- 3 0 —- 1

| | t(1) | | gtrans(3) | |

| | | | | |

1 —- 3 0 —- 2 2 —- 3

Such a partitioning causes that the whole transposition is realized by means of the transfers t(g), where g are generators. You can count the total number of operations t(g) executed and compare it with the sum of w(r) for all t(r) that would be used in the basic „ij” -> „ji” version of program, with r being arbitrary vectors. This second version would be more efficient (if we do not consider possible losts during routing … ).

#include

#include

#include

#define ROWBASE 56

#define COLBASE 7

void transpose(float*,int,int);

float a;

void main()

{

xpand(ROWBASE^COLBASE);

a= currentpoint;

transpose(&a,ROWBASE,COLBASE);

printf(„point %d data %f\n”,currentpoint,a);

o();

}

void transpose(float *a, int rowbase, int colbase)

{

static int r,c;

float A;

r= fgoss(rowbase); /* get first generator of subbase */

c= fgoss(colbase); /* get first generator of subbase */

A= *a; /* pick up matrix element */

while (r) {

t(r); /* do elementary */

gtrans(r+c); /* transposition */

r= ngoss(r,rowbase); /* get next generator */

c= ngoss(c,colbase); /* get next generator */

}

*a= A; /* replace matrix element */

}

- Asynchronous rules

————————-

Although ant programs are interpreted by synchronous computation, they can be run asynchronously. The only requirement is that the results of both intepretative and real computations be the same. In order to understand the relationship between these two ways of obtaining the same result we need the following definitions.

An action is defined as a sequence of operations of an ant, ended with a moving operation t(), tx() or o(). Each action is supposed to take one time tick in synchronous computation. Thus, a synchronous computation can be redefined as a sequence of steps

where each step consists of a set of actions which are performed simultaneously by all ants. Each step takes one tick of time.

A computational diagram is a graph in which each node represents an action. Two actions are connected by a dashed edge if they follow one another immediately and if they take place in the same point of space. Two actions are connected by a full edge

if they follow one another immediately and if they are the actions of the same ant.

To illustrate these notions, let us consider a sample program to compute coeficients of the polynomial (1+x) to the 5-th power in the 8-cycle.

#include

#include

int a,i; /* point variables */

void main()

{

int A; /* ant variable */

xpand(7); /* expand to all points */

a= 0; /* set zero in all points */

if (currentpoint== 0) a= 1 /* except of initial point */

for (i= 0; i< 5; i++) { /* do it 5 times */

A= a; /* pick up number */

gtrans(7); /* shift in the 8-cycle */

a= a+A; /* add numbers */

}

printf(„coef. at %d has value %d\n”, /* print result */

antigray(currentpoint),a);

o(); /* quit */

}

The main part of the computation begins after expansion to all points. The great deal of actions of the main part have the following form:

a= a+ A; /* not present in initial step */

A= a; /* not present in last step */

gtrans(7); /* not present in last step */

As we see, the initial and final actions look a little differently from those in between.

The computational diagram graph is as follows:

\ \ \ \ \

point= 0 o—o—o—o—o—o

\ \ \ \ \

1 o—o—o—o—o—o

\ \ \ \ \

2 o—o—o—o—o—o

\ \ \ \ \

3 o—o—o—o—o—o

\ \ \ \ \

4 o—o—o—o—o—o

\ \ \ \ \

5 o—o—o—o—o—o

\ \ \ \ \

6 o—o—o—o—o—o

\ \ \ \ \

7 o—o—o—o—o—o

\ \ \ \ \

time= 0 1 2 3 4 5

The symbols ‚o’ represent actions. The individual columns represent steps. The actions linked by dashed edges are performed in the same point. The back-slashes \ , i.e. full edges, represent transfers of ants from point to point.

A computational diagram can be viewed as a data flow diagram in which point data flow along the dashed edges —, and ant data flow along the full edges \ .

If we place initial numbers 1,0,0,… into the nodes of the first column of the above graph and then subsequently perform additions in columns that follow, we obtain 1,1,0,0,…

in the second column, 1,2,1,0,0,.. in the third column, etc., and finally 1,5,10,10,5,1 in the last column. This is exactly how the program in intepreted. The result are numbers known as the Pascal’s triangle numbers.

It is clear that individual operations in nodes of the graph need not be executed so rigidly „column after column” and some degree of independece can be allowed in this process. In general, however, a node cannot fire an action before data have been prepared on its input. However, as soon as data are prepared, the action can be fired. This observation gives the definition of so called local asynchronous rule: An action in a given node of a computational diagram is allowed to start if operations immediately preceding it finished. Note that, the operations in the leftmost column of the graph must be started externally.

A more restrictive is the following, global asynchronous rule. An operation in a given node of a computational diagram can be started if all the operations from the previous column of the graph have already finished. In the global rule, an operation that is one

step ahead of other operations must wait until these operations reach this step. In fact, this way of asynchronous computation is very close to that what we call synchronous computation.

Both the rules, to be applicable in the original environment with ants and spaces, must now be transformed from the computational diagram form to the ants-in-space form. The idea is to introduce counters to count the number of actions performed by ants and points. Actually, these counters will count lengths of the paths the ants and points route across the graph.

Let us associate a counter C with every ant. At the beginning we assume C== 0 in the initial ant. In the course of the computation, every execution of t() and tx() will automatically increase the values of C by 1. Particularly, both the old and new ants receive the same increased value of C after executing tx().

The global asynchronous rule can be now redefined as follows. An ant can perform its operation if and only if its counter C is minimal of all counters C of all ants currently active in space. It is clear that the differrence between the values of any two counters C will never exceed 1.

The ant having the value of C one more than the others ants currently have must wait until the rest of ants reach the same value of C.

Now we try redefine local asynchronous rule. We introduce additional counter c, this time for every point of space. Both counters c and C will serve for counting the operations t(), tx(). At the beginning we assume c== 0 in the initial point and C== 0 in the initial ant. During the computation, every execution of t() or tx() will automatically increase the values both c and C by 1. Moreover, the ant being expanded into a new point causes the counter c in this new point to be set to the value of C carried

by the ant, c= C.

We define the local asynchronous rule as follows. An ant can execute its operations if and only if the value of its counter C is the same as the value of the counter c associated

with the point where the ant is placed, C== c.

By thorough analysis we can see that this definition is not exactly equivalent to the local rule defined for computational diagrams For example, we can’t use this „C==c”

rule to synchronize the computation in which a single ant is travelling in space. You can, however, use the local rule for ant programs, the computational diagram of which has

the form

expansion

permutation

contraction

Hence, it should be a good practice to write programs leading to such structured computations in order to have the warranty that the local asynchronous rule produces correct results. Note that subroutines are usually programmed to perform

the permutational part of the above scheme.

- Implementation

———————–

We start with the following model of a parallel machine. Let us have p single processor computers, where p is a n-th power of 2 and let the computers be numbered 0,1,…,p-1.

Let us connect the computers by communication lines, for example in the same way as points 0,1,…p-1 are connected in the graph of the n-cube (see the graph of the 3-cube in the section 2). The communication lines will serve to transfer „ant packets”.

There is a driver in each computer which is responsible for sending and receiving ant packets via communication lines. One layer higher is a kernel which is aimed to schedule ant processes. We suppose the same copy of kernel, driver and the ant program loaded in the same area of memory in each computer (…). All drivers are initially waiting for incomming packets.

The main part of a packet are data taken from ant variables of an ant program, hence from the program stack. Contained in the ant packet also is the contents of the instruction pointer register IPR and the point P= currentpoint + r (and/or R= r, if it is more appropriate for routing), identifying the destination of the ant packet.

The following auxiliary fields in the packet serve to implement local asynchronous rule. The field C represents the number of operations t() and tx() performed. The field X is destined to indicate whether the packet is the result of expansion tx() or transfer t().

X== 1 denotes the first case, X== 0 denotes the second case. Correspondingly, the counter c is defined as part of kernel to store the number of operations t() and tx()

performed by the kernel.

Let us start manually a given ant program at the computer number zero, which can be qualified as arrival of an initial ant. As long as arithmetic operations of the program are performed everything runs locally. When an operation tx() or t() is encountered,

the kernel takes control and prepares an ant packet. So, it increases C (and simultaneously c), sets X to 1 in case of tx(), sets the value of destination point P, makes copy of current IPR and stack and then is passing the ant packet <P, C, X, IPR, stack> to the line driver queue. The line driver in turn sends the packet to an appropriate line.

Let us see how the incomming packets are processed. The driver, after receiving a packet (the packet must contain proper destination point V== currentpoint) puts it into the queue. While the processor is free, the kernel examines the queue and decides which packet is to be given processor time. The choice follows from the local asynchronous rule. If there is no packet in the queue such that C== c or X== 1, the kernel must wait. If there is such a packet, the kernel loads ant data from it to the program stack

and IPR of the processor. This will cause the ant program to take control. Note that the condition X== 1 means the arrival of expanded ant which is sufficient to make this ant active immediately (the condition C== c is not tested).

Now suppose we have less processors than the number of points in space. First consider the extreme case of one processor. In this case the kernel must be allocating memory

for all the ants being encountered during computation and for all the point data. Because there are no communication lines now, every ant packet being created as the result of t() or tx() will be simply put back to the local first-in-first-out queue.

Observe the order of ants in the queue. The ant which is the first to output from the queue has the value of counter C less or equeal to the values of C of rest of ants in the queue. So we can use the global asynchronous rule to synchronize computations. The counter C can even be left out from the ant packet, because everything works automatically, as well as the field X which is not the part of global asynchronous

strategy.

And this is exactly how the kern.c on your PC works.

Now consider the intermediate case. If we have m points and want to use k processors, k < m, we have to assign some portion of points to the first processor, another portion to the second processor, and so on. Some interconnection network is assumed, which connects processors appropriately. Applying reasonable symmetry, both to assigning points to processors and to interconnecting processors, we can obtain simple rules for sending packets and managinig them inside of processors. Note, that in this general case the local asynchronous rule must be used.

- Creation theory

————————

In the Section 9 a computational diagram was defined as a representation of computational process associated with a given program. In this section we consider a reverse problem. Suppose we have a parallel algoritm in form of an acyclic directed graph, where nodes represent operations and edges represent data flow paths. Our aim is to transform this algorithmical graph to a computational diagram, and find a corresponding ant program. The following discussion is rather theoretical, nevertheless very useful, because it reveals possibilities and limitations of the ant method. We start from the analysis of computational diagrams. We observe that computational diagrams have the property to be binary, i.e. at most 2 edges enter any node and at most 2 edges leave any node (see the computational diagram in Section 9). If a given graph of algorithm had not this property we would use transformations of type

\ \ – o -> – o – o

/ /

or

/ /

o – -> o – o –

\ \

to get required property. As we normaly work with problems of size „n-th power of 2”, we usually will be dealing with transformations of regular trees of type „depth-1 degree-4 tree” to „depth-2 degree-2 regular binary tree”, etc.

Another important property of computational diagrams is that any two different paths starting from the initial node and ending in some another node have the same length.

This property follows from the fact that the ant computation is synchronous. The procedure of stuffing redundant nodes into paths in the graph of algorithm will help to obtain the required property. This is rather theoretical construction which isn’t supposed to be encountered in case of higly symmetrical diagrams.

The next step is partitioning edges of the graph of algorithm into two classes: dashed and full. We will need the following definition. A directionaly alternating path in a binary acyclic graph is a sequence of adjacent edges in which directions of edges alternate. Let us find all such paths of maximal length in a given graph of algorithm. Note that some of them will be cycles. It follows from the property of „to be binary” that the set of all maximal directionaly alternating paths is determined uniquely and

the elements of this set are mutually edge-disjoint.

Let us redraw every second edge in each directionaly alternating path by a dashed line. There are two ways how to do it and we may choose one of them arbitrarily. Finally, if there is only one edge entering some node and it is drawn by dashed line, we draw additional full edge parallel to it. This construction assures that every node of the graph is connected to some initial (leftmost) node by full-line path. We observe this property in computational diagrams. Its meaning is that every operation in the graph is reachable

and executable by some ant.

By partitioning nodes into columns (similarly as you see it on the

computational diagrams in Section 9) we obtain synchronous time. The procedure for

determining space is as follows. First, every dashed-line edge of the graph should be contracted so that corresponding two nodes glue together. We obtain a homomorphic image of the computational diagram, in which nodes represent points and the

full-line edges, that have remained after contracting, represent trajectories of ants.

This graph, can be treated as a certain approximation of space. Instead of studying what

algebra would fit the best for this graph, we go immediately to the procedure of embedding the graph into a cube. Thus, nodes of the graph are to be mapped to points of a cube in some way. As a consequence, edges of the graph will match automatically some vectors. From the variety of all possible mappings we must choose the best one. This resembles the very old problem of coding states of finite automata, where states are to be assigned n-tuples of 0’s and 1’s, the best of all in such a manner that any transition

from state to state would change just one bit. In case of success, the Gray embedding is obtained, i.e. vectors become generators. Anyway, some algorithm optimalizing switching functions is to be used to obtain good results.

The following example explains the rest of the procedure. Consider the polynomial evaluation by means of the Horner algorithm. For the polynomial

p(x)== a3*x*x*x + a2*x*x + a1*x + a0 we can write

p(x)== (a3*x + a2)*x*x + (a1*x + a0), and in graphical representation

a0,x o

\

o

/ \

a1,x o \

o

/

a2,x o /

\ /

o

/

a3,x o

Every node, except of the leftmost four, represents two operations u + v*x and x*x . The result is put on the leaving edge as the following picture shows.

\ a,x

\

\ a + A*x , x*x

o —————–

/

/

/ A,X

The graph of algorithm is already binary and „synchronous”. So we can go on to find directionaly alternating paths. There are 3 such paths, and we redraw the original graph using two kinds of edges.

o —– o ——– o

/ /

/ /

/ /

o /

/

o —– o

/

/

/

o

time= 0 1 2

Now we contract the horizontal edges, and obtain the graph of ant trajectories, or „state transition graph”, in the terminology of finite automata. We assign the binary numbers 00, 01, 10 an 11 to the nodes and obtain the following graph:

11 10 01 00

o ——> o o ——> o

time= 0 \ time= 0 /

g= 01 \ g= 01 /

\ /

—->-

time= 1

g= 10

Using well known methods of the finite automata (or switching circuits) theory you can find Boolean expressions for moving ants in the 2-cube. An initial version of a program for evaluation of polynomials will be a simple sequence of transfers, quits and arithmetical operations. After some refinement you obtain the following program:

/* 4 ants in 4 points */

A= a; /* copy polynomial coeficient to ant */

X= x; /* copy argument to ant */

for (g= 1; 2; g++) { /* make generator a control variable */

if (currentpoint & g) t(g); /* condition when to transfer */

else o(); /* and when to go out */

A= a= a + A* x; /* main operation */

X= x= x*x; /* new power of x */

}

/* single ant, A (== a) contains result */

Thus, we have seen that the theory of program synthesis, assigning data to points, etc. is fairly simple. At the same time, this theory is very hard to apply due to the large combinatorial complexity.

- Epilogue

————–

There is lot of interesting things about ant programs. For example, as you may have noticed, there is a duality of ants and points. Duality is particularly visible when drawing

every second edge in directionaly alternating paths by dashed or full lines. There are two choices there. This is the place where we are deciding who will be carrying which data

– ants or points. You can obtain a dual program from a given one by exchanging local/capital variables for global/small letters (this is just a general rule, do it carefully).

You will probably find out that the method is sometimes dificult to apply. The most often it will be due to enormous complexity of a parallel algorithm you want to program.

However, there may occur real difficulties with the ant method itself. I believe that such difficulties can be overwhelmed by implementing so called parallel multiprocessing (or multiprocessing parallelism). Normaly an ant program governs one „cloud” of ants – a single parallel process. The ant multiprocessing means that more than one such processes coexist in a cube. You will have simply two or more ant programs active

simultaneously (the programs usually will be inside one source file). Particularly interesting is the case when ants from one program can communicate with ants from another program, e.g. by means of messages left in points. Thus, there will be a statement „write_msg” somewhere in the first program and a statement „read_msg”

somewhere in the second program. The first program can be considered as a producer, the second one a consumer.

Currently I have no implementation of the above idea.

I must also say that I spent years for unproductive meditations and trials in utilizing the ant and cube ideas to explain strange behaviour of nature in micro and macro cosmos. I still believe that there is something in that. Stochastic approach is inevitable. Particularly exciting is to imagine a big glass bottle filled with particles of two kinds (points and ants, or vice versa), which collide accidentaly but selectively. You don’t need any a priori space and time, everything is created automaticaly by the selective rules of collisions (recall the „creation theory”).

Finally, references to my old papers:

– A New Approach to Parallel Computing, Acta Informatica, Vol. 7,

1-13, 1976 /* ant method, programming examples, synchronous

implementation */

– The Universal Space for Parallel Computations, Information

Processing Letters, Vol. 6, No. 2, 42-45, 1977

/* using Gray coding for embedding spaces (originally „immersion”) */

– Implementace takzvanych mravencovych programu, SOFSEM’82

Proc., 413-417, 1982 /* asynchronous implementation */

– Souvislost mezi dvema zpusoby popisu paralelnich

algoritmu, MOP’86 Proc., 145-153, 1986

/* creating programs from computational diagrams */

Orlova, July 4, 1993