Optimization of Mapping Graphs of Parallel Programs onto Graphs of Distributed Computer Systems by Recurrent Neural Network

A distributed computer system (CS) is a set of elementary computers (ECs) connected by a network that is program-controlled from these computers. Each EC includes a computing module (CM) (processor with a memory) and a system unit (message router). The message router operates under CM control and has input and output ports connected to the output and input ports of the neighboring ECs, correspondingly. The CS structure is described by the graph s s s G (V ,E ) , where s V is the set of ECs and s s s E = V V  is the set of connections between the ECs.


Introduction
A distributed computer system (CS) is a set of elementary computers (ECs) connected by a network that is program-controlled from these computers.Each EC includes a computing module (CM) (processor with a memory) and a system unit (message router).The message router operates under CM control and has input and output ports connected to the output and input ports of the neighboring ECs, correspondingly.The CS structure is described by the graph sss G( V,E), where s V is the set of ECs and sss E= V V  is the set of connections between the ECs.
The topology of a distributed system may undergo changes while the system is operating, due to failures or repairs of communication links, as well as due to addition or removal of ECs (Bertsekas, Tsitsiklis, 1989).The CS robustness means that failures and recoveries of the ECs bring only to increasing and decreasing time of a task execution.Control on resources and tasks in the robust distributed CS suggested solution of the following problems (Tarkov, 2003(Tarkov, , 2005)): the CS optimal decomposition to connected subsystems; mapping parallel program structures onto the subsystem structures; static and dynamic balancing computation load among CMs of the computer system (subsystem); static and dynamic message routing (implementation of paths for data transfer), i.e. balancing communication load in the CS network; distribution of program and data copies for organization of fault tolerant computations; subsystem reconfiguration and redistribution of computation and communication load for computation recovery from failures, and so on.
As a rule, all these problems are considered as combinatorial optimization problems (Korte & Vygen, 2006), solved by centralized implementation of some permutations on data structures distributed on elementary computers of the CS.The centralized approach to the problem solution suggests gathering data in some (central) EC, solving optimization problem in this EC with the following scattering results to all ECs of the system (subsystem).As a result we have sequential (and correspondingly slow) method for the problem solution with great overhead for gathering and scattering data.Now a decentralized approach is significantly developed for solution of problems of control resources and tasks in computer systems with distributed memory (Tel G., 1994), and in many cases this approach allows to parallelize the problem solution.
Massive parallelism of data processing in neural networks allows us to consider neural networks as a perspective, high-performance, and reliable tool for solution of complicated optimization problems (Melamed, 1994;Trafalis & Kasap, 1999;Smith, 1999;Haykin, 1999;Tarkov, 2006).The recurrent neural network (Hopfield & Tank, 1985;Wang, 1993;Siqueira, Steiner & Scheer, 2007, 2010;Serpen & Patwardhan, 2007;Serpen, 2008;da Silva, Amaral, Arruda & Flauzino, 2008;Malek, 2008) is a most interesting tool for solution of discrete optimization problems.A model of a globally converged recurrent Hopfield neural network is in good accordance with Dijkstra's self-stabilization paradigm (Dijkstra, 1974).This signifies that the mappings of parallel program graphs onto graphs of distributed computer systems, carried out by Hopfield networks, are self-stabilizing (Jagota, 1999).An importance of usage of the self-stabilizing mappings is caused by a possibility of breaking the CS graph regularity by failures of ECs and intercomputer connections.
For distributed CSs, the graph of a parallel program ppp G( V, E) is usually determined as a set p V of the program branches (virtual elementary computers) interacting with each other by the point−to−point principle through transferring messages via logical (virtual) channels (which may be unidirectional or bidirectional) of the set ppp EVV   .Interactions between the processing modules are ordered in time and regular in space for most parallel applications (line, ring, mesh, etc.) (Fig. 1).
For this reason, the maximum efficiency of information interactions in advanced highperformance CSs is obtained by using regular graphs (,) sss GVE of connections between individual computers (hypercube, torus) (Parhami, 2002;Yu, Chung & Moreira, 2006;Balaji, Gupta, Vishnu & Beckman, 2011).The hypercube structure is described by a graph known as a m-dimensional Boolean cube with a number of nodes 2 m n  . Toroidal structures are mdimensional Euclidean meshes with closed boundaries.For m = 2, we obtain a twodimensional torus (2D-torus) (Fig. 2); for m = 3, we obtain a 3D-torus.In section 2, we consider a recurrent neural network as a universal technique for solution of mapping problems.It is a local optimization technique, and we propose additional modifications (for example, penalty coefficients and splitting) to improve the technique scalability.
In section 3, we propose an algorithm based on the recurrent neural network and WTA ("Winner takes all") approach for the construction of Hamiltonian cycles in graphs.This algorithm maps only line-and ring-structured parallel programs.So, it is less universal than the technique proposed in section 2 but more powerful because it implements a global optimization approach, and hence it is very more scalable than the traditional recurrent neural networks.

Mapping graphs of parallel programs onto graphs of distributed computer systems by recurrent neural networks
Let us consider a matrix v of neurons with size nn  , each row of the matrix corresponds to some branch of a parallel program and every column of the matrix corresponds to some EC.Each row and every column of the matrix v must contain only one nonzero entry equal to one, other entries must be equal to zero.Let the distance between the neighboring nodes of the CS graph is taken as a unit distance and ij d is the length of the shortest path between nodes i and j in the CS graph.Then we define the energy of the corresponding Hopfield neural network by the Lyapunov function (1) Here ij d is the distance between nodes i and j of the system graph corresponding to adjacent nodes of the program graph (a "dilation" of the edge of the program graph on the system graph), () Nb x is a neighborhood of the node x on the program graph.
The value xi v is a state of the neuron in the row x and column i of the matrix v, C and D are parameters of the Lyapunov function.c L is minimal when each row and every column of v contains only one unity entry (all other entries are zero).Such matrix v is a correct solution of the mapping problem (Fig. 3).

Fig. 3. Example of correct matrix of neuron states
The minimum of d L provides minimum of the sum of distances between adjacent p G nodes mapped onto nodes of the system graph s G (Fig. 4).
The Hopfield network minimizing the function ( 1) is described by the equation where xi u is an activation of the neuron with indices x, i ( , 1,..., ) is the neuron state (output signal),  is the activation parameter.
A choice of parameters ,,, tCD   determines a quality of the solution v of Equation ( 4).In accordance with (Feng & Douligeris, 2001) for the problem (1)-( 4) a necessary condition of convergence is where min From ( 4) and ( 5) it follows that the parameters t  and D are equally influenced on the solution of the equation ( 4).Therefore we state 1 t   and have the equation (this value was stated experimentally).We will try to choose the value D to provide the absence of incorrect solutions.

Mapping by the Hopfield network
Let us evaluate the mapping quality by a number of coincidences of the program edges with edges of the system graph.We call this number a mapping rank.The mapping rank is an approximate evaluation of the mapping quality because the mappings with different dilations of the edges of the program graph may have the same mapping rank.Nevertheless, the maximum rank value, which equals to the number p E of edges of the program graph, corresponds to optimal mapping, i.e. to a global minimum of d L in (1).Our objective is to determine the mapping algorithm parameters providing maximum probability of the optimal mapping.where l is the cyclic subgroup order, are carried out.
For 8 D  the correct solutions are obtained for 9 n  and 16 n  , but as follows from Fig. 5а and Fig. 5b for 8 D  , the number of solutions with optimal mapping, corresponding to the maximal mapping rank, is small.
To increase the frequency of optimal solutions of Equation ( 6) we replace the distance values ij d by the values where p is a penalty coefficient for the distance ij d exceeding the value 1, i.e. for noncoincidence of the edge of the program graph with the edge of the system graph.So, we obtain the equation 1 For the above mappings with p n  we obtain the histograms shown on Fig. 6a and Fig. 6b.
These histograms indicate the improvement of the mapping quality but for 16 n  the suboptimal solutions with the rank 13 have maximal frequency.

Splitting method
To decrease a number of local extremums of Function (1), we partition the set   1, 2,...,n of subscripts x and i of the variables xi v to K sets   (1 ) , (1 )1 , . . ., ,  In this approach which we call a splitting, for mapping line with the number of nodes 16 n  onto 2D-torus, we have for 2 K  the histogram presented on Fig. 8a.
www.intechopen.comFrom Fig. 6b and Fig. 8a we see that the splitting method essentially increases the frequency of optimal mappings.The increase of the parameter D up to the value 32 D  results in additional increase of the frequency of optimal mappings (Fig. 8b).

Mapping by the Wang network
In a recurrent Wang neural network (Wang, 1993;Hung & Wang, 2003) is multiplied by the value where  is a parameter.For the Wang network Equation ( 9) is reduced to in 100 experiments we have the following results: 1. On the Hopfield network (9) we have 23 incorrect solutions, 43 solutions with Rank 25 and 34 optimal solutions (with Rank 26) (Fig. 9).2. On the Wang network (10) with the same parameters and 500 we have all (100) correct solutions, where 27 solutions have Rank 25 and 73 solutions are optimal (with Rank 26) (Fig. 10).Further investigations must be directed to increasing the probability of getting optimal solutions of the mapping problem when the number of the parallel program nodes is increased.

Construction of Hamilton cycles in graphs of computer systems
In this section, we consider algorithms for nesting ring structures of parallel programs of distributed CSs, which are based on using recurrent neural networks, under the condition The traveling salesman problem can be formulated as an assignment problem (Wang, 1993;Siqueira, Steiner & Scheer, 2007, 2010) For solving problem (11) − (12), J.

Transformation of the resultant decision matrix ij
x is performed: is sought in the ith row of the matrix ( max j is the number of the column with the maximum element).3.If the cycle returns to the row 1 earlier than the value 1 is assigned to n elements of the matrix ij x , this means that the length of the constructed cycle is smaller than n.In this case, steps 1 and 2 are repeated.

The transformation
To ensure effective operation of the algorithm of Hamiltonian cycle construction, the following values of the parameters of system ( 14) were chosen experimentally (by the order of magnitude): The experiments show that it is not always possible to construct a Hamiltonian cycle at 1 t  , but cycle construction is successfully finalized if the step t  is reduced.We reduced the step t  as / 2 t  if a correct cycle could not be constructed after ten attempts.
The parameters , ij ci j  , are calculated by the formula ( 7) where ij d is the distance between the nodes i and j of the graph, and p > 1 is the penalty coefficient applied if the distance ij d exceeds 1.The penalty coefficient was introduced to ensure coincidence of transition in the travelling agent cycle with the edges of the CS graph.
We studied the use of iterative methods (Jacobi, Gauss-Seidel, and successive overrelaxation (SOR) methods (Ortega, 1988)) in solving Wang's system of equations.With the notation


the Jacobi method (method of simple iterations) of solving system ( 14) has the form 1. 1 , , 1,..., ; , are calculated only after all , are found.In contrast to the method of simple iterations, the new value of x  in the Gauss-Seidel method is calculated immediately after finding the corresponding value of ,, , 1 , . . ., .
In the SOR method, the calculations are performed by the formulas   , the SOR method turns to the Gauss-Seidel method.
Experiments on 2D-tori with the group of automorphisms 2 the Jacobi method can only be used for tori with a small number of nodes ( {3, 4} m  ).
The SOR method can be used for tori with {3, 4,6} m  with appropriate selection of the parameter 1   .For m ≥ 8, it is reasonable to use the Gauss-Seidel method ( 1   ). Figure 11 shows an example of a Hamiltonian cycle constructed by a neural network in a 2D-mesh with n = 16 (the cycle is indicated by the bold line).Core CPU E 52 000, 2.5 GHz (the time equal to zero means that standard procedures did not allow registering small times shorter than 0.015 s) are listed in Tables 1 and 2. In addition to the quantities listed in Tables 1 and 2, Tables 3 and 4 give the relative increase  3 and 4 that: 1.In 3D-tori, the Hamiltonian cycle was constructed for n = 64.With n = 216, 512, and 1000, suboptimal cycles were constructed, which were longer than the Hamiltonian cycles by no more than 1.6%.2. In hypercubes, the Hamiltonian cycles were constructed for n = 16 and 64 (it should be noted that the hypercube is isomorphous to the 2D-torus with n = 16).For n = 256 and n = 1024, suboptimal cycles were constructed, which were longer than n by no more than 2.3%.

Construction of Hamiltonian cycles in toroidal graphs with edge defects
The capability of recurrent neural networks to converge to stable states can be used for mapping program graphs to CS graphs with violations of regularity caused by deletion of edges and/or nodes.Such violations of regularity are called defects.In this work, we study the construction of Hamiltonian cycles in toroidal graphs with edge defects.Experiments in 2D-tori with a deleted edge and with n = 9 to n = 256 nodes for p = n were conducted.The experiments show that the construction of Hamiltonian cycles in these graphs by the above-described algorithm is possible, but the value of the step t  at which the cycle can be constructed depends on the choice of the deleted edge.The method of automatic selection of the step t  is described at the beginning of Section 3. Table 5 illustrates the dependence of the step t  on the choice of the deleted edge in constructing For unification of two cycles 1 R and 2 R , it is sufficient if the graph of the system has a cycle ABCD of length 4 such that the edge AB belongs to the cycle 1 R and the edge CD belongs to the cycle 2 R (Fig. 13).
The cycles 1 R and 2 R can be united into one cycle by using the following algorithm: 1. Find the cycle ABCD possessing the above-noted property.
2. Eliminate the edge AB from the cycle and successively numerate the nodes of the cycle 1 R in such a way that to assign number 0 to the node A and assign number The cycles 1 R and 2 R , and also the resulting cycle are marked by bold lines in Fig. 12.
The edges that are not included into the above-mentioned cycles are marked by dotted lines.
For comparison, The proposed approach can be applied to constructing Hamiltonian cycles in arbitrary nonweighted nonoriented graphs without multiple edges and loops.
We can use the splitting method to construct Hamilton cycles in three-dimensional tori because the three-dimensional torus can be considered as a connected set of twodimensional tori.So, the Hamilton cycle in three-dimensional torus can be constructed as follows: If the Hamilton cycles in all two-dimensional tori are optimal then the resulting Hamilton cycle in the three-dimensional torus is optimal too.
In the 1.Constructs optimal Hamilton cycles in 2D-tori with edge defects; 2. Allows to construct optimal Hamilton cycles in 3D-tori with tens of thousands of nodes (See Table 7).

Conclusion
A problem of mapping graphs of parallel programs onto graphs of distributed computer systems by recurrent neural networks is formulated.The parameter values providing the absence of incorrect solutions are experimentally determined.Optimal solutions are found for mapping a "line"-graph onto a two-dimensional torus due to introduction into Lyapunov function of penalty coefficients for the program graph edges not-mapped onto the system graph edges.
For increasing probability of finding optimal mapping, a method for splitting the mapping is proposed.The method essence is a reducing solution matrix to a block-diagonal form.The Wang recurrent neural network is used to exclude incorrect solutions of the problem of mapping the line-graph onto three-dimensional torus.This network converges quicker than the Hopfield one.
An efficient algorithm based on a recurrent neural Wang's network and the WTA principle is proposed for the construction of Hamiltonian cycles (ring program graphs) in regular graphs (2D-and 3D-tori, and hypercubes) of distributed computer systems and 2D-tori disturbed by removing an arbitrary edge (edge defect).The neural network parameters for the construction of Hamiltonian cycles and suboptimal cycles with a length close to that of Hamiltonian ones are determined.
Resulting algorithm allows us to construct optimal Hamilton cycles in 3D-tori with number of nodes up to 32768.The usage of this algorithm is actual in modern supercomputers having topology of the 3D-torus for organization of inter-processor communications in parallel solution of complicated problems.
Recurrent neural (Hopfield and Wang) network is a universal technique for solution of optimization problems but it is a local optimization technique, and we need additional modifications (for example, penalty coefficients and splitting) to improve the technique scalability.
The proposed algorithm for the construction of Hamiltonian cycles is less universal but more powerful because it implements a global optimization approach and so it is very more scalable than the traditional recurrent neural networks.
The traditional topology aware mappings ((Parhami, 2002;Yu, Chung & Moreira, 2006;Balaji, Gupta, Vishnu & Beckman, 2011)) are constructed especially for regular graphs (hypercubes and tori) of distributed computer systems.The proposed neural network algorithms are more universal and can be used for mapping program graphs onto graphs of distributed computer systems with defects of edges and nodes.

Fig. 2 .
Fig. 2. Example of a 2D-torus In this paper, we consider a problem for mapping graph ppp G( V, E) of a parallel program onto graph sss G( V,E) of a distributed CS, where p s nV V  is a number of program branches (of ECs).The mapping objective is to map nodes of the program graph p G onto nodes of the system graph s G one-to-one to carry out mapping p G edges onto edges of

Fig. 4 .
Fig. 4. Example of optimal mapping of "line"-graph onto torus (the mapping is distinguished by bold lines; the line-graph's node numbers are shown in brackets) 1

Fig. 8 .
Fig. 8. Histograms of mappings for the neural network (9) We note that in experiments we frequently have incorrect solutions if for a given maximal number of iterations max t Fig. 9. Histogram of mappings for the Hopfield network ( Such nesting reduces to constructing a Hamiltonian cycle in the CS graph and is based on solving the traveling salesman problem using the matrix of distances ( , 1,..., ) ij di j n  between the CS graph nodes, with the distance between the neighboring nodes of the CS graph taken as a unit distance.
the cost of assignment of the element i to the position j, which corresponds to motion of the traveling salesman from the city i to the city j; ij x is the decision variable: if the element i is assigned to the position j, Siqueira et al.  proposed a method of accelerating the solution of the system (14), which is based on the WTA ("Winner takes all") principle.The algorithm proposed below was developed on the basis of this method.the specified accuracy of satisfying constraints (12).
. All the remaining elements of the ith row and of the column numbered max j are set to zero.Then, there follows a transition to the row numbered max j .Steps 2.2 and 2.3 are repeated until the cycle returns to the first row, which means that the cycle construction is finalized.
of these parameters from the above-indicated values deteriorate algorithm operation, namely: 1. Deviations of the parameter C from the indicated value (at a fixed value of D )deteriorate the solution quality (the cycle length increases).2.A decrease in  increases the number of non-Hamiltonian ring-shaped routes.3.An increase in  deteriorates the solution quality.A decrease in  increases the number of iterations (14).It follows from(Feng & Douligeris, 2001) that max

Fig. 11 .
Fig. 11.Example of a Hamiltonian cycle in a 2D-mesh In our experiments, we obtained Hamiltonian cycles (with the cycle length L = n) in 2Dmeshes and 2D-tori for a number of experiments equals to 100 with up to n = 1024 nodes for m = 2k and suboptimal cycle lengths L = n + 1 at m = 2k + 1, k = 2, 3, . . ., 16.The penalty coefficients p and the values of t  with which the Hamiltonian cycles were constructed for n = 16, 64, 256, and 1024, and also the times of algorithm execution on Pentium (R) Dual-Core CPU E 52 000, 2.5 GHz (the time equal to zero means that standard procedures did not allow registering small times shorter than 0.015 s) are listed in Tables1 and 2.
Fig. 12. Examples of Hamiltonian cycles in 2D-torus where 1 L is the length of the cycle 1 R , to the edge B. Include the edge BC into the cycle.3. Eliminate the edge CD and successively numerate the nodes of the cycle 2 R so that the node C is assigned the number L1, and the node D is assigned the number 12 1 LL  , where 2 L is the length of the cycle 2 R .Include the edge DA into the cycle.The unified cycle of length 12 LL  is constructed.

Table 4 .
HypercubeIt follows from Tables  in the travelling salesman cycle length, as compared with the Hamiltonian cycle length, for a 3D-torus and hypercube.

Table 6 .
Table 6 gives times (in seconds) of constructing Hamiltonian cycles in a 2Dmesh by the initial algorithm ( 1 Comparison of cycle construction times in 2D-mesh t ) and by the algorithm with splitting of cycle construction ( 2 t ) with the number of subgraphs k = 2.The times are measured for p = n.The cycle construction time can be additionally reduced by parallel construction of cycles in subgraphs.

Table 7 .
table 7 the times (in seconds) of construction of optimal Hamilton cycles in threedimensional tori with n mmm   nodes are presented: seq t is the time of the sequential algorithm, par t is the time of the parallel algorithm on processor Intel Pentium Dual-Core CPU E 52000, 2,5 GHz with usage of the parallel programming system OpenMP (Chapman, Jost & van der Pas, 2008), / Construction of Hamiltonian Cycles in 3D-torus So, the experiments show that the proposed algorithm: