Home | | **An Introduction to Parallel Programming** | | **Multi - Core Architectures and Programming** | Two n-Body Solvers

In an n-body problem, we need to find the positions and velocities of a collection of interacting particles over a period of time.

**TWO**** ***n***-BODY SOLVERS**

In an *n*-body problem, we need to find the positions and velocities of a
collection of interacting particles over a period of time. For example, an
astrophysicist might want to know the positions and velocities of a collection
of stars, while a chemist might want to know the positions and velocities of a
collection of molecules or atoms. An *n*-body solver is a program that finds the solution to an* n*-body problem by simulating* *the behavior of the particles. The input to the
problem is the mass, position, and velocity of each particle at the start of
the simulation, and the output is typically the position and velocity of each
particle at a sequence of user-specified times, or simply the position and
velocity of each particle at the end of a user-specified time period.

Let’s first develop a serial *n*-body solver. Then we’ll try to parallelize it
for both shared- and distributed-memory systems.

**1. ****The problem**

For the sake of explicitness, let’s write an *n*-body solver that simulates the motions of
planets or stars. We’ll use Newton’s second law of motion and his law of
universal gravitation to determine the positions and velocities. Thus, if
particle *q* has position **s*** _{q}*(

or, more often, simply the positions and
velocities at the final time *T*∆*t*. Here, ∆*t* and *T *are specified by the user, so the input to the program will be* n*, the number of parti-cles, ∆*t*, *T*, and, for each particle, its mass, its initial position, and its
initial velocity. In a fully general solver, the positions and velocities would
be three-dimensional vec-tors, but in order to keep things simple, we’ll assume
that the particles will move in a plane, and we’ll use two-dimensional vectors
instead.

The output of the program will be the positions
and velocities of the *n* particles at the timesteps 0, ∆*t*, 2∆, ...., or just the positions and velocities at *T*∆*t*. To get the output at only the final time, we can add an input
option in which the user specifies whether she only wants the final positions
and velocities.

**2. ****Two serial programs**

In outline, a serial *n*-body solver can be based on the following
pseudocode:

Get input data;

**for **each
timestep** {**

**if **(timestep
output) Print positions and velocities of

particles;

**for **each
particle q

Compute total force on q;

**for **each
particle q

Compute position and velocity
of q;

*}*

Print positions and
velocities of particles;

We can use our formula for the total force on a
particle (Formula 6.2) to refine our pseudocode for the computation of the
forces in Lines 4–5:

Here, we’re assuming that the forces and the
positions of the particles are stored as two-dimensional arrays, forces and pos, respectively. We’re also assuming we’ve
defined constants X = 0 and Y = 1. So the *x*-component of the force on particle *q *is* *forces[q][X]* *and the* y*-component is* *forces[q][Y]. Similarly, the compo-nents of the position
are pos[q][X] and pos[q][Y]. (We’ll take a closer look at data structures
shortly.)

We can use Newton’s third law of motion, that
is, for every action there is an equal and opposite reaction, to halve the
total number of calculations required for the forces. If the force on particle *q* due to particle *k* is **f*** _{qk}*, then the force on

(Why are the diagonal entries 0?) Our original
solver simply adds all of the entries in row *q* to get forces[q]. In our modified solver, when *q* ** =** 0, the body of the loopp

**Program 6.1: **A reduced algorithm for computing** ***n*-body forces

**for **each particle q** **will add the entries in row 0
into** **forces[0]. It will also add** **the *k*th entry in column 0 into forces[k] for *k* ** =** 1, 2,…….,

Note that in using this modified solver, it’s
necessary to initialize the forces array in a separate loop, since the *q*th iteration of the loop that calculates the
forces will, in general, add the values it computes into forces[k] for *k* *=**q* ** +** 1,

In order to distinguish between the two
algorithms, we’ll call the *n*-body solver with the original force calculation, the *basic* algorithm, and the solver with the number of
calculations reduced, the *reduced* algorithm.

The position and the velocity remain to be
found. We know that the acceleration of particle *q* is given by

where **s^{’’}**

We might at first think that we can simply find
an antiderivative of the function in Formula 6.3. However, a second look shows
us that this approach has problems: the right-hand side contains unknown
functions **s*** _{q}* and

the values of the position and velocity at the
times of interest. There are *many* possible choices for numerical methods, but we’ll use the simplest
one: Euler’s method, which is named after the famous Swiss mathematician
Leonhard Euler (1707–1783). In Euler’s method, we use the tangent line to
approximate a function. The basic idea is that if we know the value of a
function *g*(*t*_{0}) at time *t*_{0} and we
also know its derivative *g^{’}(t*

*y = g(t _{0}) + g’(t_{0})(t
− t_{0}).*

Since we’re interested in the time *t* *=**t*_{0} *+**∆**t*, we get

*g(t + **∆**t) ≈ g(t _{0}) + g’(t_{0})(t
+ *

Note that this formula will work even when *g*(*t*) and *y* are vectors: when this is the case, *g^{’}*(

Now we know the value of **s*** _{q}*(

When we try to extend this approach to the
computation of **s*** _{q}*(2

A struct is also an option. However, if we’re
using arrays and we decide to change our program so that it solves
three-dimensional problems, in principle, we only need to change the macro DIM. If we try to do this with structs, we’ll need to rewrite the code
that accesses individual components of the vector.

For each particle, we need to know the values
of

. its mass,

. its position,

. its velocity,

. its acceleration, and

. the total force acting on it.

Since we’re using Newtonian physics, the mass
of each particle is constant, but the other values will, in general, change as
the program proceeds. If we examine our code, we’ll see that once we’ve
computed a new value for one of these variables for a given timestep, we never
need the old value again. For example, we don’t need to do anything like this

Also, the acceleration is only used to compute
the velocity, and its value can be computed in one arithmetic operation from
the total force, so we only need to use a local, temporary variable for the
acceleration.

For each particle it suffices to store its mass
and the current value of its position, velocity, and force. We could store
these four variables as a struct and use an array of structs to store the data
for all the particles. Of course, there’s no reason that all of the variables
associated with a particle need to be grouped together in a struct. We can
split the data into separate arrays in a variety of different ways. We’ve
chosen to group the mass, position, and velocity into a single struct and store
the forces in a separate array. With the forces stored in contiguous memory, we
can use a fast function such as memset to quickly assign zeroes to all of the
elements at the beginning of each iteration:

If the force on each particle were a member of
a struct, the force members wouldn’t occupy contiguous memory in an array of
structs, and we’d have to use a relatively slow **for** loop to assign zero to each element.

**3. ****Parallelizing the**** ***n***-body solvers**

Let’s try to apply Foster’s methodology to the *n*-body solver. Since we initially want *lots *of tasks, we can start by making our tasks the
computations of the positions, the* *velocities, and the total forces at each timestep. In the basic
algorithm, the algorithm in which the total force on each particle is
calculated directly from Formula 6.2, the

computation of **F*** _{q}*(

For the reduced algorithm, the “intra-particle”
communications are the same. That is, to compute **s*** _{q}*(

Recollect that in the reduced algorithm, we
make use of the fact that the force **f**_{rq}** = - f**

The final stage in Foster’s methodology is
mapping. If we have *n* particles and *T *timesteps, then there will be* nT *tasks in both the basic and the reduced algorithm.* *Astrophysical *n*-body problems typically involve thousands or even millions of
par-ticles, so *n* is likely to be several orders of magnitude greater than the
number of available cores. However, *T* may also be much larger than the number of available cores. So, in
principle, we have two “dimensions” to work with when we map tasks to cores.
However, if we consider the nature of Euler’s method, we’ll see that
attempt-ing to assign tasks associated with a single particle at different
timesteps to different cores won’t work very well. Before estimating **s*** _{q}*(

At first glance, it might seem that any
assignment of particles to cores that assigns roughly *n*=thread count particles to each core will do a good job of balancing the
workload among the cores, and for the basic algorithm this is the case. In the
basic algorithm the work required to compute the position, velocity, and force
is the same for every particle. However, in the reduced algorithm the work
required in the forces computation loop is much greater for lower-numbered
iterations than the work required for higher-numbered iterations. To see this,
recall the pseudocode that computes the total force on particle *q* in the reduced algorithm:

Then, for example, when *q* ** =** 0, we’ll make

However, in a shared-memory setting, a cyclic
partition of the particles among the cores is almost certain to result in a
much higher number of cache misses than a block partition, and in a
distributed-memory setting, the overhead involved in communicat-ing data that
has a cyclic distribution will probably be greater than the overhead involved
in communicating data that has a block distribution (see Exercises 6.8 and
6.9).

Therefore with a composite task consisting of
all of the computations associated with a single particle throughout the
simulation, we conclude the following:

**
**A block
distribution will give the best performance for the basic *n*-body solver.

**
**For the
reduced *n*-body solver, a cyclic distribution will best
distribute the work-load in the computation of the forces. However, this
improved performance *may* be offset by the cost of reduced cache performance in a
shared-memory setting and additional communication overhead in a
distributed-memory setting.

In order to make a final determination of the
optimal mapping of tasks to cores, we’ll need to do some experimentation.

**4. ****A word about I/O**

You may have noticed that our discussion of
parallelizing the *n*-body solver hasn’t touched on the issue of I/O, even though I/O
can figure prominently in both of our serial algorithms. We’ve discussed the
problem of I/O several times in earlier chap-ters. Recall that different
parallel systems vary widely in their I/O capabilities, and with the very basic
I/O that is commonly available it is very difficult to obtain high performance.
This basic I/O was designed for use by single-process, single-threaded
programs, and when multiple processes or multiple threads attempt to access the
I/O buffers, the system makes no attempt to schedule their access. For example,
if multiple threads attempt to execute

printf("Hello from thread %d of %d** n**n",
my_rank, thread count);

more or less simultaneously, the order in which
the output appears will be unpre-dictable. Even worse, one thread’s output may
not even appear as a single line. It can happen that the output from one thread
appears as multiple segments, and the individual segments are separated by
output from other threads.

Thus, as we’ve noted earlier, except for debug
output, we generally assume that one process/thread does all the I/O, and when
we’re timing program execution, we’ll use the option to only print output for
the final timestep. Furthermore, we won’t include this output in the reported
run-times.

Of course, even if we’re ignoring the cost of
I/O, we can’t ignore its existence. We’ll briefly discuss its implementation
when we discuss the details of our parallel implementations.

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

**Related Topics **

Copyright © 2018-2020 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.