Cornell Theory Center

Virtual Workshop Module

CTC

Parallel Processing Concepts

Play video for broadband connection
Saleh Elmohamed
Video Introduction
   View with modem or broadband connection
Read the text transcript


Table of Contents

  1. Overview and Goals of Parallel Processing
  2. Why Use Parallel Processing to Reach These Goals?
  3. Taxonomy of Architectures
  4. Terminology of Parallelism
  5. Models of Memory Access
  6. Converting From Serial to Parallel Execution
  7. Costs of Parallel Processing
  8. Parallel Processing at the Theory Center
  9. Parallel Programming Example
  10. Conclusions
Quiz Evaluation Navigation Guide


1. Overview and Goals of Parallel Processing

Overview of Parallel Processing

Parallel processing is the use of multiple processors to execute different parts of the same program simultaneously.

The main goal of parallel processing is to reduce wall-clock time. There are also other reasons, which will be discussed later on in this section.

Imagine yourself having to order a deck of playing cards: a typical solution would be to first order them by suit, and then by rank-order within each suit. If there were two of you doing this, you could split the deck between you and both could follow the above strategy, combining your partial solutions at the end; or one of you could sort by suit, and the other by order within suits, both of you working simultaneously. Both of these scenarios are examples of the application of parallel processing to a particular task, and the reason for doing so is very simple: reduce the amount of time before achieving a solution. If you've ever played card games that use multiple decks, you've almost certainly engaged in parallel processing by having multiple people help with the tasks of collecting, sorting and shuffling all of the cards.

You can use this analogy to see indications of both the power and the weakness of the parallel approach, by taking it gradually to its extreme: as you increase the number of helpers involved in a particular task, you'll generally experience a characteristic speedup curve demonstrating how up-to-a-certain-number of helpers is beneficial, but any over that simply get in each others' way and reduce the overall time to completion. Consider, for example, how little it would help to have 52 people crowding around a table, each responsible for putting one particular card into its proper place in the deck -- this is exactly what is meant by the adage Too many cooks spoil the broth.

If you'd like to do an exercise which will drive home some of the aspects of both the power and the limitations of parallel programming, click here.


1.1 Goals of Parallel Processing

  1. Reduce Wall Clock Time

    In the just-stated goal, you'll notice that it isn't simply time that's being reduced, but wall-clock time; other kinds of "time" could have been emphasized, for example CPU time, which is a count of the exact number of CPU cycles spent working on your job, but wall-clock is considered to be the most significant because it's what you-the-researcher want to spend as little of as possible waiting for the solution: your own time. What is considered to be "acceptable" differs with the situation, and involves characteristics of the user, the particular code being run, and the system it's being run on. But in all cases, it is safe to assume that the user is generally going to be more pleased the faster a solution is produced.

    It should be pointed out that reducing the wall-clock time to solution is only one of many possible goals that might be of interest; some others might be:

  2. Cheapest Possible Solution Strategy

    Running programs costs money; different ways of achieving the same solution could have significantly different costs. If you're in a fiscally tight situation, you may have no reasonable recourse to a parallel strategy if it costs more than your budget allows. At the same time, you may find that running your program in parallel across a large number of workstation-type computers could cost considerably less than submitting it to a large, mainframe-style mega-beast.

  3. Local versus Non-Local Resources

    "Locality," here, usually refers to either "geographical locality" or "political locality." The former is just another way of saying that you want all of your processes to be "close" to one another in terms of communications; the latter indicates that you only want to use resources that are administratively open to you. Both can have a bearing on "cost;" e.g., the more communications latency incurred by your application, the longer it will run, and the more you might be charged, while use of resources "owned" by other organizations may also carry charges.

  4. Memory Constraints

    As researchers become increasingly computationally sophisticated, the complexity of the problems they tackle increases proportionally (some would say superlinearly, that researchers are forever trying to bite off more than their computers can chew). One of the first resources to get exhausted is local memory -- especially for Grand Challenge level projects, the amount of memory available to a single system is rarely going to be sufficient to the computational and data-storage needs encountered during application runs.

    This situation is greatly alleviated by having access to the aggregate memory made available by distributed computing environments: working storage (main memory) requirements can be spread around the various processors engaged in the cooperative computation, and long-term storage (tape and disk) can be accommodated at different locations (indeed, data security can be enhanced by arranging for multiple copies to be maintained at distinct locations in the distributed environment).

You can probably think of others; basically, any limited resource can be considered as the object of optimization, if it is deemed to be the most important quantity to conserve. In most cases involving large-scale computation, however, user time, as measured by the clock on the wall, is considered to be the single most valuable resource to be conserved.


2. Why Use Parallel Processing to Reach These Goals?

Accepting reduction of wall-clock time as the fundamental goal of our activities, why necessarily focus on parallel programming as the means to this end? Aren't there other approaches that can also yield fast turnarounds? Yes, indeed there are, the most significant being the oldest: "beef up that old mainframe," make the standalone single-processor design larger (e.g., increase the amount of memory it can directly address) and more powerful (e.g., increase its basic wordlength and computational precision) and faster (e.g., use smaller-micron etching technology, packing more transistors into less space, and coupling everything with larger and faster communications pathways).

This approach, though, can only be pushed so far, and indications are getting stronger and stronger that fundamental limitations are going to put permanent roadblocks up before too much longer. Three of these are:


3. Taxonomy of Architectures

To set a foundation for our examination of parallel processing, we need to understand just what kinds of processing alternatives have already been identified, and where they fit into the "parallel picture", if you will. One of the longest-lived and still very reasonable classification schemes was proposed by Flynn, in 1966, and distinguishes computer architectures according to how they can be classified along two independent, binary-valued dimensions; independent simply asserts that neither of the two dimensions has any effect on the other, and binary-valued means that each dimension has only two possible states, as a coin has only two distinct flat sides. For computer architectures, Flynn proposed that the two dimensions be termed Instruction and Data, and that, for both of them, the two values they could take be Single or Multiple. The two dimensions could then be drawn like a matrix having two rows and two columns, and each of the four cells thus defined would characterize a unique type of computer architecture.

Single Instruction, Single Data (SISD)

This is the oldest style of computer architecture, and still one of the most important: all personal computers fit within this category, as did most computers ever designed and built until fairly recently. Single instruction refers to the fact that there is only one instruction stream being acted on by the CPU during any one clock tick; single data means, analogously, that one and only one data stream is being employed as input during any one clock tick. These factors lead to two very important characteristics of SISD style computers:


Multiple Instruction, Single Data (MISD)


Few actual examples of computers in this class exist; this category was included more for the sake of completeness than to identify a working group of actual computer systems. However, special-purpose machines are certainly conceivable that would fit into this niche: multiple frequency filters operating on a single signal stream, or multiple cryptography algorithms attempting to crack a single coded message. Both of these are examples of this type of processing where multiple, independent instruction streams are applied simultaneously to a single data stream.

A less technological, but perhaps more cosmopolitan example, was suggested by a participant in the Cornell Theory Center's Virtual Workshop:


Single Instruction, Multiple Data (SIMD)

A very important class of architectures in the history of computation, single-instruction/multiple-data machines are capable of applying the exact same instruction stream to multiple streams of data simultaneously. For certain classes of problems, e.g., those known as data-parallel problems, this type of architecture is perfectly suited to achieving very high processing rates, as the data can be split into many different independent pieces, and the multiple instruction units can all operate on them at the same time.

The most advanced parallel processor arrays that are in production today:

Note: The face of  HPC is changing very quickly. While few years ago a lot of SIMD vector pipeline supercomputers had been produced, few of them are produced today. However some of them are still in production run. You may find here the list of major SIMD platform that may be still available at some supercomputing centers.


Multiple Instruction, Multiple Data (MIMD)

Many believe that the next major advances in computational capabilities will be enabled by this approach to parallelism which provides for multiple instruction streams simultaneously applied to multiple data streams. The most general of all of the major categories, a MIMD machine is capable of being programmed to operate as if it were in fact any of the four.


4. Terminology of Parallelism

Parallel processing has its own lexicon of terms and phrases, emphasizing those concepts that are considered to be most important to its goals and the ways in which those goals may be achieved. The following are some of the more commonly encountered ones. They are listed in an order for you to learn them, assuming you do not know any, so you can start with the first and then build up to the rest.

Task

A logically discrete section of computational work.

This is a somewhat loose definition, but adequate for this introduction. For now, think of a task as computational work you can describe simply, such as, "calculate the mean and standard deviation of 100,000 numbers," or "calculate a Fast Fourier transform."


Parallel Tasks

Tasks whose computations are independent of each other, so that all such tasks can be performed simultaneously with correct results.


Serial Execution

Execution of a program sequentially, one statement at a time.


Parallelizable Problem

A problem that can be divided into parallel tasks. This may require changes in the code and/or the underlying algorithm.

Example of Parallelizable Problem:

Example of a Non-parallelizable Problem:

A non-parallelizable problem, such as the calculation of the Fibonacci sequence above, would entail dependent calculations rather than independent ones -- notice how calculation of the k + 2 value uses those of both k + 1 and k, hence those three terms cannot be calculated independently, nor, therefore, in parallel.


Types of Parallelism

There are two basic ways to partition computational work among parallel tasks:


Observed Speedup

Observed speedup of a code which has been parallelized =

         wall-clock time of serial execution 
       --------------------------------------- 
        wall-clock time of parallel execution
    

One of the most widely used indicators of parallelizability, the calculation of observed speedup is both intuitively satisfying, and potentially misleading; the former because a well-parallelized code can be shown to run in a fraction of the time that it takes the serial version, the latter because, in many respects, this is a comparison of apples and oranges: the codes are different, they perform different tasks, the algorithms may be entirely distinct. Still, there is no discounting the fact that a good job of parallelization will be evident in the amount of wallclock time it has saved the user; what is debatable is the converse: if it is not evident that a lot of time has been saved, is it because the problem itself is not parallelizable, or because the parallelization simply wasn't done well? This, by the way, is where parallel profiling tools, covered later in this presentation, can help tremendously.


Synchronization

The temporal coordination of parallel tasks. It involves waiting until two or more tasks reach a specified point (a sync point) before continuing any of the tasks.


Parallel Overhead

The amount of time required to coordinate parallel tasks, as opposed to doing useful work.

Parallelization doesn't come free, and one of the most insidious costs is the time and cycles put into making sure that all of those separate tasks are doing what they're supposed to be doing. Things that are simply taken for granted in serial execution, or that don't apply, take on special significance when there are many tasks instead of just one; the three most commonly encountered coordination tasks are:


Granularity

A measure of the ratio of the amount of computation done in a parallel task to the amount of communication.




Massively parallel system

A parallel system with many processors. "Many" is usually defined as 1000 or more processors.


Scalable parallel system

A parallel system to which the addition of more processors will yield a proportionate increase in parallel speedup. Whether or not this increase occurs typically depends on some combination of:


5. Models of Memory Access

Memory access refers to the way in which the working storage, be it "main-memory", "cache-memory", or whatever, is viewed by the programmer. Regardless of how the memory is actually implemented, e.g., if it's actually remotely located but is accessed as if it were local, the access method plays a very large role in determining the conceptualization of the relationship of the program to its data.

5.1 Shared Memory

Think of a single large blackboard, marked off so that all data elements have their own unique locations assigned, and all the members of a programming team are working together to test out a particular algorithmic design, all at the same time...this is an example of shared memory in action:


5.2 Distributed Memory


The other major distinctive model of memory access is termed distributed, for a very good reason:


 

5.3 Distributed Memory: Some Approaches

Distributed memory is, for all intents and purposes, virtually synonymous with message-passing, although the actual characteristics of the particular communication schemes used by different systems may hide that fact.