The RASSP Digest - Vol. 2, 4th. Qtr. 1995


An Overview of Automated Processor Specification and Task Allocation Techniques for Embedded Computer Systems

by Jim Beck and Dan Siewiorek


Abstract

This article considers the coupled design problems of processor specification and task allocation for embedded computer systems. Two distinct, complementary representations are presented to model the problem. The first is based on a generalization of vector packing while the second is a variation of graph partitioning. Based on these two problem representations, five distinct solution techniques are developed. The techniques are described, and their performance is analyzed on a suite of sixteen real and synthetic test cases with respect to two figures of merit: solution quality and run-time. The article concludes with a discussion of the results.

1. Introduction

This research is aimed at automating a portion of the design process for embedded, bus-based multicomputers. Since such a computer is embedded [7, 11, 12], it operates as a component within the scope of a larger system. It interacts with this system via sensors and actuators and there are often time constraints that further define the aspects of this interaction. An embedded computer is dedicated to a single application and it executes a single, well-defined set of software routines.

As the term multicomputer implies, the architecture consists of a network of loosely-coupled, autonomous processors [1, 5, 10]. The processors communicate with each other at a high level via message passing over a communication medium. In general, the topology of the communication subsystem may be arbitrary [5]. This research, however, is restricted to the case of communication over a single-level broadcast bus.

Important steps in the design process of embedded multicomputers include specifying the hardware capacities of the processors and statically allocating the software tasks to them. This must be carried out in a way that satisfies all of the software requirements without overutilizing any of the hardware components. Furthermore, hardware specification is complicated by the need to optimize system-level parameters relevant to the design, such as cumulative system cost or power consumption.

Processor specification and task allocation are mutually constrained, coupled problems. Thus, concurrent solution techniques are needed that can provide:

  1. A set of processor specifications
  2. An assignment of tasks to processors

    which are feasible and that optimize a design objective function.

1.1 Software Model

A synchronous data flow model is used to represent the embedded application software [9]. The model consists of a directed, acyclic graph. The nodes of the graph represent tasks and the arcs represent the data flow between them. Synchronous data flow is a special case of data flow in that the amount of data produced and consumed on each task invocation is known a priori. Thus, the resource requirements of the application are statically predictable, allowing static resource allocation.

Additional information is needed to convey the hardware resource requirements associated with the tasks. These requirements are application specific, and can occur across many dimensions such as throughput, memory or I/O channels. To model them, an application-specific resource demand vector is used that explicitly defines the resource needs across the dimensions which are pertinent to the design problem.

An example SDFG and its accompanying resource demand vector are shown in Figure 1.

1.2 Target Architecture

The target architecture consists of an arbitrary number of heterogeneous processors communicating via message passing over a broadcast bus. The hardware resources that are available on the processors are modeled by an application-specific resource capacity vector, which is analogous to and should be consistent with the resource demand vector used to model the tasks. The target architecture with an example resource capacity vector is illustrated in Figure 2.

1.3 Communication Model

Consider two communicating tasks in the SDFG. If these tasks are assigned to the same processor then the communication is essentially free, since it involves the transfer of data within the same, local address space. If, however, communicating tasks are assigned to different processors, then the communication results in message traffic over the broadcast bus. This message traffic must be modeled and considered when specifying the hardware devices and allocating the tasks.

1.4 Feasibility Constraints

A set of processor specifications and task allocations that satisfy all task requirements without overutilizing any of the hardware components are said to be feasible. A set of constraints is needed to define the feasibility condition. Such constraints are application specific, but they can be sorted into two broad categories: processor and bus constraints.

Processor constraints define when a specified processor can support its task set. Likewise, bus constraints define when the bus can support a given set of messages. Note, however, that the set of bus messages is determined as a by-product of task allocation decisions.

1.5 Design Objective Functions

Processor specification is complicated by the need to optimize system-level parameters such as hardware cost or power consumption. The parameters of interest are application-specific, so a user-specified objective function is needed to weight and quantify them.

As an example, consider the objective function that was used during experimentation, which is based on cumulative hardware cost. The system is assumed to consist of a network of heterogeneous, single-chip processing nodes and the objective function is an estimate of the cumulative cost of all processors. The cost of each processor is based on a nonlinear function of the IC die size which is specification-dependent. Die size contributions for each specification option were adapted from a proprietary IC fabrication model developed and used by a commercial producer of automotive electronics. The design objective function is summarized in Figure 3.

1.6 Problem Statement

Based on the information presented in the previous sections, a concise statement of the design problem is shown in Figure 4.

The coupled design problems have been proven to be NP-hard [3, 4] and the size of the design space has been shown to grow exponentially with problem size. Thus, effective heuristic solution techniques are necessary.

2. Problem Representations

Two representations were developed to model the coupled design problems. The first is packing-based while the second is partitioning-based. The details of each are presented in the subsections that follow.

2.1 Packing-Based Representation

Once a processor is specified, only a subset of the SDFG nodes can be assigned to it without violating feasibility constraints. This leads naturally to a packing-based representation of the coupled design problems which is shown in Figure 5.

Under this representation, each processor is viewed as a vector bin with a capacity defined with respect to the resource capacity vector. Similarly, each task is a vector object, whose demand is defined with respect to the resource demand vector. Likewise, the communication bus is modeled as a scalar bin with a capacity equal to its bandwidth. Task allocation becomes a matter of packing the vector objects into the vector bins such that none of the bins (including the scalar bus bin) overflow.

Under this problem representation, solution of the coupled design problems amounts to successively or incrementally specifying the processors and invoking a packing algorithm to perform task allocation. The set of processor specifications that optimizes the design objective function and that leads to a successful packing is chosen as the solution.

2.2 Partitioning-Based Representation

If an SDFG is modified by adding zero-weighted arcs between every pair of non-communicating nodes (i.e. tasks) in the graph, then a one-to-one correspondence exists between k-way partitions of the modified graph and k-processor task assignments. This leads to a partitioning-based representation of the coupled design problems which is shown in Figure 6.

As the figure indicates, an arbitrary partition of the modified SDFG yields k disjoint subgraphs and a set of cut arcs. The tasks represented by the nodes in a subgraph are assigned as a group to a processor, and then the resource capacities of the processor are specified using a method called shrink-wrapping. Shrink-wrapping effectively performs two functions (Note that for shrink-wrapping to be tractable, the hardware capacity elements and the feasibility constraints based on them must be mutually independent. This assumption is valid for most, but not all designs.):

  1. If a processor has a null task set then it is removed from the design.

  2. Each resource capacity in each processor is set to the smallest value such that no feasibility constraints are violated (if possible).

The cut arcs represent the IPC that the bus must support. The message set is formed by transforming each cut arc into a strip-mined sequence of messages. This transformation is based on the message format of the bus and it is represented by the function block in Figure 6.

As Figure 6 indicates, any arbitrary partition of the modified SDFG yields a solution attempt. This solution attempt may or may not be feasible, based on the set of user-defined feasibility constraints. Likewise, each feasible solution attempt will have a cost associated with it, based on the user-defined objective function. Thus, under this problem representation, solution of the coupled design problems amounts to obtaining or refining a partition of the modified SDFG such that the resulting solution attempt is feasible and optimizes the objective function.

3. Solution Techniques

Based on the two problem representation, five solution techniques were developed (SW, DA, SA, TREE and CP*) which are summarized in Figure 7.

The first algorithm, SW, is a packing-based approach. It is a one-pass, heuristic algorithm that establishes a lower bound on run-time. It uses a heuristic packing algorithm to perform task allocation and, after packing, shrink-wrapping is invoked to specify the processors. This algorithm is robust, in the sense that its only limitation on finding a solution when one exists are those inherent in the packing algorithm. It is greedy in the sense that the objective function is not explicitly optimized; rather, hardware specifications are determined as a by-product of the packing decisions.

DA is also a packing-based approach. It uses packing attempts interleaved with invocations of a custom design advisor to evolve a solution via a sequence of incremental refinements [3]. Although it utilizes the same packing algorithm as SW, its incremental design approach leads to greater optimization of the objective function.

SA is a partitioning-based approach. It uses simulated annealing [8] to refine a partition of the SDFG and shrink-wrapping to specify the processors. Since it performs a bounded, non-deterministic search of the design space, it can be used to benchmark solution quality. Its weakness is excessive run-times for moderate- to large-sized problems.

TREE is also a partitioning-based approach. It is a hybrid algorithm that uses a combination of refinement and search techniques to partition the SDFG. TREE begins by building and pruning a decomposition tree of design alternatives via recursive bipartitioning of the SDFG. Each node in the tree represents a shrink-wrapped processor to which a subset of SDFG nodes has been assigned. The tree is then searched for a near-optimal solution via simulated annealing. TREE is an example of an alternative, hybrid implementation of a non-deterministic search algorithm.

The final algorithm, CP*, is also a partitioning-based approach. It is a refinement technique that iterates until no further improvement in solution quality is found. Each iteration consists of coalescing and partitioning phases that are alternately applied to the SDFG. Coalescing attempts to improve the design via a sequence of pair-wise processor mergers, while partitioning employs a sequence of one-way task moves. The partitioning algorithm is a k-way extension of the Kernighan and Lin bipartitioning heuristic [6] and the coalescing algorithm is custom [3].

4. Results

The five algorithms were applied to the sixteen real and synthetic test cases described in Figure 8. (Realize that the average utilization level for any particular dimension on a processor is the value given in the figure multiplied by the number of tasks assigned to the processor. Thus, the examples shown in the figure do, indeed, span a large portion of the design space.)

Two figures of merit are considered: run-time of the algorithm and solution quality (based on the objective function; in this case hardware cost which was defined in Fig. 3). All run-times were measured on a DECstation 3100 using the UNIX utility time.

The real test cases were adapted from a commercial automotive electronic application [2]. The synthetic test cases are a mixture of random and hand-generated SDFGs which, taken as a whole, span a large portion of the design space.

As the figure indicates, all of the test cases assume a six-element task demand vector: (X-PUT, ROM, RAM, DIO, AIO, PIO). The elements of this vector correspond to the task’s demand for processor throughput, ROM, RAM, digital I/O channels, analog I/O channels and pulse I/O channels, respectively. Likewise, a six-element capacity vector is assumed for the processors over the same dimensions. When each processor is specified, the elements of the capacity vector are set to legal values for each particular dimension. The sets of legal values are dictated by the characteristics of the processors (or IC fabrication technology) used in the design. The values shown in the figure represent the average demand for a SDFG node (i.e. task) given as a percentage of the minimum element of the set of legal values.

Normalized results of the five algorithms applied to the sixteen test cases are plotted as scatter diagrams in Figure 9. There is a separate diagram for each algorithm, and each point represents the normalized results for a specific test case. The quadrants of the scatter diagrams can be interpreted as indicated by the key in the figure.

In Figure 10, the point clusters from the scatter diagrams are plotted on the same coordinate system. Each point cluster is represented by a box that is centered at the cluster’s centroid and that has a size equal to its standard deviation in each dimension. Figure 10 quantifies the trade-off between solution quality and run-time for the five algorithms. As the results indicate, run-times vary by several orders of magnitude while solution quality varies by less than a factor of five. The search intensive algorithms (SA and TREE) typically return higher quality solutions but require longer run-times. Conversely, the simple greedy algorithm SW returns the lowest quality solutions with the fastest run-times. The more sophisticated heuristic approaches, DA and CP*, represent an effective compromise. By exploiting domain-specific knowledge, they return solutions with quality that rivals the search techniques while still enjoying over an order of magnitude improvement in run-time.

The algorithms that have been presented offer a powerful set of tools for automating the design of embedded computer systems. However, choosing the appropriate algorithm is a domain-specific issue. As a general rule of thumb, the heuristic approaches (i.e. SW, DA, CP*) are most appropriate if a fast solution is needed for a large (i.e. one thousand or more tasks) design problem. Furthermore, for most real design problems, DA or CP* is a better choice than SW. Conversely, the search-intensive approaches (i.e. TREE and SA) are appropriate for highly constrained, cost-sensitive applications. In this situation, SA is generally preferred over TREE. Note, however, that the difference in solution quality between SA and CP* is marginal even for tightly-constrained applications- while the run-time difference is still around an order of magnitude. Thus, it is typically difficult to justify the use of a search technique such as SA over an effective heuristic approach like CP*, except for the most highly constrained problems.

5. Summary

Processor specification and task allocation are important steps in the design of embedded multicomputer systems. Two unique representations have been proposed to model the problem. The first is a packing-based representation that transforms task allocation into a generalization of vector packing. The second is a partitioning-based representation, which is derived from a variation of graph partitioning. Based on these representations, five solution techniques were presented, described and benchmarked on a suite of sixteen real and synthetic test cases with respect to two figures of merit: solution quality and run-time. The results indicate that the algorithms exhibit a trade-off between solution quality and run-time, and the magnitude of this trade-off was quantified. Taken as a whole, the five algorithms represent a powerful set of techniques that are available to aid the system designer. Although choosing the appropriate tool is still a domain-specific issue, the quantitative comparison presented in this article empowers the designer to make informed choices.

6. References

[1] W. Athas and C. Seitz. Multicomputers: Message-Passing Concurrent Computers. Computer, 21(8):9-23, Aug. 1988.

[2] J. Beck. Characterization of an automotive powertrain control application. Tech. Report, Delco Electronics Corp., May 1994.

[3] J. Beck and D. Siewiorek. Automated Processor Specification and Task Allocation Methods for Embedded Multicomputer Systems. Tech. Report EDRC-18-33-95, Carnegie Mellon Univ. Aug. 1995.

[4] M. Garey and D. Johnson. Computers and Intractability: A guide to the theory of NP-completeness. W. H. Freeman, 1979.

[5] K. Hwang and F. Briggs. Computer Architecture and Parallel Processing. McGraw Hill, 1984.

[6] B. Kernighan and S. Lin. An Efficient Heuristic Procedure for Partitioning Graphs. The Bell System Technical Journal, 49(2):291-307, Feb. 1970.

[7] A. Kundig. A Note on the Meaning of Embedded Systems. In A. Kundig, R. Buhrer and J. Dahler, editors, Lecture Notes in Computer Science, Vol. 284: Embedded Systems, pp. 1-5. Springer-Verlag (New York), 1986.

[8] S. Kirkpatrick, C. Gelatt and M. Vecchi. Optimization by Simulated Annealing. Science, 220(4598):671-680, May 1983.

[9] E. Lee and and D. Messerschmitt. Synchronous Data Flow. In Proc. of the IEEE (75)9:1235-1245, Sep. 1987.

[10] T. Saponas. A Real-Time Distributed Processing System. In Proc. of the IEEE Real-Time Systems Symposium, pp. 36-43, 1986.

[11] A. Smailagic and D. Siewiorek. A Case Study in Embedded-System Design: The VuMan 2 Wearable Computer. IEEE Design and Test of Computers, 10(3):56-67, Sep. 1993.

[12] W. Wolf and E. Frey. Tutorial on Embedded System Design. In Proc. of the IEEE Int. Conf. on Computer Design, pp. 18-21, 1992.

Jim Beck
Delco Electronics Corporation
One Corporate Center
MS CT-200W
Kokomo, IN 46904
jebeck@kocrsv01.delcoelect.com

Daniel P. Siewiorek
Carnegie Mellon
Pittsburgh, PA 15213-3890
dps@a.gp.cs.cmu.edu



The RASSP Digest - Vol. 2, 4th. Qtr. 1995 newsletter/html/95q4/news_7.html