The RASSP Digest - Vol. 3, September 1996


Automated Generation of VHDL Processor Models for Simulation and Synthesis

by Vijay K. Madisetti and Yong-Kyu Jung


Abstract

A new process for automating the creation of full-behavioral and Instruction Set Architecture (ISA) models in VHDL for complex processors and components is described, with results from the automation of a PowerPC 601 described in some detail. A number of advantages to this approach are described together with its impact on the hardware/software codesign and system prototyping processes.

1. Introduction

The Rapid Prototyping of Application-Specific Signal Processors (RASSP) project of the US Department of Defense (ARPA and Tri-Services) targets a 4X improvement in the design, prototyping, manufacturing, and support processes (relative to current practice). As per current practice (circa 1993), the prototyping time from system requirements definition to production and deployment, of multiboard signal processors, is between 37 and 73 months [8]. Out of this time, 25-49 months is devoted to detailed hardware/software (HW/SW) design and integration (with 10-24 months devoted to the latter task of integration). With the utilization of a promising top-down hardware-less codesign methodology based on full-behavioral VHDL models of HW/SW components, it appears feasible that the HW/SW integration time could be reduced to a few weeks (1-2 months) [10]. Potential show-stoppers lie in the limited availability of high quality VHDL behavioral models of components (timing and function). In addition, the time to build a single model of a complex RISC processor (such as i860XP) is approximately a person-year. We describe a mechanism via which full-behavioral models of complex components can be automatically generated in VHDL from published information available in data manuals. This method could also be used in the iterative design synthesis of custom pipelined processors for domain/application-specific applications.

2. HW/SW Codesign Practice

The Educator/Facilitator current practice (1993) model for signal processor design is presented in detail in [8] in this proceedings. The various stages in a “waterfall”-type process flow are demarcated together with time ranges (min, max) for each stage. The time lines have also been validated via communications with the industrial entities involved in large system design and implementations. In this paper, we focus on the specific tasks of hardware (HW), software (SW), and interface design and their eventual integration.

2.1 Whither True Codesign?

True HW/SW codesign allows both hardware and software to be designed within a common framework, and simulated together before being fabricated. Current practice attempts to automate this process via HW/SW/Interface partitioning followed by three individual paths to HW, SW and Interface design and implementation, respectively (as shown in Figure 1). A drawback with this approach is that software can be designed and tested only if a hardware platform (at board and rack levels) is available. The latter is time- and cost-consuming (even if it utilized FPGA technology or HW modelers). It must be understood that the software is not just application-specific software, but also control, diagnostic and test software. Often, control, diagnostic, and test software requires an order of magnitude larger person-hour effort than does application software [8]. Conventional hardware software co-design methods assign a token interest in the issue of software required for control, diagnostic and test purposes, and attempt to catch all integration issues under the term “interface.” The approach shown in Figure 1(ii) represents a “true” HW/SW codesign wherein software models (in a HDL such as VHDL) of HW are provided to the SW developers and the entire software is designed and tested and integrated with the HW models long before any hardware is fabricated or manufactured. Thus, the design loops L1 and L2 are quick, and require no hardware fabrication & engineering cost, and in addition provide capability for complete system design using a process known as virtual prototyping [1,5,10].

2.2 Showstoppers

The assumption, of course, is that libraries of full-behavioral HW models in SW are accurate, available and interoperable, and that simulation times can be kept manageable. VHDL can be used with advantage in this true HW/SW codesign philosophy — one that embraces a hardware-less system design. Recent experience with hardware-less HW/SW codesign has shown that it is efficient, often reducing time for HW/SW integration to a matter of weeks, and also allows rapid upgrades, together with savings in cost [9]. Once virtual prototyping is completed, it is expected that that pathway through which a field prototype can be manufactured, supported and upgraded will be straighforward.

3. Models for HW/SW Codesign

Several classes of models have been found suitable for HW/SW codesign. When emphasizing HW/SW integration two classes have been found particulary useful - ISA models and full-behavioral models. We will utilize the RASSP taxonomy [6] to define these two classes.

(1) Instruction Set Architecture (ISA) Models — An ISA model describes the function of the complete instruction set recognized by a given programamble processor, along with (and as operating on) externally known register set and memory/input-output space. An ISA model will execute any machine program for that processor and give exactly the same results as that processor (e.g., bit-true) as long as the initial states are the same for both simulation and the real system. Port registers, if modeled, are also bit-true. Instructions span multiple clock cycle, and ISA models need not contain any internal structural implementation information.

(2) Full-Behavioral Models (FBMs) — A full-behavioral (also known as full-functional model [10]) is a processor model that exhibits all documented timing and functionality of the modeled component, without specifying internal structural implementation details. Thus, the full-behavioral model is more detailed than the ISA model in that it includes clock-edge timing information in addition to functionality. A number of full-behavioral processor models are available from Georgia Tech’s RASSP Techbase effort [5]. The issue of creating ISA and FBMs will be examined next.

3.1 Populating VHDL Model Libraries

While complete or incomplete gate-level VHDL models are sometimes available from vendors, and are accurate for use as ISA and FBMs, a number of limitations exist — (1) gate-level models are very slow in terms of simulation times, (2) reveal confidential component design (intellectual property) information, and (3) HW/SW codesign assumes that the hardware component is continuously being designed (e.g., changing instruction set, optimized behavior, etc), and thus the gate-level description does not exist. Thus, the focus is on creation of behavioral models of complex parts. Commercial Instruction-set simulators (ISS), which can provide debug information for processors, have limited applicability within a VHDL-based environment (without wrappers and loss in efficiency) where multiple models at varying levels of detail are co-simulated during the top-down design process. In addition, they do not allow redesign of the hardware component, a trend that is increasing finding favor in application-specific markets (e.g., use of core-based functional design of DSP ASICs [2]).

The current approach to model development is best described in [3,4,10]. All these approaches model the internal and external microarchitecture of the component behaviorally from manufactured-supplied data (or via abstraction to higher levels of functional and timing information from gate-level descriptions). This is a manual, time consuming (in person-years), and error-prone (i.e., verification) operation, and often equivalent to designing the component all over again. While we have used this approach, and continue to use this approach in developing ISAs and FBM models, an investigation into automated generation of these models was long overdue.

3.2 A New Approach - Autogeneration

An alternative approach to developing ISAs and FBMs that is automated is described in Figure 2. The processor being modeled (or designed) is described by parametrized generalized time-stationary [2] pipelines (single or multi-), associated memories/registers, and a generalized controller. The user-defined or vendor-supplied information on the instruction set, architectural constraints (hazards, timing), are captured in terms of processor-specific input data files. These parametric input data files then are automatically converted to lookup tables (LUTs). The LUTs are utilized by the AMG to generate the control (timing) and functional information from the input application instruction stream. We have used this approach to synthesize behavioral models of the PowerPC 601 RISC processor and an implementation will be described in the next section.

3.3 A New Approach - Iterative Synthesis

The approach described in Figure 3 describes the process flow for automating the iterative synthesis of application-specific processors. Here the instruction-set of a programmable processor can itself be customized and iteratively designed during the HW/SW codesign process. The application drives the iterative instruction-set and architecture codesign (which are captured from input data files as LUTs) by the AMG. The controller, pipeline, and associated logic of the AMG are then simulated to measure performance on the target application. After optimization of the instruction set and timing, the AMG may be synthesized using commercial RTL-level or behavioral synthesis tools. Application-specific functional libraries can also be used with advantage when combined withVHDL and the emerging VITAL standards for sign-off quality timing simulation. Future papers will discuss and document the approach of Figure 3.

4. Automated Model Generator - AMG

The automated model generator (AMG) is an ISA or FBM model that accepts the application instruction stream and processor-specific data in the form of input tables, that are processed internally to provide all documented functional and timing characteristics as output files. The same AMG can be reused for creating models of multiple versions of the same chip, or independent families of processors.

4.1 Structure of the AMG

The automated ISA model generator consists of six major “blocks,” as described on the following page (See Figure 4).

  1. Pipeline: A single pipeline for a RISC processor consists of the following six stages — (1) Instruction Fetch (IF), (2) Instruction Dispatch (IDP), (3) Instruction Decode (ID), (4) Instruction Execute (IE), (5) Cache Access (CA), and (6) Write back (WB). These stages were implemented as procedures within a VHDL process description of the pipeline.
  2. Memory Block (MB): The MB consists of an instruction queue (IQ), instruction and data memories (IM & DM), and a cache (CACHE).
  3. Data Register Block (DRB): The DRB consists of a number of register arrays (DER, ECR, CWR), including a general purpose register (GPR) to allow storage for resolution of pipeline data hazards. A number of 32-by-32 bit data register arrays are also reserved for the user.
  4. Control Register Block (CRB): The CRB consists of register arrays (CR, HIR (hazard information registers), SCR (system control registers), HDR (hazard destination registers) to control various stages of the pipeline.
  5. System Generating Logic (SGL) Block: The SGL converts specific input data (i.e., in form of tables.dat) from manufacturer or instruction-set designer into Lookup Tables (LUTs) that can be used by the AMG. Thus, information about differing processors can be converted to a standardized internal representation that can then be utilized by the AMG in generating instructional function and timing. The six automatically generated internal LUTs are opcode lookup table (OPLUT) containing opcodes and extended opcodes for user-defined instructions, a decode lookup table (DCLUT) containing information on the bit length of the opcode and other instruction fields, an execute lookup table (EXLUT) that stores information for the execution latencies and the identification of every instruction to map into an executable location in the IE, a hazard lookup table (HLUT) containing information on data hazards of registers and memory, an extended opcode lookup table (EOLUT) consisting of data related to extended opcodes, and a system generation lookup table (SGLUT) that is used by the SGL. It may be iterated that the SGL automatically creates these LUTs based on manufacturer or designer-supplied processor or instruction-set information.
  6. Stage Buffer Block (SBB): The SBB consists of buffers for stages of the pipeline (e.g., IFB, IDB, WBB, IWB, etc).

4.2 Operation of the AMG

We now discuss the operation of the AMG as follows —

  1. Step 1 (Fetch and Dispatch): An instruction if fetched from the IM and brought to the IQ and CA in the pipeline. The IF fetches the instruction from the IQ and stores it in the IFB. The IDP then initiates the dispatch of the instruction from the IFB and translates it into the IDB. In order to decode this instruction, the opcode or the extended opcode is first extracted from the instruction. The instruction type is then docoded from the information available in the lookup table OPLUT.
  2. Step 2 (Decode): The ID then obtains the extended opcode information from the EOLUT and the instruction format from the DCLUT using the decoded instruction type information as a key. In the final step at the ID, the instruction is dissassembled, the information disseminated, and valid instruction fields are stored in the IEB. After completing the decode operation, the ID checks for data dependency on the current instruction. The information stored in DER is utilized for this check, and the information is propagated to the hazard registers, HIR and HDR, with operate in conjunction with the HLUT. Operands for the operation are put in the IE buffer (IEB).
  3. Step 3 (Execute): The IE begins operation if the IEB is nonempty. The IE updates the GPR and the DER, and picks out appropriate information from the EXLUT — i.e., instruction execution latencies, location of procedures, requirements for cache access for executing the function or process, and then executes the procedure (the AMG currently supports upto 1024 user-defined operations). The result is then stored in the WBB or sent to the CA (if cache access is needed). The HIR and HDR are then updated to allow hazard resolution for the subsequent instructions in the pipeline.
  4. Step 4 (Cache Access and Write Back): The CA then reads/writes data from/to the cache in case of a cache hit, or the DM in case of a cache miss. The WB updates the CWR through the DER or ECR before writeback. If the instruction is processed by CA, the CWR is updated by the ECR, else, it is updated by the DER (resolving hazards between IE and CA). The result is then written to the GPR and all hazard conditions caused by the current instructions are void. The WB also generates the output file with the necessary user-specified information on execution times and functional results required from the model.

4.3 Implementation of the AMG

To test the AMG we first implemented the AMG in VHDL, and successfully modelled a subset of the ISA of the PowerPC 601 with a single pipeline. More recently, the AMG has been generalized to model multiple concurrent pipelines and other processors (e.g., i860 and ADSP 21060).

In one of our PowerPC 601 variations of the AMG, that is fully operational, each memory within the MB was implemented as a 32 bit-vector array (same as the instruction length). The IQ IM, and DM are 64-by-32, 8K-by-32, and 20K-by-32 arrays, respectively. The SBB was implemented as four buffers, one of which is the IEB that is a 256-integer variable buffer for maintaining latency and executing function information in the IE stage, the others maintain bit-vector and one integer type variable for maintaining the latencies of other pipeline stages. Figure 5 summarizes the sizes of the other register and memory arrays utilized in our implementation. Note that the user of the AMG can tailor the pipeline to suit his/her implementation specifications, and can also utilize more than one pipeline within the AMG (i.e., the PowerPC 601 has three pipelines — integer, floating, and branch). The AMG currently has been implemented in about 5K lines of uncommented VHDL source code.

4.4 Performance of the AMG - PowerPC 601

Figure 6 describes the performance of a PowerPC 601 model generated by the AMG. The input source code is described in Test Bench A, and was input to the AMG. The AMG then generates the function and timing behavior via output files (shown also in Figure 6), and via VHDL signals (that are displayed on a VHDL simulator spreadsheet in Figure 7). Tables 1 and 2 in Figure 6 describe the detailed clock-cycle resolved operations of the pipeline for the PowerPC 601. The exact timing for the completion of each instructions are also shown. In Figure 7, for instance, the multiply is described in the decode buffer as 7c4118d6, and has a latency of 5 clock cycles, which are successively decremented as shown on the signal INST.EXE.CYC.1. Typical instructions executed per second on the virtual model generated by an unoptimized AMG were in the order of 500-1000 for single pipelines, and less for multiple pipelines (10-200). For a 1000 instruction test bench, the execution times on a Sparc10 workstation were; multiple pipeline AMG (242.95 sec), PowerPC with multiple pipelines (235.55 sec), Single pipeline AMG (18.0 sec), PowerPC with single pipeline (1.45 sec). The time required to generate a model is limited only by the time it required to enter the input.DAT tables from the manufacturer’s data sheets (or in the case of iterative synthesis, from the designer), and took about a person month for the PowerPC. The AMG consists of about 5K lines of VHDL source code and utilized the Vantage VHDL Spreadsheet at Georgia Tech’s DSP Laboratory.

5. Summary and Conclusions

Models have been shown to very useful in the system prototyping process, often reducing HW/SW design and integrations costs by a factor of four or more. The contributions of this paper are as follows -

  1. A new method for automated generation of full-behavioral and ISA models for complex pipelined processors has been proposed. We believe that this is the first such proposal and its implementation.
  2. A new method for iterative synthesis, where the instruction-set of a processor can be customized to the application software, utilizing true hardware/software codesign is proposed.
  3. Successful demonstration of the proposed method for automated generation, using the PowerPC 601 as an example. Our results show that the speeds in instructions per second range between 500-100 for single pipelines and 5-100 for multiple pipelines and comapare well to manually generated behavioral models. The time required for model development is, however, shorter, requiring a few person-months for an ISA model (without interface timing), as opposed to 1-3 person-years for the manual method of model generation.

Further optimization of the automated model generation process is an ongoing investigation.

Acknowledgements

Thanks to M. Rubeiz of Wright Patterson Labs (USAF) for carefully reviewing the manuscript.

References

[1] M. Richards, “The Rapid Prototyping of Application-Specific Signal Processors Program,’’ Proc. of First Annual RASSP Conference, August 1994.

[2] V. K. Madisetti, VLSI Digital Signal Processors, IEEE Press, Piscataway, NJ, May 1995.

[3] Z. Navabi, “Using VHDL for Modeling and Design of Processing Units,” Proc. of 5th Annual IEEE ASIC Conference and Exhibit, pp. 315-326, 1992.

[4] L. Maliniak, “Process Builds Accurate VLSI Behavioral Models,” Electronic Design, pp. 63-70, May 3, 1993.

[5] V. Madisetti, T. Egolf, S. Famorzadeh, L-R. Dung, “Virtual Prototyping of Embedded DSP Systems,” Proc. of IEEE ICASSP 95.

[6] C. Hein, T. Carpenter, P. Kalutkiewicz, V. Madisetti, “RASSP VHDL Modeling Terminology and Taxononomy - Revision 1.0,’’ Proc. of Second ARPA RASSP Conference, July 1995.

[7] C. Myers, R. Dreiling, “VHDL Modeling for Signal Processor Development,” Proc. of IEEE ICASSP 95.

[8] V. Madisetti, J. Corley, G. Shaw “RASSP: Current Practice (1993) E&F Model and Challenges,’’ Proc. of ARPA Second RASSP Conference, July 1995.

[9] The RASSP Information Server - WWW URL http://rassp.scra.org/

[10] T. Egolf, V. Madisetti, S. Famorzadeh, P. Kalutkiewicz, “Experiences with VHDL Models of COTS RISC Processors in Virtual Prototyping for Complex System Synthesis,’’ Proceedings VHDL International Users’ Forum (VIUF), Spring 1995.

Vijay K. Madisetti and Yong-Kyu Jung
ECE
Georgia Tech.
Atlanta, GA 30332-0250
vkm@ee.gatech.edu


Newsletter Index
The RASSP Digest - Vol. 3, September 1996
newsletter/html/96sep/news_16.html