In Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
Will defend his PhD dissertation
High performance embedded computing is characterized by data-intensive loop kernels and abundant parallelism. As the computational requirements and complexity for such systems continue to increase, technologies are adapted from the general purpose and scientific computing domains while still meeting strict embedded system cost, performance, and power constraints. To meet this challenge, embedded processor designers rely extensively on architectural parallelism to increase system performance and compact instruction encodings to reduce program code size. Instruction-level parallelism (ILP) is a combination of compilation techniques and architectural features that exploit the fine-grain parallelism present at the machine instruction level. Very long instruction word (VLIW) processors are designed to exploit ILP and have multiple functional units partitioned into clusters with local register files. VLIW processors require optimizing compilers that statically schedule resources before program execution. Due to limits in the scalability of single processor systems and improvements in the transistor density of integrated circuits, multiple VLIW processors are now connected together along with specialized accelerators on single chip systems. This situation creates new challenges and opportunities for compilers to exploit both course-grain thread level parallelism (TLP) executing on multiple processors and fine-grain ILP within each processor.
In this dissertation, we first present novel complementary compiler and architecture technologies that improve the power and performance efficiency of an existing embedded VLIW processor. This is accomplished by reducing program code size and improving the performance of software-pipelined loops. We then present a model for prototyping and programming tightly coupled accelerators. Finally, we show the initial results of implementing OpenMP for an embedded multicore VLIW processor. The experimental results show that the combination of architecture enhancements and compiler optimizations dramatically improve the efficiency of embedded application code.