## Chapter 05 Embedded heterogeneous programming with OpenCL

Copyright © 2016 Elsevier Inc. All rights reserved.

| Dependency |             |            |             |             |             |            |             |             |
|------------|-------------|------------|-------------|-------------|-------------|------------|-------------|-------------|
| Cycle      | 0           | 1          | 2           | 3           | 4           | 5          | 6           | 7           |
| Lane 0     | (it 0, ln 4 | it 4, ln 4 | it 8, ln 4  | it 12, ln 4 | (it 0, ln 5 | it 4, ln 5 | it 8, ln 5  | it 12, ln 5 |
| Lane 1     | it 1, ln 4  | it 5, ln 4 | it 9, ln 4  | it 13, ln 4 | it 1, ln 4  | it 5, ln 5 | it 9, ln 5  | it 13, ln 5 |
| Lane 2     | it 2, ln 4  | it 6, ln 4 | it 10, ln 4 | it 14, ln 4 | it 2, ln 4  | it 6, ln 5 | it 10, ln 5 | it 14, ln 5 |
| Lane 3     | it 3, ln 4  | it 7, ln 4 | it 11, ln 4 | it 15, ln 4 | it 3, ln 4  | it 7, ln 5 | it 11, ln 5 | it 15, ln 5 |

**FIGURE 5.1** Pipeline SIMD execution. "it" stands for "iteration number" and "In" stands for "line number." The four-cycle latency between lines 4 and 5 of iteration 0 is hidden by instructions from other iterations.

ARM embedded system programming



FIGURE 5.2 Hierarchical workload distribution for the example kernel.