How can pipelining optimize system performance

2022.01.07 19:15

View on ACM. Save to Library Save. Create Alert Alert. Share This Paper. Methods Citations. Figures, Tables, and Topics from this paper. Citation Type. Has PDF. Publication Type. More Filters. Multilevel Optimization of Pipelined Caches. This paper formulates and shows how to solve the problem of selecting the cache size and depth of cache pipelining that maximizes the performance of a given instruction-set architecture.

The solution … Expand. View 1 excerpt, cites methods. Change Language. Related Articles. Instruction pipelining. Cache Memory. Table of Contents. Save Article. Improve Article. Like Article. Recommended Articles. Article Contributed By :. Easy Normal Medium Hard Expert.

Writing code in comment? Please use ide. Load Comments. The Alpha and Intel Atom processors in-order models have been implemented with peak performance. However, complex designs are required for the integration of peak capacity and increase in clock frequency. Out-of-order execution: In the previous method, data dependencies and latencies in functional units can cause reduction in the performance of the processor.

In order to overcome this issue, we have uses an out-of-order OOO method which is the traditional way to increase the efficiency of pipelined processors by maximizing the instruction issued by every cycle [ 24 ].

But, this technique is very costly in terms of its implementation. In this method, the instructions are fetched in a compiler generated order and the execution of the instruction takes place in pipeline as one which is not dependent on the current instruction, i.

The instructions are dynamically scheduled and the completion of instruction may be in in-order or out-of-order. It allows sequential instructions that would normally be stalled due to certain dependencies to execute non-sequentially out-of-order execution.

This algorithm also uses a common data bus CDB on which the computed values are broadcasted to all the reservation stations that may need it. This allows for improved parallel execution of instructions, which may otherwise stall. The Tomasulo algorithm was chosen because its order of instruction execution is nearly equivalent to that of our proposed algorithm. Both algorithms are scheduled statically at the micro-architecture level.

Proposed algorithm LR Left-Right : The above mentioned methods and algorithms have their own merits and demerits while executing an instruction in a pipelined processors.

Instead of using some other methods to reduce the power consumption, we have proposed an algorithm which performs the stall reduction in a Left-Right LR manner, in sequential instruction execution as shown in Figure 1. Our algorithm introduces a hybrid order of instruction execution in order to reduce the power dissipationl. More precisely, it executes the instructions serially as in-order execution until a stall condition is encountered, and thereafter, it uses of concept of out-of-order execution to replace the stall with an independent instruction.

Thus, LR increases the throughput by executing independent instructions while the lengthy instructions are still executed in other functional units or the registers are involved in an ongoing process. LR also prevents the hazards that might occur during the instruction execution. The instructions are scheduled statically at compile time as shown in Figure 2.

In our proposed approach, if a buffer in presence can hold a certain number of sequential instructions, our algorithm will generate a sequence in which the instructions should be executed to reduce the number of stalls while maximizing the throughput of a processor. It is assumed that all the instructions are in the form of op-code source destination format. In this section, the performance and power gain of the LR and the Tomasulo algorithms are compared. As our baseline configuration, we use an Intel core i5 dual core processor with 2.

We also use the Sim-Panalyzer simulator [ 25 ]. The LR, in-order, and Tomasulo algorithms are developed as C programs. These C programs were compiled using arm-linux-gcc in order to obtain the object files for each of them, on an ARM microprocessor model. At the early stage of the processor design, various levels of simulators can be used to estimate the power and performance such as transistor level, system level, instruction level, and micro-architecture level simulators.

In transistor level simulators, one can estimate the voltage and current behaviour over time. This type of simulators are used for integrated circuit design, and not suitable for large programs. On the other hand, micro-architecture level simulators provide the power estimation across cycles and these are used in modern processors. Our work is similar to this kind of simulator because our objective is to evaluate the power-performance behaviour of a micro-architecture level design abstraction.

Though, a literature survey suggests several power estimation tools such as CACTI, WATTCH [ 26 ], and we have choose the Sim-Panalyzer [ 25 ] since it provides an accurate power modelling by taking into account both the leakage and dynamic power dissipation.

The actual instruction execution of our proposed algorithm against existing ones is shown in Algorithms 1 and 2. In the LR algorithm, an instruction is executed serially in-order until a stall occurs, and thereafter the out-of-order execution technique comes to play to replace the stall with an independent instruction stage. Therefore, in most cases, our proposed algorithm takes less cycle of operation and less cycle time compared to existing algorithms as shwon in algorithm [2].

The comparison of our proposed algorithm against the Tomasulo algorithm and the in-orderalgorithm is shown in Table 1. The next section focusses on the power-performance efficiency of our proposed algorithm.

In this setting, instructions per cycle IPC represents a performance metric that can be considered, and it is well-known [ 27 ] that an increase in IPC generally yields a good performance of the system. The use of instructions per cycle IPC to analyze the performance of a system is challenged at least for the multi-threaded workloads running on multiprocessors.

In [ 27 ], it was reported that work-related metrics e. We have also proved that our algorithm produces less IPC compared to that generated by the Tomasulo algorithm see Figure 3 b. According to this result, work-related metrics such as time per program and time per workloads are the most accurate and reliable methods to calculate the performance of the system.

The time per program is calculated as shown in Eq. We use time per program and IPC as performance metrics. Performance comparison of LR, In-order vs. Tomasulo algorithm. We have executed our program on the same machine, therefore the clock period will be the same.

Hence, Eq. By using Eq. In simulation-based power evaluation methods, the system is integrated with various components such ALU, level-1 I-cache, D-cache, irf register files , and clock. The energy consumption of a program is estimated as the sum of all these components as shown in Table 3 and the mean power dissipation results from Sim-Panalyzer for the same experiment are shown in Table 4.

To analyse the efficiency of our proposed algorithm, we have simulated both algorithms on the Sim-Panalyzer and obtained the average power dissipation of the ALU, level-1 instruction il1 and data dl1 caches as well as the internal register file irf and the clock power dissipation.

We have plot the energy consumptions of the different components of both algorithms as shown in Figure 4. Comparison of power dissipation LR, In-order vs. In terms of ALU power dissipation, it can be observed there is not much improvement in power-performance. But on comparing the results for dl1 and il1, it can be noticed that there is a significant difference in power dissipation in the level-1 data and instruction caches.

In both dl1 and il1, the average switching power dissipation resp. Also, the power dissipation generated by LR is 2. From Eq. In terms of overall power-performance benefits, our proposed LR algorithm outperforms the Tomasulo algorithm. In our experiment, it was also observed that the fraction of clock power dissipation is almost the same for both algorithms.

This significant increase of clock power in Sim-Panalyzer is mostly due to the fact that it is dependent on the dynamic power consumption. With this simulator, we are able to obtain power-performance of various below mentioned components, and compared our results. This simply states that for the processing LR requires less computation as compared to Tomasulo to order the instructions. This exemplifies that the usage of cache and the cache hit ratio is improved in LR then in the Tomasulo.

Hence, we can deduce that the cache used for holding instructions performs better while processing for LR as compared to Tomasulo. Also, a slight performance-power improvement of LR against in-order is also achieved. We have presented a stall reduction algorithm for optimizing the power-performance in pipelined processors.

erbupoval1984's Ownd

0コメント

1000 / 1000