"For high-performance CPU design, especially on 16 nm and more advanced process nodes, signoff has many corners. Increasing the length of common clock path, improving the consistency of clock delay at each RC end angle and reducing the local clock skew have become the consensus of digital back-end designers. The multi tap flexhtree structure clock tree scheme added by cadence innovus tool not only provides h-tree symmetrical clock buffer unit structure and equal line length, but also reduces the requirements for geometric symmetry and ensures that clock tree synthesis can be carried out after the timing units are placed. An automatic flexhtree implementation process is established to reduce clock skew under different corners. The number of flexhtree tap points and the influence of the subtree clock synthesis engine on clock skew and design timing are discussed in detail. Then a better flexhtree implementation scheme is found. Finally, the clock trees of flexhtree, ccopt and fishbone fishbone structure are compared comprehensively from the aspects of timing, power consumption and unit number, and it is concluded that the design is more suitable for flexible flexhtree structure.
0 Introduction
Modern high-performance processors have higher and higher requirements for data transmission and data processing. As the carrier of processor clock signal transmission, clock tree has a direct impact on the computing performance of the whole processor. In order to distribute the clock signal to each local area under the requirement of low clock skew, it becomes very challenging in high-performance systems.
Clock structures are mainly divided into two types: tree structure and mesh structure. The tree structure design is relatively mature. Cadenceinnovus ccopt is a typical representative. EDA tools can automatically generate clock trees according to specified constraints, and can choose whether to use balanced trees or unbalanced trees with useful skews. The tree structure is widely used in chip design such as mobile phones and the Internet of things; The mesh clock structure needs a lot of manual work and a lot of attempts to adjust before it can show its advantages, which can be seen in high-performance computing chips.
The reasonable choice of clock tree implementation scheme of digital synchronous logic circuit can make the high performance of CPU not a cloud. For example, mesh and fishbone clock structures are commonly used by Intel and IBM CPU processors. Their common characteristics are that the clock propagation delay, clock skew and on-chip deviation (OCV) are very low. The disadvantages of mesh structure are high power consumption and high wiring resource overhead, while fishbone structure has more manual operations because it is difficult to manually divide subtrees. At present, the clock tree represented by flexible h-tree (abbreviated as flexhtree) structure is widely used in arm architecture processors in recent years. It is characterized by flexible use, low power consumption and small clock deviation at each process corner.
This paper will take flexhtree with multi tap as the research object, trying to reduce clock skew without obvious impact on setup and power consumption. The design results of this paper provide an excellent solution for the clock tree design of high-performance CPU, and provide an engineering reference for the back-end physical implementation of independent high-performance chip.
1. Introduction to multi tap flexhtree and fishbone clock structure
Taking high-performance CPU as the research object, this paper mainly discusses and compares two clock structures, multi tap flexhtree and fishbone. The following will briefly describe the two structures from the aspects of structure and characteristics.
1.1 h-tree clock tree structure with multi tap
The traditional single h-tree is mostly used for the front-end drive of mesh and fishbone clock structures, or for clock balancing in some circuit structures that require clock skew. The structure of h-tree with multi tap points shown in Fig. 1 can be combined with clock tree synthesis (CTS) to control the clock skew of the whole clock tree [1]. The clock root pin can be a clock input port or a clock buffer, which transmits the clock signal to each leaf node (sink) with the help of h-tree. The top seven drivers form the "H" structure of h-tree. When there are many tap points, multi-level h-tree network can be used to balance the clock delay under multi-corner between tap points. The last level driver is the root node of the subtree, which can be completed using ordinary CTS.
Flexhtree structure has the characteristics of simple implementation process and easy to be embedded into the whole P & R (layout and wiring) process. Moreover, the layout containing memory and macro module can also be realized by h-tree. In addition, it does not require too much complexity of the clock gating structure. Because it does not need to pay too much attention to the geometric symmetry of clock buffer and clock trunk on h-tree, as long as it can ensure the electrical symmetry of RC parameters under multi corner, it can also reduce clock skew. For the traditional h-tree, considering the geometric symmetry, the number and position of sink in h-tree must be constrained [1].
1.2 fishbone clock structure
As the name suggests, fishbone clock structure is a clock structure whose shape is similar to fish bones. According to the number of trunks, fishbone clock network is usually divided into single fish bone structure, double fish bone structure and multi fish bone structure. Figure 2 is a schematic diagram of a single fish bone structure. The gray triangle is the front driver and the white triangle is the trunk driver. Gray is the clock branch. The fishbone front-end driver generally uses the h-tree structure to drive the multi fan in buffer array, and selects the number of pyramid shaped multi fan in drive buffer array according to the number of load points. It is precisely because of the driving capability provided by the last level trunk driver in parallel that the fish bone can "cross" the whole floorplan to ensure that the arrival delay of the local buffer clock on each branch is the same. Here, the local buffer is used as the root node of the subtree and CTS is used to generate the clock tree.
Fishbone structure has the advantages of small skew, short clock latency and small OCV. It only needs few buffers, low power consumption and low wiring overhead. It can realize useful skew. The disadvantage is that it cannot be automated and requires a lot of manual adjustment [2]. The comparison is applicable to the long strip more symmetrical floorplan.
2 implementation of multi tap flexhtree
This section mainly introduces the clock tree synthesis process of flexhtree and explains each step. Then, the influence of the number of tap points on clock skew is discussed respectively, and the influence of clock tree generated by innovus ICTs and ccopt engine on timing is compared. This section also compares the clock skew changes of CTS, flexhtree and fishbone to guide chip designers to further explore the characteristics of the three structures.
2.1 flexhtree clock tree integrated process
Figure 3 shows the process of integrating flexhtree with multi tap using cadence innovus tool. First, take the database with memory, macro and standard cell layout as the starting point of flexhtree integration. At this time, the data path delay optimization has been done. Specific steps: (1) the tool creates a clock tree spec according to standard timing constraints (SDC)( 2) Define clock tree winding rules, and specify different winding rules for clock trunk and branch( 3) Determine the clock tree design constraints to achieve the expected skew, transition and fan out number of clock buffers( 4) Define flexhtree creation specifications, such as clock source point, symmetry, number of tap points and location area( 5) Synthesize the defined flex htree trunk, and then check whether the tap point position and trunk winding are reasonable( 6) Create a clock for the placed multi tap points and define clock groups( 7) For the synthesis of the defined subtree, the balanced tree or the unbalanced tree borrowed from the useful skew can be used inside the subtree. At this time, the timing after the synthesis of the subtree is not ideal. It is necessary to analyze whether the division of the subtree is reasonable and whether it is mounted reasonably according to the logical relationship and physical location. At the same time, pay attention to whether the clock latency of a single subtree is too long.
The difficulty of using flexhtree with multi tap points to realize the clock tree lies in the determination of the number of tap points and the reasonable mounting of sink under different tap points.
2.2 impact of different tap points on clock skew
In this section, the ICT engine is used to generate the balanced tree for the tap subtree to ensure that the clock latency in the subtrees below the tap point can also be leveled. In order to explore the impact of different tap points on clock skew, 4, 6, 8 and 18 tap points are selected to generate flexhtree respectively. Figure 4 shows the relationship between the number of tap points and clock skew distribution. The horizontal axis is the range of clock skew, with a step of every 50 PS, and the vertical axis is the percentage of the number of pieces under each clock skew interval in the total number of pieces.
It can be seen from Figure 4 that the peak point of the diamond setup broken line mainly occurs in the range of - 150 PS ~ 125 PS. the skew distribution in Figure 4 (b) and Figure 4 (c) is relatively concentrated in the range of - 150 PS ~ 0 PS, which is related to the tool using negative useful skew to repair the hold violation; The positive useful skew in Fig. 4 (b) and Fig. 4 (c) focuses on 25 PS ~ 100 PS, indicating that the tool makes the clock tree more balanced, while Fig. 4 (a) and Fig. 4 (d) are relatively discrete, and the clock tree is not very flat. For the hold broken line, the skew distribution in Fig. 4 (b) and Fig. 4 (c) is more concentrated in the two intervals of - 75 PS ~ 0 PS and 25 PS ~ 175 PS, so the setup timing is better than the other two. In Figure 4 (c), clock skew is more concentrated in the range of - 75 PS ~ 0 PS and 25 PS ~ 125 PS, which shows that the eight tap points not only take into account the setup timing, but also modify the hold. As can be seen from the four figures, the clock skew at the eight tap points is small and concentrated, and the setup and hold are better than the other three cases.
2.3 analysis on rational division of multi tap flexhtree tap points
In order to further understand the reason for the good timing under the eight tap points, figure 5 provides the sink distribution of flexhtree at the eight tap points in the floorplan. As can be seen from the figure, this design is a typical rectangular structure, the memory is placed symmetrically, and the left and right sides along the up and down pin area belong to l3C respectively_ pipeline_ 0 and l3C_ pipeline_ 1 module。 The positions of the eight blocks in the figure are distinguished by different colors and boundary broken lines. At the same time, the module to which sink belongs is marked in the corresponding blocks. The highlighted line is the 8tap point h-tree structure. Such tap point division fully takes into account the physical location of sink points and the interaction (talk) relationship between modules, and avoids the large clock latency of subtree caused by too far distance between tap points and sink and too large number of sink under tap points, which is not conducive to the convergence of talk path timing among subtrees. In the module segmentation in Figure 5, the talk path is reduced, and some small modules are attached to one tap point. Therefore, this definition of eight taps has the advantage of easy timing.
Table 1 shows the number of sink carried by the eight tap points and the percentage of the tap point common path clock latency (cpcl) in the subtree average clock latency. As can be seen from the table, htree_ Tap0 has the largest number of sink attached, corresponding to figure 5liu_ pre_ The region where the processor module is located; Htree_ The number of sink attached under TAP1 is the second, corresponding to figure 5 l3C_ CFG and l3C_ pipeline_ 0 module area; Htree_ The number of attached sinks under tap4 is the least, corresponding to the lower right position l3C in the middle of Figure 5_ pipeline_ 1 module area. The number of attached sinks under other tap points is almost the same. The cpcl at 8 tap points is above 40%, so OCV has less impact on sink at tap points. From the corresponding relationship between the number of sink under tap and cpcl, htree_ Tap0 and htree_ TAP1 does not have a large number of sinks, and the average clock latency is larger. This situation shows that the tool is reasonable for tap0 and TAP1 sink segmentation, and the sink clock latency under the subtree is relatively flat. On the other hand, the cpcl percentages of the eight tap points have little deviation, and the eight subtrees are relatively flat. In this way, the hold response between subtrees is mainly OCV. Therefore, to analyze whether flexhtree is doing well, you can start from the sub tree sink module division, physical location, number of sub tree sinks and cpcl percentage.
Figure 6 shows the comparison of common clock latency of 8 tap points under 5 different corners. It can be seen from the histogram that the clock latency deviation of tap points under different corners does not exceed 70 PS, and the clock latency difference between tap points is less than 3 PS, which also confirms the flexhtree characteristics mentioned in Section 1.1 above: the electrical symmetry of RC parameters under multi corner and the electrical symmetry of RC parameters between tap points
Our other product: