Home » Rubix

Rubix

Overview

Rubix™ is a new breed of EDA tool: a clock concurrent optimization tool. Clock concurrent optimization can be thought of as merging useful skew clock tree synthesis with incremental logic sizing and timing-driven placement, but using a propagated clocks model of design timing, rather than the increasingly inaccurate ideal clocks model of design timing. Rubix increases clock frequencies by up to 25%, reduces leakage power by up to 30%, and accelerates timing closure in the backend of the design flow by up to two months.

Flow

Rubix functions as a plug-in point tool within a digital ASIC design flow, fully replacing the clock tree synthesis (CTS) and Post-CTS Optimization steps inside a Place and Route tool. Rubix integrates with any Place and Route tool using industry-standard file formats as its inputs and outputs. Both Rubix and Azuro's other product PowerCentric™ share exactly the same flow interface and scripting language: if one tool is integrated into a design flow, then both tools are integrated into that design flow.

Technology

The traditional role of clock tree synthesis is to distribute a set of source clock signals to thousands of data registers on a chip such that these signals arrive at almost exactly the same time, within some tight skew margin. "Useful skew" clock tree synthesis differs from this traditional "skew balanced" approach to clock tree synthesis by deliberately exploiting skew in the arrival times of clock signals at registers to further increase chip speed. Useful skew is similar to the concept of changing an employee's work schedule from an 8 hour day to a 40 hour week: the total working time is the same, but the 40 hour week schedule is more flexible and therefore enables the employee to put in more hours when there is a lot of work to do. By skewing clocks, certain logic functions on a chip can be given more time to compute a result, provided that other logic functions can be given less time to compute their result.

Automated tools to build useful skew based clock trees have so far received only minimal adoption by chip design teams, and where it is used it is typically restricted to skew clocks only to a small number of registers. This is because current approaches to useful skew fail to address the following two key problems:

1. Pre vs. Post CTS timing divergence

CTS is a bridge in the design flow between two fundamentally different models of design timing: ideal clock timing, and propagated clock timing.

Before CTS there are no clocks in a design and therefore it is not possible to model any real signal propagation through the clock network. Instead, an idealized "ideal clocks" model of design timing is used. In this model, clock signals magically arrive at all registers on the chip instantaneously, irrespective of any logic in the clock tree such as clock gates, clock muxes, or clock dividers. Advanced timing margining techniques such as on-chip-variation (OCV) derates and common path pessimism removal (CPPR) cannot be applied, and a credible estimate of hold violations is impossible to determine.

After clock tree synthesis the clock network is present and therefore clock signals can be propagated through this network using a true "propagated clocks" model of design timing. Using propagated clocks timing clock muxing, clock gating, and clock division are all taken into account correctly. Likewise OCV derates and CPPR can be applied to the design and hold violations can be properly estimated.

The more complex a chip, the more advanced the process node, and the more aggressively clock gated a design is, the bigger the disconnect between ideal and propagated clocks timing models becomes. For useful skew to be successful on advanced complex designs, the timing picture on which the skewing is based must be a propagated clocks model of timing; however, a propagated clocks model of timing exists only after clocks have been built. This presents a fundamental chicken-and-egg paradox: clocks must be built based on a timing picture which does not exist until after clocks are built.

2. Clocks cost power

Clock networks on complex designs at advanced process nodes consume a significant amount of power. As a general rule of thumb, every transistor inside the clock network costs 20x the power of a transistor inside a logic function. The main reason for this is that transistors in the clock network switch far more often than transistors in logic functions. Any useful skewing of clock signals must consider the impact of this skewing on chip power. Sometimes, deliberate skewing is the only option available to increase chip speed, but in many situations adding skew simply creates more freedom for traditional logic sizing and placement techniques to fix problems in different places on a chip. Useful skew is not a goal in itself; rather, the goal is to take the additional freedom offered by the "40-hour week rather than an 8-hour day" concept, and build upon this freedom by blending useful skew with traditional logic sizing and placement techniques to deliver the "best" solution in terms of speed, power, and area.

Clock concurrent optimization (ccopt for short) is useful skew based clock tree synthesis delivered in such a way as to address these two key problems. A ccopt tool must weave useful skew clock tree synthesis with incremental logic sizing and placement, and the clocks that it builds must be ripped up and re-built many times within a single optimization step. And all this needs to be done within reasonable tool runtime and memory constraints and all based on a propagated clocks model of design timing.

Ripping up and rebuilding clocks multiple times is the only way to make sure that useful skew can be influenced by a true propagated clocks model of timing; interleaving the rebuilding of clocks with incremental logic sizing and placement is the only way to ensure that problems get fixed in the most power-efficient and area-efficient way possible.

The more complex a chip, the more advanced the process node, and the more aggressively clock gated a design is, the bigger the benefits of clock concurrent optimization over traditional skew balanced clock tress synthesis.

Downloads

To learn more about Azuro's products and technology please visit our Downloads page.