Because no other chip's power consumption accounts for 80% of total cost.
Let's check 5nm as an example. If Latch/DFFs account for 30% power.
Double the Latches will have the power to 1.3X.
Lower the voltage from 0.4 -> 0.3 will have power at (0.3/0.4)^2 = 0.5625.
Since the Vth is around 0.2, so the speed will be same.
Dynamic latch is very small, maybe 5% area.
1.3 * 0.5625 * 1.05 = 0.77.
You can reduce costs by 20%, which could double your profit.
You're talking about architecture changes here, all of which which are perfectly valid
but are independent of the tradeoff between PDP and voltage. That suggests to me that you're well familiar with optimizing architectures (gate level design), but not with optimizing library conditions (transistor level design and choice of library operating conditions i.e. building custom gate libraries).
Regardless of the architecture or circuit or process node, you
*always* get PDP curves like the ones I posted, and there's
*always* a voltage where PDP is minimized -- this moves around with process corner and exact circuit design and clock rate (which trades off dynamic vs. leakage power), but is basically defined by device threshold voltage because below this current drops exponentially (subthreshold slope, causes leakage) and above this current rises roughly as (Vgs-Vth)^2, at least for small overdrives which we're dealing with here for low-voltage operation.
With higher Vth (e.g. LVT/SVT gate, slow process corner, low temperature) the minimum PDP is both higher and occurs at a higher voltage, with lower Vth (e.g. ULVT gate, fast process corner, high temperature) the minimim PDP is lower and occurs at a lower voltage -- here are some results for N7 again showing these two extreme cases. Even for the fastest gate type (ULVT) the minimum PDP in the slow/cold corner is about 0.57V, in the fast/hot corner it's off the LH side of the plot around 0.33V, for typical conditions (not shown) it's around 0.45V.
The majority of this shift is due to process corner not temperature, and there's nothing you can do about this unless you're willing to lose a lot of yield at either chip or even wafer level. The voltage for minimum PDP hardly changes with process node because it's pretty much linked to Vth, the curves all the way from N7 down to N2 (I've done this analysis for every node) are pretty much similar except that they all move downwards each node (lower power consumption, lower gate capacitance, higher gate speed).
And I have *never* seen a case where chips from the middle of the process spread have minimum PDP at 0.32V.
In practice you need to decide which power consumption to minimise, the maximum allowed (slow process corner), typical, or somewhere in between. You can't pick the ultra-low voltage for the fast corner (e.g. 0.33V) and claim this as your "operating Vdd" because hardly any chips will run fast enough at such a low voltage. This assumes that you need a constant level of processing power, on top of this if there are cases where this is lower (e.g. a CPU) then of course you can drop the supply voltage and save power by reducing both dynamic and leakage power -- but if you're just hammering away crunching data all the time (like a bitcoin miner?) you can't do this.
And just to be even clearer, to do all this you need to build custom cell libraries which meet your operating conditions, and have circuits on-chip which measure gate delay or circuit speed, and have an individually-adjustable supply *per chip*. You can't use standard cell libraries from the vendors (e.g. TSMC, Synopsys) because these invariably have corners which are based on standard supply tolerancing and are "the wrong way round" (e.g. slow process/cold/Vddmin, fast process/hot/Vddmax).
This is all just to minimise power consumption for a given function; if you want to minimise die area/cost ("double your profit") the tradeoffs are completely different, you want to run at a higher voltage to get more throughput per mm2, the result will be a smaller cheaper chip but one that dissipates more power.