More than L3 - I am digging into the efficiency / effectiveness of L1 - particularly in the area of hit rates and other bottlenecks.
Architectural impacts are tied with these scaling differences. When you look at the "capacity" of I/D$ in terms of KiB or MiB, the difference for L1 vs L2 vs L3 has to do with latency and corresponding speed of the memory instances. You want single-cycle access time for your L1$, which (tends to) lead to smaller row/column sizes, and larger bitcells (122 vs 111) to give you faster access times. The array overhead then also becomes substantial (overhead meaning the row/column sense amps and etc, versus the actual array of bitcells) such that your I/D$ data arrays are close to 4:1 or 2:1 bitcell vs periphery area. If you can relax the timing (higher latency / more clock cycles) then you can increase the string length of the array, and also opt for slower bitcells, and reduce the overhead on a given instance, increasing the density (bits / mm^2) of the memory instance, but sacrificing performance. Large block arrays for L3$ are in ~32kiB per instance in current ARM CPU designs, versus L1 that might be ~4kiB each. Similar trends for the Tag RAM that needs to grow along with the capacity of the caches, or other parts of the core, where you may instantiate a register file instead of SRAM in order to get highest performance at expense of area. Real physical SRAM instances also can have additional bits for ECC, redundant r/c for yield recovery, etc. which can also contribute to the size overhead.
On hit rates, this depends a lot on the core architecture and what it is doing in support of SMT, prefetchers, etc. You may not benefit from larger L1$ sizes in the event you are supporting SMT2/4 because context switch between threads on the same core may invalidate parts of the cache. Conversely, larger L2 / victim L3$ may benefit SMT because you can evict those invalidated parts to the next level of hierarchy for when you switch context back, as opposed to doing full writeback to DRAM and having to fetch it again from there. In multi-threaded / multi-core, you also have cache coherency to deal with, and if you attempt to deal with that at the L1$ level, then you have this exponential problem on probe / snoop requests scaling with the # of cores, so you can simplify that with a directory to track which core has which cache line, and manage this through a common cache hierarchy level, which in case of AMD is the L3 (each CPU's L2 is private). Either use the directory to send probe request to the specific core that is holding the dirty cache line in L1 (which may impact the execution of that core), or have all dirty lines write-through to the L3 and have other cores update their (now) invalid cache lines from L3.
There is always some point where there is diminishing returns on increasing the cache size in each hierarchical level, but finding that point has to do with the machine width and types of workloads executing. Apple's CPUs for example have very large L1$ relative to industry peers because the amount of parallelism in their machine back-end means they need larger caches to 'feed the beast'. If you took something like an A76 and just doubled the L1$ size (without increasing the parallelism of the execution), you would not see a substantial uplift in performance, because the hit rate is already so high that the extra cache space would get filled with stale data, or instructions for branches that are never taken. And once you increase the address range by another bit, you might as well try to maximize that, which leads to power-of-2 doubling in most cases (but not all, hence 32 --> 48kiB, 128 --> 192kiB, etc sized caches). Core architects ultimately have to determine where the sweet-spot is in these implementations (typically though combination of modeling and workload experimentation).
TL;DR - you can pack more SRAM bits / mm^2 into L3$ because it is slower, has less periphery circuit overhead, than L1$s - but the performance benefit is up to the architecture and how that L3$ is used (in AMD's case, as a victim cache). A larger L1$ may not be as beneficial when also talking 8C coherent CCXs, than you might assume looking as just single-core performance.