Comparative Deep-Dive: Zen vs. Bulldozer Architectures
Comparative Deep-Dive: Zen vs. Bulldozer Architectures
2019-03-17
AMD's glory days come in waves as do
Intel's although up until very recently
it seemed like Intel had almost full
control of the consumer grade CP market
risin has added the competition this
space so desperately needed we could
have a bit more if you ask me but risin
provided builders with a value-based
alternative that still pushes solid
frame rates without running extremely
hot or requiring ddr3 or requiring beefy
overclocking motherboards or sporting
pcie 2.0 ya platform is that old and we
covered the viability of FX processors
in our last video which you can check
out right here but today we'll take one
last deep dive into the bulldozer and
piledriver architectures and compare
them with current and Zen and Zen plus
offerings just a heads up this one's
gonna be a bit technical so let's start
right away with the bulldozer block
diagram bulldozer was the codename for
the first set of FX queues including the
fx-8150 and 6100 typically those names
with a 1 in the second place holder
Bashir or piledriver CPUs were
essentially refreshes of the 32
nanometer bulldozer architecture
internally everything is nearly
identical the notable improvements
included integer scheduling and power
consumption will address both families
interchangeably throughout this video
because the block diagrams do look
nearly identical the basic breakdown of
the bulldozer architecture is as follows
a fetcher splits requests and
instructions between two decoders where
control words are created sent to the
dispatch and fed to two unique integer
schedulers and a single floating-point
scheduler the center block in each
integer cluster is a set of Al use and
AG use which perform arithmetic
operations and calculate addresses
respectively these are important for
memory calls between the CPU and the
main memory also called system Ram
that's these guys right here the CPU and
RAM communicate via the memory bus which
is an ultra-low latent highway for
sending and receiving temporary
calculations and instructions it's made
up of two parts the address bus and the
data bus where the latter is in charge
of transferring information to and from
the CPU the address bus tells the data
bus where to find the data required by
the CPU you can see how those two kind
of work hand
and now back to our block diagram
instructions not sent to the integer
clusters are fed to the FPU or a
floating-point unit here numbers are
approximated and expressed in scientific
notation so values requiring decimals or
involving division for example may be
sent to the FPU for processing simply
for its ability to distinguish the
significand base and exponent and then
that show the key difference between an
FPU and an ALU for example involves this
decimal point if values cannot be
expressed as whole numbers the FPU is
probably involved now by contrast ALUs
are intended to perform logical
operations involving and or not you get
the point math may still be involved but
it won't stray very far from simple
addition and subtraction this is why so
many pipelines exists within integer
clusters particularly if there are more
cores and pipelines at a programs
disposal and processes can be expedited
and thus parallelized
a few side points I want to touch on
GPUs or excellent parallel processors
thanks to their typically several
thousand cores whether they be stream
processors from AMD or CUDA cores from
Nvidia graphically driven programs are
resource intensive and extremely
demanding in real time so the GPU
handles these in real time as a result
and another thing with respect to the
FPU back in the good old days basically
before I was alive a fuse used to be a
dense that you could buy and install
after the fact if you are running some
seriously heavy programs nowadays FPS
are almost totally integrated so what
happens with the three sets of data then
two instructions are executed integer
paths and another on the floating-point
level where do they meet LSU no not not
that LSU this LSU the load store unit
and that it isn't technically neat here
that would be the core interface unit
which we'll discuss next but in this
case the LSU literally does what its
name suggests loads and stores
instructions to and from the memory it's
how the AG use an FPU send and retrieve
data from the system memory and from the
memory subsystem so this ties back into
the memory bus we discussed earlier just
wanted to throw that out there because
we were talking about memory that's not
gonna go now down here toward the bottom
we have two important blocks and then
we'll talk about the cache we're still
just on bulldozer architecture by the
way we haven't even touched horizon yet
though you'll seem very familiar
acronyms so we won't need to
P ourselves too many times so this right
coalescing block basically acts as a
filter for repeating right requests
easing the load on the l2 cache the core
interface unit tied just below is the
network unifier linking all important
aspects of the module and allowing the
ICS to communicate with the l2 cache
directly this cache reduces the latency
incurred when executing certain tasks so
if for example you use a particular
program quite a bit and there are
certain instructions the CPU can send to
the cache and you open the program it
can be expedited it can run quicker open
quicker run some particular program or
you know line of code within that
program very quickly because it's stored
in cache up front and it's extremely
fast because it's already pre-loaded and
doesn't have to run through the
pipeline's system Ram does a similar
thing now that the processing apparently
much slower because the latency is also
significantly higher on top of that not
all data that's in system Ram it just
bypasses the pipeline sometimes it has
to be reprocessed sometimes there are
instructions that haven't yet been
processed that are stored in system Ram
temporarily in the case of cache it's
almost always data that's already been
executed and then just kind of there
temporarily for repetitions sake so that
you don't have to keep running the same
process over and over again it's just
very expensive from a resource
perspective but anyway this is one of
the reasons why CPU support various
levels of onboard cache and in general
the smaller it is the faster it is level
one cache being the fastest again the
smallest as well so you can't use a lot
of it all right are you ready for Raisa
now I know that was a lot to digest this
script took several days to write on
account of all the research involved
however I think you'll find this part of
the video a bit easier to understand if
a lot of this is new to you up front
just because we'll be comparing not
necessarily talking about how they work
or why they work a certain way so here
we go one of the key differences between
the architectures is the lack of a split
integer cluster layout in horizon
bulldozer packed two unique schedulers
and a single FPU in each module and
there were up to four modules in FX
processors meaning in certain cases
these cpus could act as eight core cpus
and in other cases they'd act as four
cores this partly explains why certain
programs including Cinebench initially
detected chips like the ADM 150 as for
core 8 thread units instead of 8 core
as was advertised by AMD bulldozer
implemented what was referred to as a
clustered multi-threading module which
literally implies that some aspects of
the unit are sharing resources including
the FPU and including l2 cache needless
to say a huge shortcoming of an FX
processor could be identified when the
floating point pipeline was fully
saturated since there were only up to
four of those per die in quote-unquote
eight core CPUs six core CPUs only had
three of these reisen largely avoided
this issue by delegating a single
integer cluster in FPU per core this
block diagram gets into a bit more
detail than the last one we saw
but the key things to point out are the
retire queue the dispatcher and the
integer cluster Rison CPUs boast a
simultaneous multi-threading or SMT
which is a way for schedulers to
prioritize and sort data through logical
pipelines it's basically a more
efficient scheduler we discuss
hyper-threading which is Intel's
derivative in this video right here it's
basically the same general process you
can see the rename reallocate block
inside the integer cluster which sends
redundancies to the retire queue it's a
way to filter out repetitions and keep
the expensive loads on the pipeline's
themselves the same is true for the
floating-point unit inside the integer
cluster we see six unique pipelines for
Al use and to age years we only saw two
and two before in bulldozer the extra
ALU speed-up logical operations and
allow rise and handle more instructions
per core they're also important for SMT
allowing an added scheduler to saturate
pipelines with fewer skips and errors
this is one of the big reasons why Rison
is so much more efficient per core twice
IPC's generally a lot higher than that
of bulldozers another key difference
between the two architectures has to do
with cache as discussed earlier -
bulldozer ICS which are essentially
cores of their own at least according to
AMD themselves share one large chunk of
l2 cache and remember the larger the
cache the slower operates
Rison cut this size down from two
megabytes down to 512 kilobytes and that
resulted in a much quicker response from
the cache though a trade-off there is
not being able to store as much data in
it
so in general Zen levels 1 & 2 cache are
roughly 2 times as fast as the previous
architectures again though with that
size drop it's not really going to play
too heavily into
the ability for the CP to perform
everyday tasks another radical shift
from bulldozer is reisen's CCX or core
complex dependency with each CC x4 Zen
cores exist and 2 CC exes per die exist
allowing for up to 8 cores and 16
threads per chip CC X's are connected to
each other via the Infinity fabric which
is AMD's way of interconnecting CPUs to
GPUs as well as clusters of course to
other cores Layton sees between cores in
each CC x are extremely low by
comparison
while Layton sees between CC X's can be
up to 10 times slower depending largely
on system ram frequency but it's one of
the variables involved and this is why
first generation CPUs typically yielded
higher frame rates when faster memory
kits were used
Rison also sports a B x2 rather than a
VX first introduced in Sandy Bridge and
bulldozer architectures things like XFR
were bundled in precision boost ATX
things you probably wouldn't even know
were present unless you purposely looked
for them but the options and tweaks have
definitely improved the user experience
and even allow us to tinker a bit more
with the UEFI in question which is a
good thing all around so I say all of
that to say this very obvious statement
Rison was a big step in the right
direction given the technology of its
time we went from 32 nanometers down to
14 and now twelve nanometer degrees of
precision in just a few short years and
if a few important notes I want to
mention horizon chips not ending with a
G do not include iGPS
which are a totally different discussion
for a different video bulldozer omitted
on board graphics as well now if we
looked at the block diagram of an Intel
CPU would look a little different I mean
that the CPU side would look the same
but you'd also have to worry about the
IGP on board as well it's the integrated
graphics processor which means that if
you have one of those you can just plug
an HDMI cable into your motherboard and
run off of the chips graphics instead of
having to you know use a discrete card
like a 980ti or an rx 582 you actually
get a picture on your monitor so with
Rison and bulldozer you couldn't do that
unless you were using like a 785 G
chipset I have 786 on the scribble
pretty sure at 785
or something similar to be bundled with
the board it's basically a dedicated
graphics driver that's embedded on the
board itself it's not included in
CPU so the rule still applies I do hope
you've enjoyed this video by the way I
know it's a lot to take in but I
appreciate you giving me the chance to
explain this stuff as best I can to you
with the limited primary resources at my
disposal I have a lot of fun with these
and I hope you are willing to contribute
at least willing to be somewhat
entertained or interested in this kind
of stuff because I feel like this is how
the channel started and we've kind of
straight away from that because as the
channel grew a lot more people were
interested in just PC builds benchmarks
but this is the heart of the channel
right here and this is why the channel
is called science studio so for those
wondering these videos are why the name
is still what it is leave a comment down
below click that red subscribe button if
you're feeling especially special and
give me a thumbs up if you liked the
video if you like content look for stuff
on the channel past abuse topics like
these you guys are awesome I will catch
you in the next one this is science
video thanks for watching and thanks for
learning
you
We are a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for us to earn fees by linking to Amazon.com and affiliated sites.