AMD RDNA / Navi Arch Deep-Dive: Waves & Cache, Ft. David Kanter
AMD RDNA / Navi Arch Deep-Dive: Waves & Cache, Ft. David Kanter
2019-06-13
hey everyone we're at the AMD next
horizon event and I'm joined by David
Cantor David Cantor has joined us a few
times now and runs real world tech at
real-world tech comm one of the most
technical outlets and I have you have
you posted something that you mostly in
the forum I'm more active in the forum
but I've had a couple of things on
interesting process technology related
innovations in thinking the last six
months
sure so it's a few underwear where I go
to read stuff it would be his website
yeah and then we're gonna be talking
about na'vi today but david also works
on ml per org yeah and so we'll just
plug that if you're interested in
machine learning benchmark performance
you can check out Alpert org for some of
David's work over there before that this
video is brought to you by Skillshare
skill sure makes it easy to learn skills
and advance yourself professionally with
classes available for just about
everything we've found the JavaScript
toolkit class taught by Christiana
Heilmann a senior developer at Microsoft
to be of a notable interest for our
audience the class is an intro on how to
get started with JavaScript and cover
skills you need to be marketable for web
development skill sure costs $10 per
month on an annual subscription or click
the link in the description below for a
two month free trial of Skillshare
premium my baby for the last 11 months
has been getting the inference
benchmarks together which is once we
have a trained neural network you know
how do we classify things or predict
things or recognize images and so it's
been really exciting to get at that out
the door and then also work on some of
the power measurement for those kinds of
systems cool huh navi you also know a
good bit about this because you've
spoken with the architect david was at
the event with us so should we start
with let's let's do a really top level I
guess key changes yeah versus Vega
because one of the biggest things Mike
Matz or his big point he was driving to
our group of press was he said by the
end of this I hope to convince you all
that this is actually a new architecture
yeah that's right ace ahead yeah so you
know I think for context
you know AMD has a somewhat
starting point right which is there in
all the consoles and for all the console
guys as well as for the the PC folks in
an ideal world we want the software that
was written over the last seven years
since the start of GCN to run well on
Navi yeah and so that kind of backwards
compatibility is constrains some of your
innovation you can't go too wild and
crazy' where as you know you look at
Nvidia and you know every couple
generations they're tearing things up
and rebalancing but that does mean that
if you wrote software for Maxwell that's
probably not gonna run optimally they're
on something like turning right and so
the question for AMD is is very much how
do we deliver a forward looking path of
innovation while also having everything
that we was written before especially
for the consoles run really don't work
yeah yeah and so you know the the
philosophy behind Navi I think is is
quite different than GCN and Vega and
Polaris and right you know I think one
is this philosophical change which I
would say is he probably talked to you
about IPC and building a wider compute
unit yes and and having higher effective
instruction per clock or higher clock
speeds and in the cases that we've seen
so far right and then also talked a lot
about their word there's one really
basic thing I wonder well see the way
and that I want to top up with
explanation so one of the things in the
presentations for context when we come
out to these events architecture and
tech days and he's got someone from
basically everywhere in the company
representing their their spot and so you
have product specifications then you
have like really low level architecture
yep and the two don't necessary tellers
area of expertise so on the architecture
side I am NOT an expert there and one of
the things that Mike mans workouts
mentioning it was waves yeah so wave 32
where's wave 64 let's let's start with
yeah what is
well so let's step back and I want to
take it a little bit more basic I can
kind of talk a little bit about the
philosophy okay so you know when we look
at performance realistically there's a
good mental model is you've got
everything today is a multiprocessor so
you've got n cores and then if you want
to make the whole solution faster you
can do do more cores or you can do a
faster core right in and from the CPU
side which applies to the GPU side here
the way to get a faster core is either
higher clock speed right or more
instructions per cycle more work person
which we've seen authorized in the
discussion as well right exactly and so
conceptually if you look at the Vega
architecture the main lever for
performance there is more cores more
compute units and so you got to go wider
and with Navi and the rDNA architecture
the big shift is to say actually what we
want to do is we want to go faster on a
per core basis so that we maybe don't
need as many cores input let me do one
quick interruption when you say their
cores I mean compute units okay
you know everyone so look in duty Intel
everyone who does something that
computes will have a special name for it
fortunately the CPU guys all agree and
they call them core base so that's the
term I like to use because I'm
originally a CPU and just very quickly
we have this discussion in a separate
video you can watch but yes but very
quickly your definition of a core right
is is sort of your basic compute
pipeline that is going to be fetching
instructions decoding instructions
executing them and then you know writing
the results react and reading and
writing to Memphis so and David's saying
core is it is not stream processors
yeah no no no you know and so so a lot
of the GPU guys like to call their
floating-point units a core or a stream
processor or whatever and I really mean
you know in Invidia parlance an SM in in
AMD parlance a computer
right right so the you know on the Navi
side you said they're focusing on going
faster and narrowing they've narrowed it
a bit yeah and and so this gets to into
like what away this so when you look at
the programming model for GPUs the whole
idea is we're gonna run one program on
every pixel on the screen right and so
but we want to do that with a vector
execution unit right and so that means
that we can do a multiplication on a
bunch of pixels at the same time but
they've got to be running the same
instruction right so so if we wanted to
translate or rotate you're going to be
doing a matrix multiplication on every
single pixel okay so but you can have
all those pixels together and if you've
ever looked at you know some of the
Nvidia white papers they'll call this a
warp yes right and so that's a bunch of
data items that you're gonna process to
get so this is Nvidia's warp to andy's
wave they are the yes and every GPU will
have a very similar construction effect
so intel's
integrated graphics also has you know
something like this and so the whole
idea is this is a bunch of data elements
that you all process together in the
microarchitecture now okay so one of the
big changes with na'vi is in previous
generations the waves were 64 data
elements wide yes and so that works
really well for some things but the
downside that there's two downsides so
one is when you have a branch and some
of the waves go off this way and some of
their sorry some of the data elements go
this way and the others go this way well
you have to do both and so if you have
say one branch out of 64 then in one
clock you do 63 guys in the other clock
you do one and so that's not great
utilization right so now so that's one
aspect and then the second is that in
the GCN architecture which includes
Vega and Polaris the way you would
execute a wave is over four o'clock so
you'd have a 16 wide Cindy and then you
know o'clock one you do a quarter
o'clock two you do a quarter clock three
you do a quarter clock 4 you're done and
then you can move on to the next
instruction so if you look at a single
data element it only makes forward
progress every four clocks
okay right so you know say data element
one you know you're gonna do first your
multiply instruction and then four
clocks later you can do an ad or a
subtract or whatever and so if you look
at its progress it's as I said once
every four clocks and so the narrower
part is that Navi has a 32 wide wave and
then the faster part is that the Cindy's
are now 32 wide okay
and so you execute a full wave every
clock so if you take that same example I
gave you and you look at one particular
data element so now you've got it makes
forward progress every clock right so
the other thing that's worth mentioning
about the new wave 32 model with not the
is that it is optional and as I
mentioned it's probably gonna be better
for performance but from a backwards
compatibility standpoint if you have
software that was written with wave 64 a
you can still run it B you can still
write new software in way of 64 if
that's gonna be best for you and then
behind your back the machine will take
these 64 wide waves and then issue them
every other clock essentially so they'll
it'll split it up in a potentially
there's two different ways to split it
up but it'll figure out how to split it
out so and that kind of goes back to
this backwards compatibility constraint
that AMD has as a function of doing
consoles and not wanting to disrupt the
software ecosystem and so the nice thing
is since we're talking about executing a
program on all these different data
items that program will execute a lot
faster and the so
you get the same flops but it's sort of
the way you're running it through the
machine changes and the benefit of this
is if you think about it so say you have
a pixel shader like you're gonna have a
bunch of associated State with that
you're gonna need registers you're gonna
need descriptors and other things that
occupy space within the machine and so
if we can burn through that pixel shader
for X faster we can free up those
resources right and we can use the
machine with fewer data items right so
the question is like how many data items
do you need to get the machine at a
hundred percent and the answer is always
going to be less with Navi okay then
with GCN and so that's just gonna result
in better utilization and some of your
yeah you won't have as many Hardware
fixed-function either well just units
sitting there doing nothing I guess move
it's not just fixed function units but
also your shader right and you know I
know some of the work you did previously
showed that essentially there was some
under utilization of the compute
resources in Vega yeah and so you can
think of this as an architectural
approach to try and fix that right right
and there were also things with we saw
issues with stems like cache yeah so we
were talking about that previously
mm-hmm cache bandwidth is something you
mentioned yet me so I'd say that's one
of the the second big changes in in Navi
and so you know I think some of the work
you did was great and you pointed out
that actually the biggest bang for the
buck on Vega was overclocking the memory
not the computer yeah right and so you
know you look at that and well the data
says okay this machine doesn't have
enough bandwidth so at the end of the
day bandwidth is always going to come
from either DRAM or cash and so the
approach for Navi
was to introduce another level of cache
and so if you look at GC n including
Vega you have a cache in each compute
unit that's handy small you have a
shared l2 cache between everything
that's not super big but not super small
I mean from
CPU standpoint its small right these are
like you know a quarter of the die is
cash yeah right you know 10 20 megawatts
get about game cash right so let's let's
not talk about game cash
oh man you know
AMD it builds fine products sometimes
the marketing names leave it a little
bit to be desired
right that is a very kind way to say it
yeah so on GCN you had an l1 cache it
was pretty small per core for compute
unit and then you had this shared l2
cache and then you had memory right and
so the big change in na'vi is now you
have an l1 cache that's actually shared
between two compute units so there's
actually a way to communicate there and
then there's what they call the graphics
l1 cache which is in each shader or a
also so they're different
yeah so you have suit so that you have
this sort of perk or cache and then the
array there's four arrays in Nabi 10 are
they physically different okay yeah so
not just like logical yeah it's not it's
not a logical but it's actually a
physically separate array and it's used
for slightly different things okay so
the the graphics l1 cache backs up the
per core l0 cache and then it's also
used by the render backends for for
blending in depth and so forth okay and
so then in addition to that you have the
globally shared l2 and then you have
DRAM like regular so the nice thing is
that the this new l1 will absorb a lot
of the bandwidth that previously would
have spilled over to the l2 okay Isis
then reduces the demands on memory so
the thing that's interesting is if you
do out the math Navi 10 has actually
slightly more bandwidth than a Vegas 64
but it only has 40 compute units instead
of 64 so when you look at the bandwidth
or core it's actually going off quite so
you
okay and so you're gonna stall in your
shaders less so from a testing
perspective what's what type of workload
do you you know no one-to-one which
we're not gonna have but in the
one-to-one where do you start to see
that difference emerge theory I mean I
think just realistically you're gonna
see better scaling with respect to
frequency okay right and so it should be
less bottleneck down memory right we
hope and then it also has a benefit in
terms of power right if you are getting
data from SRAM on chip that is always
cheaper than getting it from DRAM all
interesting to on power
HBM to as much as Andy was stuck with it
yeah that's also a power play I mean it
was yeah it's over our than gddr5 yep so
I'm not sure what the the numbers look
like actually four g6 I'll top my ad for
power consumption versus HBM I mean I'm
sure it's higher right so you know I
think this is one of those things where
it's trade-off of die area cost and
power and you know there are folks who
are willing to pay a premium for much
more power efficient memory and I think
the the challenges that's mostly in the
datacenter and in the national space not
not consumer not mainstream right and
that HBM cost is significant right yeah
I think you know it made it impossible
for Andy to drop the prices really right
and so from that standpoint GDD are six
is a much more cost-effective solution
for for gaming it makes a ton more sense
and the downside is you're gonna have to
spend a little bit more power so to get
it there you know you know that that's a
motivator to make things more power
efficient elsewhere yeah and Andy has on
the GPU side has drawn over some of the
CPU people like Sam NASA eager yeah to
work on power for the GP is now yeah it
sounds I don't know it sounds like
they've done a lot of work there versus
Vega yeah well I think just broadly
speaking I'd say that there's that power
management in CPUs has historically been
more aggressive much more careful than
on the GPU side just just kind of a 4pc
class GPUs right and so you know getting
to apply some of those techniques to the
GPU and getting also new techniques
right because you know in CPU you always
want to go fast but you know you look at
something like freesync or powershell
maybe it was chill right yeah yeah yeah
but you know that's something you can't
really do in a CPU but it's you know a
very clever technique where you know for
those who aren't familiar with it it's
okay you know we're in a scene where
things aren't moving much so I lost my
frame rate right and it's like well
that's a really cool trick that you can
use because you understand the
application you know oh this actually
isn't going to have an impact and we can
cut down a row sniffin Utley and as a
side effect sometimes and improve the
response the latency right exactly and
so there's because graphics is a little
bit more of uh there's more application
awareness there's things that can be
done there so it anyways
the I think there's been a lot of effort
invested in AMD really at Intel at
everyone and getting graphics to be much
more power efficient and so you know
there's there's all sorts of tricks that
I expect AMD will be burning over the
gpiod and then you know unique GPU only
tricks that'll get better so are there
any closing thoughts you have on Navi in
general I mean power consumption is
obviously one of Andy's biggest punching
bags I guess over the last couple of
generations but his power consumption
discussion will have to look into more
with testing yeah but TVP was 225 on the
5700 xt8 yeah how do you feel about
those numbers drumming from a glance I
mean so at the end of the day you know I
think with a lot of gamers power is
probably more of a second order thing I
yeah definitely for me you know I
actually care much more about the noise
levels yes power draw except maybe on a
summer day when I don't want the place
eating up yeah
yeah I mean the other thing is you with
a new architecture you you end up
changing so many things that you may not
discover all the pain points until you
have silicon under running real
workloads and on the basis of that that
opens up new avenues for tuning for
power right and so you know it's sort of
new architectures first one in seven
years
and so just like with GCN right there
result these incremental steps with
Polaris and Vega and I'm sure there's
going to be sort of a similar trajectory
for Navi and then so we'll have to see
what those look like right but then I
thought the other interesting thing is
that you know Samsung announced that
they're gonna license this that was very
interesting and it's it's a good move
for I mean it should support AMD where
they need it which is get some more
guaranteed revenue in on the graphics
department yeah and so you know for
those who aren't familiar or that it's
Samsung is going to be licensing the
rDNA architecture for mobile yeah
they're guys X and oz yeah probably for
high-end smartphones and tablets and you
know the truth of the matter is if you
did that that's gonna put AMD's and
engineers under a lot of pressure to to
get the power down and so it'll be
interesting to see you know what they
develop there and how that might apply
to the rest of the line but you know I
think there's there's always
opportunities to cut power and it's
usually a question of you know time and
priorities and money yeah yeah and so
you know the nice thing is you know and
let's let's be really honest if you're
making things for consoles and you're
making things for for primarily desktop
PCs power is not necessarily the most
critical thing and so I think you know
drawing on my either cynicism or
economics background depending on your
perspective right you know you got to
follow the money
all right and you know
if AMD is now getting money to do mobile
stuff then then that says that someone
is going to be putting Pat there's
dollars associated with power so we
should expect it to improve yes and very
good point on that so if you want to see
more of David's work we have other
videos yeah we did the one on what is a
core I think you talked that is a coup
de coeur really a core yeah TL DR no but
you can watch the video from heart Deb
and also real-world attack comm you've
got a forum there that David's active in
yep
and I'm Alpert org for people interested
in machine learning performance that's
right
so I think that'll wrap it for this one
thank you for watching David thank you
very good to see you again we'll see you
all next time yeah
We are a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for us to earn fees by linking to Amazon.com and affiliated sites.