RTX 2080 Ti Failure Analysis: Artifacting, Thermals, Black Screens, & Defects
RTX 2080 Ti Failure Analysis: Artifacting, Thermals, Black Screens, & Defects
2018-11-16
the r-tx 20 atti failures aren't as
widespread as they might have seems from
initial posting but they are absolutely
real
when discussing internally whether we
thought the issue of artifacting and
dyeing r-tx cards had been blown out of
proportion by the internet we have two
frames of mind on one side the level of
attention did seem disproportionate to
the size of the issue particularly as
RMA rates are within the norm on the
other side the other frame of mind
actually nothing was blown out of
proportion for people who spent $1,250
and received a brick in return for those
affected buyers the artifacting is
absolutely a real issue and it deserves
real attention it's a very expensive
brick partners though are still often
under 1% and retailers are under 3.5%
for RMA rates which is pretty standard
and good overall so we have to keep both
of those aspects in mind before that
this video is brought to you by the
Thermaltake level 20 VT micro-atx case
the level 20 VT takes the high quality
at level 20 design and makes it more
affordable and shrinks it down to a
micro ATX form factor at that with fully
modular paneling it's possible to
rearrange this case into whatever
configuration you prefer for a micro ATX
case that can be a discussion piece in a
home theater system click the link in
the description below quick overview of
what we're doing then we already have a
live stream that we did where we went
through about maybe half a dozen or so
RTX 28 ET is from our viewers where we
tested see do the artefact and a lot of
them did artefact spectacularly during
the stream but not all of them do and
some of the issues the cards we got in
were blue screens which were resolved
largely with that latest driver update
so the the thing to remember here with
the issue of seeing Kenna reddit threads
all over the place of dying dead BSO
Dean 28 ET is is that it was two
separate issues there and one of them is
the blue screens the driver
compatibility the software which is
getting improved and a lot of that's
been fixed but not all of it it
particularly things with specific g-sync
monitors for example were very bad with
ps4 DS but a lot of that's getting
addressed separately that's not a
hardware level issue the hardware level
issue we are pretty confident anyway is
an artifact ting issue that you've
likely also seen some people call them
space invaders artifacting specifically
got the kind of XD icons that fit in a
perfect square if you were to overlap
them so that is those are the two
separate issues we were dealing with
here and a couple of cards we got in
from viewers were just the bsod issue
that were at least solved on our end and
then the rest a lot of them were
artifact and we had over 10 devices
total which although a small sample size
on the in the grand scheme of things
it's pretty good when considering what
it is it's a $1,300 card and we asked a
bunch of people who don't know eyes to
send it to us on loan so a pretty good
sample size and we were able to validate
the issues we dug through the cards try
to figure out if there's any common
thread there and primarily what we walk
away with it's it's two things one this
isn't a hugely wide reaching issue so as
far as we can tell from all the RMA
numbers we've gotten from manufacturers
from retailers this really doesn't seem
like it's affected a whole lot of 20/80
eyes but it definitely affected people
so even though it's not perhaps
deserving entirely of as much attention
as it did get initially it's absolutely
deserving of attention because if you're
one of the people who bought one of
these things the TI version especially
and it doesn't work and that video might
be slow at getting back to you and they
were in a lot of cases with customers
that we spoke with then yes it deserves
all the attention you can get it because
that's a really expensive thing to not
work and it has to work it's that's the
whole idea with guaranteed
merchantability of a product so anyway
we're looking into that and there's a
limited amount of what we can do we set
the wheels turning with some others in
the industry so that's helpful and we
expedited a lot of the RMA is from
people who had board partner cards
things like that but this is more of a
focus on what it isn't what the problem
is not rather than what it was a lot of
the early speculation was things like
thermals and the problem with this is
one image goes out and rather than think
about the image or read what the author
might have said a lot of the comments
online or just instantly jump into it's
at their own problem view rooms
overheating oh my god and that's not
really that's not the right approach to
a profit like this that's how it starts
just snowballing way out of control
because if you posted that you didn't
know what you were talking about and
there are problems but unless someone
has confirmed that vram thermals are an
issue you really can't just go saying
they are an issue and tweeting everybody
on the planet because it needs to be
validated so we're gonna go through that
we did somebody REM testing now unlike
those people we're not gonna say that
100% for sure Veeran thermals aren't a
problem but it doesn't look quite so bad
as some of those discussions might have
led you to believe which means that
there's plenty of room elsewhere in the
product for the problem of artifacting
to manifest itself it's just not
necessarily in thermals which was the
major speculative talking point for the
last couple of weeks or so we had
attached two thermocouples to the memory
modules in the review originally and
we'll reproduce that chart on the screen
now and never saw any thermal issues
with the memory that said it's always
possible that some cards have thermal
problems where others don't so we took a
week to attach thermocouples all over
the cards that we received from viewers
here's a list of fer mark and fire
strike thermal results for the memory
modules we measured the two hottest
memory modules we could find one was
between the GPU and the capacitor bank
for the vrm and the other is near the
PCIe slot at the bottom we determined
that these two locations are the hottest
by probing each module individually on
two cards and then committing to the two
primary modules to test in the
worst-case scenario on this chart the
card i1 failed in fir mark from the
usual modes of failure typically
artifacting freezing and then crashing
after only about 10 minutes
i won failed so it never reached steady
state but what we can do is look at the
thermals to see if those might have
caused a premature crash this is the
worst card it had memory Missouri not 84
degrees for the module near the chokes
or 79 degrees for the module near the
PCIe slots we could not find a card or
worse than this one for thermals overall
the spec calls for temperature to be
under about 95 degrees Celsius for GT
dr6 we're measuring the external package
temperature here so it is possible that
the internal die is at its maximum
thermal value or at least the one that
the spec calls for but we can't really
be a hundred percent sure
realistically from experience the Delta
is likely closer to five degrees between
the external
the internal part of the package so it
should be within spec if barely even
still we wanted to see what would happen
if we used one of our known good review
samples that has never failed and
tortured it without any heatsink or air
flow on the memory theoretically if
hitting some magical maximum temperature
number triggers an instant freeze or
artifacting fits then our card without
any heat sinks should encounter that at
the same temperature this next move is
inadvisable if you remember our hybrid
card and we'll show some shots of it the
only way it really works well was to add
a fan or two to blast the PCB our CL C
only covered the GPU here there was no
additional board cooling we redeployed
the hybrid without any cooling on the
memory whatsoever finding that it'll
continually ramp temperature until you
feel uncomfortable we end up halting the
test right at around 100 degrees on the
thermocouple for the hottest module
which is probably about 105 to maybe 110
internally this is obviously not good
for the card but the point was to see if
a known good card would be made to
instantly start artifacting or crashing
as a result of only high temperatures
that's at least what a lot of internet
conjectures suggested over the past
couple of weeks
so this testing looked at that this card
fared well it still works to its full
overclocked potential today and now
again it's not good for the components
but it didn't artefact flicker freeze or
a crash and we stopped the test it
didn't fail on its own there does not
appear to be a thermal shutdown that
triggers from high memory temperatures
at least not 100 to 105 degrees may be
higher than that it's outside of spec
but there was no mode of failure here
for our known good card back to the
previous chart momentarily other memory
modules ended up at 65 degrees for card
II one which failed nearly instantly it
did not have time for any components to
exceed the thermal spec and overheat key
one just crashed too fast from likely
other issues this is not thermally
related in this instance unless it's
perhaps a component on the board we
didn't measure there are a lot of them
but we measured the hottest ones and the
ones that are the most susceptible to
thermal crashes if they were to
encounter one card f1 couldn't even
finish loading the application and so is
never under any meaningful load to push
the thermals card Iwan with fire strike
pushed 78 degrees on the
which is within spec card eewan with
fire strike failed nearly instantly and
so never had a chance to get hot it was
around 64 degrees to the modules and the
rest of the cards pretty much are at the
same area when they didn't fail
instantly I mean they're all comparable
it's just I won is the worst case and
that one does not appear to be crashing
from thermal related issues even though
it is very hot and definitely hot enough
that we would consider RMA in any way
even if it didn't artifact but you get
the idea it looks like memory thermal so
I one's the worst here it's not really
acceptable it's borderline questionable
or concerning but we don't have any
reason to believe that the artifact is
caused from the memory temperatures it's
likely something else maybe the service
mount method for the memory was wrong or
something like that so it looks like the
normals here are not the cause for
failure of at least these cards even an
eye one with fur mark where thermals
were high enough to be concerning we
know that the memory temperature was
still within spec we know because the
card still failed even when putting
higher end cooling on it so even when
you drive those temperatures down below
what they measured with extra fans the
waterblock it still fails the same way
at about the same time and we know in
those instances for sure is well within
SPECT continuing the thermal trend we
took turns measuring inductors and
MOSFETs on the same cards here MOSFETs
inductors can take 125 to 150 degrees
Celsius depend on which component it is
and if it has any thermal fail-safes
included card.i one was again the
hottest at about 72 degrees for the
hottest MOSFET while running fur mark f1
didn't hold long enough to heat up
meaningfully so we can count that as
another tick against thermals at least
for the vrm being the cause of issues
cardi one and fer mark also failed
quickly never exceeding fifty four
degrees on its MOSFET i wanna remain hot
and fire strike at 71 degrees for the
MOSFET but even so none of these
temperatures are legitimately hot 71 is
hotter than the other cards but it's
still way within spec for the MOSFET 71
degrees for MOSFET temperature is
completely reasonable there's really no
reason to complain about it no reason to
think it's crashed and as a result of
that and it's just it's higher than the
others but it's still completely fine so
you're looking at 125 degree
about where you'd start really being
concerned even the 73 degree value is
within spec just again there's seemingly
no thermal issues with the cards we had
a lot of people also noted that their
back plates were running really hot
we'll note that this is what back plates
are supposed to do since they are heat
sinks and that means they're working but
the founders edition back plate does run
a bit hotter than most might be used to
so it's reasonable that people would be
concerned about it sticking a
thermocouple to the backside of the
hottest memory module near the PCIe slot
we measured a maximum backside PCB
temperature of 75 degrees on the i1
infer mark this is within spec and not
unreasonable seen as a PCB is just a
giant conductor with shared power planes
running through it and the significant
portion of the PCB is copper it's it's
gonna be hot that's what copper and what
PCBs do so it will be hot on the
backside its sandwich that thermocouple
is right between the PCB and the back
plate both of which are conductors and
heat is coming from the memory on the
other side of the PCB anyway being
synced through the PCB 75 is is fine
it's certainly warm yes but it's not
causing problems that we know of lest
there be any concerns of testing
conditions here are some ambient thermal
numbers for each test logs second to
second for the entire test run we stayed
within a range of roughly 22 to 23.5
degrees Celsius for ambient temperature
GPU thermals are also plotted here again
for devices that failed nearly instantly
those never got hot they didn't reach
steady state and they were just
beginning to ramp up when they failed
for the rest I one was at 76 degrees one
was at 71 degrees and all the others
were really near by that point although
the FE heatsink is not impressive
there's really no red flag with these
thermals so we can somewhat confidently
say that thermals were not the issue
with the cards we got the next point of
consideration is firmware as newer units
are shipping with firmware revision
ninety point zero two point one seven
one zero zero point zero four and
original cards that shipped with it 9000
to 0 B 0 0 0 e we tried flashing a few
cards to newer revisions of firmware
that we obtained officially ultimately
finding the same artifacting results
that we saw previously they were not
resolved you can see some of those on
the screen now if you want to
what the artifacting looks like once
again firmware updates did not resolve
the issue on the cards in our lab
following several requests during our
live stream of the dine 28 ET i cards we
also decided to test behavior in windows
versus linux this was a good idea and it
helps eliminate one of the biggest
possible variables which is the
operating system we installed Ubuntu
18.04 and Unigine Heaven and then tested
two cards that artefact in Windows
against Ubuntu running OpenGL we tested
with driver revisions 410 and 415 and in
both instances on both cards with both
drivers and with the proprietary driver
we saw artifacting as early as the
terminal we also encountered freezes
during the heaven benchmark run often
with the same type period as windows
would freeze in times by extreme from
our testing this issue does not appear
to be isolated to Windows does not
appear to be isolated to drivers for the
cards and at this point we can start
assuming that it's almost certainly a
physical board level defect the next
step was to tune frequencies to try and
mitigate the artifacting and freezing
behavior of the Kart we hosted a multi
hour livestream that included some
frequency tuning to try and mitigate
these crashes most devices seems to
degrade over time in general but we
noticed that a few benefited from clock
reductions of various sorts that said
most units we ended up with by the time
we got them did not exhibit increased
stability from intentional frequency
throttling we tried all combinations and
permutations we can think of down
clocking memory down clocking core down
clocking both simultaneously negative
power offset power offset positively
mixing the power offsets in both
directions with down clocks in both
directions Harper sent fan speeds and no
other changes or 100% fan speeds and
lots of other changes we also did power
offsets with no changes and so on
ultimately although a few of the users
who sent their cards and noted that
these steps could improve stability
temporarily for them we were not able to
reproduce this in any widespread fashion
there didn't seem to be a key solution
here where we could just constantly down
clock the memory and it would always
work or constantly down clock the core
so when these things did work there was
no pattern to it and we can't really
draw any conclusions firmly and just for
good measure we also took apart the
cards and did a cursory look over
board components we were really only
looking for anything extremely out of
place like a missing throwing pad poor
contact to the thermal pads burned or
damaged components and so on there's
only so much we can do here we don't
have x-ray scanners and double leaves
and things like that so there's only one
device that demonstrated any physical
defect out of all of them so one
defective unit out of all the cards but
it was unrelated to the issue of
artifacting and it's something we may
discuss later just it's not related to
this particular problem so we're gonna
skip over it until we can understand
what went wrong with that specific card
the throwing pad contacts on all of the
cards was fine we can see indentations
there's clear contact being made to the
pads we did some pressure paper testing
it was fine there's obviously there's no
visible damage to the components that we
could see although most component damage
would not be visible it's just that is
the most obvious thing to rule out any
defect is going to be something we don't
have the tools or knowledge to see like
something inside the board or inside the
silicon so really well he's done
primarily here is rule out a few things
thermals we strongly suspect are not the
issue now we're leaving room of course
for that to be the problem because with
a sample size of a bit over 10 units
it's hard to draw from 100% sure
conclusion there could absolutely be and
we had one card that was genuinely
running very hot for VR on thermals or
vram thermals rather not vrm the problem
was when we ran cards without heat sinks
at all they didn't instantly artifact if
they're known good so it's not like it's
just you hit 95 degrees and instantly
artifacting boom it's potentially
degradation if it is thermal related but
we don't really have any reason to
suspect that it is and it's not
something as simple as missing thermal
pads we thought perhaps the different
types of thermal pads used might have
varying heights but that was not the
case in fact all these devices use the
same thermal pads in the same places so
that wasn't a problem frequency tuning
did not seem to have a patterned result
where you can for sure say that dropping
memory clock would work every time
because sometimes it just didn't
sometimes dropping core clock did
instead windows do not seem to be the
issue because we use Davonte Linux
firmware also didn't appear to be the
issue and we spoke with some people in
the industry as well
who might be closer to this than we are
for example and the takeaway here is
pretty simple it seems like it's either
a manufacturing level issue with some
kind of assembly problem or it's some
kind of in silicon problem whether that
silicon is GPU or memory we're not sure
but it does not appear to be any of the
other things that we walk through today
there is of course a possibility that it
could be those things but there is
strong enough evidence here that likely
it's not and it's it's another issue
that is more difficult to troubleshoot
or solve for Nvidia but it appears that
like they are now aware of the issue
because they did post something on their
forums talking about test escapes they
called them cards that were dying and
were just shouldn't have made it out of
the lab but I found it particularly
comical because all of these cards had a
QC passed sticker on them and all you
had to do in most cases is plug them in
and they would not pass so perhaps the
QC passed why and should be revamped to
include a plug-in in step not just a
visible a visual inspection step as it
may be right now
either way hopefully that gives you some
kind of conclusion to some different
ends of the story lines highs a couple
of those up the thermals that was the
big one we want to talk about because
people see one image and jump to the
conclusion that it's overheating
that's the problem that's it but you
know if anything that's demonstrates you
you can't really can't do that you got
to step back and look at the greater
picture and allow for the fact that
they're I mean it's it's within spec
there's no strong evidence all you have
to do to disprove that simple theory is
take the cooler off around the card with
only a GPU cooler it's not like it just
instantly artifacts at a certain
temperature so maybe degradation from
thermals but not like an instant
t.j.maxx type of thing which is what
seemed to be suggested so anyway yeah we
sent all the cards back sep 4 2014 for
loaning them to us - those of you who
did loading your cards hopefully you got
it sorted soon if you do not obviously
you have my information I've got yours
let me know if I can help you expedite
anything but I think you all should be
pretty taken care of at this point so
and we'll get you some some merch for
sending us the cards
so you were watching subscribe for more
go to store documents exes net if you
did not send a car and you would like to
buy something instead of getting it like
the others are and a patriotic concepts
Cameron's access to get some
behind-the-scenes videos I'll see you
all next time
We are a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for us to earn fees by linking to Amazon.com and affiliated sites.