RTX 2080 Ti Failure Analysis: Artifacting, Thermals, Black Screens, & Defects

the r-tx 20 atti failures aren't as widespread as they might have seems from initial posting but they are absolutely real when discussing internally whether we thought the issue of artifacting and dyeing r-tx cards had been blown out of proportion by the internet we have two frames of mind on one side the level of attention did seem disproportionate to the size of the issue particularly as RMA rates are within the norm on the other side the other frame of mind actually nothing was blown out of proportion for people who spent $1,250 and received a brick in return for those affected buyers the artifacting is absolutely a real issue and it deserves real attention it's a very expensive brick partners though are still often under 1% and retailers are under 3.5% for RMA rates which is pretty standard and good overall so we have to keep both of those aspects in mind before that this video is brought to you by the Thermaltake level 20 VT micro-atx case the level 20 VT takes the high quality at level 20 design and makes it more affordable and shrinks it down to a micro ATX form factor at that with fully modular paneling it's possible to rearrange this case into whatever configuration you prefer for a micro ATX case that can be a discussion piece in a home theater system click the link in the description below quick overview of what we're doing then we already have a live stream that we did where we went through about maybe half a dozen or so RTX 28 ET is from our viewers where we tested see do the artefact and a lot of them did artefact spectacularly during the stream but not all of them do and some of the issues the cards we got in were blue screens which were resolved largely with that latest driver update so the the thing to remember here with the issue of seeing Kenna reddit threads all over the place of dying dead BSO Dean 28 ET is is that it was two separate issues there and one of them is the blue screens the driver compatibility the software which is getting improved and a lot of that's been fixed but not all of it it particularly things with specific g-sync monitors for example were very bad with ps4 DS but a lot of that's getting addressed separately that's not a hardware level issue the hardware level issue we are pretty confident anyway is an artifact ting issue that you've likely also seen some people call them space invaders artifacting specifically got the kind of XD icons that fit in a perfect square if you were to overlap them so that is those are the two separate issues we were dealing with here and a couple of cards we got in from viewers were just the bsod issue that were at least solved on our end and then the rest a lot of them were artifact and we had over 10 devices total which although a small sample size on the in the grand scheme of things it's pretty good when considering what it is it's a $1,300 card and we asked a bunch of people who don't know eyes to send it to us on loan so a pretty good sample size and we were able to validate the issues we dug through the cards try to figure out if there's any common thread there and primarily what we walk away with it's it's two things one this isn't a hugely wide reaching issue so as far as we can tell from all the RMA numbers we've gotten from manufacturers from retailers this really doesn't seem like it's affected a whole lot of 20/80 eyes but it definitely affected people so even though it's not perhaps deserving entirely of as much attention as it did get initially it's absolutely deserving of attention because if you're one of the people who bought one of these things the TI version especially and it doesn't work and that video might be slow at getting back to you and they were in a lot of cases with customers that we spoke with then yes it deserves all the attention you can get it because that's a really expensive thing to not work and it has to work it's that's the whole idea with guaranteed merchantability of a product so anyway we're looking into that and there's a limited amount of what we can do we set the wheels turning with some others in the industry so that's helpful and we expedited a lot of the RMA is from people who had board partner cards things like that but this is more of a focus on what it isn't what the problem is not rather than what it was a lot of the early speculation was things like thermals and the problem with this is one image goes out and rather than think about the image or read what the author might have said a lot of the comments online or just instantly jump into it's at their own problem view rooms overheating oh my god and that's not really that's not the right approach to a profit like this that's how it starts just snowballing way out of control because if you posted that you didn't know what you were talking about and there are problems but unless someone has confirmed that vram thermals are an issue you really can't just go saying they are an issue and tweeting everybody on the planet because it needs to be validated so we're gonna go through that we did somebody REM testing now unlike those people we're not gonna say that 100% for sure Veeran thermals aren't a problem but it doesn't look quite so bad as some of those discussions might have led you to believe which means that there's plenty of room elsewhere in the product for the problem of artifacting to manifest itself it's just not necessarily in thermals which was the major speculative talking point for the last couple of weeks or so we had attached two thermocouples to the memory modules in the review originally and we'll reproduce that chart on the screen now and never saw any thermal issues with the memory that said it's always possible that some cards have thermal problems where others don't so we took a week to attach thermocouples all over the cards that we received from viewers here's a list of fer mark and fire strike thermal results for the memory modules we measured the two hottest memory modules we could find one was between the GPU and the capacitor bank for the vrm and the other is near the PCIe slot at the bottom we determined that these two locations are the hottest by probing each module individually on two cards and then committing to the two primary modules to test in the worst-case scenario on this chart the card i1 failed in fir mark from the usual modes of failure typically artifacting freezing and then crashing after only about 10 minutes i won failed so it never reached steady state but what we can do is look at the thermals to see if those might have caused a premature crash this is the worst card it had memory Missouri not 84 degrees for the module near the chokes or 79 degrees for the module near the PCIe slots we could not find a card or worse than this one for thermals overall the spec calls for temperature to be under about 95 degrees Celsius for GT dr6 we're measuring the external package temperature here so it is possible that the internal die is at its maximum thermal value or at least the one that the spec calls for but we can't really be a hundred percent sure realistically from experience the Delta is likely closer to five degrees between the external the internal part of the package so it should be within spec if barely even still we wanted to see what would happen if we used one of our known good review samples that has never failed and tortured it without any heatsink or air flow on the memory theoretically if hitting some magical maximum temperature number triggers an instant freeze or artifacting fits then our card without any heat sinks should encounter that at the same temperature this next move is inadvisable if you remember our hybrid card and we'll show some shots of it the only way it really works well was to add a fan or two to blast the PCB our CL C only covered the GPU here there was no additional board cooling we redeployed the hybrid without any cooling on the memory whatsoever finding that it'll continually ramp temperature until you feel uncomfortable we end up halting the test right at around 100 degrees on the thermocouple for the hottest module which is probably about 105 to maybe 110 internally this is obviously not good for the card but the point was to see if a known good card would be made to instantly start artifacting or crashing as a result of only high temperatures that's at least what a lot of internet conjectures suggested over the past couple of weeks so this testing looked at that this card fared well it still works to its full overclocked potential today and now again it's not good for the components but it didn't artefact flicker freeze or a crash and we stopped the test it didn't fail on its own there does not appear to be a thermal shutdown that triggers from high memory temperatures at least not 100 to 105 degrees may be higher than that it's outside of spec but there was no mode of failure here for our known good card back to the previous chart momentarily other memory modules ended up at 65 degrees for card II one which failed nearly instantly it did not have time for any components to exceed the thermal spec and overheat key one just crashed too fast from likely other issues this is not thermally related in this instance unless it's perhaps a component on the board we didn't measure there are a lot of them but we measured the hottest ones and the ones that are the most susceptible to thermal crashes if they were to encounter one card f1 couldn't even finish loading the application and so is never under any meaningful load to push the thermals card Iwan with fire strike pushed 78 degrees on the which is within spec card eewan with fire strike failed nearly instantly and so never had a chance to get hot it was around 64 degrees to the modules and the rest of the cards pretty much are at the same area when they didn't fail instantly I mean they're all comparable it's just I won is the worst case and that one does not appear to be crashing from thermal related issues even though it is very hot and definitely hot enough that we would consider RMA in any way even if it didn't artifact but you get the idea it looks like memory thermal so I one's the worst here it's not really acceptable it's borderline questionable or concerning but we don't have any reason to believe that the artifact is caused from the memory temperatures it's likely something else maybe the service mount method for the memory was wrong or something like that so it looks like the normals here are not the cause for failure of at least these cards even an eye one with fur mark where thermals were high enough to be concerning we know that the memory temperature was still within spec we know because the card still failed even when putting higher end cooling on it so even when you drive those temperatures down below what they measured with extra fans the waterblock it still fails the same way at about the same time and we know in those instances for sure is well within SPECT continuing the thermal trend we took turns measuring inductors and MOSFETs on the same cards here MOSFETs inductors can take 125 to 150 degrees Celsius depend on which component it is and if it has any thermal fail-safes included card.i one was again the hottest at about 72 degrees for the hottest MOSFET while running fur mark f1 didn't hold long enough to heat up meaningfully so we can count that as another tick against thermals at least for the vrm being the cause of issues cardi one and fer mark also failed quickly never exceeding fifty four degrees on its MOSFET i wanna remain hot and fire strike at 71 degrees for the MOSFET but even so none of these temperatures are legitimately hot 71 is hotter than the other cards but it's still way within spec for the MOSFET 71 degrees for MOSFET temperature is completely reasonable there's really no reason to complain about it no reason to think it's crashed and as a result of that and it's just it's higher than the others but it's still completely fine so you're looking at 125 degree about where you'd start really being concerned even the 73 degree value is within spec just again there's seemingly no thermal issues with the cards we had a lot of people also noted that their back plates were running really hot we'll note that this is what back plates are supposed to do since they are heat sinks and that means they're working but the founders edition back plate does run a bit hotter than most might be used to so it's reasonable that people would be concerned about it sticking a thermocouple to the backside of the hottest memory module near the PCIe slot we measured a maximum backside PCB temperature of 75 degrees on the i1 infer mark this is within spec and not unreasonable seen as a PCB is just a giant conductor with shared power planes running through it and the significant portion of the PCB is copper it's it's gonna be hot that's what copper and what PCBs do so it will be hot on the backside its sandwich that thermocouple is right between the PCB and the back plate both of which are conductors and heat is coming from the memory on the other side of the PCB anyway being synced through the PCB 75 is is fine it's certainly warm yes but it's not causing problems that we know of lest there be any concerns of testing conditions here are some ambient thermal numbers for each test logs second to second for the entire test run we stayed within a range of roughly 22 to 23.5 degrees Celsius for ambient temperature GPU thermals are also plotted here again for devices that failed nearly instantly those never got hot they didn't reach steady state and they were just beginning to ramp up when they failed for the rest I one was at 76 degrees one was at 71 degrees and all the others were really near by that point although the FE heatsink is not impressive there's really no red flag with these thermals so we can somewhat confidently say that thermals were not the issue with the cards we got the next point of consideration is firmware as newer units are shipping with firmware revision ninety point zero two point one seven one zero zero point zero four and original cards that shipped with it 9000 to 0 B 0 0 0 e we tried flashing a few cards to newer revisions of firmware that we obtained officially ultimately finding the same artifacting results that we saw previously they were not resolved you can see some of those on the screen now if you want to what the artifacting looks like once again firmware updates did not resolve the issue on the cards in our lab following several requests during our live stream of the dine 28 ET i cards we also decided to test behavior in windows versus linux this was a good idea and it helps eliminate one of the biggest possible variables which is the operating system we installed Ubuntu 18.04 and Unigine Heaven and then tested two cards that artefact in Windows against Ubuntu running OpenGL we tested with driver revisions 410 and 415 and in both instances on both cards with both drivers and with the proprietary driver we saw artifacting as early as the terminal we also encountered freezes during the heaven benchmark run often with the same type period as windows would freeze in times by extreme from our testing this issue does not appear to be isolated to Windows does not appear to be isolated to drivers for the cards and at this point we can start assuming that it's almost certainly a physical board level defect the next step was to tune frequencies to try and mitigate the artifacting and freezing behavior of the Kart we hosted a multi hour livestream that included some frequency tuning to try and mitigate these crashes most devices seems to degrade over time in general but we noticed that a few benefited from clock reductions of various sorts that said most units we ended up with by the time we got them did not exhibit increased stability from intentional frequency throttling we tried all combinations and permutations we can think of down clocking memory down clocking core down clocking both simultaneously negative power offset power offset positively mixing the power offsets in both directions with down clocks in both directions Harper sent fan speeds and no other changes or 100% fan speeds and lots of other changes we also did power offsets with no changes and so on ultimately although a few of the users who sent their cards and noted that these steps could improve stability temporarily for them we were not able to reproduce this in any widespread fashion there didn't seem to be a key solution here where we could just constantly down clock the memory and it would always work or constantly down clock the core so when these things did work there was no pattern to it and we can't really draw any conclusions firmly and just for good measure we also took apart the cards and did a cursory look over board components we were really only looking for anything extremely out of place like a missing throwing pad poor contact to the thermal pads burned or damaged components and so on there's only so much we can do here we don't have x-ray scanners and double leaves and things like that so there's only one device that demonstrated any physical defect out of all of them so one defective unit out of all the cards but it was unrelated to the issue of artifacting and it's something we may discuss later just it's not related to this particular problem so we're gonna skip over it until we can understand what went wrong with that specific card the throwing pad contacts on all of the cards was fine we can see indentations there's clear contact being made to the pads we did some pressure paper testing it was fine there's obviously there's no visible damage to the components that we could see although most component damage would not be visible it's just that is the most obvious thing to rule out any defect is going to be something we don't have the tools or knowledge to see like something inside the board or inside the silicon so really well he's done primarily here is rule out a few things thermals we strongly suspect are not the issue now we're leaving room of course for that to be the problem because with a sample size of a bit over 10 units it's hard to draw from 100% sure conclusion there could absolutely be and we had one card that was genuinely running very hot for VR on thermals or vram thermals rather not vrm the problem was when we ran cards without heat sinks at all they didn't instantly artifact if they're known good so it's not like it's just you hit 95 degrees and instantly artifacting boom it's potentially degradation if it is thermal related but we don't really have any reason to suspect that it is and it's not something as simple as missing thermal pads we thought perhaps the different types of thermal pads used might have varying heights but that was not the case in fact all these devices use the same thermal pads in the same places so that wasn't a problem frequency tuning did not seem to have a patterned result where you can for sure say that dropping memory clock would work every time because sometimes it just didn't sometimes dropping core clock did instead windows do not seem to be the issue because we use Davonte Linux firmware also didn't appear to be the issue and we spoke with some people in the industry as well who might be closer to this than we are for example and the takeaway here is pretty simple it seems like it's either a manufacturing level issue with some kind of assembly problem or it's some kind of in silicon problem whether that silicon is GPU or memory we're not sure but it does not appear to be any of the other things that we walk through today there is of course a possibility that it could be those things but there is strong enough evidence here that likely it's not and it's it's another issue that is more difficult to troubleshoot or solve for Nvidia but it appears that like they are now aware of the issue because they did post something on their forums talking about test escapes they called them cards that were dying and were just shouldn't have made it out of the lab but I found it particularly comical because all of these cards had a QC passed sticker on them and all you had to do in most cases is plug them in and they would not pass so perhaps the QC passed why and should be revamped to include a plug-in in step not just a visible a visual inspection step as it may be right now either way hopefully that gives you some kind of conclusion to some different ends of the story lines highs a couple of those up the thermals that was the big one we want to talk about because people see one image and jump to the conclusion that it's overheating that's the problem that's it but you know if anything that's demonstrates you you can't really can't do that you got to step back and look at the greater picture and allow for the fact that they're I mean it's it's within spec there's no strong evidence all you have to do to disprove that simple theory is take the cooler off around the card with only a GPU cooler it's not like it just instantly artifacts at a certain temperature so maybe degradation from thermals but not like an instant t.j.maxx type of thing which is what seemed to be suggested so anyway yeah we sent all the cards back sep 4 2014 for loaning them to us - those of you who did loading your cards hopefully you got it sorted soon if you do not obviously you have my information I've got yours let me know if I can help you expedite anything but I think you all should be pretty taken care of at this point so and we'll get you some some merch for sending us the cards so you were watching subscribe for more go to store documents exes net if you did not send a car and you would like to buy something instead of getting it like the others are and a patriotic concepts Cameron's access to get some behind-the-scenes videos I'll see you all next time

Gadgetory

All Cool Mind-blowing Gadgets You Love in One Place

RTX 2080 Ti Failure Analysis: Artifacting, Thermals, Black Screens, & Defects

2018-11-16