Gadgetory


All Cool Mind-blowing Gadgets You Love in One Place

Does Benchmark Duration Matter? One Year of Testing

2018-02-01
the short answer to the headline of this video is sometimes but it's more complicated than just FPS overtime to really address this question we have to first explain the oddity that is FPS frames per second it's inherently an average frames per second is a collection of data over a period of time however it's presented on the spot every millisecond you get an FPS number if I tell you something is that a variable framerate but as presently 60 fps what does that really mean if we look at the framerate at any given millisecond given that framerate is an average of a period of time we have to acknowledge that driving spot measurements in frames per second is inherently flawed all this stated the industry has accepted frames per second as a rating measure for performance of games and it's one of the most user-friendly means to convey what the actual underlying metric is frame time or frame to frame intervals measured in milliseconds today we're publicly releasing some internal test data that we've collected over the past year as we work to refresh our test methodology for 2018 before that this video is brought to you by thermal grizzly makers of the conductor not liquid metal that we recently used to drop 20 degrees off of our temperatures thermal grizzly also makes traditional thermal compounds we use on top of the IHS like cryo not and Hydra not pastes learn more at the link below before we publish benchmark data for games we always research them heavily this involves a long period of looking for flaws and potential testing methodology and inevitably you find some later and revise it and improve it but the immediate pre-publication goals are to determine things like how long do we need to benchmark for accuracy what settings should we use that are the most fair or the most agnostic to all vendors and other aspects like load level of particular areas of the game which ones are more intensive than others and best and worst-case performance scenarios for the game as well and of course the expected user scenarios as well so you have to balance all those things we research all this stuff prior to publishing data but we tend to keep most of it internal only at least a good amount of it we do keep test methodology sections and the Articles that you can going forward though the current plan is to as we continue to iterate our methodology year-over-year tend to do that January February to start releasing more of the behind-the-scenes information that drove our previous years methods so that you can get a look at what we've been doing and how we're hoping to advance our testing before the year to come because computers are complicated and there's always room to improve so it's our goal to continue to do that let's get some examples of times we've published internal research with destiny two's beta we tested various parts of the game including testing durations spanning from 30 seconds to 20 minutes at a time this also allowed us to determine that most parts of the intro campaign performed equivalently while a select few sections were highly demanding of the system this also included multiplayer benchmarking and single player benchmarking of various durations to determine what one could expect from both multi and single player gameplay and we also did this for games like for honor which we can show where we determined that the built-in benchmark wasn't at all representative of real-world gameplay something that pushed us away from using the built-in option we did this again from Mass Effect Andromeda where we discovered that with early drivers on AMD cards the game would stutter on the first test pass through the test area the result was that we needed to include more test passes than normally then present data both with and without the stutter included because otherwise the stutter drags down the performance average significantly but it's still important to show that information this is something that was later resolved by AMD the point is that we do this for each game and often discover anomalous behaviors for each GPU vendor or for particular regions in the game or even with particular graphics settings another example of some of this test data was when we discovered that dynamic reflections had the most significant impact for overwatch in which we discovered that frame time charts plotting the difference jumps between 10 milliseconds and 16 milliseconds on the single tested device we also tested the graphics settings scale in on a particular set of hardware giving us an understanding for where devices may gain an unexpected lead over competing devices with destiny choose beta again for a recent example this allowed us to determine that and had a significant advantage only went under the highest settings but that its advantage faded away when dropped down to high and he fixed this upon launch of the game to the point where it was really climbing back to compete with Nvidia directly at highest and NVIDIA later leveraged the same thing to improve its own performance again specifically under highest settings the point here is that there is significant performance impact between these two and knowing what causes it and why and testing it is important for reviews and we studied this again in watchdogs to where we demonstrated a cpu scalability settings chart for framerate so that we could then decide at what point we're hitting either a CPU or GPU bottleneck depending on the graphics settings of the game all useful for our CPU or GPU benchmarks all of this is to say that it's important to work hard to understand what you're testing which we do and then work harder to create charts demonstrating why we test the scenarios that we do the next big concern though is repeatability of the tests and how accurate they are tested test so this would start getting into standard deviation and test variance which we'll have a separate video on coming up soon but with a benchmark you've really only got two options you have highly repeatable and potentially entering synthetic or highly realistic if you go for highly realistic you can really only reasonably test maybe two devices and I had to head scenario because if you're playing a game for a significant period of time there's just too much variance and it's just not a good test at the end of the day also it's not realistic to be able to benchmark a realistic scenario in the exact same settings on say 14 plus devices with multiple test passes each time because other system variables can interfere as well that's not to say that either highly repeatable or highly realistic are the best methods or the worst methods they're both very important but it's important to illustrate for viewers in our case when we're using one versus the other and why this is why synthetics exist and why games exist they both achieve different things or similar things I suppose in a different way that's the important part about them so our approach to benchmarking theory is to collect large data sets with accurate and beautiful numbers that we can get again and again and ultimately what we care about is device scalability this would be the difference between device a and B as a percentage rather than the hard absolute FPS difference for example it's kind of irrelevant if you're hitting 7 B whereas a 75 FPS if all you're comparing ultimately is the relative performance versus the other device if the other device is also doing 70 to 75 FPS then you have a 1 to 1 so doesn't really matter the time it does matter what your absolutely fps is is if you are trying to determine or we think you are trying to determine if a particular device can play a game at a specific frame rate ie 4k highest settings at 60 fps that's a very specific goal for those we publish standalone game benchmark guides whereas for reviews of computer hardware like video cards we focus more on relative performance rather than absolute performance though we do provide both sets of data so now let's get into answering the question of what the optimal duration is for a benchmark in a few different games please note that this data is not representative of every game or every device all the time donors it represents the games and devices we tested here however it is AIT's a point pretty well we normally keep most this information private but now that we are revising our testing methodology completely for 2018 and moving forward with what we think are better methods we thought it'd be a good time to share if you're a content creator and you use this information we ask that you mentioned gamers Nexus it was a lot of work so all these tests were conducted a minimum of 4 times an average test durations range from 30 seconds to 5 minutes depending on the game error bars are present to display standard deviation between all test runs and we have more information in the article linked in the description below for sake of timeliness for the video and keeping it relatively brief for us we have a lot more information on this that we've collected over the past year or two at this point but this set of data pretty much represents the whole though there are anomalies where we'll talk about that later and the standard deviation piece we'll get into more of a test variance and confidence interval versus repeated a test and its results we're starting with the oldest benchmark title as it's the easiest to configure for multiple test durations we've historically proven metro to also be the single most consistent benchmark title under the right settings it's not always it depends on if you know what you're doing all the test methodology again and the components are in the article in the description below starting with only the gtx 1070 gaming X we see that the average FPS sits at 84 for a set of four thirty seconds passes 85 FPS average is what we get for 460 second test passes within test variance and error and eighty seven point eight is what we get for 90 seconds of testing which exits margin of error and becomes a performance increase of 4.5% over baseline what's relevant here is how this compares relatively to the RX 580 and we'll show that next if both scale equivalently over both test durations and all we care about is relative performance between devices that the difference is irrelevant average FPS hovers at 87 for a 120 second run and 88 for a 150 second run still on the gtx 1070 here and overall this is exceptionally consistent our total range is 4 FPS for a total bottom-to-top increase of 4.8% for frame times the 1% point 1% lows are also relatively equal and are largely within test and test variants moving on to the RX 580 chart this showed performance between 59 and 61 fps throughout all tests generally sitting at around 60 FPS vsync of course is disabled though these numbers might lead you to believe otherwise it's just that it just so happened to average at 60 the card is taxed enough that these small performance swings exhibited by the GTX and 70 are not shown here here's a chart of relative performance to one another using the average FPS at each iteration the RX 580 is roughly equal to 68 to 69 percent of the GTX 10 70s performance when tested at 60 90 120 and 150 second durations it maintains the same 68 to 69 percent performance of the 1070 the RX 580 is equal to 72 percent of the gtx 1070 when tested for a shorter duration a results of operating point 8 FPS faster for the 580 and operating a few percent slower on the 1070 some of this is within variance but minor differences do begin to emerge we next tested GTA 5 for which we use scripted automation to complete the final plain scene with a minimum of four times per test with the gtx 1070 we observed average frame rates ranging from 104 to 112 a wider range than the previous test from 30 seconds to 90 seconds we're tracking a 7.5 percent performance uplift at 90 seconds and we observed this in 2015 as well back when we started GTA 5 testing and made an active decision to limit our test passes to 30 seconds for this title with this benchmark if you look at the GTA 5 benchmark scene once the plane nears the town frame rate climbs and loaded on the devices is no longer as high because we also tracked GTA 5's unique performance behavior upon hitting 180 7.5 fps discussed in two previous videos where we show severe stuttering on some CPUs we want it to limit testing to a more stressful and consistent part of the benchmark as for lows those remain relatively consistent between the two longer test passes the shorter test pass exhibits better 0.1% low performance but this difference is largely within test variants the RX 580 exhibits almost identical performance to the 1070 we're at 73 point 4 FPS for the shorter test 77 for the two longer tests and those also exhibit similar behavior general consistency is found here overall within some variants at dictating emergen differences that said i used the words almost identical in performance to the 1070 of course that's not true in raw framerate but what we care about again here is relative performance in terms of percentages the arts 580 maintains almost precisely 70% of the performance of the 1070 across all three tests in this regard any of these three test patterns of the three different durations would be valid for comparing these two devices to one another relative performance is in lockstep with the gtx 770 which means that we derive the same conclusion of relative value at any of the three durations the only difference is the absolute FPS number which our publication considers to be a lesser value for purposes of review of computer hardware despite considering it the highest value for standalone game benchmarking we are trying to achieve two different things and those two different types of content overwatch is next we wrote an entire in depth graphics optimization guide for this game where we studied various performance behaviors versus graphics settings something that we can show on the screen that testing is also where we decided to move all testing including benchmarks for 2017 over to a five-minute test duration for this game we collect 10 times as much data per pass as our more controlled built in benchmarks and this is strictly due to the huge variation that multiplayer games are subjected to we also benchmark single player bot matches something we've previously found to be equal in performance again something we can show on screen from our overwatch graphics optimization guide to online multiplayer matches they are effectively the same performance but there's far greater reliability and consistency with bot matches because it's easier to stay alive and stay where you want to be our reasoning for going to 5-minute passes will be made more clear once we publish the standard deviation video at 30 seconds on the 1070 we observed 74 FPS averages with 60 FPS 1% and 53 0.1% lows we tracked marginally lower performance at 60 seconds but our confidence interval is lower than average due to the variance in this game so we can't confidently state whether the differences are significant at 5 minutes our confidence is high and our data looks good 72 FPS average 58 and 53 for the lows our x5 80 outputs similar performance the 30-second tests as also shown on the 1070 tend to output slightly higher performance metrics due to a more even split between non-combat and combat with non combat rising higher we believe this is unrealistic for overwatch as the games that most important moments revolve around combat for this reason our five-minute tests are conducted from the time the doors open through combat and we remain alive and in combat for the entire 5 minute duration this gives the most important data relatively the 580 maintains 65% to 67 percent of the 1070 of this test and remember the content isn't about the 1070 versus the 580 that's relevant the two devices are just being used to illustrate scalability over duration although performance is ultimately similar in terms of relative performance our confidence interval is significantly lower for the shorter test passes in overwatch so we opted for five minutes ashes of the singularity is next and for this one we observe higher frame rates over 30 seconds than 60 and 90 seconds resulting in a performance disparity of about ten point seven percent this is the greatest we've yet observed however once again we need to determine whether this has significance when looking at GPUs in a relative fashion rather than looking at absolute fps chart it alone the RS 580 it looks about the same 38 FPS for the shorter test 34 for the longer test with loes deadly accurate relatively however the GPUs are identical we see that the RX 580 maintains 70% of the 1070 s performance in all three tests this isn't to say once again that they are identical to each other it's that the test passes at the different durations are identical with regard to relative scaling so the 580 always is equaling about the same percentage of the 1070 s performance so that pretty much shows why we care about relative versus absolute performance for purposes of reviewing a piece of hardware and an absolute is what we care about for purposes of determining which piece of hardware is best for playing game a or B at specific settings X and specific frame rate Y if you follow all of that Sniper Elite 4 is the next one in this game we see the 1070 operating at about 52 53 fps on average with the higher value stemming from our combat test during the 90 second run this game is also a DirectX 12 game that's highly optimized so we wanted to include it as well the other two tests were conducted by running around the village a geometrically complex area without any combat so that's 30 and 60 seconds while a 90 second test included a lot of combat the RX 580 exhibited more consistent performance as it was more pinned for resources but the relative performance gives us values of 74% to 78% of the 10 seventies performance this one is one of the wider ranges and when we started testing sniper a year ago we chose to focus on walking through the more geometrically complex scenes rather than introducing the variants of combat because it gave us hired confidence and because ultimately if we're at 76% rather than 74 or 78% of the 1070 baseline we're really looking at about the same thing at the end of the day we also tested doom for honor and a couple other things in addition to these games you can find links to those and the article below if you're curious to see more but the short of it is we saw about the same thing their performance scaling at different durations was roughly the same so why then do you choose one duration versus another if the values are roughly the same in terms of relative scaling well there are occasionally games where that's not the case for example if we let's say we're using The Witcher 3 or some other game that has a specific element in it that one GP vendor does not play well with while the other does and something like The Witcher 3 you might have tessellation and let's go back to original launched before any patches came out and things like that if you have a lot of tessellation and hair in that particular game it's likely that one GPU and Vidia would handle that better than the other AMD and it was an Nvidia technology generating that hair after all so it makes sense so then if you are testing a scene where maybe it's 30 seconds versus 90 and the first 30 seconds involves a lot of hair works or involves a lot of something that AMD or Nvidia is good at but the other is not and then the next 60 seconds is neutral agnostic in its graphics requirements that's where you'll see a bigger difference in terms of relative performance none of the games shown here have that kind of behavior but we've encountered it and that's why we always test it behind the scenes before deciding how we want to go about running the next hundred hours of tests in that game over the course of the year and it is hundreds of hours of testing per year so pretty important to get the testing done figure out how to do it and where it's the fairest to each GPU or CPU vendor of course the nature of this kind of work is that computers are highly complex and even if you're someone way above our level at CPU or GPU Architect you still aren't going know every aspect of behavior of the device you helped architect because that's just the nature of software playing with hardware there's so many variables it's unreasonable to assume that anyone knows all of them so we do work on advancing these methods year over year since to happen around now each year we're working on our next set of test methods for 2018 which means we can reveal some of the stuff we research for 2017 in a more transparent fashion so hopefully that's interesting to all of you and we're pretty excited for what's coming up next so you'll see that implemented in reviews eventually sometime this year every game is different because what you saw here showed relative performance roughly the same in most but not all of these games at 30 to 150 seconds that does not mean every game will be like that so we test this on a game by game basis it's not like we just collect this data want and then set a time scale for every single game ever so game by game it's done not all games are the same not all Hardware just gonna treat the games the same and then you also have confidence interval and things like overwatch where yes 30 to 60 seconds may roughly equal the same relative performance between two devices as five minutes but when you're testing say 14 devices for multiple passes each and you have a decent amount of variants introduced from multiplayer whether that's represented in data or not that impacts the testers confidence interval so it's better for us to just lengthen that duration and then we we mitigate the potential for bad data or data that can cause us to rerun dozens of hours of testing so it's all done based on min Maxine how many tests you can fit for a given game in a given period of time in order to publish a review while remaining profitable and also making sure you remain accurate so it's just being able to fit a million tests and is no good if they're inaccurate but inversely if you go overboard to a point where you're collecting data that is no longer establishing a real difference in reality but you're trying to be hyper accurate it's just impossible it has to the devices that need to be tested in the period of time required to remain in business so that's the trade-off that's why we invest so much time behind the scenes testing everything prior to committing to a methodology that ends up going public because it's important to not have to redo everything every three months we try to redo everything on conserves methodology on a yearly basis then we rerun tests as new drivers and game updates come out so we don't have to redesign the entire methodology and learn how to do it just rerun the tests on the new stuff so that's it for this one we'll have more in this series coming out I think we're gonna call this bench theory or something like that for the series name keep an eye out for it we'll make a playlist or something and subscribe for more as always if you're interested in this type of thing please leave a comment below discussing any game requests you have for us to test for the next year because now is the time to get those requests in if you want us to use a specific game for testing CPS or GPUs good patreon.com slash gamers Nexus to help us out as always that helps us fund to this type of research and thank you for watching I'll see you all next time
We are a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for us to earn fees by linking to Amazon.com and affiliated sites.