Does Benchmark Duration Matter? One Year of Testing
Does Benchmark Duration Matter? One Year of Testing
2018-02-01
the short answer to the headline of this
video is sometimes but it's more
complicated than just FPS overtime to
really address this question we have to
first explain the oddity that is FPS
frames per second it's inherently an
average frames per second is a
collection of data over a period of time
however it's presented on the spot every
millisecond you get an FPS number if I
tell you something is that a variable
framerate but as presently 60 fps what
does that really mean if we look at the
framerate at any given millisecond given
that framerate is an average of a period
of time we have to acknowledge that
driving spot measurements in frames per
second is inherently flawed all this
stated the industry has accepted frames
per second as a rating measure for
performance of games and it's one of the
most user-friendly means to convey what
the actual underlying metric is frame
time or frame to frame intervals
measured in milliseconds today we're
publicly releasing some internal test
data that we've collected over the past
year as we work to refresh our test
methodology for 2018 before that this
video is brought to you by thermal
grizzly makers of the conductor not
liquid metal that we recently used to
drop 20 degrees off of our temperatures
thermal grizzly also makes traditional
thermal compounds we use on top of the
IHS like cryo not and Hydra not pastes
learn more at the link below before we
publish benchmark data for games we
always research them heavily this
involves a long period of looking for
flaws and potential testing methodology
and inevitably you find some later and
revise it and improve it but the
immediate pre-publication goals are to
determine things like how long do we
need to benchmark for accuracy what
settings should we use that are the most
fair or the most agnostic to all vendors
and other aspects like load level of
particular areas of the game which ones
are more intensive than others and best
and worst-case performance scenarios for
the game as well and of course the
expected user scenarios as well so you
have to balance all those things we
research all this stuff prior to
publishing data but we tend to keep most
of it internal only at least a good
amount of it we do keep test methodology
sections and the Articles that you can
going forward though the current plan is
to as we continue to iterate our
methodology year-over-year tend to do
that January February to start releasing
more of the behind-the-scenes
information that drove our previous
years methods so that you can get a look
at what we've been doing and how we're
hoping to advance our testing before the
year to come because computers are
complicated and there's always room to
improve so it's our goal to continue to
do that let's get some examples of times
we've published internal research with
destiny two's beta we tested various
parts of the game including testing
durations spanning from 30 seconds to 20
minutes at a time this also allowed us
to determine that most parts of the
intro campaign performed equivalently
while a select few sections were highly
demanding of the system this also
included multiplayer benchmarking and
single player benchmarking of various
durations to determine what one could
expect from both multi and single player
gameplay and we also did this for games
like for honor which we can show where
we determined that the built-in
benchmark wasn't at all representative
of real-world gameplay something that
pushed us away from using the built-in
option we did this again from Mass
Effect Andromeda where we discovered
that with early drivers on AMD cards the
game would stutter on the first test
pass through the test area the result
was that we needed to include more test
passes than normally then present data
both with and without the stutter
included because otherwise the stutter
drags down the performance average
significantly but it's still important
to show that information this is
something that was later resolved by AMD
the point is that we do this for each
game and often discover anomalous
behaviors for each GPU vendor or for
particular regions in the game or even
with particular graphics settings
another example of some of this test
data was when we discovered that dynamic
reflections had the most significant
impact for overwatch in which we
discovered that frame time charts
plotting the difference jumps between 10
milliseconds and 16 milliseconds on the
single tested device we also tested the
graphics settings scale in on a
particular set of hardware giving us an
understanding for where devices may gain
an unexpected lead over competing
devices with destiny choose beta again
for a recent example this allowed us to
determine that and
had a significant advantage only went
under the highest settings but that its
advantage faded away when dropped down
to high and he fixed this upon launch of
the game to the point where it was
really climbing back to compete with
Nvidia directly at highest and NVIDIA
later leveraged the same thing to
improve its own performance again
specifically under highest settings the
point here is that there is significant
performance impact between these two and
knowing what causes it and why and
testing it is important for reviews and
we studied this again in watchdogs to
where we demonstrated a cpu scalability
settings chart for framerate so that we
could then decide at what point we're
hitting either a CPU or GPU bottleneck
depending on the graphics settings of
the game all useful for our CPU or GPU
benchmarks all of this is to say that
it's important to work hard to
understand what you're testing which we
do and then work harder to create charts
demonstrating why we test the scenarios
that we do the next big concern though
is repeatability of the tests and how
accurate they are tested test so this
would start getting into standard
deviation and test variance which we'll
have a separate video on coming up soon
but with a benchmark you've really only
got two options you have highly
repeatable and potentially entering
synthetic or highly realistic if you go
for highly realistic you can really only
reasonably test maybe two devices and I
had to head scenario because if you're
playing a game for a significant period
of time there's just too much variance
and it's just not a good test at the end
of the day also it's not realistic to be
able to benchmark a realistic scenario
in the exact same settings on say 14
plus devices with multiple test passes
each time because other system variables
can interfere as well that's not to say
that either highly repeatable or highly
realistic are the best methods or the
worst methods they're both very
important but it's important to
illustrate for viewers in our case when
we're using one versus the other and why
this is why synthetics exist and why
games exist they both achieve different
things or similar things I suppose in a
different way that's the important part
about them so our approach to
benchmarking theory is to collect large
data sets with accurate and
beautiful numbers that we can get again
and again and ultimately what we care
about is device scalability this would
be the difference between device a and B
as a percentage rather than the hard
absolute FPS difference for example it's
kind of irrelevant if you're hitting 7 B
whereas a 75 FPS if all you're comparing
ultimately is the relative performance
versus the other device if the other
device is also doing 70 to 75 FPS then
you have a 1 to 1 so doesn't really
matter the time it does matter what your
absolutely fps is is if you are trying
to determine or we think you are trying
to determine if a particular device can
play a game at a specific frame rate ie
4k highest settings at 60 fps that's a
very specific goal for those we publish
standalone game benchmark guides
whereas for reviews of computer hardware
like video cards we focus more on
relative performance rather than
absolute performance though we do
provide both sets of data so now let's
get into answering the question of what
the optimal duration is for a benchmark
in a few different games please note
that this data is not representative of
every game or every device all the time
donors it represents the games and
devices we tested here
however it is AIT's a point pretty well
we normally keep most this information
private but now that we are revising our
testing methodology completely for 2018
and moving forward with what we think
are better methods we thought it'd be a
good time to share if you're a content
creator and you use this information we
ask that you mentioned gamers Nexus it
was a lot of work so all these tests
were conducted a minimum of 4 times an
average test durations range from 30
seconds to 5 minutes depending on the
game error bars are present to display
standard deviation between all test runs
and we have more information in the
article linked in the description below
for sake of timeliness for the video and
keeping it relatively brief for us we
have a lot more information on this that
we've collected over the past year or
two at this point but this set of data
pretty much represents the whole though
there are anomalies where we'll talk
about that later and the standard
deviation piece we'll get into more of a
test variance and confidence interval
versus repeated
a test and its results we're starting
with the oldest benchmark title as it's
the easiest to configure for multiple
test durations we've historically proven
metro to also be the single most
consistent benchmark title under the
right settings it's not always it
depends on if you know what you're doing
all the test methodology again and the
components are in the article in the
description below starting with only the
gtx 1070 gaming X we see that the
average FPS sits at 84 for a set of four
thirty seconds passes 85 FPS average is
what we get for 460 second test passes
within test variance and error and
eighty seven point eight is what we get
for 90 seconds of testing which exits
margin of error and becomes a
performance increase of 4.5% over
baseline what's relevant here is how
this compares relatively to the RX 580
and we'll show that next
if both scale equivalently over both
test durations and all we care about is
relative performance between devices
that the difference is irrelevant
average FPS hovers at 87 for a 120
second run and 88 for a 150 second run
still on the gtx 1070 here and overall
this is exceptionally consistent our
total range is 4 FPS for a total
bottom-to-top increase of 4.8% for frame
times the 1% point 1% lows are also
relatively equal and are largely within
test and test variants moving on to the
RX 580 chart
this showed performance between 59 and
61 fps throughout all tests generally
sitting at around 60 FPS vsync of course
is disabled though these numbers might
lead you to believe otherwise it's just
that it just so happened to average at
60 the card is taxed enough that these
small performance swings exhibited by
the GTX and 70 are not shown here here's
a chart of relative performance to one
another using the average FPS at each
iteration the RX 580 is roughly equal to
68 to 69 percent of the GTX 10 70s
performance when tested at 60 90 120 and
150 second durations it maintains the
same 68 to 69 percent performance of the
1070 the RX 580 is equal to 72 percent
of the gtx 1070 when tested for a
shorter duration a
results of operating point 8 FPS faster
for the 580 and operating a few percent
slower on the 1070 some of this is
within variance but minor differences do
begin to emerge we next tested GTA 5 for
which we use scripted automation to
complete the final plain scene with a
minimum of four times per test with the
gtx 1070 we observed average frame rates
ranging from 104 to 112 a wider range
than the previous test from 30 seconds
to 90 seconds we're tracking a 7.5
percent performance uplift at 90 seconds
and we observed this in 2015 as well
back when we started GTA 5 testing and
made an active decision to limit our
test passes to 30 seconds for this title
with this benchmark if you look at the
GTA 5 benchmark scene once the plane
nears the town frame rate climbs and
loaded on the devices is no longer as
high because we also tracked GTA 5's
unique performance behavior upon hitting
180 7.5 fps discussed in two previous
videos where we show severe stuttering
on some CPUs we want it to limit testing
to a more stressful and consistent part
of the benchmark as for lows those
remain relatively consistent between the
two longer test passes the shorter test
pass exhibits better 0.1% low
performance but this difference is
largely within test variants the RX 580
exhibits almost identical performance to
the 1070 we're at 73 point 4 FPS for the
shorter test 77 for the two longer tests
and those also exhibit similar behavior
general consistency is found here
overall within some variants at
dictating emergen differences that said
i used the words almost identical in
performance to the 1070 of course that's
not true in raw framerate but what we
care about again here is relative
performance in terms of percentages the
arts 580 maintains almost precisely 70%
of the performance of the 1070 across
all three tests in this regard any of
these three test patterns of the three
different durations would be valid for
comparing these two devices to one
another relative performance is in
lockstep with the gtx 770 which means
that we derive the same conclusion of
relative value at any of the three
durations the only difference is the
absolute FPS number which our
publication considers to be a lesser
value for purposes of review
of computer hardware despite considering
it the highest value for standalone game
benchmarking we are trying to achieve
two different things and those two
different types of content overwatch is
next we wrote an entire in depth
graphics optimization guide for this
game where we studied various
performance behaviors versus graphics
settings something that we can show on
the screen that testing is also where we
decided to move all testing including
benchmarks for 2017 over to a
five-minute test duration for this game
we collect 10 times as much data per
pass as our more controlled built in
benchmarks and this is strictly due to
the huge variation that multiplayer
games are subjected to
we also benchmark single player bot
matches something we've previously found
to be equal in performance again
something we can show on screen from our
overwatch graphics optimization guide to
online multiplayer matches they are
effectively the same performance but
there's far greater reliability and
consistency with bot matches because
it's easier to stay alive and stay where
you want to be our reasoning for going
to 5-minute passes will be made more
clear once we publish the standard
deviation video at 30 seconds on the
1070 we observed 74 FPS averages with 60
FPS 1% and 53 0.1% lows
we tracked marginally lower performance
at 60 seconds but our confidence
interval is lower than average due to
the variance in this game so we can't
confidently state whether the
differences are significant at 5 minutes
our confidence is high and our data
looks good 72 FPS average 58 and 53 for
the lows our x5 80 outputs similar
performance the 30-second tests as also
shown on the 1070 tend to output
slightly higher performance metrics due
to a more even split between non-combat
and combat with non combat rising higher
we believe this is unrealistic for
overwatch as the games that most
important moments revolve around combat
for this reason our five-minute tests
are conducted from the time the doors
open through combat and we remain alive
and in combat for the entire 5 minute
duration this gives the most important
data relatively the 580 maintains 65% to
67 percent of the 1070 of this test and
remember the content isn't about the
1070 versus the 580 that's
relevant the two devices are just being
used to illustrate scalability over
duration
although performance is ultimately
similar in terms of relative performance
our confidence interval is significantly
lower for the shorter test passes in
overwatch so we opted for five minutes
ashes of the singularity is next and for
this one we observe higher frame rates
over 30 seconds than 60 and 90 seconds
resulting in a performance disparity of
about ten point seven percent this is
the greatest we've yet observed however
once again we need to determine whether
this has significance when looking at
GPUs in a relative fashion rather than
looking at absolute fps
chart it alone the RS 580 it looks about
the same 38 FPS for the shorter test 34
for the longer test with loes deadly
accurate relatively however the GPUs are
identical we see that the RX 580
maintains 70% of the 1070 s performance
in all three tests this isn't to say
once again that they are identical to
each other it's that the test passes at
the different durations are identical
with regard to relative scaling so the
580 always is equaling about the same
percentage of the 1070 s performance so
that pretty much shows why we care about
relative versus absolute performance for
purposes of reviewing a piece of
hardware and an absolute is what we care
about for purposes of determining which
piece of hardware is best for playing
game a or B at specific settings X and
specific frame rate Y if you follow all
of that Sniper Elite 4 is the next one
in this game we see the 1070 operating
at about 52 53 fps on average with the
higher value stemming from our combat
test during the 90 second run this game
is also a DirectX 12 game that's highly
optimized so we wanted to include it as
well the other two tests were conducted
by running around the village a
geometrically complex area without any
combat so that's 30 and 60 seconds while
a 90 second test included a lot of
combat the RX 580 exhibited more
consistent performance as it was more
pinned for resources but the relative
performance gives us values of 74% to
78% of the 10 seventies performance this
one is one of the wider ranges
and when we started testing sniper a
year ago we chose to focus on walking
through the more geometrically complex
scenes rather than introducing the
variants of combat because it gave us
hired confidence and because ultimately
if we're at 76% rather than 74 or 78% of
the 1070 baseline we're really looking
at about the same thing at the end of
the day we also tested doom for honor
and a couple other things in addition to
these games you can find links to those
and the article below if you're curious
to see more but the short of it is we
saw about the same thing their
performance scaling at different
durations was roughly the same so why
then do you choose one duration versus
another if the values are roughly the
same in terms of relative scaling well
there are occasionally games where
that's not the case for example if we
let's say we're using The Witcher 3 or
some other game that has a specific
element in it that one GP vendor does
not play well with while the other does
and something like The Witcher 3 you
might have tessellation and let's go
back to original launched before any
patches came out and things like that if
you have a lot of tessellation and hair
in that particular game it's likely that
one GPU and Vidia would handle that
better than the other AMD and it was an
Nvidia technology generating that hair
after all so it makes sense so then if
you are testing a scene where maybe it's
30 seconds versus 90 and the first 30
seconds involves a lot of hair works or
involves a lot of something that AMD or
Nvidia is good at but the other is not
and then the next 60 seconds is neutral
agnostic in its graphics requirements
that's where you'll see a bigger
difference in terms of relative
performance none of the games shown here
have that kind of behavior but we've
encountered it and that's why we always
test it behind the scenes before
deciding how we want to go about running
the next hundred hours of tests in that
game over the course of the year and it
is hundreds of hours of testing per year
so pretty important to get the testing
done figure out how to do it and where
it's the fairest to each GPU or CPU
vendor of course the nature of this kind
of work is that computers are highly
complex and even if you're someone way
above our level at CPU or GPU Architect
you still aren't going
know every aspect of behavior of the
device you helped architect because
that's just the nature of software
playing with hardware there's so many
variables it's unreasonable to assume
that anyone knows all of them so we do
work on advancing these methods year
over year
since to happen around now each year
we're working on our next set of test
methods for 2018 which means we can
reveal some of the stuff we research for
2017 in a more transparent fashion so
hopefully that's interesting to all of
you and we're pretty excited for what's
coming up next so you'll see that
implemented in reviews eventually
sometime this year every game is
different because what you saw here
showed relative performance roughly the
same in most but not all of these games
at 30 to 150 seconds
that does not mean every game will be
like that so we test this on a game by
game basis it's not like we just collect
this data want and then set a time scale
for every single game ever so game by
game it's done not all games are the
same not all Hardware just gonna treat
the games the same and then you also
have confidence interval and things like
overwatch where yes 30 to 60 seconds may
roughly equal the same relative
performance between two devices as five
minutes but when you're testing say 14
devices for multiple passes each and you
have a decent amount of variants
introduced from multiplayer whether
that's represented in data or not that
impacts the testers confidence interval
so it's better for us to just lengthen
that duration and then we we mitigate
the potential for bad data or data that
can cause us to rerun dozens of hours of
testing so it's all done based on min
Maxine how many tests you can fit for a
given game in a given period of time in
order to publish a review while
remaining profitable and also making
sure you remain accurate so it's just
being able to fit a million tests and is
no good if they're inaccurate but
inversely if you go overboard to a point
where you're collecting data that is no
longer establishing a real difference in
reality but you're trying to be hyper
accurate it's just impossible it has to
the devices that need to be tested in
the period of time required to remain in
business so that's the trade-off that's
why we invest so much time behind the
scenes testing everything prior to
committing to a methodology that ends up
going public because it's important to
not have to redo everything every three
months we try to redo everything on
conserves methodology on a yearly basis
then we rerun tests as new drivers and
game updates come out so we don't have
to redesign the entire methodology and
learn how to do it just rerun the tests
on the new stuff so that's it for this
one we'll have more in this series
coming out I think we're gonna call this
bench theory or something like that for
the series name keep an eye out for it
we'll make a playlist or something and
subscribe for more as always if you're
interested in this type of thing please
leave a comment below discussing any
game requests you have for us to test
for the next year because now is the
time to get those requests in if you
want us to use a specific game for
testing CPS or GPUs good patreon.com
slash gamers Nexus to help us out as
always that helps us fund to this type
of research and thank you for watching
I'll see you all next time
We are a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for us to earn fees by linking to Amazon.com and affiliated sites.