Gadgetory


All Cool Mind-blowing Gadgets You Love in One Place

Exploratory Multitasking Benchmark Issues - G4560 & R3 1200

2017-08-27
the point of this content is to show why multitasking benchmarks are wholly unreliable difficult to execute and ultimately uninformative unless controlled to a point of becoming a quote-unquote normal benchmark there's something here we can work with for the future maybe but our first attempt at multitasking benchmarking largely demonstrates why no one really does this mainly it's not reliable and variance between tests produces unexpected read wrong results we're looking at the G 4560 and r3 1200 today testing under conditions of gaming while running video playback and other bloatware before getting to that this coverage is brought to you by the EVGA 1080 TI sc2 and NVIDIA Destiny to bundle running up through September 4th the 1080i sc2 comes with a synchronous fan control for its dual fans and 9 thermal sensors and again includes destiny to learn more at the link in the description below patrick headed up the test planning an execution for this content I worked with him on a lot of it and he'll be joining me at the end of this video to talk about some of the testing procedures he went through things that we found inadequate or things that we thought might potentially give us an option for the future but the thing is this is the kind of testing we do behind the scenes to build a new test plan we don't normally publish this type of data but the idea here is to answer the very highly requested community question of can you test multitasking which depending on which person you were speaking with that meant a lot of different things so we introduced stream benchmarking alongside gaming and that was a way of doing multi tasking benchmarks but for some folks that was too much multitasking so people wanted other things that were maybe more normal so to speak things like discord or Chrome or video playback or music playback or running steam you play origin and stuff like that so that's what we're visiting today and the point of publishing this is to demonstrate why it's hard to trust these multitasking benchmarks and why we haven't quite figured it out yet because there's just there are so many variables when you start introducing the real world scenarios that everyone requests turns out pretty much all the real world scenarios involve internet attached applications which do whatever the heck they want a lot of time to the point of really throwing off results and we have examples of those where in some cases the risin 3cp was just getting destroyed in numbers but not because it was actually bad just because some application decided to fire off some kind of background task while we were benchmarking and that's why you don't run tests with those things in the background normally but you can try to and if you're able to get these things well one work with applications that don't just randomly spawn more services or randomly start downloading stuff or start doing whatever in the background it is that they do if you can figure that out then you might have a multitasking benchmark but it starts looking more and more like a normal test as you stray from the internet attached applications that everyone wants us to test like again Steam battlenet Chrome all that type of thing so in listening to the community questions and thoughts from a while ago and throughout the last several CPU reviews we've wouldn't that a lot of people think things like discord plus gaming is multitasking or that Skype plus gaming is multitasking it kind of at the purest definition of it it is but for the folks who say that they're seeing performance differences that are noticeable from running discord or Skype while playing a game there is something seriously wrong with your computer if you're seeing that even on a gee 45 60 that is not an intensive workload so we're looking at something in the middle we're not looking at just one application discord or Skype because it's pointless but if you look at stuff like a bunch of bloatware all at once then you start developing a real test so we're doing that we have another test where we tried to do video playback with a 4k video while gaming and then tried logging the 4k videos playback performance while logging the gaming performance because if you just take one of them you don't get the full picture maybe the games I was perfectly fine but the videos dropping frames so we tried that as well but have more to work with what we didn't do is the I keep 100 tabs of Chrome open test because Chrome also seems to do whatever it wants we looked at it the theta is Chrome does all kinds of caching it deactivates tabs sometimes when they're not in use you have dynamic page elements like advertisements that just kind of fire off whenever and sure you can adjust some of these problems in testing by getting it blocker or by trying to avoid things like YouTube and twitch but then again we're straying from real-world scenarios that people are interested in so you really might as well just use a local application but even those are problematic as we'll show you so Chrome not a good benchmark for a lot of reasons but as we look through this other stuff we learned that it's not alone and being not a good benchmark for a lot of reasons so here's what we've got for the bloatware test this one includes game clients Blizzard or battlenet or whatever they call it now mixed with Origen you play and steam all open simultaneously monitoring software so hardware info 64 and MSI Afterburner with hardware info 64 login actively every two seconds chat clients like discord in a call and Skype peripheral software and the xdcam corsair q logitech gaming software and overwolf all of which had some sort of used six overwolf because we don't like it other than that we had VLC with mp3 playback looping and then all the tests were performed for a period of about two minutes with three passes for parody just to figure out if this was even worth pursuing further for longer durations so here's the problem something like half of these applications or more are internet attached either silently or noticeably they might be doing things in the background that we don't know about unless you really sit there and look at it and monitor packets and things like that so that's a problem that's that means that you have a dynamic testing element to account for and then another thing is if we get these benchmarks working in a satisfactory manner we'd also have to look more at loss from baseline performance rather than exact like absolute performance because if you're looking at in our 3G 45 60 the performance ultimately will be different because the performance should be different baseline in general we picked a few games where was about the same but what you care about is which one doesn't more efficiently so what's your performance loss what's the Delta between a baseline test and a bloated test so that's that's one of things we're looking at this was our nuclear option we ran it this way because just doing discord in a game is not enough but because people kept asking for Internet attached application testing while gaming so we've done basically all of them at this point and it should produce some kind of difference so that's the base testing plan we'll start with the video stuff talk about why we scrapped it let me go through the bloatware video testing was our starting point the goal was to play a game while playing back one of our own 4k videos hosted locally and playing back on a secondary monitor then using software to log game framerate and video playback drop frames at the same time VLC was used initially because it has an expansive options menu and it can count to dropped frames but it also came with some issues out of the box VLC was hitting nearly 100% CPU usage while watching a 4k video with these lower end CPUs and toggling some options in the codec section helped with that but then the software was no longer stock so he starts straying from that scenario in addition there's a sample that we've got here of what it took to log a single benchmark run because it really wasn't trivial windowed mode is mentioned here because VLC was crashing when certain games were launched in full-screen logging the framerate of video playback was important as well since one of the issues people experience is video choppiness not just loss of framerate in-game that meant to framerate loggers onto windows at the same time or relying on VLC zone drop frame meter we considered switching to Windows Media Player but detected playback of more than 60 frames per second on a 60 frames per second video which indicates some kind of Microsoft or Windows Media Player shenanigans after encountering so many issues with monitoring playback reliably while also playing back a video at 4k while gaming video testing was shelved for the time being and we moved on the next option was one that Patrick dubbed the quote nuclear option including numerous viewer requested game clients peripheral clients and other background software this we thought would surely draw out any differences between the g45 60 and 1200 in these multitasking benchmarks and again we started this thinking that we might have an actually valid benchmark to look at performance differences between the two CPUs but we left with low confidence in multitasking benchmarking in general without using local non-internet attached software let's start with blizzards clients and why it is 100% unreliable for any type of benchmark while it's open beginning with the expectations most of you would agree that it is reasonable to assume one of two outcomes here either there's really no difference with the battlenet client open and you have basically the same forints between the 4560 and the r3 1200 with no loss baseline vs battlenet being open this isn't lost between the two cps against each other it's what is baseline performance of the CPU versus baseline with battlenet open that's what we're looking at so it would one expectation is that there's no difference the other expectation is that maybe there's a slight advantage for the r3 1200 but what you wouldn't expect is that the G 4560 would perform 31% better than in r3 1200 I think we can all agree that's pretty unreasonable to assume but that's what we saw that's because this type of testing is unreliable here's a chart for anyone who skipped to this chart and bypassed all the stuff I just said you're not going to understand any of the numbers and you're gonna post a comment calling us chills if you skipped the last few minutes stop now go back because you're not going to understand the context which is that this is not a valid benchmark that's the whole point of this video we've got a few main figures here the G 4560 baseline is 40 8.3 FPS with lows at 33 and 27.5 the R 3 1200 is remarkably close by with the two parts more or less tied and within variance in the frame time department when bloated with all of these applications which will show on the screen once again we lose 4 FPS on the G 45 60 that's average dropping 8% in performance and they hit the frame times is a bit worse illustrated by 1 percent in point one percent low is here here's a surprise though the R 3 1200 drops 11 FPS or 24% and now suffers with have the 0.1% lows that align with more stutter clearly something went wrong here we started disabling applications ultimately finding that battle.net was the culprit and decided to do something in the background during the our three tests that it did not do in the background during the G 4560 tests solving for this bumps the R 3 1200 up to G 4560 performance once again with the - more or less tied when we acts battle.net from the R 3 again these are not definitive benchmarks we're not telling you that one CPU is better than the other or even that they're tied what we're showing you is the results from different test passes with an R 3 1200 and a G 45 60 and why it doesn't really make sense some of the numbers you get sometimes and that's because of the variance that's why we don't do these tests normally I'd like to do them because it is so heavily requested but we're not gonna be able to do these benchmarks with the applications everyone wants to see benchmarked battle nets behavior here could also explain other weird frame deltas when you have battlenet open while benchmarking something like overwatch or any other game that they have on there so it's just kind of a weird application that throws issues to begin with let's go to the Metro last light again we're looping tests for about two minutes with three loops each time here is measure last lights benchmark between the same two CPUs baseline we've got the G 45 60 at about 84 fps and the r3 1200 at about 85 FPS we chose this benchmark because they were so close and performance was more or less equal the 45 60 drops to 75 FPS average with bloatware or about 10% performance loss the r3 1200 drops 31% of its average performance again because of completely unpredictable internet attached software in the form of battlenet doing things in the background and of course other South or - it's not just battlenet disabling battlenet gives a more proportional loss as you can see in the bloatware without battlenet results and who knows what other software was responsible for performance that we saw we got lucky and disabling blizzards client finding that I had issues and kind of moving ahead but the theta is retesting the r3 1200 with battlenet sometimes it's numbers look fine other times it's a loss same is true for the G 45 60 it's just a matter of what was going on what was Blizzard doing what was the client doing when you ran the test here's Ash's escalation where we see a 1 to 2 FPS lost with bloatware though the G 45 60 ran blizzards battlenet and the R 3 did not in this test so again this isn't the test you can rely on for a comparative data between the two aside from showing that we saw a little performance loss in this particular title with the applications doing whatever they were doing when we had them on on the background and this brings up another challenge with multitasking benchmarking or what we're calling multitasking benchmarking because this game is so taxing on the CPU anyway because it is commanding all of the CPU resources from Andy and Intel we now have an issue where ash is clearly is more or less the same performance but what's happening in the background applications what are they dropping that we don't know about to try and make sure and keep up with ashes well the thing is you can look at some of that for example logging applications in the past we've noticed that 8 o 64 will drop log intervals when it is incapable of keeping up with the performance for example running fer mark with prime95 and keeping aida64 in a normal priority in task manager means that 8 a 64 with the hardware we were testing on when we do this will drop intervals so you'll instead of an interval every second you'll get an interval at second 1 an interval at second 19 at 22 at 37 it's kind of random and you're dropping performance in that application a 264 but maybe not the other ones so that's another really difficult thing you have to keep an eye out for is a de or Hart grant for those are easier you produce a log file you check the interval average and you know if it was logging the whole time or not but what about things like video playback software now you need either another frame monitor that you know is accurate for video playback or an application that logs drop frames ok you can figure that out what about the other software what about music playback or streaming playback if you're trying to work with Chrome or something like that there are all these things it's not just monitoring the game it's monitoring everything else and trying to figure out where is the performance loss occurring especially when you start accounting for things like Windows which schedules stuff in ways that should theoretically be beneficial to the user but it's not the same between Intel and AMD so that's hard to here's one more rocket League where we've again got roughly equal performance to start which was an intentional choice and then a gap of about 8% when we use the bloatware here's the thing with this one we have no idea if we're getting game priority on the G 45 60 and some negative effect to the background software or if the R 3 1200 is genuinely just slower or if there's some sporadic and unpredictable background process firing off as a result of all the game clients and internet attached applications trying to do stuff at the end of the day this test means nothing data is not reliable we could make the win or Intel win just by running the test enough times that one of them has some terrible thing going on in the background to take the performance as we saw with rising in that particular set of tests with battlenet so the alternative to that is you could as a tester not know that something just happened publish those results and then you end up with results that make a huge disparity between two components that might only exist because something started occurring like maybe a download or maybe some kind of Auto video playback and one of the clients you're working with that starts tapping into the CPU for some kind of encoding process whatever something like that could go on in one of the applications depending what you're using so it requires very careful selection of applications and then some way to monitor the performance of those applications as well if it's video playback or music playback or something like that so a lot of trouble there it's easy to overlook the difference caused by a background result that was not the case for the other product and it's easy to just have applications that for whatever reason are scheduled differently between tests so very difficult to do this kind of benchmarking and ideally you do it with stuff that's not internet attached so the next step to this would be we ditch all of these game clients especially battlenet and start doing stuff with Excel you can make Excel enumerate or iterate through some formula ad infinitum and just sit there and process a formula non-stop while you play a game ok and then at the end of it you look at how far did Excel get or was the framerate of the game you compare the two numbers unfortunately that's not really something people do too much I certainly there are people who do that but our audience probably not so much and ultimately the comments we would get for doing something like that would be this isn't a real-world test we want real we're holding multitasking benchmarking so you're back to the original problem of here's something that's kind of synthetic we've created to simulate multitasking benchmarking although it is technically a real multitasking thing that's just does anyone do it or care about it you could do that but it doesn't satisfy the demand which is for more normal applications so we've got excel you make it great numbers you look at some media player that's trustworthy in its drop-frame login or some other way to log it doesn't crash like VLC was doing under certain conditions you find something like that maybe there's something there to test but it starts looking more and more like a normal test environment and not the scenario that people want to see which is unfortunate because we'd love to fill that demand it's just it's not easy and we don't want to publish numbers for something that clearly has so much variance and we have no confidence in what's going on in the background if I could look at it and know something happened that didn't happen in the other test and by something happened I mean one of the applications painting a server were tapping into the CPU a different way if we could look at it and identify that reliably without investing an absurd amount of effort and I mean we're willing to invest effort but there's a reasonable amount that you can do if that were the case we'd run the tests but until that point there's really no point in running them so this is why we don't test with that many variables we tried this continent B started out with the hopes we would do an actual G 4560 vs. r3 1200 over overclocking over tasking benchmark we'll call it but it just didn't turn out that way it turned into a content piece of this is why this thing's hard to do with the applications we used and then besides most people who talk about feeling a faster or smoother response in just Windows after changing CPUs it's probably because you reinstalled windows unless you went from something like a garbage seller on up to literally anything else so that's another thing to consider is a lot of these subjective user feeling of responsiveness comes from things like SSDs or if that was already in there then reinstalling Windows but that's enough of my thoughts on it we're gonna get patrick on for a minute to talk through some of his testing and see what he thinks about the future of trying to do something like this now that we've learned a bit and then we'll close it out okay so I've got Patrick Hahn now Patrick ran the tests figured out how to do most of them let's start with that let's start with what going back to the video stuff what was the process to start and adequately execute all the logging of her video while gaming we wrote this down in the article in a little more detail but basically what I was having to do is have the video open on one monitor the game open on another monitor switch the video to be the active window so that fraps would detect it said the game as the process that present Mon was detecting hit a key combination to start logging with fraps and present MA and hit play on the video tab back into the game and start benchmarking the game and do all of that within a reasonable at a time so that the benchmarks gonna be synched up and so that we wouldn't be logging frame rates before or after the benchmark because then we get really weird 0.1% was just like a really complicated process for something that isn't super important really I mean like we want to log framerate of both the video and the game because a lot of the issues that people have when multitasking are not just with the game but also like if they're watching a video if the video is playing back badly where the audio is skipping so we want to test for that as well but then adding that to the benchmark makes it exponentially more difficult well also and we could deal with a difficulty but then you have stuff like one is fraps or present Mon or whatever monitoring the video application even accurate yeah is it even a frame output that is realistic or what's occurring on screen and then also they probably add some level conflict with each other yeah like I've definitely run tests in the past accidentally that we figured out we're running incorrectly where I've had present mana and fraps log in the same game and you'll see it because the numbers don't look right so that's a question we had VLC's drop frame output to so VLC does have a feature where it will display dropped frames and that's a pretty useful feature if what you're doing is watching VLC but when you're doing multiple things at the same time and the video isn't the active window we were getting playback that was stuttering but then not being reflected in the dropped frame counter I think we pretty much dropped VLC after that point I don't know right now my feeling is I just generally don't trust multitasking benchmarks with a lot of the applications people want us to test because they're like like battlenet is internet attached yeah and we saw a 30% performance class like for who knows what reason yeah there was some pretty weird stuff and just the the nature of the test makes the test difficult I mean people that are talking about difficulties with multitasking they're talking about unpredictable behaviors and weird unpredictable behaviors aren't lab friendly right there they're hard to test for they're hard to account for and they're hard to reproduce this isn't us complaining saying this is hard feel bad for us this is saying like this is what everyone wants to see and we want to see it too that's why I just paid Patrick for like a week to try and do this is we want to see these tests I'd like to do them I'd love to be one of the only sites that has the way of doing them and you know a trustworthy fashion unfortunately until we can really figure out key applications that are both representative of what users want and friendly to benchmarking processes where you know what's going on and there's not variants until we configure those applications out and everyone agrees that they're good to test then we don't have multitasking test for you that anything beyond today which was like here's an exploratory look at it because otherwise like I was saying before we cut in with Patrick you can do stuff like we talked about using Excel to just enumerate some formula yeah that would be reliable does it count and we are just not comfortable with the level of accuracy that we would get from doing like a casual benchmark with like just Chrome like a YouTube video playing we I mean we could do that pretty easily but then we wouldn't be comfortable publishing results and standing by the yeah be trivial to do like how do you how do you isolate the software from the network from the OS from the hardware because we want to isolate the hardware yeah but network alone could be a big factor in things so at the end of the day we don't want to say that the the 1200 is a better or worse cpu than the G 4560 based on this I mean before the best time yeah just a really complex question and really a lot of variants and the answers to that question yep and ultimately if your form of multitasking is discord YouTube your inbox and playing a game it's not gonna matter which CPU you buy like it will matter for other reasons for which we have the r3 1200 review so you can check that out if you want those reasons well that's all for this one thank you all for watching and for filing that request seriously it's it's very interesting to look into we like knowing these things even if we didn't come to the conclusion that I wanted which was a real test that shows real differences that we could trust didn't get that today but we still got interesting content and we have stuff we can work on for the future so thank you for requesting it if you want to leave comments with application suggestions for like video playback maybe you know of one that is really good at logging its performance please leave them below we'll look into it for next time but I think we're good for now so they you were watching subscribe for more patreon.com slash gamers Nexus tops out directly and we will see you all next time you
We are a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for us to earn fees by linking to Amazon.com and affiliated sites.