NVidia Pascal Architecture Deep Dive w/ GP100

and Vidya hasn't owned up to the recent order of ten puzzle that we solved you can check the channel for that but it's pretty fairly obvious that it's from them because it does indirectly mention Fermi it directly mentions Kepler they're all related somehow to Maxwell or Pascal and so today I thought it'd be fun to talk about the GP 100 architecture and this is the GPU for Pascal the 1st GPU for Pascal that was unveiled at GTCC about a month ago in the tesla P 100 graphics accelerated or accelerated card so this is not a video card for gamers it's a scientific card but it's on the Pascal GP 100 architecture and this Pascal architecture will be used in the gaming cards in some form now they're gonna strip things out of course that are not necessary they don't want to validate or build to increase cost when that can be reduced so reducing cost by eliminating things like HP m2 and some of the lower end cards makes sense or by reducing DP double precision or other things of that nature so we're gonna talk about the SM architecture or streaming multi processors within GPU 100 talk about the cache unified memory and HBM 2 versus HBM 1 before we get into that here's the specs table showing the GP 100 Pascal specification compared against previous kepler gk110 and maxwell GM 200 chips there are various versions of Pascal and the versions shown here the chip is actually simplified slightly compared to its maximum potential we'll talk about that in this video meet Pascal this is the GP 100 block diagram on screen now that we showed from GTC last month although the GeForce gaming chips will be a lot different in terms of their architecture their design at the high level the low level is the same it'll likely at least at some part have the same memory subsystem but other cards may have gddr5 or gddr5 X we don't exactly know quite yet what the so-called gtx 1,000 series cards will have but Pascal overall is going to look like this GP 100 is the biggest GPU that NVIDIA has ever made measuring at 610 millimeters squared and using the new 60 nano meter FinFET process note from TSM see all previous generation architectures from both Andy have been on a 28 nanometer process note so this dice drink has been years in the making the process shrink champions an era of reduced wattage and more densely packed in transistors which continues the performance for walk gains that nvidia and AMD have been boasting for the past few years and he began its performance per watt improvements with fiji by liquid cooling the fury acts which reduced capacitor leakage and resolved other potential issues but it didn't undergo the tremendous die shrink that Pascal and polaris herald further performance and power efficiency gains are aided by move to FinFET transistors which means that power leakage becomes less significant this also marks the eol for planar FETs in GPUs as all major silicon manufacturers have now transitioned to the FinFET design FinFET transistors use a three dimensional design which extrudes a fin to form a drain and source on the gate the gate encircles the transistors fins and GP 100 has a transistor count totalling fifteen point three billion across at six hundred ten millimeter squared GPI sized GP 100 is rated for a 300 watt TDP and pushes 5.3 teraflops of FP 64 double-precision compute performance and ten point six teraflops of FP 32 FP 16 is also available natively at twenty-one point two teraflops on GP 100 but it's more critical than it might sound on paper or in youtube video it's mostly though for deep learning applications that we won't dive into here just as a quick primer deep learning benefits as the precision is less required by nature of the backpropagation algorithm so FP 16 allows for reduced memory consumption and faster processing because that precision of FP 64 or FP 32 isn't required they can benefit instead from the speed and then sort of use the redundancy and parity to check and make sure all that data is good the tesla p 100 pascal card host 3584 cuda cores for FP 32 and 1792 cuda cores capable of double precision at FP 64 the full GP 100 will have 1920 FP 64 cores if is released and pascal's p 100 base clock rate just for reference here is 1328 megahertz and is 14 80 megahertz when boosted pretty fast for a GPU for the first part of this architecture deep dive we'll start with the sm or streaming multi processor talk about graphics graphics processing clusters which enclosed the SMS then we'll get into a unified memory and HBM 2gp 100 host 6 graphics processing clusters orgy pcs each of which contains a set of texture processing clusters or tea pcs and for everyone TPC and Pascal there are two streaming multiprocessors or SMS in total there are 60 SMS on GP 100 with the Tesla P 100 accelerator hosting 56 this is what an SM looks like in the Pascal architecture on the screen right now there are 10 SMS / GPC accompanied by 1/2 that count of t pcs or five total TBC's / g pc each s m contains 64 FP 32 a single precision cuda cores and 32 FP 64 cores or 1 FP 64 core for every 2 32 cores and these stats are for a fully enabled GP 100g PU so that will of course be different for the gaming g-force guards when they come out still though it's a good look at how things work architecturally this is a market reduction in total core count per SM vs. max while kepler architectures despite this though there is an overall higher per GPU core count so the total core count is higher with 3500 plus on Pascal GP 100 but the per SM core count is lower Maxwell has 128 FP 32 - 2 cores per SM and predecessor Kepler had 192 FP DSP - two cores for each SM the reduced kuduk or count per SM is because GP 100 has been effectively partitioned into two sets of 32 core processing blocks each of which contains independent instruction buffers warp schedulers and dispatch units there's one warp scheduler and one instruction buffer / 32 core partition each partition also contains two independent dispatch units per SM there's four total per SM and each partition contains two of these dispatch units and there are four total per SM each partition further contains a 32,768 32-bit register files the SM segments share a single instruction cache unified texture and l1 cache and for texture units or TM use each and one 64 kilobyte shared memory block passed has half the course / SMS Maxwell but the same register file size and comparable warp and thread counts GP 100 can sustain more in-flight threads warps and blocks partially because of the increased register access presented to the threads overall the core counts of Pascal GP 100 is higher than GM 200 even though the cores per SM are lower that's just because there are more SMS but as a whole this increases at processing efficiency by changing the data path configuration and that's something that's only aided further by the move to smaller processes and the FinFET processing node the data path organization of Pascal requires less power per data transfer management tasks and Pascal schedules tasks with greater efficiency and consumes less die space than max while also critical mostly by dispatching to warp instructions per clock one warp is scheduled per block or so called partition as we've been using that word or segments and that's all shown in the sort of block diagrams i've been showing on screen each sm has four texture units so there's a maximum possible count of 240 TM use l1 and texture cache are split for shared use unified memory most immediately benefits programmers it reduces their manual workload by eliminating the need for explicit memory calls between the CPU and GPU memory pools but it's still useful to us as well as sort of benefactors of this impact with regard to Pascal specifically l2 cache is unified into a single 4096 kilobyte pool as opposed to GM 230 72 kilobyte l2 cache and that further again reduces reliance upon DRAM which is huge for speed increases Pascal dedicates a single pool of 64 kilobytes of shared memory to each SM eliminating previous reliance on splitting memory utilization between l1 and shared pools for more detail on that and all the other stuff we're talking about here hit the article links in the description below because that has more on unified memory sort of a GPU version of DMA and the SM architecture itself but now we're going to talk about HB m 2 and HB M 1 which is of course very interesting to anyone who followed Fury and its launch Fiji first introduced high bandwidth memory on Andy's fury X which stacked memory vertically atop an interposer which is then on top of the substrate this reduced the physical distance between the GPU and memory which reside on the same substrate and was joined by a bus with increase and reduced electrical and thermal requirements each stack of HB ml one uses a 1024 bit wide interface capable of producing at 128 gigabytes per second throughput per stack so it's 128 gigabytes per second per stack and that can put upwards of a terabyte per second depending on what you're using in terms of the GPU architecture gddr5 pushes a maximum theoretical throughput of about 8 gigabits per second - per die with microns gddr5 X pushing 13 to 14 gigabits per second per die VG it was limited to 4 gigabytes total capacity due to yields and cost but HBM 2 will change the game in terms of capacity HBM - relevant to Pascal devices like GP 100 will be able to maximally host 16 gigabytes of vram pursuant to increase dye density of the new version HBM - densities will run upwards of 8 gigabits with 4 to 8 dies per GPU maximally enabling that 16 gigabyte capacity although the initial p100 accelerated shipments will be 8 gigabytes GPU 100 HP m2 is a composition of 5 dyes presented as a coplanar surface to simplify heat sink mounting and hotspot cooling basically it's flat that's all that really means and the stack starts with the base die then vertically turns into 4 DRAM module stacked above it and we have a micro photo of this under a microscope GP 100 has a 4096 bit wide interface for its HB m2 with each stack running a 512 bit memory interface between them so that covers the initial Pascal GP 100 launch and of course the geforce cards we're all very interested in those will be here eventually I don't know quite when just yet but hopefully sometime soon of course Polaris is coming out from AMD Polaris 10 and all these GPUs are moving to the new process node the 14 and Andy's case or 16 nanometer invidious case FinFET process from in Nvidia's GP 100 TSMC being the the manufacturer of that process of the actual silicon itself so that's the basics of Pascal the or the in depth basics of Pascal for more as always article below it has all the charts and diagrams and things like that and more text which should hopefully help you get through the dense information a bit more easily other than that stay tuned because we have plenty of other content coming up this weekend we're out here in Austin for Dreamhack and we'll be talking about some of the goings on there we'll also be talking about very soon I actually Computex and that starts late May so as subscribe if you're not subscribed already and you'll find all of our Computex taipei factory tours things like that patreon links official video as always thank you for watching I'll see you all next time

Gadgetory

All Cool Mind-blowing Gadgets You Love in One Place

2016-05-06