So that leaves performance, which I honestly haven’t found good numbers for. If you have this, I’m very interested, but since RAM speed is rarely the bottleneck in a computer (unless you have specific workloads), I’m going to assume it to be a marginal improvement.
This is where you’re mistaken. There is one thing that integrated RAM enables that makes a huge difference for performance: unified memory. GPUs code is almost always bandwidth limited, which why on a graphics card the RAM is soldered on and physically close to the GPU itself, because that is needed for the high bandwidth requirements of a GPU.
By having everything in one package, CPU and GPU can share the same memory, which means that you eliminate any overhead of copying data to/from VRAM for GPGPU tasks. But there’s more than that, unified memory doesn’t just apply to the CPU and GPU, but also other accelerators that are part of the SoC. What is becoming increasingly important is AI acceleration. UMA means the neural engine can access the same memory as the CPU and GPU, and also with zero overhead.
This is why user-replaceable RAM and discrete GPUs are going to die out. The overhead and latency of copying all that data back and forth over the relatively slow PCIe bus is just not worth it.
The best I’ve found is benchmarks of Apple silicon vs Intel+dGPU, but that’s an apples to oranges comparison. And if I’m not mistaken, Apple made other changes like a larger bus to the memory chips, which again makes comparisons difficult.
I’ve heard about potential benefits, but without something tangible, I’m going to have to assume it’s not the main driver here. If the difference is significant, we’d see more servers and workstations running soldered RAM, but AFAIK that’s just not a thing.
I understand the scepticism, but without links of what you’ve found or which parts in particular you consider dubious claims (ram speed can be increased when soldered, higher speeds lead to better performance, etc) it comes across as “i don’t believe you, because i choose to not believe you”
Yes, and the results from that video (i assume, I skimmed it, but have watched similar videos) is that the difference is negligible (like 1-10FPS) and you’re usually better off spending that money on something else.
I look at the benchmarks between the Intel MacBook Pro and the M1 MacBook Pro, and both use soldered RAM, yet the M1 gets so much better performance, even on non-GPU tasks (e.g. memory-heavy unit tests at work went from 3-5min to 45-50sec from latest Intel to M1). Docker build times saw a similar drop. But it’s hard for me to know what the difference is between memory vs CPU changes. I’d have to check, but I’m guessing there’s also the DDR4 to DDR5 switch, which increases memory channels.
The claim is that proximity to the CPU explains it, but I have trouble quantifying that. For me, a 1-10FPS drop isn’t enough to reduce repairability and expandability. Maybe it is for others though, but if that’s the difference, that’s a lot less than the claims they seem to make.
The video has a short section on productivity (i.e. rendering or compiling). That part is probably the most relevant for most people. Check the chapter view in YouTube to jump directly to it.
I think a 2x performance improvement is plausible when comparing non-soldered ram to the Apple silicon, which goes even further and has the memory on the die itself. If, of course, ram is the limiting factor.
The advantages of upgradable, expandable ram are obvious. But let’s face it: most people don’t need and even less use that capability.
Looks about the same as the rest. Big gains for handbrake, pretty much nothing for anything else. And that makes sense, because handbrake will be doing lots of roundtrips to the GPU for encoding.
has the memory on the die itself
On the package, not the die. But perhaps that’s what you meant. On die would be closer to a massive cache like on the X3D AMD chips.
The performance improvement seems to be that Apple has a massive iGPU, not anything to do with RAM next to the CPU. So in CPU-only benchmarks, I’d expect the lion’s share of the difference to be CPU design and process node, not the memory.
Also, unified memory isn’t particularly new, APUs have supported it for years. It’s just not well utilized by devs because most users have dGPUs. So I think the main innovation here is Apple committing to it and providing tooling for devs to utilize the unified memory better, like console manufacturers have done.
So I guess that brings a few more questions:
what performance improvements could we see if devs use unified memory in socketed LPDDR memory in laptops?
how would that compare to Apple’s on-package RAM (I think it’s also LPDDR, so more apples to apples?)?
how likely are AMD and Intel to push for massive APUs on laptops?
I guess we’re kind of seeing it with the gaming PC handhelds, like Steam Deck and Ayaneo etc al, so maybe that’ll become more mainstream.
The best I’ve found is benchmarks of Apple silicon vs Intel+dGPU, but that’s an apples to oranges comparison.
The thing with benchmarks is that they only show you the performance of the type of workload the benchmark is trying to emulate. That’s not very useful in this case. Current PC software is not build with this kind of architecture in mind so it was never designed to take advantage of it. In fact, it’s the exact opposite: since transferring data to/from VRAM is a huge bottleneck, software will be designed to avoid it as much as possible.
For example: a GPU is extremely good at performing an identical operation on lots of data in parallel. The GPU can perform such an operation much, much faster than the CPU. However, copying the data to VRAM and back may add so much additional time that it still takes less time to run it on the CPU, a developer may then choose to run it on the CPU instead even if the GPU was specifically designed to handle that kind of work. On a system with UMA you would absolutely run this on the GPU.
The same thing goes for something like AI accelerators. What PC software exists that takes advantage of such a thing?
A good example of what happens if you design software around this kind of architecture can be found here. This is a post by a developer who worked on Affinity Photo. When they designed this software they anticipated that hardware would move towards a unified memory architecture and designed their software based on that assumption.
When they finally got their hands on UMA hardware in the form of an M1 Max that laptop chip beat the crap out of a $6000 W6900X.
We’re starting to see software taking advantage of these things on macOS, but the PC world still has some catching up to do. The hardware isn’t there yet, and the software always lags behind the hardware.
I’ve heard about potential benefits, but without something tangible, I’m going to have to assume it’s not the main driver here. If the difference is significant, we’d see more servers and workstations running soldered RAM, but AFAIK that’s just not a thing.
It’s coming, but Apple is ahead of the game by several years. The problem is that in the PC world no one has a good answer to this yet.
Nvidia makes big, hot, power hungry discrete GPUs. They don’t have an x86 core and Windows on ARM is a joke at this point. I expect them to focus on the server-side with custom high-end AI processors and slowly move out of the desktop space.
AMD has the best papers for desktop. They have a decent x86 core and GPU, they already make APUs. Intel is trying to get into the GPU game but has some catching up to do.
Apple has been quietly working towards this for years. They have their UMA architecture in place, they are starting to put some serious effort into GPU performance and rumor has it that with M4 they will make some big steps in AI acceleration as well. The PC world is held back by a lot of legacy hard and software, but there will be a point where they will have to catch up or be left in the dust.
“unified memory” is an Apple marketing term for what everyone’s been doing for well over a decade. Every single integrated GPU in existence shares memory between the CPU and GPU; that’s how they work. It has nothing to do with soldering the RAM.
You’re right about the bandwidth though, current socketed RAM standards have severe bandwidth limitations which directly limit the performance of integrated GPUs. This again has little to do with being socketed though: LPCAMM supports up to 9.6GT/s, considerably faster than what ships with the latest macs.
This is why user-replaceable RAM and discrete GPUs are going to die out. The overhead and latency of copying all that data back and forth over the relatively slow PCIe bus is just not worth it.
The only way discrete GPUs can possibly be outcompeted is if DDR starts competing with GDDR and/or HBM in terms of bandwidth, and there’s zero indication of that ever happening. Apple needs to puts a whole 128GB of LPDDR in their system to be comparable (in bandwidth) to literally 10 year old dedicated GPUs - the 780ti had over 300GB/s of memory bandwidth with a measly 3GB of capacity. DDR is simply not a good choice GPUs.
“unified memory” is an Apple marketing term for what everyone’s been doing for well over a decade.
Wrong. Unified memory (UMA) is not an Apple marketing term, it’s a description of a computer architecture that has been in use since at least the 1970’s. For example, game consoles have always used UMA.
Every single integrated GPU in existence shares memory between the CPU and GPU; that’s how they work.
Again, wrong.
While iGPUs have existed for PCs for a long time, they did not use a unified memory architecture. What they did was reserve a portion of the system RAM for the GPU. For example on a PC with 512MB RAM and an iGPU, 64MB may have been reserved for the GPU. The CPU then had access to 512-64 = 448MB. While they shared the same physical memory chips, they both had a separate address space. If you wanted to make a texture available to the GPU, it still had to be copied to the special reserved RAM space for the GPU and the CPU could not access that directly.
With unified memory, both CPU and GPU share the same address space. Both can access the entire memory. No RAM is reserved purely for the GPU. If you want to make something available to the GPU, nothing needs to be copied, you just need to point to where it is in RAM. Likewise, anything done by the GPU is immediately accessible by the CPU.
Since there is one memory pool for both, you can use RAM more efficiently. If you have a discrete GPU with 16GB VRAM, and your app only needs 8GB VRAM, that other memory just sits there being useless. Alternatively, if your app needs 24GB VRAM, you can’t run it because your GPU only has 16B, even if you have lots of system RAM available.
With UMA you can use all the RAM you have for whatever you need it for. On an M2 Ultra with 192GB RAM you can use almost all of that for the GPU (minus a little bit that’s used for the OS and any running apps). Even on a tricked out PC with a 4090 you can’t run anything that needs more than 24GB VRAM. Want to run something where the GPU needs 180MB of memory? No problem on an M1 Ultra.
It has nothing to do with soldering the RAM.
It has everything to do with soldering the RAM. One of the reason iGPUs sucked, other than not using UMA, is that GPUs performance is almost limited by memory bandwidth. Compared to VRAM, standard system RAM has much, much less bandwidth causing iGPUs to be slow.
A high-bandwidth memory bus, like a GPU needs, has a lot of connections and runs at high speeds. The only way to do this reliably is to physically place the RAM very close to the actual GPU. Why do you think GPUs do not have user-upgradable RAM?
Soldering the RAM makes it possible to integrate a CPU and an non-sucking GPU. Go look at the inside of a PS5 or XSX and you’ll see the same thing: an APU with the RAM chips soldered to the board very close to it.
This again has little to do with being socketed though: LPCAMM supports up to 9.6GT/s, considerably faster than what ships with the latest macs.
LPCAMM is a very recent innovation. Engineering samples weren’t available until late last year and the first products will only hit the market later this year. Maybe this will allow for Macs with user-upgradable RAM in the future.
The only way discrete GPUs can possibly be outcompeted is if DDR starts competing with GDDR and/or HBM in terms of bandwidth
What use is high bandwidth memory if it’s a discrete memory pool with only a super slow PCIe bus to access it?
Discrete VRAM is only really useful for gaming, where you can upload all the assets to VRAM in advance and data practically only flows from CPU to GPU and very little in the opposite direction. Games don’t matter to the majority of users. GPGPU is much more interesting to the general public.
Wrong. Unified memory (UMA) is not an Apple marketing term, it’s a description of a computer architecture that has been in use since at least the 1970’s. For example, game consoles have always used UMA.
Apologies, my google-fu seems to have failed me. Search results are filled with only apple-related results, but I was now able to find stuff from well before. Though nothing older than the 1990s.
While iGPUs have existed for PCs for a long time, they did not use a unified memory architecture.
Do you have an example, because every single one I look up has at least optional UMA support. The reserved RAM was a thing but it wasn’t the entire memory of the GPU instead being reserved for the framebuffer. AFAIK iGPUs have always shared memory like they do today.
It has everything to do with soldering the RAM. One of the reason iGPUs sucked, other than not using UMA, is that GPUs performance is almost limited by memory bandwidth. Compared to VRAM, standard system RAM has much, much less bandwidth causing iGPUs to be slow.
I don’t disagree, I think we were talking past each other here.
LPCAMM is a very recent innovation. Engineering samples weren’t available until late last year and the first products will only hit the market later this year. Maybe this will allow for Macs with user-upgradable RAM in the future.
What use is high bandwidth memory if it’s a discrete memory pool with only a super slow PCIe bus to access it?
Discrete VRAM is only really useful for gaming, where you can upload all the assets to VRAM in advance and data practically only flows from CPU to GPU and very little in the opposite direction. Games don’t matter to the majority of users. GPGPU is much more interesting to the general public.
gestures broadly at every current use of dedicated GPUs. Most of the newfangled AI stuff runs on Nvidia DGX servers, which use dedicated GPUs. Games are a big enough industry for dGPUs to exist in the first place.
This is where you’re mistaken. There is one thing that integrated RAM enables that makes a huge difference for performance: unified memory. GPUs code is almost always bandwidth limited, which why on a graphics card the RAM is soldered on and physically close to the GPU itself, because that is needed for the high bandwidth requirements of a GPU.
By having everything in one package, CPU and GPU can share the same memory, which means that you eliminate any overhead of copying data to/from VRAM for GPGPU tasks. But there’s more than that, unified memory doesn’t just apply to the CPU and GPU, but also other accelerators that are part of the SoC. What is becoming increasingly important is AI acceleration. UMA means the neural engine can access the same memory as the CPU and GPU, and also with zero overhead.
This is why user-replaceable RAM and discrete GPUs are going to die out. The overhead and latency of copying all that data back and forth over the relatively slow PCIe bus is just not worth it.
Do you have actual numbers to back that up?
The best I’ve found is benchmarks of Apple silicon vs Intel+dGPU, but that’s an apples to oranges comparison. And if I’m not mistaken, Apple made other changes like a larger bus to the memory chips, which again makes comparisons difficult.
I’ve heard about potential benefits, but without something tangible, I’m going to have to assume it’s not the main driver here. If the difference is significant, we’d see more servers and workstations running soldered RAM, but AFAIK that’s just not a thing.
I understand the scepticism, but without links of what you’ve found or which parts in particular you consider dubious claims (ram speed can be increased when soldered, higher speeds lead to better performance, etc) it comes across as “i don’t believe you, because i choose to not believe you”
LTT has made a comparison video on ram speeds: https://www.youtube.com/watch?v=b-WFetQjifc
Do you need proof that soldered ram can be made to run faster?
Yes, and the results from that video (i assume, I skimmed it, but have watched similar videos) is that the difference is negligible (like 1-10FPS) and you’re usually better off spending that money on something else.
I look at the benchmarks between the Intel MacBook Pro and the M1 MacBook Pro, and both use soldered RAM, yet the M1 gets so much better performance, even on non-GPU tasks (e.g. memory-heavy unit tests at work went from 3-5min to 45-50sec from latest Intel to M1). Docker build times saw a similar drop. But it’s hard for me to know what the difference is between memory vs CPU changes. I’d have to check, but I’m guessing there’s also the DDR4 to DDR5 switch, which increases memory channels.
The claim is that proximity to the CPU explains it, but I have trouble quantifying that. For me, a 1-10FPS drop isn’t enough to reduce repairability and expandability. Maybe it is for others though, but if that’s the difference, that’s a lot less than the claims they seem to make.
The video has a short section on productivity (i.e. rendering or compiling). That part is probably the most relevant for most people. Check the chapter view in YouTube to jump directly to it.
I think a 2x performance improvement is plausible when comparing non-soldered ram to the Apple silicon, which goes even further and has the memory on the die itself. If, of course, ram is the limiting factor.
The advantages of upgradable, expandable ram are obvious. But let’s face it: most people don’t need and even less use that capability.
Looks about the same as the rest. Big gains for handbrake, pretty much nothing for anything else. And that makes sense, because handbrake will be doing lots of roundtrips to the GPU for encoding.
On the package, not the die. But perhaps that’s what you meant. On die would be closer to a massive cache like on the X3D AMD chips.
The performance improvement seems to be that Apple has a massive iGPU, not anything to do with RAM next to the CPU. So in CPU-only benchmarks, I’d expect the lion’s share of the difference to be CPU design and process node, not the memory.
Also, unified memory isn’t particularly new, APUs have supported it for years. It’s just not well utilized by devs because most users have dGPUs. So I think the main innovation here is Apple committing to it and providing tooling for devs to utilize the unified memory better, like console manufacturers have done.
So I guess that brings a few more questions:
I guess we’re kind of seeing it with the gaming PC handhelds, like Steam Deck and Ayaneo etc al, so maybe that’ll become more mainstream.
Here is an alternative Piped link(s):
https://www.piped.video/watch?v=b-WFetQjifc
Piped is a privacy-respecting open-source alternative frontend to YouTube.
I’m open-source; check me out at GitHub.
The thing with benchmarks is that they only show you the performance of the type of workload the benchmark is trying to emulate. That’s not very useful in this case. Current PC software is not build with this kind of architecture in mind so it was never designed to take advantage of it. In fact, it’s the exact opposite: since transferring data to/from VRAM is a huge bottleneck, software will be designed to avoid it as much as possible.
For example: a GPU is extremely good at performing an identical operation on lots of data in parallel. The GPU can perform such an operation much, much faster than the CPU. However, copying the data to VRAM and back may add so much additional time that it still takes less time to run it on the CPU, a developer may then choose to run it on the CPU instead even if the GPU was specifically designed to handle that kind of work. On a system with UMA you would absolutely run this on the GPU.
The same thing goes for something like AI accelerators. What PC software exists that takes advantage of such a thing?
A good example of what happens if you design software around this kind of architecture can be found here. This is a post by a developer who worked on Affinity Photo. When they designed this software they anticipated that hardware would move towards a unified memory architecture and designed their software based on that assumption.
When they finally got their hands on UMA hardware in the form of an M1 Max that laptop chip beat the crap out of a $6000 W6900X.
We’re starting to see software taking advantage of these things on macOS, but the PC world still has some catching up to do. The hardware isn’t there yet, and the software always lags behind the hardware.
It’s coming, but Apple is ahead of the game by several years. The problem is that in the PC world no one has a good answer to this yet.
Nvidia makes big, hot, power hungry discrete GPUs. They don’t have an x86 core and Windows on ARM is a joke at this point. I expect them to focus on the server-side with custom high-end AI processors and slowly move out of the desktop space.
AMD has the best papers for desktop. They have a decent x86 core and GPU, they already make APUs. Intel is trying to get into the GPU game but has some catching up to do.
Apple has been quietly working towards this for years. They have their UMA architecture in place, they are starting to put some serious effort into GPU performance and rumor has it that with M4 they will make some big steps in AI acceleration as well. The PC world is held back by a lot of legacy hard and software, but there will be a point where they will have to catch up or be left in the dust.
“unified memory” is an Apple marketing term for what everyone’s been doing for well over a decade. Every single integrated GPU in existence shares memory between the CPU and GPU; that’s how they work. It has nothing to do with soldering the RAM.
You’re right about the bandwidth though, current socketed RAM standards have severe bandwidth limitations which directly limit the performance of integrated GPUs. This again has little to do with being socketed though: LPCAMM supports up to 9.6GT/s, considerably faster than what ships with the latest macs.
The only way discrete GPUs can possibly be outcompeted is if DDR starts competing with GDDR and/or HBM in terms of bandwidth, and there’s zero indication of that ever happening. Apple needs to puts a whole 128GB of LPDDR in their system to be comparable (in bandwidth) to literally 10 year old dedicated GPUs - the 780ti had over 300GB/s of memory bandwidth with a measly 3GB of capacity. DDR is simply not a good choice GPUs.
Wrong. Unified memory (UMA) is not an Apple marketing term, it’s a description of a computer architecture that has been in use since at least the 1970’s. For example, game consoles have always used UMA.
Again, wrong.
While iGPUs have existed for PCs for a long time, they did not use a unified memory architecture. What they did was reserve a portion of the system RAM for the GPU. For example on a PC with 512MB RAM and an iGPU, 64MB may have been reserved for the GPU. The CPU then had access to 512-64 = 448MB. While they shared the same physical memory chips, they both had a separate address space. If you wanted to make a texture available to the GPU, it still had to be copied to the special reserved RAM space for the GPU and the CPU could not access that directly.
With unified memory, both CPU and GPU share the same address space. Both can access the entire memory. No RAM is reserved purely for the GPU. If you want to make something available to the GPU, nothing needs to be copied, you just need to point to where it is in RAM. Likewise, anything done by the GPU is immediately accessible by the CPU.
Since there is one memory pool for both, you can use RAM more efficiently. If you have a discrete GPU with 16GB VRAM, and your app only needs 8GB VRAM, that other memory just sits there being useless. Alternatively, if your app needs 24GB VRAM, you can’t run it because your GPU only has 16B, even if you have lots of system RAM available.
With UMA you can use all the RAM you have for whatever you need it for. On an M2 Ultra with 192GB RAM you can use almost all of that for the GPU (minus a little bit that’s used for the OS and any running apps). Even on a tricked out PC with a 4090 you can’t run anything that needs more than 24GB VRAM. Want to run something where the GPU needs 180MB of memory? No problem on an M1 Ultra.
It has everything to do with soldering the RAM. One of the reason iGPUs sucked, other than not using UMA, is that GPUs performance is almost limited by memory bandwidth. Compared to VRAM, standard system RAM has much, much less bandwidth causing iGPUs to be slow.
A high-bandwidth memory bus, like a GPU needs, has a lot of connections and runs at high speeds. The only way to do this reliably is to physically place the RAM very close to the actual GPU. Why do you think GPUs do not have user-upgradable RAM?
Soldering the RAM makes it possible to integrate a CPU and an non-sucking GPU. Go look at the inside of a PS5 or XSX and you’ll see the same thing: an APU with the RAM chips soldered to the board very close to it.
LPCAMM is a very recent innovation. Engineering samples weren’t available until late last year and the first products will only hit the market later this year. Maybe this will allow for Macs with user-upgradable RAM in the future.
What use is high bandwidth memory if it’s a discrete memory pool with only a super slow PCIe bus to access it?
Discrete VRAM is only really useful for gaming, where you can upload all the assets to VRAM in advance and data practically only flows from CPU to GPU and very little in the opposite direction. Games don’t matter to the majority of users. GPGPU is much more interesting to the general public.
Apologies, my google-fu seems to have failed me. Search results are filled with only apple-related results, but I was now able to find stuff from well before. Though nothing older than the 1990s.
Do you have an example, because every single one I look up has at least optional UMA support. The reserved RAM was a thing but it wasn’t the entire memory of the GPU instead being reserved for the framebuffer. AFAIK iGPUs have always shared memory like they do today.
I don’t disagree, I think we were talking past each other here.
Here’s a link to buy some from Dell: https://www.dell.com/en-us/shop/dell-camm-memory-upgrade-128-gb-ddr5-3600-mt-s-not-interchangeable-with-sodimm/apd/370-ahfr/memory. Here’s the laptop it ships in: https://www.dell.com/en-au/shop/workstations/precision-7670-workstation/spd/precision-16-7670-laptop. Available since late 2022.
gestures broadly at every current use of dedicated GPUs. Most of the newfangled AI stuff runs on Nvidia DGX servers, which use dedicated GPUs. Games are a big enough industry for dGPUs to exist in the first place.