Handling Complex Memory Situations
Jérôme Glisse felt that the time had come for the Linux kernel to address seriously the issue of having many different types of memory installed on a single running system. There was main system memory and device-specific memory, and associated hierarchies regarding which memory to use at which time and under which circumstances. This complicated new situation, Jérôme said, was actually now the norm, and it should be treated as such.
The physical connections between the various CPUs and devices and RAM chips—that is, the bus topology—also was relevant, because it could influence the various speeds of each of those components.
Jérôme wanted to be clear that his proposal went beyond existing efforts to handle heterogeneous RAM. He wanted to take account of the wide range of hardware and its topological relationships to eek out the absolute highest performance from a given system. He said:
One of the reasons for radical change is the advance of accelerator like GPU or FPGA means that CPU is no longer the only piece where computation happens. It is becoming more and more common for an application to use a mix and match of different accelerator to perform its computation. So we can no longer satisfy our self with a CPU centric and flat view of a system like NUMA and NUMA distance.
He posted some patches to accomplish several different things. First, he wanted to expose the bus topology and memory variety to userspace as a clear API, so that both the kernel and user applications could make the best possible use of the particular hardware configuration on a given system. A part of this, he said, would have to take account of the fact that not all memory on the system always would be equally available to all devices, CPUs or users.
To accomplish all this, his patches first identified four basic elements that could be used to construct an arbitrarily complex graph of CPU, memory and bus topology on a given system.
These included "targets", which were any sort of memory; "initiators", which were CPUs or any other device that might access memory; "links", which were any sort of bus-type connection between a target and an initiator; and "bridges", which could connect groups of initiators to remote targets.
Aspects like bandwidth and latency would be associated with their relevant
links and bridges. And, the whole graph of the system would be exposed to
userspace via files in the SysFS
hierarchy.
In addition, Jérôme's patches provided a way to express memory policy. A system's memory policy is the mechanism it uses to decide which memory to use for a given task. For example, it might use faster memory first and slower memory only when fast memory is full. But, the kernel's current memory policy was organized on a per-CPU basis, which Jérôme felt was not good enough. But, he also acknowledged that trying to change that aspect of kernel infrastructure directly might break a lot of existing code. To deal with this, his patch added an entirely new memory policy API that new user code could take advantage of and old user code simply could ignore.
Aneesh Kumar responded to all of this, in particular praising Jérôme's approach of keeping the new API separate from the old. But Aneesh said, "that is also the drawback isn't it? We now have multiple entities tracking cpu and memory."
Aneesh also wanted to confirm that "once we have these different types of targets, ideally the system should be able to place them in the ideal location based on the affinity of the access. ie. we should automatically place the memory such that initiator can access the target optimally."
Jérôme seemed to agree with this in principle, but he also seemed to feel that making any of this automatic was still not guaranteed. The first step, he felt, was to expose the APIs and data structures, and then see what could be accomplished.
Meanwhile, Dave Hansen pointed out that there were existing elements of the kernel that dealt with heterogeneous memory. Dave said that HMAT (Heterogeneous Memory Attribute Table) existed in firmware specifically to express the topology to the kernel. Dave also said that NUMA (Non-Uniform Memory Access) was already part of the kernel, and wasn't lying fallow. Additionally, he pointed out that the ACPI (Advanced Configuration and Power Interface) specification had officially embraced NUMA, and there were Linux developers actively contributing patches to support this.
So, Dave was not immediately enthusiastic about ditching this ongoing momentum in one direction, in order to accept Jérôme's radical solution that went in an entirely new direction.
But, Jérôme replied that he was not trying to overthrow the existing work or any kernel patches that made use of HMAT. He said that all of that was still useful just on its own. But he added:
I do not see how to evolve NUMA to support device memory. [...] I can not expose device memory as NUMA node as device memory is not cache coherent on AMD and Intel platform today. [...] In some case that memory is not visible at all by the CPU which is not something you can express in the current NUMA node.
Somewhat mollified, Dave replied:
Yeah, our NUMA mechanisms are for managing memory that the kernel itself manages in the "normal" allocator and supports a full feature set on. That has a bunch of implications, like that the memory is cache coherent and accessible from everywhere.
The HMAT patches only comprehend this "normal" memory, which is why we're extending the existing /sys/devices/system/node infrastructure.
This series has a much more aggressive goal, which is comprehending the connections of every memory-target to every memory-initiator, no matter who is managing the memory, who can access it, or what it can be used for.
Theoretically, HMS could be used for everything that we're doing with /sys/devices/system/node, as long as it's tied back into the existing NUMA infrastructure somehow.
Jérôme agreed with all of the above, and Dave seemed to be on board with Jérôme's approach. But, he did have some practical objections. For one thing, he said:
We support 1024 NUMA nodes on x86. The ACPI HMAT expresses the connections between each node. Let's suppose that each node has some CPUs and some memory.
That means we'll have 1024 target directories in sysfs, 1024 initiator directories in sysfs, and 1024*1024 link directories. Or, would the kernel be responsible for "compiling" the firmware-provided information down into a more manageable number of links?
Some idiot made the mistake of having one sysfs directory per 128MB of memory way back when, and now we have hundreds of thousands of /sys/devices/system/memory/memoryX directories. That sucks to manage. Isn't this potentially repeating that mistake?
Dave also was worried that if Jérôme went forward with his patches, it could be four or five years before all the problems were solved, in which case, some portions of memory management would be bottle-necked waiting for those solutions, which could have been solved sooner using existing NUMA projects.
Additionally, Dave was curious how Jérôme's code would scale. He said, "It's quite easy to represent a laptop, but can this scale to the largest systems that we expect to encounter over the next 20 years that this ABI will live?"
At this point, Jérôme and Dave were joined by several other folks and began diving into the technical details, further objections and possible solutions that might come out of Jérôme's work. Ultimately, it seemed as if these patches did not represent a threat to existing approaches to memory, and that Jérôme would have support—or at least tolerance—from NUMA-related projects.
The things I love about this discussion are, first of all, that one developer got a big idea that seemed to go against current thinking, but that solved problems he saw as real. Second, that a developer on the other side of the issue was actually interested in the new approach and willing to take it seriously rather than tear it down.
Also, the whole direction of hardware resources is really becoming so strange. The kernel tries to eek out absolutely everything it can from the various devices on the system—even to the point of going beyond the ways in which those devices thought they would be used! And then once the kernel starts using them that way, other devices come out that use the kernel's new infrastructure. And so we end up with some kind of crazy situation requiring crazy solutions like what Jérôme has proposed.
Note: if you're mentioned above and want to post a response above the comment section, send a message with your response text to ljeditor@linuxjournal.com.