diff -u: What's New in Kernel Development
Nicolas Dichtel and Thierry Herbelot pointed out that the directories in the /proc filesystem used a linked list to identify their files. But, this would be slow when /proc directories started having lots of files, which, for example, might happen when the system needed lots of network sockets.
Nicolas and Thierry posted a patch to change the /proc implementation to use multiple linked lists instead of just one. Each subdirectory would have its own linked list, keyed to a hash of the directory's name. According to their benchmarks, the patch shaved 1/5 of the time needed to churn through all the entries of a given subdirectory.
Stephen Hemminger liked the speedup, but suggested that there already were implementations, like the hlist macro, that might simplify their hash table code.
Eric W. Biederman also liked the speedup and kicked himself for overlooking the /proc issue when doing other scalability work. But, he felt that the whole linked list concept was not the right approach. Especially, he felt that /proc/net/dev/snmp6 was the real target of Nicolas and Thierry's patch, and if no one actually needed the files in that directory (except people requiring extreme backward compatibility), it would be even more efficient to do away with them completely.
This, however, already had come up in an earlier thread, when David S. Miller had said that "It potentially breaks tools, it's a non-starter, sorry." So, reworking the user interface would not be allowed, which left the linked list speedup that Nicolas and Thierry proposed. But, Nicolas said he'd look into an rbtree implementation instead of a plain linked list, because rbtrees would potentially scale better.
Minchan Kim noticed that putting memory pressure on qemu-kvm under Linux 3.14 would cause a kernel stack overflow and crash the system. He dug into the code and tried to reduce his own stack usage, but he wasn't able to cut back enough to prevent the crash. And in any case, he said, trying to reduce everyone's stack usage was not very scalable. He proposed expanding the kernel stack from 8K to 16K, although he acknowledged that there possibly were good reasons not to do this that he wasn't aware of.
Dave Chinner remarked that "8k stacks were never large enough to fit the Linux IO architecture on x86-64, but nobody outside filesystem and IO developers has been willing to accept that argument as valid, despite regular stack overruns and filesystems having to add workaround after workaround to prevent stack overruns."
He added, "We're basically at the point where we have to push every XFS operation that requires block allocation off to another thread to get enough stack space for normal operation", and said "XFS has always been the stack usage canary and this issue is basically a repeat of the 4k stack on i386 kernel debacle."
Borislav Petkov pointed out that if they increased the kernel stack from 8K to 16K, there undoubtedly would come a time when 16K wouldn't be enough either. He wondered if there ever would be a limit, or if the kernel stack ultimately would grow to one megabyte and beyond.
Steven Rostedt said, "If [Minchan's patch] goes in, it should be a config option, or perhaps selected by those filesystems that need it. I hate to have 16K stacks on a box that doesn't have that much memory, but also just uses ext2."
Meanwhile, H. Peter Anvin said, "8K additional per thread is a huge hit. XFS has indeed always been a canary, or trouble spot, I suspect because it originally came from another kernel where this was not an optimization target."
At around this point, Linus Torvalds remarked that something like Minchan's fix probably would be necessary at some point, although the development cycle was already at -rc7, making it too late for that particular kernel version. Linus also pointed out that there was plenty of room to reduce stack usage in the stack trace Minchan had posted in his original e-mail. Linus remarked, "From a quick glance at the frame usage, some of it seems to be gcc being rather bad at stack allocation, but lots of it is just nasty spilling around the disgusting call-sites with tons or arguments. A lot of the stack slots are marked as '%sfp' (which is gcc-ese for 'spill frame pointer', afaik)."
There was a technical discussion about various ways to reduce stack usage in general (and some further consideration of ways in which GCC might be somewhat to blame), but with Linus willing to accept a patch to implement a larger stack, it seems like something along the lines of Minchan's patch will soon be part of the kernel. At one point, Linus summed up his position on the issue, saying, "Minchan's call trace and this thread has actually convinced me that yes, we really do need to make x86-64 have a 16kB stack. [...] The 8kB stack has been somewhat restrictive and painful for a while, and I'm ok with admitting that it is just getting too damn painful."