This article covers some recent changes in the SELinux kernel code including a performance patch from Kaigai Kohei of NEC and related updates to the selinuxfs API. Currently, these changes are waiting in the -mm tree for merging into Linus' kernel in 2.6.10, and are also available in the development Fedora and RHEL4 Beta kernels. This article should be useful to sysadmins who are looking to understand SELinux performance and generally also to curious people.
Until now, SELinux kernel development has been focused primarily on getting the code merged into the upstream kernel and preparation for integration into mainstream Linux distributions. This has taken several years, from the initial release in 2000 where SELinux was a patch against the kernel, to the port to the LSM framework, upstream merge and integration work performed by various distributions including Debian, Red Hat and Gentoo.
During this phase of the project, performance tuning had not received much attention from developers. While the general base performance hit of SELinux is around 7% -- which is not too bad -- testing on large systems revealed serious scalability issues.
AVC operation and analysis
To understand this, here's a brief explanation of how the core of the SELinux code works. SELinux uses a series of hooks (via LSM) at strategic points within the kernel. These hooks are used to divert the normal flow of the kernel into a security checking mechanism called the Access Vector Cache (AVC). The AVC contains cached permission checks performed against the system's security policy. As the working set of permission checks for a given workload is typically quite small, having them cached allows SELinux to scale virtually independently to the size of the security policy. For the vast majority of cases, an SELinux permission check is a quick cache lookup rather than a search of the entire security policy database.
This sounds great except that the AVC was protected by a global spinlock. Additionally, due to the fact that SELinux hooks can be called from hard interrupt context, interrupts had to be disabled and interrupt state saved before taking the lock, then resumed on release (i.e the spin_lock_irqsave/restore() variant was used). This is quite expensive and the lock showed up clearly in profiling runs.
Further analysis revealed that it was trivially possible to generate hundreds of thousands of AVC lookups a second on reasonably modest hardware (e.g. running
ls -lR /usrtwice), and the atomic operations associated with spinlocks alone were clearly hurting SMP performance.
A few attempts had been made at "fixing" or getting rid of the AVC lock, using reader/writer lock variants and per-cpu techniques. Neither worked very well. Kaigai's patch, however, utilized RCU, a mechanism which allows lockless reads and the elimination of atomic operations in read paths. As the AVC is overwhelmingly read-only on a typical system, RCU turned out to be an ideal method for removing the AVC lock completely.
With the RCU patch applied, SELinux scalability problems noticed in several benchmarks were eliminated. The benchmark results can be found in this post to lkml. For example, Kaigai measured linear scalability for the dbench benchmark on a 32-way system.
Some minor tradeoffs in baseline performance showed up in some benchmarks, although the scalability win definitely outweighs these (if they are even noticeable to anyone in the real world).
The RCU patch does not solve all of the SELinux performance issues. For example, some networking hooks do not use the AVC and effectively always fall through to security policy lookups, which involves the use of a global spinlock. Work is ongoing in these cases and they should be similarly resolved with caching and RCU reasonably soon.
During development and analysis, I instrumented the AVC code and enabled some of the internal debugging code to get a better look at what was really going on and see if there were any further improvements to be made. (Much of this instrumentation is now available via selinuxfs, and is discussed below). Firstly, I was surprised to see that the AVC seemed to be thrashing. The cache had a maximum size of 410 entries (I don't know where this specific number comes from :-). When more than this number of entries was needed, the AVC code used an LRU policy to reclaim cache entries in chunks. It seemed that this would surely be bad for performance, given that I was seeing thousands of AVC entries being used in total. Additionally, there were 512 hash buckets for the cache, which meant that at least 102 buckets would not be used. My thoughts were to increase the maximum cache size to reduce the need to reclaim entries, and then implement a periodic garbage collector (GC) to keep the hash chains short. Shortening the hash chains was potentially important as some workload (such as bootup) can invoke thousands of AVC entries which are then not used again or used very infrequently from then on.
The first step in implementing the GC was to add a last used timestamp to each AVC entry. This was accomplished by setting a field in the AVC node structure to the current value of jiffies, the global current tick count. Then, from a timer, the GC cleanup code would traverse the cache, looking for AVC entries older than a certain amount of time (really, ticks converted to time). As the GC would run extremely infrequently compared to the AVC lookups (many orders of magnitude), the GC run itself should not cause any undue performance problems. And it didn't, although it turned out that the GC system was still somehow hurting performance. The problem with the GC turned out to be reading the global jiffies variable on each AVC lookup. It is declared volatile (which kills some possible optimizations by the compiler) and as a constantly updated global value, reading it very frequently incurs significant overhead on a heavily loaded SMP system.
So, the GC idea was dropped, and the default value for the maximum size of the AVC cache was increased to allow it to fill up all of its hash buckets. Extensive benchmarking with various values showed 512 to be a good choice.
But what about the cache thrashing? After further analysis, this also turned out to be bogus. Using the instrumentation, I was able to observe that workloads tend to use a surprisingly small working set of AVC entries. Most of the cache entries were stale or infrequently used, so reclaiming them was not generally a problem. To keep an idle system running, with SSH and a monitoring script active, as little as 20 or 30 AVC entries might be in the working set. And perhaps that number again if running a simple server workload such as Apache with static content. It was difficult to find a real workload which used more than a hundred AVC entries in a working set. With the original reclamation code, if the cache was populated with mostly stale entries (e.g. those created during a login or daemon startup), and a new workload was started, there would be an initial hit while the LRU based system reclaimed enough entries for the new workload, but this would then be amortized very quickly over the life of the workload. So, the original design was fine all along. Lesson: be careful about second-guessing NSA code :-)
The new AVC API
One thing that seemed useful though would be to allow the sysadmin to control the maximum size of the AVC, so that complex, multiple or fast changing workloads could be more readily accommodated if needed. This is now a tunable exposed via the selinuxfs API.
To view the current value:
# cat /selinux/avc/cache_threshold 512
To set a new value:
# echo 1024 > /selinux/avc/cache_threshold
I would suggest always verifying that this succeed, as the new setsecparam permission is needed to write to the file.
It's not much use providing a way to modify the maximum cache size without some way to monitor the operation of the AVC. Two further files have been added under the avc directory:
# cat /selinux/avc/hash_stats entries: 503 buckets used: 269/512 longest chain: 5
This is really to help debug complicated performance problems, although it is worth explaining. The output shows the total number of AVC entries, hash bucket distribution and longest hash chain length. Maximum chain lengths should generally be quite short (less than 10 is typical on my systems). Significantly large values indicate possible performance issues, as the AVC may be spending a lot of time traversing long hash chains on each AVC lookup. In this case, performance may improve by reducing the size of the cache, although it might be worth posting a report to one of the SELinux mailing lists if you see anything odd here.
# cat /selinux/avc/cache_stats lookups hits misses allocations reclaims frees 933871 928520 5351 5486 1216 5124 723798 719270 4528 4569 608 4483 1677626 1675464 2162 2180 352 2051 1390072 1388430 1642 1671 576 1744
These are per-cpu statistics for various AVC operations and hopefully fairly self explanatory. The lookups field is a good indicator of the load being placed on the AVC. The field to watch, probably, is reclaims. This is a count of the current total number of reclaimed cache entries. If it keeps changing, you may need to increase the cache size. Generally it is normal to see light reclaim activity from time to time.
The avcstat utility
Note that you probably should not just cat this file directly. Use the avcstat utility available here or in the latest libselinux package (Fedora users can just update to the latest development stuff via yum and it will all be there). This utility is similar to tools such as vmstat. It adds up the per-cpu values and displays them in a tabular format. You can display continuous values by including the number of seconds between updates on the command line:
# avcstat 1 lookups hits misses allocs reclaims frees 4782948 4769221 13727 14013 2752 13465 81 81 0 0 0 0 47 47 0 0 0 0 53 53 0 0 0 0 ^C
The first line always shows the current total, then by default, shows relative values. Using the -c switch shows cumulative values, if desired.
The use of RCU solved a serious scalability problem with the AVC, thanks to the work of Kaigai and the RCU developers. It seems likely to be a useful approach for dealing with similar problems, and specifically with some of the SELinux networking code as mentioned. The new AVC API and avcstat utility should prove useful to sysadmins and SELinux developers who need to analyze and tune performance. Feedback from users is welcome. These are new features which have not had wide exposure and there are obviously going to be many workload which I've not tested. If you have any interesting performance situations, please post some information and output from the AVC API to one of the SELinux mailing lists.