Control Group Subsystems in RHEL7

Control groups (cgroups) are a Linux kernel feature that enables you to allocate resources — such as CPU time, system memory, disk I/O, network bandwidth, etc. — among hierarchically ordered groups of processes running on a system. Initially developed by Google engineers Paul Menage and Rohit Seth in 2006 under the name “process containers”, it was merged into kernel version 2.6.24 and extensively enhanced since then. RHEL6 was the first Red Hat distribution to support cgroups.

Cgroups provide system administrators with fine-grained control over allocating, prioritizing, denying, managing, and monitoring system resources. A cgroup is a collection of processes that are bound by the same criteria. These groups are typically hierarchical, where each group inherits limits from its parent group.

The problem with the traditional use of cgroups is summarized by the following except from a Red hat guide:

Control Groups provide a way to hierarchically group and label processes, and to apply resource limits to them. Traditionally, all processes received similar amount of system resources that administrator could modulate with the process niceness value. With this approach, applications that involved a large number of processes got more resources than applications with few processes, regardless of the relative importance of these applications.

In RHEL6, administrators had to build custom cgroup hierarchies to meet their application needs. In RHEL7, it is no longer necessary to build custom cgroups as resource management settings have moved from the process level to the application level via binding the system of cgroup hierarchies with the systemd unit tree. By default, systemd automatically creates a hierarchy of slices, scopes and services to provide a unified structure for the cgroup tree.

A resource controller, also called cgroup subsystem, represents a single resource, such as CPU time or memory. The Linux kernel provides a range of resource controllers which can be seen by cat’ing /proc/cgroups

# cat /proc/cgroups
#subsys_name	hierarchy	num_cgroups	enabled
cpuset	2	1	1
cpu	3	1	1
cpuacct	3	1	1
memory	4	1	1
devices	5	1	1
freezer	6	1	1
net_cls	7	1	1
blkio	8	1	1
perf_event	9	1	1
hugetlb	10	1	1


A quick explanation of each of the above cgroup subsystems:

  • cpuset: Assigns individual CPUs (on a multicore system) and memory nodes to tasks in a cgroup.
  • cpu: Uses the scheduler to provide cgroup tasks access to the CPU.
  • cpuacct: Automatic reports on CPU resources used by tasks in a cgroup.
  • memory: Sets limits on memory use by tasks in a cgroup, and generates automatic reports on memory resources used by those tasks.
  • devices: Allows or denies access to devices by tasks in a cgroup.
  • freezer: Suspends or resumes tasks in a cgroup.
  • net_cls: Tags network packets with a class identifier (classid) to enable the Linux traffic controller to identify packets originating from a particular cgroup task.
  • blkio: Sets limits on input/output access to and from block devices.
  • perf_event: Permits monitoring cgroups with the perf tool.
  • hugetlb: Enables large virtual memory pages and the enforcing of resource limits on these pages.

You can also use the lsusbsys utility to view the control group subsystems.:

# lssubsys
cpuset
cpu,cpuacct
memory
devices
freezer
net_cls
blkio
perf_event
hugetlb

# lssubsys -im
cpuset /sys/fs/cgroup/cpuset
cpu,cpuacct /sys/fs/cgroup/cpu,cpuacct
memory /sys/fs/cgroup/memory
devices /sys/fs/cgroup/devices
freezer /sys/fs/cgroup/freezer
net_cls /sys/fs/cgroup/net_cls
blkio /sys/fs/cgroup/blkio
perf_event /sys/fs/cgroup/perf_event
hugetlb /sys/fs/cgroup/hugetlb

# mount | grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,seclabel,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)


If you install the kernal-doc RPM, you will find documentation on each of the above cgroup subsystems under /usr/share/doc/kernel-doc-<kernel_version>/Documentation/cgroups/.

OSTree

Project Atomic, a Red Hat sponsored project, features an interesting new update system for RPM-based Linux operating systems called OSTree (rpm-ostree) which has been developed by Colin Walters over the last couple of years. Evidently OSTree supports atomic updates to an OS although I am not sure how that actually works because there is a lot of marketing hype and buzz words associated with Project Atomic including Docker containers.

In the default model, the RPMs are composed on a server into an OSTree repository, and client systems can replicate in an image-like fashion, including incremental updates. However, unlike traditional Linus distribution update mechanisms, it automatically keeps the previous version of the OS and makes it available for rollback.

The goal of Project Atomic is more than a new update mechanism:

Project Atomic integrates the tools and patterns of container-based application and service deployment with trusted operating system platforms to deliver an end-to-end hosting architecture that’s modern, reliable, and secure.

Upgrading a system is as easy as simply invoking rpm-ostree upgrade. The utility checks the repository URL specified in /ostree/repo/config to check for an updated OS codebase. If a new version is found, it is first downloaded and then deployed. At that point, a three-way merge of configuration is performed, using the new /etc as a base, and applying your specific changes on top. You then have to reboot your system for the update to take effect.

If you boot into the new tree and determine something is wrong, you can rollback to a previous tree (operating system snapshot) via a bootloader menu entry and invoke rpm-ostree rollback to permanently remove the OS update.

OSTree supports the storing of multiple trees and easily switching between any of the trees.

Interestingly, OSTree functionality and use cases kind of remind me of Live Update in Solaris 10.

The Sunsetting of SHA-1

SHA-1 (Secure hash algorithm) is a 160-bit hash algorithm that is at the heart of many web security protocols such as Secure Sockets Layer (SSL) and Transport Layer Security (TLS) since shortly after it was developed by the NSA (National Security Agency) in 1995.

In 2005, a professor in China demonstrated an attack that could be successfully launched against the SHA-1 function, suggesting that the algorithm might not be secure enough for ongoing use. Because of this, NIST immediately recommended federal agencies begin moving away from SHA-1 toward stronger algorithms. In 2011, NIST mandated that many applications in federal agencies move away from SHA-1 by December 31, 2013. Other attacks have been found on SHA-1, and NIST estimated in 2013 that SHA-1 provides only 69 bits of security in digital signature applications.

A recent survey by Netcraft found that 98 percent of all the SSL certificates used on the web still use SHA-1 signatures, and less than 2 percent use SHA-256 signatures.

Last year Microsoft announced that Windows will stop accepting SHA-1 certificates in SSL by 2017, and code signing certificates with SHA-1 hashes will no longer be accepted by Windows on January 1, 2016. Recently Google announced Chrome will stop accepting SHA-1 certificates in SSL in a phased way by 2017. Similarly, Mozilla Firefox is also planning to stop accepting SHA-1-based SSL certificates by 2017.

So it looks like SHA-1 certificates will sunset by the end of 2017 in most major applications.

RHEL7 Does Not Support User Namespace

The Linux kernel currently implements six (out of 10 proposed) namespaces for process separation:

  • mnt – mount points, filesystems
  • pid – processes
  • net – network stack
  • ipc – System V IPC
  • uts – hostname, domainname
  • user – UIDs, GIDs

The last Linux namespace to be fully implemented was the user namespace (CLONE_NEWNS) whose implementation was finally completed in the 3.8 kernel after being started in the 2.6.23 kernel.

The current kernel in RHEL7 is 3.10.0-121. Unfortunately it does not include the user namespace. According to Dan Walsh of Red Hat:

We hope to add the user namespace support to a future Red Hat Enterprise Linux 7.

User namespaces are important for securely implementing Linux Containers. According to the Red Hat Enterprise Linux Blog:

User namespaces isolate the user and group ID number spaces, such that a process’s user and group IDs can be different inside and outside a user namespace. The most interesting case here is that a process can have a normal unprivileged user ID outside a user namespace while at the same time having a user ID of 0 inside the namespace. This means that the process has full root privileges for operations inside the user namespace, but is unprivileged for operations outside the namespace. While very promising, there are still a few kinks to work out before it meets our standards for enabling it in Red Hat Enterprise Linux 7 Beta. Rest assured, we are working on stabilizing the code base and solving the issues so we can consider turning it on in the future when it becomes enterprise ready.

This is a serious omission on the part of Red Hat and should be rectified as soon as possible. My guess is that it was probably omitted because a number of filesystems are not yet namespace-aware.

Boycott Systemd

Finally people are beginning to wake up and understand that systemd and Lennart Poettering, who works for Red Hat, is a cancer that will destroy and splinter the Linux ecosystem.

According to a new movement, boycottsystemd.org:

It represents a monumental increase in complexity, an abhorrent and violent slap in the face to the Unix philosophy, and its inherent domineering and viral nature turns it into something akin to a “second kernel” that is spreading all across the Linux ecosystem.

I could not agree more.

systemd flies in the face of the Unix philosophy: “do one thing and do it well,” representing a complex collection of dozens of tightly coupled binaries1. Its responsibilities grossly exceed that of an init system, as it goes on to handle power management, device management, mount points, cron, disk encryption, socket API/inetd, syslog, network configuration, login/session management, readahead, GPT partition discovery, container registration, hostname/locale/time management, and other things. Keep it simple, stupid.

Red Hat is now, via Poettering and others, driving Linux in a direction which is fundamentally at odds with the classical Unix philosophy of Linux distributions over the years. It started with their crazy Network Manager daemon a few years ago and now is in full swing with systemd and other “enhancements”.

Systemd adds a serious level of complexity, risk and cost to deploying RHEL7 in an enterprise. Binary logs, for example, mean that an enterprise will have to redo their existing logging infrastructure. That costs real time and money. For these reasons and more, I would not be surprised if the move to RHEL7 in current RHEL deployments is close to non-existent.

All hail Lennart Poettering!