iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
💭

The Dreaded Phantom Atari Partition

に公開2

Introduction

This article explains a data corruption issue that recently surfaced in Rook. The root cause lies in an unexpected place and could potentially occur in systems other than Rook. It is fascinating because while the root cause has existed for a long time, a series of coincidences led Rook to be caught in the crossfire recently. I am sharing this to show that such things can happen in the real world. Another reason for writing this is that I found it amusing to see the word "Atari" used in a non-historical context even in 2021.

In addition to the information provided in Rook's official documentation, this article reconstructs the details with added information for those who may not have prior knowledge of Rook.

Glossary

  • Ceph: An open-source distributed storage system.
  • Rook: An orchestrator for Ceph that runs on Kubernetes. This is also open-source.
  • OSD: A data structure residing on the disks that make up Ceph.
  • OSD on disk: One of the methods for creating OSDs in Rook. You directly specify device paths and other settings in the Rook configuration. Details are described later.
  • Atari partition: A partition format used by the Atari ST computers from the past.

Problem Summary

  • Phenomenon
    • OSD data becomes corrupted.
  • Root Cause Summary
    • A disk where an OSD exists is misidentified as having an Atari partition table, and Rook ends up creating an OSD on that (non-existent) partition.
  • Conditions for Occurrence
    • Using Rook v1.6.0 through v1.6.7.
    • Creating an "OSD on disk" on a disk that has not been partitioned.
  • Workaround
    • Update to Rook v1.6.8 or higher, or Rook v1.7.
  • Recovery Method from Data Corruption
    • None. Your only option is to recreate the OSD on the disk holding the corrupted OSD by following this procedure.

Mechanism

In this scenario, let's assume Rook attempts to create an OSD on a disk named /dev/sdb.

  1. Rook runs and creates an OSD on /dev/sdb.
  2. The next time Rook runs, it misidentifies that /dev/sdb contains an Atari Partition Table instead of an OSD, and that there are several empty Atari partitions available for OSD creation.
  3. It creates an OSD on the fake empty partition mentioned in step 2, resulting in data corruption.

Problem Details

To understand the problem in detail, you need some knowledge about Rook, Ceph, and Atari partitions. Therefore, I will first explain some prerequisite knowledge before describing the flow of how the actual problem occurs.

How to configure OSD on device in Rook

When creating an OSD in Rook, you write settings in a CephCluster Custom Resource (CR) specifying the conditions under which you want the OSD to be created. To create an "OSD on device," you provide specifications such as:

  • The path of the device where you want to create the OSD (e.g., "/dev/sdb")
  • A regular expression to match devices (e.g., "/dev/sd.*")
  • The setting useAllDevice: true, which specifies to create OSDs on all unused devices

For more details, please refer to the official documentation.

When Rook operates, it creates OSDs on each device while referring to the Ceph Cluster CR as follows:

  1. Rook running on the nodes that make up the Rook cluster executes a command provided by Ceph called ceph-volume.
  2. The ceph-volume command lists the devices existing in the system and indicates whether they are in a state where an OSD can be created—that is, whether they are empty.
  3. Rook creates an OSD on devices that are empty and match the configuration in the Ceph Cluster CR.

OSD Formats in Ceph

When Ceph creates an OSD on a device, it writes OSD metadata onto that device. Ceph has two formats for OSDs, each writing metadata to a different location:

  • lvm mode OSD: Creates an LVM Volume Group (VG) on the device, creates a Logical Volume (LV) within it, and writes the OSD metadata to the beginning of that LV.
  • raw mode OSD: Writes the OSD metadata directly to the beginning area of the device.

Originally, only lvm mode existed, but recently the simpler and easier-to-manage raw mode OSD was introduced. In Rook as well, starting from v1.6.0, raw mode OSDs have been used for "OSD on device."

For more information on OSD modes, please also refer to this article.

How the Linux Kernel Recognizes Atari Partitions

The method for recognizing Atari partitions in the Linux kernel is quite loose compared to other partition formats. To confirm whether the fundamental problem lies in the Linux kernel or if the Atari partition specification itself is unusual, I would need to check the Atari partition specifications, but since I couldn't find them, I have limited my investigation to the source code.

The kernel determines if a disk has an Atari partition table based on whether at least one piece of partition information exists in the disk's first sector. There can be up to four partitions[1], and the verification method used is the VALID_PARTITION() macro.

https://github.com/torvalds/linux/blob/master/block/partitions/atari.c#L53-L70

	rs = read_part_sector(state, 0, &sect);
	if (!rs)
		return -1;

	/* Verify this is an Atari rootsector: */
	hd_size = get_capacity(state->disk);
	if (!VALID_PARTITION(&rs->part[0], hd_size) &&
	    !VALID_PARTITION(&rs->part[1], hd_size) &&
	    !VALID_PARTITION(&rs->part[2], hd_size) &&
	    !VALID_PARTITION(&rs->part[3], hd_size)) {
		/*
		 * if there's no valid primary partition, assume that no Atari
		 * format partition table (there's no reliable magic or the like
	         * :-()
		 */
		put_dev_sector(sect);
		return 0;
	}

Based on the comments, it seems there is no "magic number" or anything similar that typically exists in other partition tables.

The definition of VALID_PARTITION() is as follows:

https://github.com/torvalds/linux/blob/master/block/partitions/atari.c#L19-L25

/* check if a partition entry looks valid -- Atari format is assumed if at
   least one of the primary entries is ok this way */
#define	VALID_PARTITION(pi,hdsiz)					     \
    (((pi)->flg & 1) &&							     \
     isalnum((pi)->id[0]) && isalnum((pi)->id[1]) && isalnum((pi)->id[2]) && \
     be32_to_cpu((pi)->st) <= (hdsiz) &&				     \
     be32_to_cpu((pi)->st) + be32_to_cpu((pi)->siz) <= (hdsiz))

It is quite frightening that a disk can be misidentified as a partition table simply by satisfying such loose conditions.

Flow Leading to the Problem

Let's consider again the case where Rook tries to create an OSD on a disk named /dev/sdb. For simplicity, assume that Rook is configured to create OSDs on all empty devices.

First, Rook runs and creates an OSD on /dev/sdb. Everything is fine up to this point. In versions v1.6.0 through v1.6.7, a raw mode OSD is created here, and the OSD metadata is written to the beginning of the disk. Unfortunately, the bit pattern of this OSD metadata is prone to being misidentified as an Atari Partition Table[2].

Devices that have been misidentified can be confirmed from the output of the lsblk command as shown below:

vdb    252:16   0    3T  0 disk 
├─vdb2 252:18   0   48G  0 part ★ phantom Atari Partition 
└─vdb3 252:19   0  6.1M  0 part ★ same as above

Interestingly, because tools like lsblk, blkid, udevadm, and parted cannot recognize Atari partitions, vdb2 or vdb3 might appear non-existent, or the partition table type might show as "unknown," which has confused users and developers alike. This is why these misidentified partitions are referred to as "phantom" partitions.

Subsequently, for some reason, the next time Rook runs, the ceph-volume command is executed. This command misidentifies that /dev/sdb contains an Atari Partition Table instead of an OSD, and that there are several phantom partitions. Let's assume /dev/sdb2 is a phantom partition. Due to this misidentification, it reports that while /dev/sdb is in use, /dev/sdb2 is empty and an OSD can be created on it.

Finally, since Rook has been instructed by the user to "create OSDs on empty devices," it creates a new OSD on /dev/sdb2. This action destroys part of the OSD data that was originally on /dev/sdb.

History of Addressing the Problem

This issue involves three parties: Rook, Ceph, and the Linux Kernel. In resolving such problems, considerations like "at which layer can it be fixed?" and "at which layer should it be fixed?" are crucial.

After various twists and turns, several workarounds were proposed. However, it was determined that superficial measures would not suffice, and a fix has now been incorporated into Rook to use lvm mode OSDs when creating OSD on disk.

Separately, a fix for ceph-volume to ignore phantom Atari partitions has already been submitted to Ceph and is expected to be fixed in the upcoming v16.2.6 release. Once v16.2.6 is released, I predict that Rook will likely return to using raw mode for OSD on disk when using that version.

Additionally, regarding the Linux kernel, a fix was recently made in Ubuntu to disable Atari partition support in kernels for major cloud provider environments. Since the reason for this fix isn't explicitly stated, it might have emerged in a context unrelated to Rook or Ceph.

Conclusion

This was quite an intense problem, but I believe it is a great case study for learning which software to fix and how to communicate such issues. If you would like to know more, I recommend digging deeper by following the related issues and discussions.

脚注
  1. This may be a Linux restriction, but I'm not sure. ↩︎

  2. Fortunately, there have been no reports of cases where an lvm mode OSD was misidentified as an Atari partition. However, this is simply because in lvm mode, the beginning of the disk contains VG metadata, which happens to be a pattern that is not recognized as an Atari partition. ↩︎

Discussion

uchanuchan

/dev/sdbは使用中だが/dev/sdb2上は空でありOSDが作れると報告します。

最後にRookは「空きデバイス上にはOSDを作ってよい」という指示をユーザから受けているので/dev/sdb1上にOSDを新規作成します。

空きデバイスと報告されるのは /dev/sdb2 という話でしたが、/dev/sdb1 が使用されるのはなぜですか?

satsat

書き間違えでした。修正しました。