DONG Yuxuan @ Sep 30, 2019 Asia/Shanghai
A very highly abstract guide to the file system of Liunx for application programmers, ignoring many difficult details. It aims to help application programmers not to make basic mistakes.
A disk is just a bag of bytes. It’s the file system makes it structured. It brings the concepts of file, directory, path, symbolic/hard link, and so on.
How does the file system work? This is not a well-defined question, because there’re many file systems. On Linux, the ext-family is most common. It includes ext2, ext3, ext4, and maybe more in the future. The ext2 is a good choice to learn, because it’s simple and has a good compatibility with the later versions.
However, if you search the web for ext2 you will find it’s very difficult. Most documents you can find contains many byte-by-byte analysis and horrible Linux kernel structures. Those documents/articles are written for system programmers who develop disk utilities or operating systems. At least, they are written for C programmers.
You will be scared if you’re an application programmer. Maybe you use Python for some high-level work like machine learning or web development and you happenly met a performance problem caused by your I/O operations. You just want to know an abstract model of the file system to avoid some basic mistakes.
Well, this article is written for you. It introduces a simplified model of ext2 which can help to build a basic understanding of the file system. The article first introduces the model, then explains some basic file operations with this model, at last the article introduces an interesting real-world application to show how to use the model in your work.
Even it’s a simplified model, basic programming knowledge like the time complexity and symbolic/hard links is required. The article also assumes you have good programming intuition and know some common programming tricks like bitmap.
The disk is divided into blocks by the file system. The size of a block is specified while initialization, usually 1KB or 4KB. This article assumes it’s 1KB unless otherwise stated.
A file’s data is stored in blocks and a block is the smallest storage unit. If the size of the file is 2.5KB, we need 3 blocks to store it. Blocks of a file’s data may not be continuous. Blocks storing file data are called data blocks. Not all blocks are data blocks. Some blocks are used to store metadata like the inode table which is introduced later.
Apparently, we need to know the metadata about which blocks are free, which are used. This metadata is stored as a bitmap called the block bitmap. When we write data to a file, we allocate free blocks as the data blocks of the file, then flag bits of these blocks as used in the block bitmap, then write data to blocks.
As blocks of a file may not be contiuous, how do we know which blocks belong to the file? Of course for each file we need maintain a data block list. The list is one of the file’s metadata. The data block list with other metadata like permissions together are stored in the structure called inode. The filename is not stroed in the data nor in the inode. It’s stored in the data of the directory containing the file. We will discuss this later.
As blocks have the block bitmap, inodes have the inode bitmap to indicate which is used, which is free.
As the inode doesn’t contain the filename, two files can have the same inode and this is how hard links work. If you create a hard link of the file. The link and the file itself will have one inode, so they have the same blocks, so they synchronize data and metadata like permissions with each other. Each inode maintains a reference count to record how many files are linked to it.
The inode is the entrance of a file. Most operations of the file need finding its inode as the first step. However, as an application programmer, we rarely face inodes directly. In the application level, the first step of visiting a file is alway opening it according to its filepath, or filename.
How does the file system find the inode by the filename? To understand it we must first understand what a directory really is.
A directory is also a file, its data is names and points to inodes of files and subdirectories in it. Each item (filename with inode pointer) is called a directory entry. You can run
vi /usr to see it more clearly.
The root directory has a fixed inode number, usually 2. If we want to visit
/usr/local/etc/kvdiff.txt, we find the inode of
/, as I said it’s fixed, and find data blocks of
/, then search the directory entries of
usr to find the inode of
/usr. Repeat this procedure until we find the inode of
/usr/local/etc/kvdiff.txt. This procedure is so important that I must give it a name “the locating algorithm” for later references in the article.
Now we have understood the baisc theory of file systems. We can use it to explain basic operations of files.
Use the locating algorithm to find its inode
Find data blocks by the inode
Read data from data blocks
As we can see, the size of the file and the position we reading from has a minimal effect on the performance.
Use the locating algorithm to find its inode
Allocate data blocks
Flag allocated blocks as used in the block bitmap
Write data to data blocks
The size of the file has a minimal effect on the performance.
Use the locating algorithm to find inodes and data blocks of the file and the directory containing the file
Removing the directory entry from the directory
Reduce the reference count of the inode by 1
If the reference count after reduced is not zero, the procedure is finished, else do the following.
Flag the inode of the file as free in the inode bitmap
Flag data blocks of the file as free in the block bitmap
Because we may need to flag data blocks in the block bitmap, the bigger the file is, the slower the deleting is.
Use the locating algorithm to find inodes and data blocks of the source directory and the destination directory
Remove the directory entry regarding to the file in the source directory
Add the directory entry regarding to the file in the destination directory
As shown above, moving inside a file system is just modifying directory entries of two directory files, so it is very fast and irrelevant to the size of the file to be moved.
Suppose you’re going to develop a web service to store users’ avatars. How should you arrange the directory structure? The most simple solution is putting all image files into one directory and naming files by their md5 values.
If you arrange so, how long will your system take to visit a file? The answer is \(\Theta(n)\), where \(n\) is the number of files, because to read or write a file, we must run the locating algorithm. It means to scan the directory entry list with length \(n\). If you have large number of files, it would be slow.
A smarter solution is to build a 32-levels structure. Each level represents a character in the 32-length md5 hex string. For example, the PNG file with md5
aaca0f5eb4d2d98a6ce6dffa99f8254b will be put in
In this structure, visiting a file will run 32 times of locating algorithm and each will scan at most 16 directory entries. So we do at most 16*32=512 times of filename comparations.