OSTEP Learnings [ Personal ]
This post is not for the general public, but a personal note to myself. Some things may not make sense, and my English is definitely broken in this one. If you want to inspect my notes, you are welcome.
OS is the main program that interacts with physical hardware, mainly CPU, Memory, and Disk, as well as any other device connected to the computer. It has several critical jobs:
- Sharing the resources amongst user processes
- Securing the hardware from malicious processes
- Securing processes from other processes
- Making sure the hardware is utilized properly
- Abstracting away intricate details of working with hardware, so user programs can focus on their main goals
Shared resources are primarily CPU and Memory. Though persistent storage is also shared, it is not considered the same way as CPU and Memory - because it is not as constrained as others. OS virtualizes CPU and Memory and implements some interesting mechanisms to share them amongst processes.
Sharing the CPU is about sharing the time. OS's key objective is to ensure provide fair sharing of CPU cycles between processes. Though eventually all the processes eventually get what they want, the difference lies in when. This is the responsibility of Scheduler algorithms. There is no single best scheduler algorithm possible, and each has its pros and cons. Some of the important metrics scheduler algorithms try to balance are interactivity, turnaround time, and fairness.
OS and Scheduler manage the processes via a mechanism called Limited Direct Execution. In this mechanism, the OS gives CPU to the user program and steps aside. At that point, the only running program is the user program. The OS takes back control via interrupts. Those interrupts can be timed interrupts, or when the user program makes a system call or when the user program tries to do something bad, such as accessing an illegal memory. At that point, the OS program takes control again, and decides on what to do next.
I/O requests take a long time to complete compared to the CPU operations. When a program issues an I/O request, and when that I/O request completes are two critical points for scheduler algorithms. For example, switching to another process when the current process issues an I/O request is a very common technique to better utilize the CPU.
Switching between processes - context switch - is a cumbersome task and if happens too frequently, valuable CPU cycles can be wasted. During a context switch, the state of the current program is saved, and the next program's state is loaded. This state includes registers regarding the memory address space of the program, trap handlers, and maybe some other things.
A process needs to be started from some other process, therefore the UNIX shell provides fork and exec model. fork creates an exact copy of the parent process, and exec replaces the memory contents of the child process. This abstraction seems strange at first, but allows for some optimizations, can be used to create UNIX pipes and very simple to implement.
Scheduling algorithms could be much better if they knew what the processes would do in the future, but such a thing is not possible. They can only rely on the past behaviors of the processes. Therefore, some interesting algorithms are developed, such as multi level feedback queue, ticked based sharing, completely fair scheduler and multi queues.
One last thing to remember about schedulers is that, the CPU has a cache, and schedulers must take that into consideration to better utilize CPU.
Other than that there seems to be a multiprocessor scheduling chapter but I don't remember it.
The memory was especially hard for me, I hope I can remember the details. It is different in nature compared to CPU. Both are considered resources but the CPU is a resource in time dimension whereas the Memory is in space dimension. Apparently sharing a time based resource is much simpler compared to a space based resource, at least in my opinion.
First of all, the OS must manage the address space and share it among processes. So, it must keep track of which processes use which addresses of the memory.
The physical address space of the Memory locations are only available to the OS and the user processes never know about the physical address space. They can only use virtual address spaces.
Seemingly, the user processes has a vast address space, in reality the allocated physical space is smaller. When a program accesses the memory, it goes through several steps. Assuming no page fault, the process reaches to MMU, MMU gets the Page Table Entry, .... Man, I am really lost in these parts, https://medium.com/@karthix25/5-the-art-of-translation-how-virtual-addresses-become-physical-c935ed3069e4maybe should checkout the link, or go again the relevant sections again.
Long story short, there is some hardware support that makes the life of OS much easier, but in the end, it is about caching some important mapping information along with the access controls. If encounters cache miss, the hardware goes to OS, the OS creates the required mapping or retrieves it from somewhere and goes back to the hardware. The whole operation is a little complicated because memory management requires speed, flexibility, and security. It just needs too many things to be taken care of.
There is also the case of fragmentation. I guess the OS can't do anything about internal fragmentation but it handles external fragmentation.
Also, the OS itself is a mere program and it needs to make sure it always has the required memory for itself. Also, when the memory is full or low, it can use the disk to extend the Memory - swap.
Up until now, it is considered the first pillar of operating systems. Some general ideas the book teaches are:
- Start with the simplest solution first,
- Think about the edge cases,
- Try to optimize the common case,
- If there are two good ideas, their combination is better than each of them.
This is a recurring theme within the book, that is also prevalent in real life. Also, an OS is just a program, it can't do everything on its own, in crucial steps it relies on hardware support and they dance together.
The hardest takeaway is that, some problems don't have a definitive solution. You can try to optimize for the general case, or some specific cases but in the end you must learn to live with "good enough".
Operating system provides some mechanisms for handling concurrent programs. Threads are the native concurrency mechanism the OS provides, though there are some other concurrency mechanisms. For example, Node.js uses an event loop on a single threaded process to provide concurrency. Not sure how Go works internally, but you can use Goroutines ( "lightweight threads" ) as a concurrency mechanism.
Threads share the same address space and cpu cycles as the main program, therefore it is open for various bugs, such as race condition, deadlock, livelock, double free, use after free etc.
To help these programs to run correctly, OS together with the hardware again provides some mechanisms. Atomic increment operations, locks and semaphores are some of those mechanisms.
The simplest lock mechanism can be built upon a "compare and set" + "spin wait", though it is not efficient in most cases. However, a full-blown semaphore can be an overkill in others. Sometimes, using a spin wait if the waited operation is short is more performant.
A semaphore can be used for ordering of operations, limiting the concurrent operations on a resource or as a basic lock. Though, it is better to use the basic lock instead of semaphore in that case.
These mechanisms are important when building a concurrent data structure, that is a data structure that can be used concurrently amongst threads even running on different CPUs.
select and poll mechanisms provide more concurrency control for I/O bounded operations.
To be honest, most of this concurrency chapter is a reassurance of my experience in work life. Spin lock and semaphores are new concepts to me, as well as how those concurrent data structures work. Other than that, coordination amongst the processes and some event-based concurrency mechanisms are well understood by me.
OS provides a persistent storage abstraction via file system. It exposes some level of underlying hardware's implementation using calls such as fsync. File systems built without considering the underlying hardware perform badly - as expected -. Therefore, a lot of file systems are built to better utilize HDDs and SDDs. This is also important lesson that the filesystem had the correct abstractions.
The superblock is the first block in the disk that contains the information about the filesystem. Inode contains information about stored files, metadata, and where the file's data is. Inodes are fixed-sized and small, sometimes if a file is too large, it needs to be defined in multiple inodes. It uses a tree of inodes instead of simply referencing to the next inode. E.g. it is not a linked list structure, when a file can't be described in a single inode, the initial inode references the next set of say 5 inodes. This is a one-level inode tree example, and this file can be as large as 5 x ( MAX size an inode can describe ). If the same thing happens in the next level, this file can be as large as 5 x 5 x ( MAX ) ... As a result, a filesystem can choose a MAX depth for this tree, and a MAX file size possible in that file system. This is faster than a linked list-like structure, and since most of the files are small it optimizes for the common case.
Fragmentation issues are also prevalent in persistent storage, therefore they also shuffle things around for better space utilization.
Data can become corrupted in the disk, during writes or when it is stale. Therefore, some redundancy regular checks are used. Also, the disk can fail altogether, which is why RAID systems exist. Each RAID configuration, redundancy mechanism, and everything comes along with its own trade-offs.
For maximum disk utilization, file systems try to perform sequential writes and sequential reads as much as possible.
Some file systems use a log-based structure to write all of its data to better utilize the disk. Again with its trade-offs. But, almost all file systems use an append-only circular log structure to make sure writes are atomic. e.g. they first write to the write ahead log in the form: TRX START, D1, D2, ..., . Then perform the actual data write and then commit TRX END. I can't remember how exactly, but this mechanism also makes sure that inode is updated atomically, too.
One important thing that I noticed is an abstraction/implementation can choose to be aware of the underlying system or not. Protocol-aware systems bring a lot of benefits, especially performance, but are harder to implement.
I was shocked to learn that SSDs need to erase the block before writing. SSDs also need to distribute writes as evenly as possible amongst blocks. Because blocks wear and tear after a certain amount of write operations. Interestingly, an SSD also moves stale data from one block to another to utilize the occupied blocks and distribute wear and tear.
Also, HDDs ( maybe SSDs ) contain volatile and non-volatile RAM. Non-volatile RAM protects data integrity against power losses.
Multiple computers can work together to create a distributed system. Distributed systems has their advantages, one of which is fault tolerance. Their coordination and exposing their services to other computers are more complicated than accessing local resources.
One example is distributed file systems. A distributed file system is less performant than a local one, but easier to backup, manage and a lot more fault tolerant compared to using the local hard drives.
When a system crosses a single computer boundary, a lot of the guarantees provided by the OS and hardware are lost, because you don't know what is going to happen to the bits and bytes left your computer, as well as incoming to your computer. Therefore, carefully designed protocols are very important. As well as, security and checking consistency of the data.
The most problematic issue is the cache consistency, because whatever you do, you are always dealing with stale data. You can't know if the local data you fetched earlier is still up-to-date or not until your next communication with the remote. Also, some requests can be lost or replied, both due to network conditions or malicious actors. Such protocols need to define idempotent methods for critical operations, and can use idempotency keys for those methods.