Amber Won
120 PB, 200,000 drives (each ~600 GB) - Chosen to have a lot of smaller drives instead of less big drives because it gives better performance
Striping includes splitting a file into pieces onto different disks, then using parallelism to access different parts of the file at the same time. This leads to faster throughput.
Metadata - contains info about file, but not the actual file contents (e.g. directories, timestamps).
Having a central metadata in one CPU can lead to bottlenecks as different processes wait to access the central metadata. GPFS removes this bottleneck by distributing the metadata.
Network partition
The dotted line acts as a partition. In the above image, the right side of the partition represents a part of a network that is down. A client who is on the left side of the partition wants to be able to continue to operate his or her files, which are maintained on the left side of the partition. With partition awareness, the user can still operate/make progress under the assumption that all needed data is on the client's side of the partition.
A simplified example of an algorithm deciding if the user can still access the file when part of a network is down is: If you're on the big side, you can access. If you're on the small side, you cannot.
Locking is needed to prevent different processes from writing over each other's data. The processes need to be able to see if a file is locked or unlocked in a timely manner. The locking information should be distributed so each process can check quickly.
Say you have many files in a directory. We still want to be able to find/access a file efficiently. Indexing is very important so that we do not have to look through every file to find the one we are looking for.
Users still want to be able to access their files even during maintenance.
On machine with 16 KiB RAM, 700 GB disk
Eggert's file system contains a table and the rest of the space is used to store data. The table keeps track of the file with 12 byte entries include the file name, a pointer to where the file's contents are stored, and the size of the file.
Pros:
Cons:
The biggest flaw is external fragmentation (having holes in your memory/disk space)
This is the file system after numerous reads/writes/allocate data/remove data instructions. The black areas represent currently used storage. After reading, writing, allocating, removing data many times, the file system begins to have many holes where data used to be. Now if a user wants to create a file that is bigger than the size of the holes, he or she can't, even though there is enough total space for the file.
A potential solution is to shift the data once in a while so that the file system's data is contiguous and no longer have these holes, and therefore can utilize the available storage. However, this is expensive to implement.
(1970s)
The FAT file system consists of a boot sector, superblock, next fields, and the data blocks. In the FAT file system, the available data storage is separated into blocks. This removes the problem of external fragmentation. Each block has a next field, which is stored separtely from the data blocks. This next field is used when a file needs more than one block to store its data. The file system keeps track of the multiple blocks by using the next field, which contains a pointer to the next block, until it reaches an EOF. If the next field is 0, that indicates a EOF. If it is 2^16-1, then the block is free for use.
Pros:
Cons:
$ mv foo.c bar.c is easy. However, $ mv a/foo.c b/foo.c can give us some potential problems because we are writing to two different blocks. If there is a crash sometime in the middle of this (because someone pulled the plug, power outage, etc.), there is the possibility that a/foo.c and b/foo.c both exist. In this case, the program was exited before a/foo.c could be deleted and now we have two unwanted links to the same file.