Lecture 16 Scribe Notes
5/28/13
Robustness, Parallelism, and NFS
Scribed by: William Schoellkopf and
Bradley Shoemaker
Let’s look at the following function call:
R = read(fd, buf, size);
Note how this is actually a syscall, but it LOOKS like a function call!
There are a lot of ways to do abstraction.
Abstraction via _________________
Let’s fill in the blank!
We are going to implement abstraction with RPC – Remote Procedure Call technology.
Here, we execute a function like read to get data out of a file, but it is NOT like a syscall!
Example: r = chmod(file, 0644);
RPC idea: implement NOT by a trap to the kernel, but ship off the request to someone else on some other server.
So it’s a change from ordinary function calls (or even syscalls)
BUT the caller and callee are on different machines so they DON’T share memory!
THEREFORE, no call by reference!
Example: r = chmod(“/etc/passwd”, 0644);
“/etc/passwd” is a char const* to a file name
We can’t have any call by reference
Therefore, we must only have call by value. However, there are potential efficiency issues because
large values will be slow because we have to ship large values over the network to the other server.
This contrasts with normal function calls because those can easily ship large values to the kernel.
PRO: hard modularity is even better than with syscalls!
PRO/CON: Caller and callee might use different architectures
Example: Caller uses x86-64, Callee uses ARM
Problem: May use different data representations.
Example: 32-bit integers vs. 64-bit integers
We need to deal with converting remote procedure calls.
There also can be problems with machines having big endian vs little endian bit representation.
SOLUTION: Have a network representation!
MARSHALLING
Marshalling is the process of figuring out how to represent internal data structures and put them over a wire.
It’s called marshaling because the caller is like a field marshall of an army. The callee decodes the message
Into its internal format.
Marshalling is also known as pickling or serializing
Above is an example diagram of how the Caller and Callee each have their own different internal
Representation for the data, but because of marshaling, they are able to serialize their data to
Each other over the wire, and then unmarshal it into their own preferred internal representation.
Glue code is extremely annoying to write. It is generally just a bunch of stubs and skins that pretend to be calls to high level.
Glue code is now often generated automatically with “rpcgen”
RPC failure modes are different than failure modes for function
calls
PRO: Callee can’t trash caller’s data
CON: Messages can get lost (TCP: resend, UDP: app deals with it)
CON: Messages can get corrupted
CON: Network might be down (so ALL your messages are lost)
-cosmic rays are a major source of error on memory. They can flip a bit so it’s risky if you run your laptop outside
CON: Network might just be VERY slow
CON: Server might be down (or slow)
Glue Code should do the following:
If message corrupted: resend
If no response: retry, keep retrying until succeeds
This method needs AT LEAST once RPC
This works for idempotent requests, which means it doesn’t hurt to implement the item three times even though
The caller only asked for once.
Example: chmod permissions is fine
But BAD for switching money between savings and checking accounts.
Alternative:
If no response: fail, return error
This method needs AT MOST one RPC
Example: works better for transactions
Ideal: We WANT exactly one RPC, but that’s too hard.
So the X protocol is designed to do common things efficiently.
Example: X-Client does the function XSetPixel(X,Y,R,G,B);
So it can set the triangles’ pixels on your display remotely but also very quickly and efficiently.
HTTP Protocol
Connect to the HTTP server
Send “GET/HTTP/10\r\n”
Receive “HTTP/1.1 200 ok\r\n
Content-length: 10243
So HTTP protocol works as well through RPC.
Performance Problems with RPC:
Several requests
X: Set 4 pixel requests
So the time it takes to send over the network is even longer than doing the set pixel operation!
SOLUTIONS:
1. Could try to coalesce requests into a single request. Example: “Fill Rectangle”
Problems: not always possible to do this.
2. Asynchronous Calls
Split the call into two components
- Request
- Notification
Also called HTTP pipelining.
Note how requests identify themselves, responses specify the request tag, and
Responses can come back “out of order.”
PROBLEMS:
Dependent requests:
Create a file xyz;
Change permission xyz;
MUST create the file first, or otherwise get error “file not found”, or could fail to create the file.
SOLUTIONS:
I. Asynchronous/Pipeline
II. Change API to send bigger data chunks
III. Cache recent answers to requests (collaborate with server)
For example: ls, open, readdir. And cache into kernel
PROBLEM: stale cache problem
IV. SOLUTION: Prefetch Answers
Can guess what link the client will click on
Can hotwire the cache, but only for “read-only” actions
Linux Virtual File System (VFS)
Source: http://www.linux.it/~rubini/docs/vfs/vfs.html
In class we focused on the green struct file boxes, and the blue struct inode boxes.
NFS – Network File System
The NFS protocol has file names that are very similar to the linux virtual file system since it was designed for unix.
MKDIR(dirfh, name, attr) -> returns fh + attrs
Dirfh = directory file handle
Name = file name but without the ‘/’
Attr = attribute
LOOKUP(dirfh, name) -> returns fh + attrs. This is for file name resolution
CREATE(dirfh, name, attr) -> fh + attr
REMOVE(dirfh, name) -> string
READ
WRITE
So anything you can think of as a system call, gets mapped to a protocol request via system calls.
NFS just uses its own names.
We want to worry about reliability.
We want NFS to keep working even if the server crashes
DESIGN GOAL: clients should survive server crashes nicely
So there should be a “stateless server” with no important states
But this can be slow!
How to implement?
We could cheat and say we wrote when we haven’t yet
Or we could add an NVRAM
Dirfh file handles can NOT be file descriptor numbers because whenever a server reboots, its
File descriptor numbers go away. So we need a better way to UNIQUELY identify a file.
Therefore, we invent the file handle, which for now we will says consists of the following
File handle = inode # and device #
This pair SHOULD uniquely identify a file. So if it all works this should be enough.
PROBLEM: There could still be a system crash.
EXAMPLE:
Client 1: REMOVE(3,12) where 3 is the inode #, 12 is the device #, the pair (3,12) is the file handle
Pretend that this call was sent, but then the response was lost due to a server crash
Now, client 2 makes the following call
Client 2: CREATE(…) -> (3,12)
So now, since client 1 thinks that the remove didn’t happen since it never got a response.
Client 1 then resends the remove operation, which it thinks is ok because remove is an idempotent operation.
Client 1: REMOVE(3,12)
PROBLEM!!! Because now client 1 is removing the wrong file! It is removing the file that client 2 just created,
Instead of its original file, which was already removed!
Therefore, we must add a third piece of identifying information, a serial number.
File handle = inode # and device # and serial #
Now we repeat the function calls, but with the additional serial number.
Client 1: REMOVE( (3,12), 70) where 70 is the serial number
Client 2: CREATE( (3,12), 71) so we’re creating a brand new file, with serial number 71, even though
It has the same inode number and the same device number.
Client 1: REMOVE( (3,12), 70) so when client 1 performs the remove again, it doesn’t affect client 2’s
Brand new file.
How do we implement this though?
If we JUST rebooted
LOOKUP(dirfh, name)
LOOKUP( (3, 12, 71), “bin”)
It is easy for us to find filesystem 12
However, how do we find inode number 3? No user processes can use that system call.
SOLUTION: Put X-Server code in the linux kernel so that it can run those server calls!
Synchronization Issues
Process 1: write(fd, buf, bufsize)
Process 2: read(fd, buf, bufsize)
This can NOT happen with NFS because in NFS each process is running on its own machine.
So NFS does NOT have read-after-write synchronization.
But it DOES have close-to-open synchronization
Process1: Write
Close
Process 2: Open
Read
This works because close and open are heavyweight operations. So NFS enforces that these two
Are synchronized. But also, close can fail by saying that you ran out of disk space. This happens because
Only when you close does NFS ACTUALLY check the disk space.