Lecture 16 Scribe Notes

5/28/13

Robustness, Parallelism, and NFS

Scribed by: William Schoellkopf and Bradley Shoemaker

Let’s look at the following function call:

R = read(fd, buf, size);

Note how this is actually a syscall, but it LOOKS like a function call!

There are a lot of ways to do abstraction.

Abstraction via _________________

Let’s fill in the blank!

We are going to implement abstraction with RPC – Remote Procedure Call technology.

Here, we execute a function like read to get data out of a file, but it is NOT like a syscall!

Example: r = chmod(file, 0644);

RPC idea: implement NOT by a trap to the kernel, but ship off the request to someone else on some other server.

So it’s a change from ordinary function calls (or even syscalls)

BUT the caller and callee are on different machines so they DON’T share memory!

THEREFORE, no call by reference!

Example: r = chmod(“/etc/passwd”, 0644);

“/etc/passwd” is a char const* to a file name

We can’t have any call by reference

Therefore, we must only have call by value. However, there are potential efficiency issues because

large values will be slow because we have to ship large values over the network to the other server.

This contrasts with normal function calls because those can easily ship large values to the kernel.

PRO: hard modularity is even better than with syscalls!

PRO/CON: Caller and callee might use different architectures

Example: Caller uses x86-64, Callee uses ARM

Problem: May use different data representations.

Example: 32-bit integers vs. 64-bit integers

We need to deal with converting remote procedure calls.

There also can be problems with machines having big endian vs little endian bit representation.

SOLUTION: Have a network representation!

MARSHALLING

Marshalling is the process of figuring out how to represent internal data structures and put them over a wire.

It’s called marshaling because the caller is like a field marshall of an army. The callee decodes the message

Into its internal format.

Marshalling is also known as pickling or serializing

Above is an example diagram of how the Caller and Callee each have their own different internal

Representation for the data, but because of marshaling, they are able to serialize their data to

Each other over the wire, and then unmarshal it into their own preferred internal representation.

Glue code is extremely annoying to write. It is generally just a bunch of stubs and skins that pretend to be calls to high level.

Glue code is now often generated automatically with “rpcgen”

RPC failure modes are different than failure modes for function calls

PRO: Callee can’t trash caller’s data

CON: Messages can get lost (TCP: resend, UDP: app deals with it)

CON: Messages can get corrupted

CON: Network might be down (so ALL your messages are lost)

-cosmic rays are a major source of error on memory. They can flip a bit so it’s risky if you run your laptop outside

CON: Network might just be VERY slow

CON: Server might be down (or slow)

Glue Code should do the following:

If message corrupted: resend

If no response: retry, keep retrying until succeeds

This method needs AT LEAST once RPC

This works for idempotent requests, which means it doesn’t hurt to implement the item three times even though

The caller only asked for once.

Example: chmod permissions is fine

But BAD for switching money between savings and checking accounts.

Alternative:

If no response: fail, return error

This method needs AT MOST one RPC

Example: works better for transactions

Ideal: We WANT exactly one RPC, but that’s too hard.

So the X protocol is designed to do common things efficiently.

Example: X-Client does the function XSetPixel(X,Y,R,G,B);

So it can set the triangles’ pixels on your display remotely but also very quickly and efficiently.

HTTP Protocol

Connect to the HTTP server

Send “GET/HTTP/10\r\n”

Receive “HTTP/1.1 200 ok\r\n

Content-length: 10243

So HTTP protocol works as well through RPC.

Performance Problems with RPC:

Several requests

X: Set 4 pixel requests

So the time it takes to send over the network is even longer than doing the set pixel operation!

SOLUTIONS:

1. Could try to coalesce requests into a single request. Example: “Fill Rectangle”

Problems: not always possible to do this.

2. Asynchronous Calls

Split the call into two components

- Request

- Notification

Also called HTTP pipelining.

Note how requests identify themselves, responses specify the request tag, and

Responses can come back “out of order.”

PROBLEMS:

Dependent requests:

Create a file xyz;

Change permission xyz;

MUST create the file first, or otherwise get error “file not found”, or could fail to create the file.

SOLUTIONS:

I. Asynchronous/Pipeline

II. Change API to send bigger data chunks

III. Cache recent answers to requests (collaborate with server)

For example: ls, open, readdir. And cache into kernel

PROBLEM: stale cache problem

IV. SOLUTION: Prefetch Answers

Can guess what link the client will click on

Can hotwire the cache, but only for “read-only” actions

Linux Virtual File System (VFS)

Virtual file system items

Source: http://www.linux.it/~rubini/docs/vfs/vfs.html

In class we focused on the green struct file boxes, and the blue struct inode boxes.

NFS – Network File System

The NFS protocol has file names that are very similar to the linux virtual file system since it was designed for unix.

MKDIR(dirfh, name, attr) -> returns fh + attrs

Dirfh = directory file handle

Name = file name but without the ‘/’

Attr = attribute

LOOKUP(dirfh, name) -> returns fh + attrs. This is for file name resolution

CREATE(dirfh, name, attr) -> fh + attr

REMOVE(dirfh, name) -> string

READ

WRITE

So anything you can think of as a system call, gets mapped to a protocol request via system calls.

NFS just uses its own names.

We want to worry about reliability.

We want NFS to keep working even if the server crashes

DESIGN GOAL: clients should survive server crashes nicely

So there should be a “stateless server” with no important states

But this can be slow!

How to implement?

We could cheat and say we wrote when we haven’t yet

Or we could add an NVRAM

Dirfh file handles can NOT be file descriptor numbers because whenever a server reboots, its

File descriptor numbers go away. So we need a better way to UNIQUELY identify a file.

Therefore, we invent the file handle, which for now we will says consists of the following

File handle = inode # and device #

This pair SHOULD uniquely identify a file. So if it all works this should be enough.

PROBLEM: There could still be a system crash.

EXAMPLE:

Client 1: REMOVE(3,12) where 3 is the inode #, 12 is the device #, the pair (3,12) is the file handle

Pretend that this call was sent, but then the response was lost due to a server crash

Now, client 2 makes the following call

Client 2: CREATE(…) -> (3,12)

So now, since client 1 thinks that the remove didn’t happen since it never got a response.

Client 1 then resends the remove operation, which it thinks is ok because remove is an idempotent operation.

Client 1: REMOVE(3,12)

PROBLEM!!! Because now client 1 is removing the wrong file! It is removing the file that client 2 just created,

Instead of its original file, which was already removed!

Therefore, we must add a third piece of identifying information, a serial number.

File handle = inode # and device # and serial #

Now we repeat the function calls, but with the additional serial number.

Client 1: REMOVE( (3,12), 70) where 70 is the serial number

Client 2: CREATE( (3,12), 71) so we’re creating a brand new file, with serial number 71, even though

It has the same inode number and the same device number.

Client 1: REMOVE( (3,12), 70) so when client 1 performs the remove again, it doesn’t affect client 2’s

Brand new file.

How do we implement this though?

If we JUST rebooted

LOOKUP(dirfh, name)

LOOKUP( (3, 12, 71), “bin”)

It is easy for us to find filesystem 12

However, how do we find inode number 3? No user processes can use that system call.

SOLUTION: Put X-Server code in the linux kernel so that it can run those server calls!

Synchronization Issues

Process 1: write(fd, buf, bufsize)

Process 2: read(fd, buf, bufsize)

This can NOT happen with NFS because in NFS each process is running on its own machine.

So NFS does NOT have read-after-write synchronization.

But it DOES have close-to-open synchronization

Process1: Write

Process 2: Open

Read

This works because close and open are heavyweight operations. So NFS enforces that these two

Are synchronized. But also, close can fail by saying that you ran out of disk space. This happens because

Only when you close does NFS ACTUALLY check the disk space.