The Google File System   no comments

It's been a little while since my last technically meaty update. One system that I've been looking at a fair bit recently is Hadoop, which is an open-source implementation of Google's MapReduce. For me, the interesting part is the large-scale distributed filesystem on which it runs called HDFS. It's well known that HDFS is based heavily on its Google equivalent.

In 2003 Google published a paper on their Google File System (GFS) at SOSP, the Symposium on Operating Systems Principles. This is the same venue at which Amazon published their Dynamo work, albeit four years earlier. One of the lecturers in my group tells me that SOSP is a venue where "interesting" is rated highly as a criterion for acceptance, over other more staid conferences. So what, if anything, was interesting about GFS? Read on for some details...
Read the rest of this entry »

Written by Henry on October 1st, 2008

Tagged with , , ,

Adding a system call to the Linux Kernel   no comments

Posted at 8:57 pm in Operating systems

So, despite ostensibly being a 'systems' guy, I haven't spent too much time in my life getting hands on with the Linux kernel. I've written tiny toy operating-system-like projects before, but haven't done much open-heart surgery on real life code.

I think this should change, so in my very limited spare time I'm doing some very simple projects to teach me more about the Linux kernel code layout so that if it should so happen in a job interview that someone asks me if I'm comfortable hacking at the kernel level I can answer `yes' with far more conviction (I would probably answer positively anyhow, because I'm arrogant enough to think that it's not beyond me, but there's a lot of metaphorical difference between having the book on your shelf and having read it :) ).
Read the rest of this entry »

Written by Henry on September 28th, 2008

Tagged with , ,

StackOverflow   no comments

Posted at 7:26 pm in Note, link

What do you know, StackOverflow has its useful moments.

The question in, er, question relates to an issue I had been having myself with the behaviour of lexical scoping in Python, but had been able to work around sufficiently easily to not devote time to finding a proper solution. This in itself is an unfortunate truth of work; there isn't enough time to investigate every interesting problem thoroughly. StackOverflow might just turn out to help with that niche: I presume that over time it's going to evolve into a Not-So-Frequently-Asked-But-Still-Interesting-Questions repository. I've put some effort myself into answering some questions about basic computer science and distributed systems. The reputation farming I can do without, the badges are quite a neat feature - they encourage participation passively. The problem is that there really is a blind-leading-the-blind feel to some question answers, especially in the areas of data structures and the ever popular and yet heavily abused 'big-O' notation. If people speak authoritatively enough then they will be taken as authoritative, garner reputation points which serve as a feedback loop, amplifying their authority on future occasions. Unfortunately, there seem to be more people willing to upvote correct sounding answers than those who know whether an answer is actually correct. Time will tell if that is a general truth or just an early adoption issue. I suspect, alas, that it might be the latter.

Written by Henry on September 28th, 2008

Tagged with ,

Pain and suffering and ffmpeg   1 comment

Posted at 11:55 am in Note

All I wanted to do was to transcode real media files from MIT OCW to iPod compatible mp4 on Linux. It shouldn't have been this difficult. As of now, I still don't have a satisfactory solution.

Problem 1: mplayer / mencoder read and play the stream correctly, but the mp4 files they produce when transcoding don't work on the iPod. In particular, they're not readable by any utilities I have such as Easytag and Amarok.

Problem 2: ffmpeg can't read rv30 files, so won't encode them. One possibility is to use mencoder to encode to something mutually acceptable, but that involves transcoding twice which is far below ideal.

As of now, I've found some of the videos I want to watch on Google Video, but that's not a guaranteed solution. Uploading them to Google Video just to download them again is also a non-starter.

Anyone have any ideas?

Written by Henry on September 19th, 2008

Tagged with , ,

The Real GoogleOS?   no comments

Posted at 2:43 pm in Conjecture, Note

So Google have announced Chrome, their entrant into the web browser circus. They are presenting Chrome as a complete reboot of the browser, which of course it isn't. It is interesting, however, to speculate wildly about Google's intentions. We shouldn't, of course, discount their stated intent of 'adding value for users'; a lot of features of Chrome are focused upon improving today's browsing experience. See, for example, pop-ups that are modal only in their own tab, which is something I have been wishing for for ages. However, looking at the big picture, even from a viewpoint far removed, is good for a laugh sometimes.

Read on for some rampant speculation.

Read the rest of this entry »

Written by Henry on September 2nd, 2008

Consistency and availability in Amazon’s Dynamo   4 comments

Posted at 12:50 pm in Distributed systems

There is a continuing and welcome trend amongst large, modern technology companies like Google, Yahoo and Amazon to publish details of their systems at academic conferences. One of the problems that researchers at universities have is making a convincing case that their ideas would work well in the real world, since no matter how many assumptions are made there really is no substitute for field testing, and the infrastructure, workloads and data just aren't available to do that effectively. However, companies have infrastructure to burn and a genuine use-case with genuine users. Using their experience and data to discover what does and doesn't work, and what is and is not really important provides an invaluable feedback loop to researchers.

More than that, large systems are built from a set of independent ideas. Most academic papers leave the construction of a practical real-world system as an exercise for the reader. Synthesising a set of disparate techniques often throws up lots of gotchas which no papers directly address. Companies with businesses to run have a much greater incentive to build a robust system that works.

At 2007's Symposium on Operating Systems Principles (SOSP), Amazon presented a paper about one of their real-world systems: "Dynamo: Amazon's Highly Available Key-value Store". It wound up winning, I think, the audience prize for best paper. In this post, I was planning to describe Dynamo 'inside-out', based on a reading group mandated close reading of the paper. However, trying to lucidly explain a dense 12 page paper leads to many more than 12 pages of explanation. So instead, I want to focus on one particular aspect of Dynamo which I think is the most interesting.

Read the rest of this entry »

Written by Henry on August 26th, 2008

Good survey of the important papers in distributed consensus   no comments

Posted at 4:11 pm in Distributed systems, link

This blog post is an excellent survey of the last thirty years of research into consensus problems.

Written by Henry on August 25th, 2008

Dijkstra award 2008 goes to ‘Sparse Partitions’   no comments

Posted at 7:59 pm in Distributed systems

Not sure why I didn't post this when I first found out - the Djikstra 2008 prize for an outstanding and influential paper in distributed systems has been awarded to David Peleg and Baruch Awerbuch for their 1990 FOCS paper Sparse Partitions (on-line copy available from MIT ad-hoc algorithms course here).

The citation does a much better job than I could of explaining the paper's relevance. The general idea is that the authors show that there are efficient ways of constructing clustered representations of graphs that remain within a small factor of the original in terms of route lengths. Further, the authors show that this can be done in a distributed manner. This has lots (and lots) of potential applications - the typical example is for a compact routing scheme, where nodes can store smaller routing tables between clusters rather than between nodes.

I've got the paper cued up on my list of walkthroughs to write, so expect a better explanation than the one above soon.

Written by Henry on August 17th, 2008

A Brief Tour of FLP Impossibility   1 comment

Posted at 11:30 am in Distributed systems, Paper Walkthrough

One of the most important results in distributed systems theory was published in April 1985 by Fischer, Lynch and Patterson. Their short paper 'Impossibility of Distributed Consensus with One Faulty Process', which eventually won the Dijkstra award given to the most influential papers in distributed computing, definitively placed an upper bound on what it is possible to achieve with distributed processes in an asynchronous environment.

This particular result, known as the 'FLP result', settled a dispute that had been ongoing in distributed systems for the previous five to ten years. The problem of consensus - that is, getting a distributed network of processors to agree on a common value - was known to be solvable in a synchronous setting, where processes could proceed in simultaneous steps. In particular, the synchronous solution was resilient to faults, where processors crash and take no further part in the computation. Informally, synchronous models allow failures to be detected by waiting one entire step length for a reply from a processor, and presuming that it has crashed if no reply is received.

This kind of failure detection is impossible in an asynchronous setting, where there are no bounds on the amount of time a processor might take to complete its work and then respond with a message. Therefore it's not possible to say whether a processor has crashed or is simply taking a long time to respond. The FLP result shows that in an asynchronous setting, where only one processor might crash, there is no distributed algorithm that solves the consensus problem.

In this post, I want to give a tour of the proof itself because, although it is quite subtle, it is short and profound. I'll start by introducing consensus, and then after describing some notation and assumptions I'll work through the main two lemmas in the paper.

If you want to follow along at home (highly, highly recommended) a copy of the paper is available here.

Read the rest of this entry »

Written by Henry on August 13th, 2008

Tagged with , ,

Binomial Heaps   no comments

Posted at 5:35 pm in Data structures

(The python code for this article is available here)

The standard binary heaps that everyone learns as part of a first algorithms course are very cool. They give guaranteed n sorting cost, can be stored compactly in memory since they're full binary trees and allow for very fast implementations of priority queues. However, there are a couple of operations that we might be interested in that binary trees don't give us, at least not cheaply.

In particular, we might be concerned with merging two heaps together. Say, for example, that we're shutting down a processor with its own priority queue for schedulable processes, and we want to merge the workload in with another processor. One way to do this would be to insert every item in the first processor's queue into the receiving processor's queue. However, this takes O(n) time - at least, depending on how the queues are implemented. We'd like to be able to do that more efficiently.

Step forward binomial heaps. Binomial heaps are rather different to binary heaps - although they share a few details in common. Binomial heaps allow us to merge two heaps together in O(\log n) time, in return for some extra cost when finding the minimum. However, extracting the minimum still takes O(\log n), which is the same as a binary heap.

Read the rest of this entry »

Written by Henry on July 11th, 2008