Great Unsolved Problems In Computer Science
You know what I think is actually the hardest distributed systems problem?
Because Microsoft has packaged together Word, Excel, Powerpoint, and Outlook into a single enterprise productivity suite, we tend to think of these applications as being approximately in parity with each other in terms of programming complexity. It turns out that this isn't really true.
Word processors are probably the easiest - people are constantly writing them and hundreds, perhaps thousands, exist. Excel is quite a bit harder - there aren't as many spreadsheet programs, and Excel can be fairly said to dominate its product niche out of sheer excellence. And Powerpoint's just silly.
But Outlook (or Exchange, the backend service) is another beast entirely. Computer people like to look down on "enterprise" software, thinking it largely straightforward manipulation of blandly-structured data according to "business logic rules" with none of the elegance of "real" computer science problems like compilers, networking and... distributed systems.
Well, I think calendaring software is probably the most difficult distributed systems software known to man. There are a lot of distributed systems problems out there, and many of them have fairly well-worked-out solutions that yield correct results. Academia and industry have gotten pretty good at farming out large computing tasks to multiple processors, recovering from stochastic node failures, peer-to-peer distributed storage and retrieval, and solving that problem where all the soldiers have to fire at the same time. But we've been working on calendaring for maybe 20 years now, and Outlook/Exchange still sucks.
Correction: not we, MICROSOFT. It turns out that calendaring is so hard that academic computer science departments can't even address it (not that they acknowledge it to be a "real" problem). The only company willing and able to make a go at it is the greatest software company in the world. The only other company willing and able to make a go at it is the next greatest software company in the world, Google.
That's two companies.
Bear in mind right now that as much as you think Exchange sucks, it's still the only calendaring system able to handle large-scale (tens of thousands) sets of calendars with any degree of correctness. You can point to open-source or other alternatives all you want, but today nothing still comes close to the scale of the largest Exchange installations.
Why is calendaring hard? Well, it becomes obvious once you just list out the product requirements and realize that almost every one of them requires some sort of ridiculous amount of distributed systems hackery around timeliness of updates, heterogenous network, consistency, graceful exception handling, and node availability semantics.
To start off, human calendars suck. They're irregular, have multiple holiday overlays, some involving the implementation of complicated formulae. This first part already cuts out the lower 90% of regular idiot programmers. Even in the case of great programmers - every set of great programmers I've ever worked with, when they've implemented something that has to handle time-based scheduling of anything, has neglected either DST transitions or leap years in their initial implementation.
Second, every node is not just unpredictable, but involves unpredictability at the human level. We are not talking about a network of thousands of homogenous machines like some nursery school distributed rendering cluster - no, we are talking about thousands of machines, each one run by a human, most of whom are in direct control of their own calendar and probably illiterate in computer skills and pretty much randomly clicking here and there willy-nilly.
Third: ridiculous levels of exception handling. How to handle recurring meetings? How do you handle exceptions to recurring meetings? How do you handle modifications to exceptions? What if some nodes didn't get the original exception but got the later modification? All the same consistency problems in distributed systems apply. They're just all made worse because humans are involved, so you can't even reliably apply consistency-checking rules - you need human-inconsistency-checking-checking. And as someone who's worked on internationalization, let me tell you - humans are very inconsistent.
It's telling that only the corporate pinnacle (Microsoft) of software development has sought to undertake this monumental task. It's notable that the only other company that has a decent chance at matching the accomplishment is its new rival, Google.
Everyone else, go back to your labs and run your Dynamo cluster, thinking you are a distributed systems badass because you were able to use a gossip system to ensure "eventual" consistency. That is nothing compared to Exchange. NOTHING.