As is becoming traditional for ramblings on train journeys, here's another technology fail. First, a bit of theory.
Recently I've been doing a lot of programming in Java. Java, unlike C, is a garbage collected language. What this means is that you don't have to keep track of memory usage - instead, a garbage collector runs periodically and frees unused objects for you. Garbage collected environments tend to have less predictable performance than non-garbage-collected ones (as you have little if any control over when and how the garbage collector runs), but on the flip side memory leaks and heap corruption tend to be impossible.
Java's particular garbage collection system is a mark-sweep system. This happens in two steps. To begin with, the garbage collector starts with a set of root objects and follows all the references from those, marking everything it finds. At the end of this step it has marked every object that can be reached from somewhere in the code. It then sweeps through all the objects, and anything that's not marked it throws away. The downside to a mark/sweep system is that there's no guarantee about when an object might be freed (so you can't do anything interesting when the object is destroyed), but on the plus side it deals with circular references and there's no way for an object to become lost.
So, on to the technology fail. This particular fail was a memory leak that occurred after a long soak test. Actually, a better term would be a memory squirrel - it's impossible for memory to be leaked and forever lost, but what can happen is that objects are held on for far too long. In this case it was a linked list, which very quickly showed up in a heap dump as holding far too many objects.
What was happening was an interaction between two pieces of code. The first was, about 300 times a second, adding an item to the list. The second was removing items, but had been throttled to remove at most 100 items per second. This was actually rather tricky to find, as the leak only showed up after the soak test had been running for a few days. If you ran a test for a day, then stopped the load, then firstly the list didn't get full enough to become noticeable (the rest of the code was chewing through objects at a crazily fast rate, with the garbage collector running several times a minute), and secondly the second part of the code kept running so the memory used was constantly going down!