After hunting for quite some time for a strange application behavior, I finally found the reason.
The Java application was behaving strangely in 4 out of 10 runs. It did not process all data available and assumed that the data input already ended. The application features several producer-consumer patterns, where one thread offers preprocessed data to the next one, passing it into a buffer where the next thread reads it from.
The consumer or producer fall into a wait state in case no data is available or the buffer is full. In case of a state change, the active threads notifies all waiting threads about the new data or the fact that all data is consumed.
On 2-core and 8-core machines, the application was running fine but when we moved it to 24-cores, it suddenly started to act in an unpredictable manner.
After a lot of debugging I found out that threads wake up without having been notified by their partner thread. In this case the consumer was woken up despite the fact that data was unavailable aka the producer has not delivered and therefore not notified anyone. But the consumer was awake…
The debugging nightmare finally revealed a rare behavior of any POSIX based application. This is the quote from the JDK6 doc:
A thread can also wake up without being notified, interrupted, or timing out, a so-called spurious wakeup. While this will rarely occur in practice, applications must guard against it by testing for the condition that should have caused the thread to be awakened, and continuing to wait if the condition is not satisfied. In other words, waits should always occur in loops.
So never trust your notification chains. Threads might wake up even though nobody directly notified it. Additionally you should never exclude the possibility that the move from a small box to a big box does not influence the application behavior. More cores mean more trouble.