The lockless queue used by 0mq has a notification mechanism which IMO is pretty nice.  The writer notifies the reader only if  the reader is sleeping or is going to sleep. The technique is quite simple but mixied with the relatively complicated flush/rollback logic, the technique looks less obvious than it actually is.  I will explain it with a very simple example.

Writer:
new_wptr = wptr + 1;
if (cas (c, wptr, new_wptr) == 0) {
//c is NULL, which implies reader is sleeping or is going to sleep
wptr = new_wptr;
//reader is sleeping or is going to sleep, so it is safe to update c
c.set(new_wptr);
return WAKEUP;
}
//c is new_wptr now
wptr = new_wptr;
return DONOT_WAKEUP;

if (cas (c,  rptr, 0) == rptr)
return SLEEP;
++rptr;
return DONOT_SLEEP;

## 0mq code reading (1): mailbox

mailbox.cpp/h:
mailbox is the inter-thread communication mechanism . In 0mq, whenever one thread wants to modify the status of the other one, it sends a message to inform the other thread to do it. You don’t need any synchronization among threads since they don’t share anything.

mailbox is implemented using sock_pair. The data type communicated through the sock_pair is defined as command_t (command.h/cpp). Other than the message type and message parameters, one import component of every message is the “destination”, which is an interferce type defined as “object_t” (object.cpp/h). object_t has  a function process_command(command_t& cmd), for different message type, a different virtual function will be called to response the message type.

In current 0mq, every iothread object and every socket own a mailbox.

One important command for 0mq socket is “Activiate”. A thread calling blocking read on an empty pipe will be blocked by calling process_command (block on reading sock_pair); Later on, a writer thread to the pipe will send an “Activate” command to wake it up. The lockless pipe of 0mq adopts an interesting design: The writer can know exactly if the reader is blocked or is going to be blocked, so it only does notification when necessary — This greatly improves the performance.

## As hard as 10 lines of code…

bool ceq(const char *s, const char *t, std::size_t n) {
int i=0, j=0, size=static_cast<int>(n), k;
while(i<n && j<size) {
for(k=0; k<size && s[(i+k)%size]==t[(j+k)%size]; ++k);
if(k==size) return true;
if(s[(i+k)%size]>t[(j+k)%size])
i+=k+1;
else
j+=k+1;
}
return false;
}
Given two strings $s$ and $t$, this function returns true if and only if $s$ and $t$ are equivalent under circular shift. $s$ and $t$ are equivalent under circular shift if there is an index $k$ such that for any integer $x\ge 0$, $s[x \mod n]=t[(k+x) \mod n]$.

It is easy to see that the worst case running time of this function is linear: The asymptotic running time is bounded by the total number of iterations executed by the inner “for” loop. In each iteration, the value of $k$ will be increased and ultimately, the increment will be accumulated into either variable $i$ or $j$. When either $i$ or $j$ reaches $n$, the function will return.

Why do I post this small piece of program? It is a perfect example to show a program can be short but still not easy to understand. It is also one of my favorite examples to convince people “loop invariant proof” is a good way to understand a program — if it is not the only way. You may load the program into your favorite debugger, and check closely how each single line is executed; you may feed it with all kind testing cases, and verify every output. But you simply cannot get enough insight to convince yourself why the program is correct. This is what I believe in: Program should be proven to be correct; Testing should only be used to catch typos.

How to prove this program is correct? Let us introduce several notations first.
1. Given string $s$ of length of $n>0$, define $s^{(i)} = s[i, i+1,\ldots n-1, 0,\ldots i-1]$.
2. Define $F(s)= \{s^{(i)}, 0\le i . That is, the set of strings generated by shifting $s$.
3. For a string $s$ and a set of strings $T$, we call $s$ beats $T$ if there is a string $t\in T$ such that $s>t$.
4. For two set of strings $S$ and $T$, we call $S$ beats $T$ if for any string $s\in S$, there exists a string $t\in T$ such that $s>t$.

Claim: $s$ and $t$ are equivalent if and only if  neither $F(s)$ beats $F(t)$ nor $F(t)$ beats $F(s)$. (Hints: If $s$ and $t$ are equivalent, then $F(t)= F(s)$. So the minimum string in $F(s)$ cannot beat $F(t)$ vice versa.)

Our program holds such invariant: At the beginning of each iteration of the “while” loop, for any $x, $s^{(x)}$ beats $F(t)$; for any $y, $t^{(y)}$ beats $F(s)$. We will skip the proof of the invariant since it is trivial. With this invariant, when the while loop stops we have either:
1. k==size: in this case, the two string are equivalent. (return true)
2. i>=size or j>=size: in this case, either $F(s)$ beats $F(t)$ or $F(t)$ beats $F(s)$ , so $s$ and $t$ are not equivalent by the claim.  (return false)

P.S. The function returns false for empty strings ($n=0$) and one may argue that true should be returned instead.

## MCS Spin Lock

MCS spin lock is named after the inventors John Mellor Crummey and Michael Scotty.  see “Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors”.  ACM Trans. Comput. Syst. 9(1): 21-65 (1991). The quotation from Edsger W. Dijkstra Prize in Distributed Computing  2006:

“The key lesson this work has taught us is that one cannot underestimate the importance of reducing memory traffic in scalable synchronization algorithms.  The “local spinning” technique used by the MCS algorithm has heavily influenced virtually all practical scalable synchronization algorithms since. ”

struct lock {
bool locked;
struct lock* next;
};
lock *current  = 0;

acquire_lock(lock *l) {
node * n = atomic_swap(current, l);
if (n) {
l->lock=true;
n->next=l;
while(l->lock);
}
}

procedure release_lock (lock *l) {
if (!l->next) {
if (atomic_cas(current, l, 0) == l)
return;
while (!l->next);
l->next->locked=false;
}
}

## A simple read/write spin lock

class RWSpinLock {
public:
RWSpinLock();
void writeLock();
void writeUnlock();

private:
const static uint32_t WRITING = 0x80000000u;
volatile uint32_t state;

RWSpinLock(const RWSpinLock&);
RWSpinLock& operator= (const RWSpinLock&);
};

RWSpinLock::RWSpinLock() : state(0) {}

while(atomic_inc_32_nv(&state) > WRITING)
atomic_dec_32(&state);
}

atomic_dec_32(&state);
}

void RWSpinLock::writeLock() {
while(atomic_cas_32(&state, 0, WRITING));
}

void RWSpinLock::writeUnlock() {
}

## Simple linear palindrom finding algorithm

A simple algorithm designed by Manacher in 1975. It scans the longest palindrom in left-to-right fashion.
int palindrom_odd(const char *st, int n, int *p) {
int a = 0, m = 1;
p[0] = 1;
for (int i = 1; i < n; ++i) {
int w = 1;
if (a + p[a] > i )
w = std::min(p[a + a -i], a-i+p[a]);
while(i >= w && i+w < n && st[i+w] == st[i-w])
++w;
if (w + i > a + p[a])
a = i;
p[i] = w;
m = std::max (w, m);
}
return m + m – 1;
}

The algorithm is quite straightforward. If you take the two lines in bold away, you will immediately see that it becomes a naive palindrom finding algorithm. What are the two lines for?
Assuming you are computing the radius of index $i$ and there is some index $a < i$ such that $p[a] + a > i$.
If the palindrom centered at index $2a-i$ is covered by the range $[a-p[a], a+p[a]]$, then by symetric property, the radius of index $i$ cannot be larger than $p[2a-i]$. So it is enough for us to compute $p[i]$ starting from $min(p[a + a -i], a-i+p[a])$.
Why the algorithm is linear? For each comparison, you either extend the coverage of $a+p[a]$ (on sucess) or finish the computation of $p[i]$ immediately.

Is this algorithm fast? Not really, a naive algorithm most probably will beat it in practice. The algorithm is good for interview since it is simple and nature. It is much faster than a naive algorithm in some extreme case (e.g. a lot of 0’s followed by a single 1, in this case, the running time of a naive algorithm is $\Theta(n^2)$)