Affects Version/s: Lustre 2.7.0, Lustre 2.8.0
Fix Version/s: None
Environment:CentOS 6.5, current master or 2.7.0. Probably earlier versions as well.
Running obdfilter-survey causes GPF/null pointers
Current master on CentOS 6.5 crashes when obdfilter is run on RAM disks with GPF or null pointers, always in the same spot. If I've diagnosed this correctly, the RAM disk only matters because it makes the race condition more likely, and is not directly related.
The stack trace of the crash is always like this:
<4>RIP: 0010:[<ffffffff81054691>] [<ffffffff81054691>] __wake_up_common+0x31/0x90
<4>RSP: 0018:ffff88082f285d60 EFLAGS: 00010096
<4>RAX: 5a5a5a5a5a5a5a42 RBX: ffff880780bcb9b8 RCX: 0000000000000000
<4>RDX: 5a5a5a5a5a5a5a5a RSI: 0000000000000003 RDI: ffff880780bcb9b8
<4>RBP: ffff88082f285da0 R08: 0000000000000000 R09: 0000000000000000
<4>R10: 0000000000000001 R11: 000000000000000f R12: 0000000000000282
<4>R13: ffff880780bcb9c0 R14: 0000000000000000 R15: 0000000000000000
<4>FS: 0000000000000000(0000) GS:ffff880028380000(0000) knlGS:0000000000000000
<4>CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
<4>CR2: 00007f4a594a0060 CR3: 0000000784c0d000 CR4: 00000000000407e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process loop2 (pid: 1677, threadinfo ffff88082f284000, task ffff880831376080)
<4> ffff880780bcb9b8 0000000300000001 ffff88082e570f58 ffff880780bcb9b8
<4><d> 0000000000000282 0000000000000003 0000000000000001 0000000000000000
<4><d> ffff88082f285de0 ffffffff81058bc8 ffff88082d851cc0 ffff880833361800
<4> [<ffffffff81058bc8>] __wake_up+0x48/0x70
<4> [<ffffffffa1390e80>] dio_complete_routine+0x1a0/0x380 [osd_ldiskfs]
<4> [<ffffffff811c314d>] bio_endio+0x1d/0x40
<4> [<ffffffff8137d33f>] loop_thread+0xdf/0x270
<4> [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
<4> [<ffffffff8137d260>] ? loop_thread+0x0/0x270
<4> [<ffffffff8109abf6>] kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
Note the freed memory in RAX/RDX. These are sometimes other invalid pointers, rather than freed memory.
Here's the race that's causing this.
In dio_complete_routine, which is called in one the 'loop' threads, we have this sequence:
The corresponding wait (for one of the 'lctl' threads) is this:
atomic_read(&iobuf->dr_numreqs) == 0);
Once this wait has completed, things roll up, and, eventually, the struct 'osd_thread_info' containing this 'iobuf' is freed.
The bug (which is easily repeatable) is this:
'loop', in dio_complete_routine, arrives at 'atomic_dec_and_test' and succesfull reduces numreqs to 0. The thread is then delayed for some reason or another.
'lctl', the other thread, arrives at the 'wait_event', and passes the condition, because dr_numreqs is already 0.
'lctl' then continues on and, eventually, the osd_thread_info containing the iobuf is freed.
Then the first thread goes and 'wake_up' is called, except that the wait queue has been freed.
This very quickly leads to a GPF/null pointer exception.
I've given a fair bit of thought to this, and I don't see a terribly clean way to fix this. The fundamental problem, I think, is that the osd_thread_info is supposed to be thread specific information & it's being used to hold a wait queue that's used by another thread.
I've created a patch which allocates the iobuf separately from the osd_thread_info, and uses a spinlock in the thread info to prevent the race condition.
We've confirmed this fixes the problem in our testing, but I'm not sure I like the approach. I am very open to suggestions for different fixes.
I will also attach logs and a dmesg showing the crash (with some extra debug info the logs). I can provide a dump if requested.
In the case in the attached logs, the two racing threads are 1677 - loop2 - and 5188, one of the lctl threads.
In the attached logs, thread 1677 logs "wakeup", which is printed before the wakeup call, then for whatever reason it is delayed. When it starts up again, it is in the actual wakeup call and causes the kernel panic. (As 5188 has freed the osd_thread_info struct containing the iobuf & wait queue.)