[LU-7681] Deadlock on MDS around dqptr_sem Created: 18/Jan/16 Updated: 07/Jun/17 Resolved: 07/Jun/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Sebastien Piechurski | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | p4b | ||
| Environment: |
RHEL 6 with Bull kernel based on 2.6.32-279.5.2 |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
A MDS has encountered several times a deadlock where a process seems to have acquired the superblock dqptr_sem semaphore and left without releasing it. When looking at a dump taken during this deadlock, we can see this:
PID: 1479 TASK: ffff880cefe9e7f0 CPU: 5 COMMAND: "mdt_279" #0 [ffff8810093538a0] schedule at ffffffff814965a5 #1 [ffff881009353968] jbd2_log_wait_commit at ffffffffa00b9e55 [jbd2] #2 [ffff8810093539f8] jbd2_journal_stop at ffffffffa00b1b6b [jbd2] #3 [ffff881009353a58] __ldiskfs_journal_stop at ffffffffa05c7808 [ldiskfs] #4 [ffff881009353a88] osd_trans_stop at ffffffffa0e86b35 [osd_ldiskfs] #5 [ffff881009353ab8] mdd_trans_stop at ffffffffa0d8b4aa [mdd] #6 [ffff881009353ac8] mdd_attr_set at ffffffffa0d6aa5f [mdd] #7 [ffff881009353ba8] cml_attr_set at ffffffffa0ec3a86 [cmm] #8 [ffff881009353bd8] mdt_attr_set at ffffffffa0dfe418 [mdt] #9 [ffff881009353c28] mdt_reint_setattr at ffffffffa0dfea65 [mdt] #10 [ffff881009353cb8] mdt_reint_rec at ffffffffa0df7cb1 [mdt] #11 [ffff881009353cd8] mdt_reint_internal at ffffffffa0deeed4 [mdt] #12 [ffff881009353d28] mdt_reint at ffffffffa0def2b4 [mdt] #13 [ffff881009353d48] mdt_handle_common at ffffffffa0de3762 [mdt] #14 [ffff881009353d98] mdt_regular_handle at ffffffffa0de4655 [mdt] #15 [ffff881009353da8] ptlrpc_main at ffffffffa082e4e6 [ptlrpc] #16 [ffff881009353f48] kernel_thread at ffffffff8100412a
We are at this point waiting in dquot_initialize() for the superblock dqptr_sem semaphore to be released. For reference, the semaphore is seen as follows: crash> struct rw_semaphore ffff88086b5c1180
struct rw_semaphore {
count = -4294967296, # == 0xffffffff00000000
wait_lock = {
raw_lock = {
slock = 2653658667
}
},
wait_list = {
next = 0xffff880834bbb5c0,
prev = 0xffff880834bbb5c0
}
}
Given the definition and comments for rw_semaphore in include/linux/rwsem-spinlock.h below: /*
* the rw-semaphore definition
* - if activity is 0 then there are no active readers or writers
* - if activity is +ve then that is the number of active readers
* - if activity is -1 then there is one active writer
* - if wait_list is not empty, then there are processes waiting for the semaphore
*/
struct rw_semaphore {
__s32 activity;
spinlock_t wait_lock;
struct list_head wait_list;
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
#endif
};
this would mean that activity is -1 (one active writer). What did I miss there ? |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 18/Jan/16 ] |
|
Assigning to me since I have already started working with Seb on this problem from Bull office. |
| Comment by Bruno Faccini (Inactive) [ 18/Jan/16 ] |
|
Suspecting a dead-lock problem around dqptr_sem in quite old Kernel ... |
| Comment by Bruno Faccini (Inactive) [ 19/Jan/16 ] |
|
Seb, is the crash-dump for this pb still available? And then, can you upload it with kernel-[common-]debugingo and lustre-debuginfo RPMs ? |
| Comment by Bruno Faccini (Inactive) [ 19/Jan/16 ] |
|
BTW, later (2.4<=) Lustre version use a Kernel patch to avoid dqptr_sem usage. So it is very likely that this problem no longer exists since then. |
| Comment by Sebastien Piechurski [ 20/Jan/16 ] |
|
A bundle with all the debuginfo packages and sources is currently uploading on ftp.whamcloud.com. |
| Comment by Bruno Faccini (Inactive) [ 21/Jan/16 ] |
|
Seb, |
| Comment by Sebastien Piechurski [ 21/Jan/16 ] |
|
Yes, the tranfer failed, I had not noticed. |
| Comment by Sebastien Piechurski [ 23/Jan/16 ] |
|
The transfer finally succeeded with file |
| Comment by Bruno Faccini (Inactive) [ 17/Feb/16 ] |
|
Hello Seb, and sorry to be late on this. I have spent more time to analyze the crash-dump you have provided. BTW, this looks like a new occurrence/crash-dump than the one we already worked-on together, so this means that same problem has re-occurred ... Also, can you give me some hint on how to use the lustre[-ldiskfs]-core RPMs you have provided (and their embedded sets of patches), in order to get the full+exact source tree being used for this Lustre version? Then as a first thought, and despite the fact I still have not identified the thread that presently owns dqptr_sem and the blocked situation looks a bit different, I wonder if my patch for |
| Comment by Sebastien Piechurski [ 07/Jun/17 ] |
|
I did not hear from this problem for quite a while, so I think this was solved by moving away from 2.1. Please close. |
| Comment by Peter Jones [ 07/Jun/17 ] |
|
Thanks |