[LU-5264] ASSERTION( info->oti_r_locks == 0 ) at OST umount Created: 27/Jun/14 Updated: 14/May/20 Resolved: 20/May/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.3 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Bruno Travouillon (Inactive) | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | p4b | ||
| Environment: |
RHEL6 w/ kernel 2.6.32-431.17.1.el6.x86_64 |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 14690 | ||||||||||||||||
| Description |
|
While stopping a Lustre filesystem, the following LBUG occurred on an OSS: ----8< ---- Call Trace: Kernel panic - not syncing: LBUG There were 15 OSTs mounted on the OSS. One umount process has completed while others were still present at crash time. All umount processes were running in parallel because of Shine (shine stop -f scratch3 -n @io) ----8< ---- Backtrace of the process: ----8< ---- You can find attached:
|
| Comments |
| Comment by Bruno Faccini (Inactive) [ 27/Jun/14 ] |
|
Hello Bruno, nice to read from you even in a JIRA !! |
| Comment by Bruno Travouillon (Inactive) [ 27/Jun/14 ] |
|
Hi Bruno, Yes, AFAIK, this is a one shot. |
| Comment by Bruno Faccini (Inactive) [ 03/Jul/14 ] |
|
Humm after having a look to the assembly code of the routines in panic/LBUG stack, it will be difficult/painful for me to really detail the actions/crash-subcommands to get what I need from the crash-dump ... But let's try and 1st I would like to get both the "bt -f" and "bt -F" output for the panic/LBUG task, is it possible ? Also, this ticket duplicates LU-4776, generated from auto-tests similar failures... |
| Comment by Bruno Travouillon (Inactive) [ 11/Jul/14 ] |
|
Hi Bruno, You will find attached both 'bt -f' and 'bt -F' from the crash dump. HTH, Bruno |
| Comment by Bruno Faccini (Inactive) [ 22/Jul/14 ] |
|
Hello Bruno, can you get me more infos out from the crash-dump+site ? If yes, here is what I need : p/x lu_keys lu_env 0xffff88041e7c5d80 This last command will print you the lu_env struct containing the context ant its lc_value array of pointers address. And since current index of interest, in both lu_keys[] and lc_value, seems to be #21 ... : p/x *lu_keys[21] rd <lc_value address> 40 osd_thread_info <lc_value[21] pointer value> Don't forget to load the Lustre modules containing debuginfo stuff before with "mod -S <[debuginfo,modules] root dir>". |
| Comment by Bruno Travouillon (Inactive) [ 18/Sep/14 ] |
|
Hi Bruno, Here is the requested output. Sorry for the delay. crash> p/x lu_keys
$8 = {0xffffffffa0ec53e0, 0xffffffffa0ec7aa0, 0xffffffffa0ec6280, 0xffffffffa0ecd720, 0xffffffffa0eb2a20, 0xffffffffa10a39c0, 0xffffffffa0820ac0, 0xffffffffa090d320, 0xffffffffa11b77a0, 0xffffffffa11beac0, 0
xffffffffa11b9c60, 0xffffffffa1254960, 0xffffffffa12549a0, 0xffffffffa12ca080, 0xffffffffa12ca0c0, 0xffffffffa1391f20, 0xffffffffa1391f60, 0xffffffffa1393220, 0xffffffffa1393260, 0xffffffffa1438700, 0xffffff
ffa14386c0, 0xffffffffa15197e0, 0xffffffffa1594f60, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
crash> lu_env 0xffff88041e7c5d80
struct lu_env {
le_ctx = {
lc_tags = 2952790018,
lc_state = LCS_LEFT,
lc_thread = 0xffff88041227f140,
lc_value = 0xffff88041e7c8e00,
lc_remember = {
next = 0xffff881057a58ac0,
prev = 0xffff880848277d58
},
lc_version = 25,
lc_cookie = 6
},
le_ses = 0x0
}
crash> p/x *lu_keys[21]
$9 = {
lct_tags = 0x400000c3,
lct_init = 0xffffffffa14e0330,
lct_fini = 0xffffffffa14db1a0,
lct_exit = 0xffffffffa14da620,
lct_index = 0x15,
lct_used = {
counter = 0x2
},
lct_owner = 0xffffffffa1526680,
lct_reference = {<No data fields>}
}
crash> rd 0xffff88041e7c8e00 40
ffff88041e7c8e00: ffff88041e7c8c00 0000000000000000 ..|.............
ffff88041e7c8e10: ffff88041e7c9800 0000000000000000 ..|.............
ffff88041e7c8e20: 0000000000000000 ffff88041e7cac00 ..........|.....
ffff88041e7c8e30: ffff880418c30d40 ffff88041e7c5cc0 @........\|.....
ffff88041e7c8e40: ffff88041e7c7dc0 0000000000000000 .}|.............
ffff88041e7c8e50: ffff88041e7c8a00 0000000000000000 ..|.............
ffff88041e7c8e60: 0000000000000000 0000000000000000 ................
ffff88041e7c8e70: 0000000000000000 0000000000000000 ................
ffff88041e7c8e80: 0000000000000000 0000000000000000 ................
ffff88041e7c8e90: 0000000000000000 0000000000000000 ................
ffff88041e7c8ea0: 0000000000000000 0000000000000000 ................
ffff88041e7c8eb0: 0000000000000000 0000000000000000 ................
ffff88041e7c8ec0: 0000000000000000 0000000000000000 ................
ffff88041e7c8ed0: 0000000000000000 0000000000000000 ................
ffff88041e7c8ee0: 0000000000000000 0000000000000000 ................
ffff88041e7c8ef0: 0000000000000000 0000000000000000 ................
ffff88041e7c8f00: 0000000000000000 0000000000000000 ................
ffff88041e7c8f10: 0000000000000000 0000000000000000 ................
ffff88041e7c8f20: 0000000000000000 0000000000000000 ................
ffff88041e7c8f30: 0000000000000000 0000000000000000 ................
There is no <lc_value[21] pointer value> here... Did I miss something? We had a new occurrence of this bug last week on the same OSS. I will try to get the output from the new dump. |
| Comment by Bruno Faccini (Inactive) [ 01/Oct/14 ] |
|
Yes, this is strange because you should have crashed with a Kernel Oops/BUG() instead of an LBUG ... |
| Comment by Bruno Faccini (Inactive) [ 12/Dec/14 ] |
|
I have spent sometime this week doing more analysis on my own crash-dump and I should be able to push a patch soon now to fix the suspected race during umounts. BTW, can you check in your on-site crash-dump/logs that this was the last OST to be unmounted ? |
| Comment by Gerrit Updater [ 17/Dec/14 ] |
|
Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/13103 |
| Comment by Bruno Faccini (Inactive) [ 17/Dec/14 ] |
|
After more debugging on my own crash-dump, I think problem comes from the fact that upon umount, presumably of last device using same OSD back-end, to prepare for module unload, lu_context_key_quiesce() is run to remove all module's key reference in any context linked on lu_context_remembered list. Thus threads must protect against such transversal processing when exiting from their context. |
| Comment by Bruno Travouillon (Inactive) [ 17/Dec/14 ] |
|
Hi Bruno, I was just looking at the dumps at the customer site. For the first crash, there were still 14 umount processes running for 15 OSTs. For the second, 3 umount processes were remaining for 8 OSTs. |
| Comment by Bruno Travouillon (Inactive) [ 17/Dec/14 ] |
|
I mean 14 umount processes running for 15 OSTs previously mounted, ie 1 OST has been successfully unmounted for first crash and 5 OSTs have been successfully unmounted for the second one. |
| Comment by Bruno Faccini (Inactive) [ 18/Dec/14 ] |
|
Thanks Bruno, but I believe this could be the effect of Shine tool running all umounts in parallel ... |
| Comment by Bruno Travouillon (Inactive) [ 11/Feb/15 ] |
|
Bruno, Can we go ahead and backport the patch to b2_5? |
| Comment by Bruno Faccini (Inactive) [ 12/Feb/15 ] |
|
Humm master patch version, has successfully passed auto-test but definitelly needs to pass the review step. Will try to get more involvment from the reviewers. |
| Comment by Gerrit Updater [ 03/Mar/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13103/ |
| Comment by Peter Jones [ 20/May/15 ] |
|
Landed for 2.8 |
| Comment by Gerrit Updater [ 20/Jul/15 ] |
|
Grégoire Pichon (gregoire.pichon@bull.net) uploaded a new patch: http://review.whamcloud.com/15647 |
| Comment by Bruno Travouillon (Inactive) [ 02/Sep/15 ] |
|
Grégoire, Maybe you should abandon this backport to b2_5 because of |