running truncated executable causes spewing of lock debug messages
(LU-1299)
|
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.1 |
| Fix Version/s: | None |
| Type: | Technical task | Priority: | Major |
| Reporter: | Prakash Surya (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: | |||
| Rank (Obsolete): | 9747 |
| Description |
|
Hit the following LBUG: 2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) lock@ffff8802615024d8[4 2 0 0 0 00000005] P(0):[0, 18446744073709551615]@[0x16b9ac4cbe:0xc:0x0] {
2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) vvp@ffff88025946e9e8:
2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) lov@ffff880431e80cf8: 2
2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) 0 0: ---
2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) 1 1: lock@ffff8803c465faf8[1 3 0 1 1 00000000] R(1):[0, 18446744073709551615]@[0x100a80000:0x1b3b212:0x0] {
2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) lovsub@ffff8801f6d645a0: [1 ffff880431e80cf8 P(0):[0, 18446744073709551615]@[0x16b9ac4cbe:0xc:0x0]]
2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) osc@ffff8803c4694b50: ffff88010cd766c0 00101001 0xe12a56a3ad7ca7fa 3 ffff880428397e48 size: 0 mtime: 1335700029 atime: 1335700029 ctime: 1335700029 blocks: 0
2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) } lock@ffff8803c465faf8
2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one())
2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) } lock@ffff8802615024d8
2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) Delete CLS_HELD lock
2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:383:lovsub_lock_delete_one()) Impossible state: 2
2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:384:lovsub_lock_delete_one()) LBUG
|
| Comments |
| Comment by Jinshan Xiong (Inactive) [ 30/Apr/12 ] |
|
Do you have a stack trace for this? |
| Comment by Ned Bass [ 30/Apr/12 ] |
|
Here's the stack trace. crash> bt
PID: 72532 TASK: ffff880431796ae0 CPU: 3 COMMAND: "rm"
#0 [ffff880332151920] machine_kexec at ffffffff8103216b
#1 [ffff880332151980] crash_kexec at ffffffff810b8c12
#2 [ffff880332151a50] panic at ffffffff814ee6cf
#3 [ffff880332151ad0] lbug_with_loc at ffffffffa03e2e1b [libcfs]
#4 [ffff880332151af0] lovsub_lock_delete at ffffffffa08bfff2 [lov]
#5 [ffff880332151b50] cl_lock_delete0 at ffffffffa055a665 [obdclass]
#6 [ffff880332151b80] cl_lock_hold_release at ffffffffa055a816 [obdclass]
#7 [ffff880332151ba0] cl_lock_unhold at ffffffffa055bf73 [obdclass]
#8 [ffff880332151bd0] lov_sublock_release at ffffffffa08b7e78 [lov]
#9 [ffff880332151c20] lov_lock_cancel at ffffffffa08b8ed8 [lov]
#10 [ffff880332151c90] cl_lock_cancel0 at ffffffffa0559715 [obdclass]
#11 [ffff880332151cc0] cl_lock_hold_release at ffffffffa055a7f5 [obdclass]
#12 [ffff880332151ce0] cl_lock_release at ffffffffa055beb2 [obdclass]
#13 [ffff880332151d10] cl_glimpse_lock at ffffffffa09722d6 [lustre]
#14 [ffff880332151d70] cl_glimpse_size at ffffffffa0972914 [lustre]
#15 [ffff880332151dc0] ll_inode_revalidate_it at ffffffffa093ab8a [lustre]
#16 [ffff880332151e00] ll_getattr_it at ffffffffa093acb9 [lustre]
#17 [ffff880332151e30] ll_getattr at ffffffffa093ae17 [lustre]
#18 [ffff880332151e70] vfs_getattr at ffffffff8117cb21
#19 [ffff880332151eb0] vfs_fstatat at ffffffff8117cbb8
#20 [ffff880332151ef0] sys_newfstatat at ffffffff8117cbf4
#21 [ffff880332151f80] system_call_fastpath at ffffffff8100b0f2
RIP: 00002aaaaada549a RSP: 00007fffffffd2d8 RFLAGS: 00010246
RAX: 0000000000000106 RBX: ffffffff8100b0f2 RCX: 3144324c2f303144
RDX: 000000000060e600 RSI: 000000000060d0f0 RDI: ffffffffffffff9c
RBP: 000000000060e600 R8: 0000000000000100 R9: 0000000000000000
R10: 0000000000000100 R11: 0000000000000246 R12: 000000000060d060
R13: 000000000060e590 R14: 0000000000000004 R15: 000000000060d060
ORIG_RAX: 0000000000000106 CS: 0033 SS: 002b
|
| Comment by Prakash Surya (Inactive) [ 30/Apr/12 ] |
|
Sure. This was hit on many nodes, this one printed the stack trace to the console: 2012-04-29 12:15:02 LustreError: 72532:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) lock@ffff880320159b38[4 2 0 0 0 00000005] P(0):[0, 18446744073709551615]@[0x1bb30df888:0x4:0x0] {
2012-04-29 12:15:02 LustreError: 72532:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) vvp@ffff8803515befb0:
2012-04-29 12:15:02 LustreError: 72532:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) lov@ffff88037119c6e0: 2
2012-04-29 12:15:02 LustreError: 72532:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) 0 0: ---
2012-04-29 12:15:02 LustreError: 72532:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) 1 1: lock@ffff88031bb1eeb8[1 3 0 1 1 00000000] R(1):[0, 18446744073709551615]@[0x100950000:0x1a45bfa:0x0] {
2012-04-29 12:15:02 LustreError: 72532:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) lovsub@ffff8804276ae3a0: [1 ffff88037119c6e0 P(0):[0, 18446744073709551615]@[0x1bb30df888:0x4:0x0]]
2012-04-29 12:15:02 LustreError: 72532:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) osc@ffff880425f1c2b0: ffff8801354436c0 00101001 0x92774b263242bbe4 3 ffff880812898b10 size: 0 mtime: 1335712511 atime: 1335712511 ctime: 1335712511 blocks: 0
2012-04-29 12:15:02 LustreError: 72532:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) } lock@ffff88031bb1eeb8
2012-04-29 12:15:02 LustreError: 72532:0:(lovsub_lock.c:381:lovsub_lock_delete_one())
2012-04-29 12:15:02 LustreError: 72532:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) } lock@ffff880320159b38
2012-04-29 12:15:02 LustreError: 72532:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) Delete CLS_HELD lock
2012-04-29 12:15:02 LustreError: 72532:0:(lovsub_lock.c:383:lovsub_lock_delete_one()) Impossible state: 2
2012-04-29 12:15:02 LustreError: 72532:0:(lovsub_lock.c:384:lovsub_lock_delete_one()) LBUG
2012-04-29 12:15:02 Pid: 72532, comm: rm
2012-04-29 12:15:02
2012-04-29 12:15:02 Call Trace:
2012-04-29 12:15:02 [<ffffffffa03e2895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2012-04-29 12:15:02 [<ffffffffa03e2dc7>] lbug_with_loc+0x47/0xb0 [libcfs]
2012-04-29 12:15:02 [<ffffffffa08bfff2>] lovsub_lock_delete+0x1a2/0x370 [lov]
2012-04-29 12:15:02 [<ffffffffa055a665>] cl_lock_delete0+0xb5/0x150 [obdclass]
2012-04-29 12:15:02 [<ffffffffa055a816>] cl_lock_hold_release+0x116/0x170 [obdclass]
2012-04-29 12:15:02 [<ffffffffa055bf73>] cl_lock_unhold+0x33/0xc0 [obdclass]
2012-04-29 12:15:02 [<ffffffffa08b7e78>] lov_sublock_release+0x198/0x260 [lov]
2012-04-29 12:15:02 [<ffffffffa08b8ed8>] lov_lock_cancel+0xe8/0x230 [lov]
2012-04-29 12:15:02 [<ffffffffa0559715>] cl_lock_cancel0+0x75/0xf0 [obdclass]
2012-04-29 12:15:02 [<ffffffffa055a7f5>] cl_lock_hold_release+0xf5/0x170 [obdclass]
2012-04-29 12:15:02 [<ffffffffa055beb2>] cl_lock_release+0x82/0x110 [obdclass]
2012-04-29 12:15:02 [<ffffffffa09722d6>] cl_glimpse_lock+0x196/0x330 [lustre]
2012-04-29 12:15:02 [<ffffffffa0972914>] cl_glimpse_size+0x124/0x130 [lustre]
2012-04-29 12:15:02 [<ffffffffa093ab8a>] ll_inode_revalidate_it+0x5a/0x140 [lustre]
2012-04-29 12:15:02 [<ffffffff811843e5>] ? putname+0x45/0x50
2012-04-29 12:15:02 [<ffffffffa093acb9>] ll_getattr_it+0x49/0x170 [lustre]
2012-04-29 12:15:02 [<ffffffffa093ae17>] ll_getattr+0x37/0x40 [lustre]
2012-04-29 12:15:02 [<ffffffff8117cb21>] vfs_getattr+0x51/0x80
2012-04-29 12:15:02 [<ffffffff810d152f>] ? audit_putname+0x3f/0x50
2012-04-29 12:15:02 [<ffffffff8117cbb8>] vfs_fstatat+0x68/0x80
2012-04-29 12:15:02 [<ffffffff8117cbf4>] sys_newfstatat+0x24/0x50
2012-04-29 12:15:02 [<ffffffff8100c5b5>] ? math_state_restore+0x45/0x60
2012-04-29 12:15:02 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
2012-04-29 12:15:02
2012-04-29 12:15:02 Apr 29 12:15:02 Kernel panic - not syncing: LBUG
2012-04-29 12:15:02 cab283 kernel: LPid: 72532, comm: rm Not tainted 2.6.32-220.13.1.2chaos.ch5.x86_64 #1
2012-04-29 12:15:02 ustreError: 7253Call Trace:
2012-04-29 12:15:02 2:0:(lovsub_lock [<ffffffff814ee6c8>] ? panic+0x78/0x143
2012-04-29 12:15:02 .c:384:lovsub_lo [<ffffffffa03e2e1b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
2012-04-29 12:15:02 ck_delete_one()) [<ffffffffa08bfff2>] ? lovsub_lock_delete+0x1a2/0x370 [lov]
2012-04-29 12:15:02 LBUG
2012-04-29 12:15:02 Apr 29 12: [<ffffffffa055a665>] ? cl_lock_delete0+0xb5/0x150 [obdclass]
2012-04-29 12:15:02 15:02 cab283 ker [<ffffffffa055a816>] ? cl_lock_hold_release+0x116/0x170 [obdclass]
2012-04-29 12:15:02 nel: Kernel pani [<ffffffffa055bf73>] ? cl_lock_unhold+0x33/0xc0 [obdclass]
2012-04-29 12:15:02 c - not syncing: [<ffffffffa08b7e78>] ? lov_sublock_release+0x198/0x260 [lov]
2012-04-29 12:15:02 LBUG
2012-04-29 12:15:02 [<ffffffffa08b8ed8>] ? lov_lock_cancel+0xe8/0x230 [lov]
2012-04-29 12:15:02 [<ffffffffa0559715>] ? cl_lock_cancel0+0x75/0xf0 [obdclass]
2012-04-29 12:15:02 [<ffffffffa055a7f5>] ? cl_lock_hold_release+0xf5/0x170 [obdclass]
2012-04-29 12:15:02 [<ffffffffa055beb2>] ? cl_lock_release+0x82/0x110 [obdclass]
2012-04-29 12:15:02 [<ffffffffa09722d6>] ? cl_glimpse_lock+0x196/0x330 [lustre]
2012-04-29 12:15:02 [<ffffffffa0972914>] ? cl_glimpse_size+0x124/0x130 [lustre]
2012-04-29 12:15:02 [<ffffffffa093ab8a>] ? ll_inode_revalidate_it+0x5a/0x140 [lustre]
2012-04-29 12:15:03 [<ffffffff811843e5>] ? putname+0x45/0x50
2012-04-29 12:15:03 [<ffffffffa093acb9>] ? ll_getattr_it+0x49/0x170 [lustre]
2012-04-29 12:15:03 [<ffffffffa093ae17>] ? ll_getattr+0x37/0x40 [lustre]
2012-04-29 12:15:03 [<ffffffff8117cb21>] ? vfs_getattr+0x51/0x80
2012-04-29 12:15:03 [<ffffffff810d152f>] ? audit_putname+0x3f/0x50
2012-04-29 12:15:03 [<ffffffff8117cbb8>] ? vfs_fstatat+0x68/0x80
2012-04-29 12:15:03 [<ffffffff8117cbf4>] ? sys_newfstatat+0x24/0x50
2012-04-29 12:15:03 [<ffffffff8100c5b5>] ? math_state_restore+0x45/0x60
2012-04-29 12:15:03 [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
We also have a core dump which I can poke at if you need me to. |
| Comment by Ned Bass [ 30/Apr/12 ] |
|
Here is the patch stack that was running at the time. |
| Comment by Ned Bass [ 01/May/12 ] |
|
Hi Jinshan, Do you think this may be a regression in our recent 2.1.1-10chaos tag? If not, do you have any ideas about the nature of the bug? |
| Comment by Jinshan Xiong (Inactive) [ 01/May/12 ] |
|
No, this is a pre-existing bug. How often do you hit this one? I've found the root cause of this one and am working on 1356. |
| Comment by Ned Bass [ 01/May/12 ] |
|
That's great news! We hit it very often. I just unintentionally reproduced it running ls on a login node. I wonder if we landed something that makes it more likely to hit since we never saw it before. |
| Comment by Jinshan Xiong (Inactive) [ 02/May/12 ] |
|
I need more time on this bug. Ned, if you can reproduce this issue quite often, is it possible to collect a debug log at the client side. I need DLMTRACE to be set. |
| Comment by Jinshan Xiong (Inactive) [ 02/May/12 ] |
|
I confirm that this bug was imported by |
| Comment by Ned Bass [ 02/May/12 ] |
|
Do you still need the debug log? Now that I'm trying to reproduce it, of course it is being elusive. |
| Comment by Jinshan Xiong (Inactive) [ 02/May/12 ] |
|
Actually my previous comment was inaccurate. This bug is concealed by Please check patch: http://review.whamcloud.com/2632 for a fix. This is still about signal handling problem. |
| Comment by Jinshan Xiong (Inactive) [ 02/May/12 ] |
|
Ned, I don't need debug log since I've known every details of this bug. Thanks. |
| Comment by Ned Bass [ 02/May/12 ] |
|
Thanks, we'll give the patch a try. I'm not sure that your comment about Although, now that I write that, I'm realizing that we have a local bug open about OOMs occurring despite the application is using little real memory. So perhaps we have unknowingly been hitting |
| Comment by Jinshan Xiong (Inactive) [ 02/May/12 ] |
|
Yes, original code can trigger fake OOM because wrong error code was returned by ll_fault(), so I guess OOM you have seen should go away after From what I have seen from log, it looks very like that a glimpse of file size was interrupted by a signal. Now that you mentioned it was hit in a normal file system usage like ls, it would exist another path to have the same back trace because ls won't issue signals afaik. Anyway, please you apply this patch and try to reproduce it again, this way we can get more information and move steps forward. |
| Comment by Ned Bass [ 02/May/12 ] |
|
Also note the originally reported occurrences involved rm. |
| Comment by Jinshan Xiong (Inactive) [ 04/May/12 ] |
|
Fixed in |