running truncated executable causes spewing of lock debug messages (LU-1299)

[LU-1355] LBUG in lovsub_lock_delete_one -- Impossible state: 2 (CLS_ENQUEUED) Created: 30/Apr/12  Updated: 04/May/12  Resolved: 04/May/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.1
Fix Version/s: None

Type: Technical task Priority: Major
Reporter: Prakash Surya (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

https://github.com/chaos/lustre


Rank (Obsolete): 9747

 Description   

Hit the following LBUG:

2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) lock@ffff8802615024d8[4 2 0 0 0 00000005] P(0):[0, 18446744073709551615]@[0x16b9ac4cbe:0xc:0x0] {
2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one())     vvp@ffff88025946e9e8: 
2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one())     lov@ffff880431e80cf8: 2
2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one())     0 0: ---
2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one())     1 1: lock@ffff8803c465faf8[1 3 0 1 1 00000000] R(1):[0, 18446744073709551615]@[0x100a80000:0x1b3b212:0x0] {
2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one())     lovsub@ffff8801f6d645a0: [1 ffff880431e80cf8 P(0):[0, 18446744073709551615]@[0x16b9ac4cbe:0xc:0x0]] 
2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one())     osc@ffff8803c4694b50: ffff88010cd766c0 00101001 0xe12a56a3ad7ca7fa 3 ffff880428397e48 size: 0 mtime: 1335700029 atime: 1335700029 ctime: 1335700029 blocks: 0
2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) } lock@ffff8803c465faf8
2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) 
2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) } lock@ffff8802615024d8
2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) Delete CLS_HELD lock
2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:383:lovsub_lock_delete_one()) Impossible state: 2
2012-04-29 07:43:52 LustreError: 83833:0:(lovsub_lock.c:384:lovsub_lock_delete_one()) LBUG


 Comments   
Comment by Jinshan Xiong (Inactive) [ 30/Apr/12 ]

Do you have a stack trace for this?

Comment by Ned Bass [ 30/Apr/12 ]

Here's the stack trace.

crash> bt
PID: 72532  TASK: ffff880431796ae0  CPU: 3   COMMAND: "rm"
 #0 [ffff880332151920] machine_kexec at ffffffff8103216b
 #1 [ffff880332151980] crash_kexec at ffffffff810b8c12
 #2 [ffff880332151a50] panic at ffffffff814ee6cf
 #3 [ffff880332151ad0] lbug_with_loc at ffffffffa03e2e1b [libcfs]
 #4 [ffff880332151af0] lovsub_lock_delete at ffffffffa08bfff2 [lov]
 #5 [ffff880332151b50] cl_lock_delete0 at ffffffffa055a665 [obdclass]
 #6 [ffff880332151b80] cl_lock_hold_release at ffffffffa055a816 [obdclass]
 #7 [ffff880332151ba0] cl_lock_unhold at ffffffffa055bf73 [obdclass]
 #8 [ffff880332151bd0] lov_sublock_release at ffffffffa08b7e78 [lov]
 #9 [ffff880332151c20] lov_lock_cancel at ffffffffa08b8ed8 [lov]
#10 [ffff880332151c90] cl_lock_cancel0 at ffffffffa0559715 [obdclass]
#11 [ffff880332151cc0] cl_lock_hold_release at ffffffffa055a7f5 [obdclass]
#12 [ffff880332151ce0] cl_lock_release at ffffffffa055beb2 [obdclass]
#13 [ffff880332151d10] cl_glimpse_lock at ffffffffa09722d6 [lustre]
#14 [ffff880332151d70] cl_glimpse_size at ffffffffa0972914 [lustre]
#15 [ffff880332151dc0] ll_inode_revalidate_it at ffffffffa093ab8a [lustre]
#16 [ffff880332151e00] ll_getattr_it at ffffffffa093acb9 [lustre]
#17 [ffff880332151e30] ll_getattr at ffffffffa093ae17 [lustre]
#18 [ffff880332151e70] vfs_getattr at ffffffff8117cb21
#19 [ffff880332151eb0] vfs_fstatat at ffffffff8117cbb8
#20 [ffff880332151ef0] sys_newfstatat at ffffffff8117cbf4
#21 [ffff880332151f80] system_call_fastpath at ffffffff8100b0f2
    RIP: 00002aaaaada549a  RSP: 00007fffffffd2d8  RFLAGS: 00010246
    RAX: 0000000000000106  RBX: ffffffff8100b0f2  RCX: 3144324c2f303144
    RDX: 000000000060e600  RSI: 000000000060d0f0  RDI: ffffffffffffff9c
    RBP: 000000000060e600   R8: 0000000000000100   R9: 0000000000000000
    R10: 0000000000000100  R11: 0000000000000246  R12: 000000000060d060
    R13: 000000000060e590  R14: 0000000000000004  R15: 000000000060d060
    ORIG_RAX: 0000000000000106  CS: 0033  SS: 002b
Comment by Prakash Surya (Inactive) [ 30/Apr/12 ]

Sure. This was hit on many nodes, this one printed the stack trace to the console:

2012-04-29 12:15:02 LustreError: 72532:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) lock@ffff880320159b38[4 2 0 0 0 00000005] P(0):[0, 18446744073709551615]@[0x1bb30df888:0x4:0x0] {
2012-04-29 12:15:02 LustreError: 72532:0:(lovsub_lock.c:381:lovsub_lock_delete_one())     vvp@ffff8803515befb0: 
2012-04-29 12:15:02 LustreError: 72532:0:(lovsub_lock.c:381:lovsub_lock_delete_one())     lov@ffff88037119c6e0: 2
2012-04-29 12:15:02 LustreError: 72532:0:(lovsub_lock.c:381:lovsub_lock_delete_one())     0 0: ---
2012-04-29 12:15:02 LustreError: 72532:0:(lovsub_lock.c:381:lovsub_lock_delete_one())     1 1: lock@ffff88031bb1eeb8[1 3 0 1 1 00000000] R(1):[0, 18446744073709551615]@[0x100950000:0x1a45bfa:0x0] {
2012-04-29 12:15:02 LustreError: 72532:0:(lovsub_lock.c:381:lovsub_lock_delete_one())     lovsub@ffff8804276ae3a0: [1 ffff88037119c6e0 P(0):[0, 18446744073709551615]@[0x1bb30df888:0x4:0x0]] 
2012-04-29 12:15:02 LustreError: 72532:0:(lovsub_lock.c:381:lovsub_lock_delete_one())     osc@ffff880425f1c2b0: ffff8801354436c0 00101001 0x92774b263242bbe4 3 ffff880812898b10 size: 0 mtime: 1335712511 atime: 1335712511 ctime: 1335712511 blocks: 0
2012-04-29 12:15:02 LustreError: 72532:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) } lock@ffff88031bb1eeb8
2012-04-29 12:15:02 LustreError: 72532:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) 
2012-04-29 12:15:02 LustreError: 72532:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) } lock@ffff880320159b38
2012-04-29 12:15:02 LustreError: 72532:0:(lovsub_lock.c:381:lovsub_lock_delete_one()) Delete CLS_HELD lock
2012-04-29 12:15:02 LustreError: 72532:0:(lovsub_lock.c:383:lovsub_lock_delete_one()) Impossible state: 2
2012-04-29 12:15:02 LustreError: 72532:0:(lovsub_lock.c:384:lovsub_lock_delete_one()) LBUG
2012-04-29 12:15:02 Pid: 72532, comm: rm
2012-04-29 12:15:02 
2012-04-29 12:15:02 Call Trace:
2012-04-29 12:15:02  [<ffffffffa03e2895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2012-04-29 12:15:02  [<ffffffffa03e2dc7>] lbug_with_loc+0x47/0xb0 [libcfs]
2012-04-29 12:15:02  [<ffffffffa08bfff2>] lovsub_lock_delete+0x1a2/0x370 [lov]
2012-04-29 12:15:02  [<ffffffffa055a665>] cl_lock_delete0+0xb5/0x150 [obdclass]
2012-04-29 12:15:02  [<ffffffffa055a816>] cl_lock_hold_release+0x116/0x170 [obdclass]
2012-04-29 12:15:02  [<ffffffffa055bf73>] cl_lock_unhold+0x33/0xc0 [obdclass]
2012-04-29 12:15:02  [<ffffffffa08b7e78>] lov_sublock_release+0x198/0x260 [lov]
2012-04-29 12:15:02  [<ffffffffa08b8ed8>] lov_lock_cancel+0xe8/0x230 [lov]
2012-04-29 12:15:02  [<ffffffffa0559715>] cl_lock_cancel0+0x75/0xf0 [obdclass]
2012-04-29 12:15:02  [<ffffffffa055a7f5>] cl_lock_hold_release+0xf5/0x170 [obdclass]
2012-04-29 12:15:02  [<ffffffffa055beb2>] cl_lock_release+0x82/0x110 [obdclass]
2012-04-29 12:15:02  [<ffffffffa09722d6>] cl_glimpse_lock+0x196/0x330 [lustre]
2012-04-29 12:15:02  [<ffffffffa0972914>] cl_glimpse_size+0x124/0x130 [lustre]
2012-04-29 12:15:02  [<ffffffffa093ab8a>] ll_inode_revalidate_it+0x5a/0x140 [lustre]
2012-04-29 12:15:02  [<ffffffff811843e5>] ? putname+0x45/0x50
2012-04-29 12:15:02  [<ffffffffa093acb9>] ll_getattr_it+0x49/0x170 [lustre]
2012-04-29 12:15:02  [<ffffffffa093ae17>] ll_getattr+0x37/0x40 [lustre]
2012-04-29 12:15:02  [<ffffffff8117cb21>] vfs_getattr+0x51/0x80
2012-04-29 12:15:02  [<ffffffff810d152f>] ? audit_putname+0x3f/0x50
2012-04-29 12:15:02  [<ffffffff8117cbb8>] vfs_fstatat+0x68/0x80
2012-04-29 12:15:02  [<ffffffff8117cbf4>] sys_newfstatat+0x24/0x50
2012-04-29 12:15:02  [<ffffffff8100c5b5>] ? math_state_restore+0x45/0x60
2012-04-29 12:15:02  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
2012-04-29 12:15:02 
2012-04-29 12:15:02 Apr 29 12:15:02 Kernel panic - not syncing: LBUG
2012-04-29 12:15:02 cab283 kernel: LPid: 72532, comm: rm Not tainted 2.6.32-220.13.1.2chaos.ch5.x86_64 #1
2012-04-29 12:15:02 ustreError: 7253Call Trace:
2012-04-29 12:15:02 2:0:(lovsub_lock [<ffffffff814ee6c8>] ? panic+0x78/0x143
2012-04-29 12:15:02 .c:384:lovsub_lo [<ffffffffa03e2e1b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
2012-04-29 12:15:02 ck_delete_one()) [<ffffffffa08bfff2>] ? lovsub_lock_delete+0x1a2/0x370 [lov]
2012-04-29 12:15:02  LBUG
2012-04-29 12:15:02 Apr 29 12: [<ffffffffa055a665>] ? cl_lock_delete0+0xb5/0x150 [obdclass]
2012-04-29 12:15:02 15:02 cab283 ker [<ffffffffa055a816>] ? cl_lock_hold_release+0x116/0x170 [obdclass]
2012-04-29 12:15:02 nel: Kernel pani [<ffffffffa055bf73>] ? cl_lock_unhold+0x33/0xc0 [obdclass]
2012-04-29 12:15:02 c - not syncing: [<ffffffffa08b7e78>] ? lov_sublock_release+0x198/0x260 [lov]
2012-04-29 12:15:02  LBUG
2012-04-29 12:15:02  [<ffffffffa08b8ed8>] ? lov_lock_cancel+0xe8/0x230 [lov]
2012-04-29 12:15:02  [<ffffffffa0559715>] ? cl_lock_cancel0+0x75/0xf0 [obdclass]
2012-04-29 12:15:02  [<ffffffffa055a7f5>] ? cl_lock_hold_release+0xf5/0x170 [obdclass]
2012-04-29 12:15:02  [<ffffffffa055beb2>] ? cl_lock_release+0x82/0x110 [obdclass]
2012-04-29 12:15:02  [<ffffffffa09722d6>] ? cl_glimpse_lock+0x196/0x330 [lustre]
2012-04-29 12:15:02  [<ffffffffa0972914>] ? cl_glimpse_size+0x124/0x130 [lustre]
2012-04-29 12:15:02  [<ffffffffa093ab8a>] ? ll_inode_revalidate_it+0x5a/0x140 [lustre]
2012-04-29 12:15:03  [<ffffffff811843e5>] ? putname+0x45/0x50
2012-04-29 12:15:03  [<ffffffffa093acb9>] ? ll_getattr_it+0x49/0x170 [lustre]
2012-04-29 12:15:03  [<ffffffffa093ae17>] ? ll_getattr+0x37/0x40 [lustre]
2012-04-29 12:15:03  [<ffffffff8117cb21>] ? vfs_getattr+0x51/0x80
2012-04-29 12:15:03  [<ffffffff810d152f>] ? audit_putname+0x3f/0x50
2012-04-29 12:15:03  [<ffffffff8117cbb8>] ? vfs_fstatat+0x68/0x80
2012-04-29 12:15:03  [<ffffffff8117cbf4>] ? sys_newfstatat+0x24/0x50
2012-04-29 12:15:03  [<ffffffff8100c5b5>] ? math_state_restore+0x45/0x60
2012-04-29 12:15:03  [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b

We also have a core dump which I can poke at if you need me to.

Comment by Ned Bass [ 30/Apr/12 ]

Here is the patch stack that was running at the time.

https://github.com/chaos/lustre/commits/2.1.1-10chaos

Comment by Ned Bass [ 01/May/12 ]

Hi Jinshan,

Do you think this may be a regression in our recent 2.1.1-10chaos tag? If not, do you have any ideas about the nature of the bug?

Comment by Jinshan Xiong (Inactive) [ 01/May/12 ]

No, this is a pre-existing bug. How often do you hit this one? I've found the root cause of this one and am working on 1356.

Comment by Ned Bass [ 01/May/12 ]

That's great news! We hit it very often. I just unintentionally reproduced it running ls on a login node. I wonder if we landed something that makes it more likely to hit since we never saw it before.

Comment by Jinshan Xiong (Inactive) [ 02/May/12 ]

I need more time on this bug.

Ned, if you can reproduce this issue quite often, is it possible to collect a debug log at the client side. I need DLMTRACE to be set.

Comment by Jinshan Xiong (Inactive) [ 02/May/12 ]

I confirm that this bug was imported by LU-1299. I will cook a patch tomorrow.

Comment by Ned Bass [ 02/May/12 ]

Do you still need the debug log? Now that I'm trying to reproduce it, of course it is being elusive.

Comment by Jinshan Xiong (Inactive) [ 02/May/12 ]

Actually my previous comment was inaccurate. This bug is concealed by LU-1299 because LU-1299 should be hit first if it was not fixed.

Please check patch: http://review.whamcloud.com/2632 for a fix.

This is still about signal handling problem.

Comment by Jinshan Xiong (Inactive) [ 02/May/12 ]

Ned, I don't need debug log since I've known every details of this bug. Thanks.

Comment by Ned Bass [ 02/May/12 ]

Thanks, we'll give the patch a try.

I'm not sure that your comment about LU-1299 concealing this bug is consistent with what we've seen. In particular, AFAIK we've only hit LU-1299 by running a truncated executable, which should be rare. On the other hand we run into this bug during normal filesystem usage.

Although, now that I write that, I'm realizing that we have a local bug open about OOMs occurring despite the application is using little real memory. So perhaps we have unknowingly been hitting LU-1299 after all.

Comment by Jinshan Xiong (Inactive) [ 02/May/12 ]

Yes, original code can trigger fake OOM because wrong error code was returned by ll_fault(), so I guess OOM you have seen should go away after LU-1299 is applied.

From what I have seen from log, it looks very like that a glimpse of file size was interrupted by a signal. Now that you mentioned it was hit in a normal file system usage like ls, it would exist another path to have the same back trace because ls won't issue signals afaik.

Anyway, please you apply this patch and try to reproduce it again, this way we can get more information and move steps forward.

Comment by Ned Bass [ 02/May/12 ]

Also note the originally reported occurrences involved rm.

Comment by Jinshan Xiong (Inactive) [ 04/May/12 ]

Fixed in LU-1299

Generated at Sat Feb 10 01:15:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.