[LU-6334] racer test_1: ldlm_lock_destroy_internal() LBUG Created: 05/Mar/15  Updated: 07/Nov/16  Resolved: 07/Nov/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Di Wang
Resolution: Duplicate Votes: 0
Labels: dne2
Environment:

OpenSFS cluster running lustre 2.7.0-RC-4 build # 29 with two MDSs with two MDTs each, three OSSs with two OSTs each and three clients.


Issue Links:
Related
is related to LU-7173 ldlm_lock_destroy_internal() LBUG enc... Resolved
Severity: 3
Rank (Obsolete): 17740

 Description   

Racer test 1 hangs and second MDS crashes. From the client test log:

== racer test 1: racer on clients: c11,c12,c13 DURATION=900 == 01:00:24 (1425546
024)
racers pids: 24447 24448 24449 24450 24452 24455 24457 24463
c12: ./file_exec.sh: line 12: 32546 Segmentation fault      (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c13: ./file_exec.sh: line 12:  3014 Segmentation fault      (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c12: ./file_exec.sh: line 12: 22190 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c13: ./file_exec.sh: line 12: 20868 Segmentation fault      (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c13: ./file_exec.sh: line 12: 27709 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c12: ./file_exec.sh: line 12:  4393 Segmentation fault      (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c13: ./file_exec.sh: line 12: 30807 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c13: ./file_exec.sh: line 12: 27894 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c13: ./file_exec.sh: line 12:  6553 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c13: ./file_exec.sh: line 12:  8077 Segmentation fault      (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c13: ./file_exec.sh: line 12: 15490 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c12: ./file_exec.sh: line 12: 25110 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c12: ./file_exec.sh: line 12: 22466 Bus error               $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c12: ./file_exec.sh: line 12: 13403 Bus error               $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c12: ./file_exec.sh: line 12: 31697 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c12: ./file_exec.sh: line 12:  4156 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c12: ./file_exec.sh: line 12: 13081 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c12: ./file_exec.sh: line 12: 13991 Segmentation fault      (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c12: ./file_exec.sh: line 12: 31225 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c12: ./file_exec.sh: line 12:  5779 Bus error               $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c12: ./file_exec.sh: line 12: 23907 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c12: ./file_exec.sh: line 12: 14579 Segmentation fault      $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c11: ./file_exec.sh: line 12: 13126 Segmentation fault      (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c11: ./file_exec.sh: line 12: 12056 Segmentation fault      (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c11: ./file_exec.sh: line 12: 12121 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c11: ./file_exec.sh: line 12: 21271 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c11: ./file_exec.sh: line 12: 26034 Bus error               $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c11: ./file_exec.sh: line 12: 24871 Bus error               $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c11: ./file_exec.sh: line 12: 27385 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c11: ./file_exec.sh: line 12:  2404 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c11: ./file_exec.sh: line 12:  5912 Bus error               $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c11: ./file_exec.sh: line 12:  6463 Segmentation fault      (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c11: ./file_exec.sh: line 12:  8118 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c11: ./file_exec.sh: line 12: 10024 Bus error               $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c11: ./file_exec.sh: line 12: 10552 Bus error               $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c11: ./file_exec.sh: line 12: 17440 Segmentation fault      (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c11: ./file_exec.sh: line 12: 21595 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c11: ./file_exec.sh: line 12:  3143 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c11: ./file_exec.sh: line 12:  9341 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c11: ./file_exec.sh: line 12: 18994 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c11: ./file_exec.sh: line 12:  2735 Bus error               $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
c11: ./file_mknod.sh: line 11:  5348 Terminated              $MCREATE $file 2> /dev/null
c11: ./file_mknod.sh: line 11:  5349 Terminated              $MCREATE $file 2> /dev/null
c11: ./file_mknod.sh: line 11:  5323 Terminated              $MCREATE $file 2> /dev/null
…

The second MDS crashed with the following on the console:

Message from syslogd@mds02 at Mar  5 01:02:21 ...
 kernel:LustreError: 24677:0:(ldlm_lock.c:364:ldlm_lock_destroy_internal()) LBUG

In the dmesg before the crash:

<4>Lustre: DEBUG MARKER: == racer test 1: racer on clients: c11,c12,c13 DURATION=900 == 01:00:24 (1425546024)
<3>LustreError: 25276:0:(mdt_reint.c:1575:mdt_reint_migrate_internal()) scratch-MDT0001: can not migrate striped dir [0x3c0002b17:0xa:0x0]: rc = -1
<3>LustreError: 10002:0:(mdd_dir.c:4034:mdd_migrate()) scratch-MDD0003: [0x4400013a9:0x4:0x0]0 is already opened count 1: rc = -16
<3>LustreError: 9999:0:(mdt_reint.c:1536:mdt_reint_migrate_internal()) scratch-MDT0001: parent [0x3c0002b10:0x41e:0x0] is still on the same MDT, which should be migrated first: rc = -1
<3>LustreError: 9993:0:(mdt_reint.c:1173:mdt_reint_link()) scratch-MDT0001: source inode [0x400002347:0x25:0x0] on remote MDT from [0x3c0002b12:0x7:0x0]
<3>LustreError: 9993:0:(mdd_dir.c:4034:mdd_migrate()) scratch-MDD0001: [0x3c0002b17:0x6:0x0]13 is already opened count 1: rc = -16
<3>LustreError: 25282:0:(mdt_reint.c:1536:mdt_reint_migrate_internal()) scratch-MDT0003: parent [0x4400013a0:0xa:0x0] is still on the same MDT, which should be migrated first: rc = -1
<3>LustreError: 27158:0:(mdt_reint.c:1173:mdt_reint_link()) scratch-MDT0003: source inode [0x400002348:0x3c:0x0] on remote MDT from [0x4400013aa:0x1:0x0]
<3>LustreError: 27158:0:(mdt_reint.c:1173:mdt_reint_link()) Skipped 1 previous similar message
…

and call trace:

<3>LustreError: 24677:0:(ldlm_lock.c:363:ldlm_lock_destroy_internal()) ### lock 
still on resource ns: mdt-scratch-MDT0003_UUID lock: ffff8806b331dd00/0x652e27c3
5471591f lrc: 3/0,0 mode: PR/PR res: [0x4400013a5:0x1e94:0x0].0 bits 0x20 rrc: 1
 type: IBT flags: 0x50200000000000 nid: 192.168.2.111@o2ib remote: 0xe6ac309763e
dc3fe expref: 30 pid: 9995 timeout: 0 lvb_type: 0
<3>LustreError: 9997:0:(mdt_open.c:1564:mdt_cross_open()) scratch-MDT0001: [0x3c0002b14:0x3d8:0x0] doesn't exist!: rc = -14
<3>LustreError: 9997:0:(mdt_open.c:1564:mdt_cross_open()) Skipped 6 previous similar messages
<0>LustreError: 24677:0:(ldlm_lock.c:364:ldlm_lock_destroy_internal()) LBUG
<4>Pid: 24677, comm: mdt00_001
<4>
<4>Call Trace:
<4> [<ffffffffa0565895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa0565e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa089b2c1>] ldlm_lock_destroy_internal+0x251/0x2c0 [ptlrpc]
<4> [<ffffffffa089cef5>] ldlm_lock_destroy+0x35/0x130 [ptlrpc] 
<4> [<ffffffffa089d525>] ldlm_lock_enqueue+0x155/0x9d0 [ptlrpc]
<4> [<ffffffffa08c964b>] ldlm_handle_enqueue0+0x51b/0x13f0 [ptlrpc]
<4> [<ffffffffa094a2f1>] tgt_enqueue+0x61/0x230 [ptlrpc]
<4> [<ffffffffa094af3e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
<4> [<ffffffffa08fa9b1>] ptlrpc_main+0xe41/0x1960 [ptlrpc]
<4> [<ffffffffa08f9b70>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
<4> [<ffffffff8109abf6>] kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 24677, comm: mdt00_001 Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
<4>Call Trace:
<4> [<ffffffff81528fdc>] ? panic+0xa7/0x16f
<4> [<ffffffffa0565eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
<4> [<ffffffffa089b2c1>] ? ldlm_lock_destroy_internal+0x251/0x2c0 [ptlrpc]
<4> [<ffffffffa089cef5>] ? ldlm_lock_destroy+0x35/0x130 [ptlrpc]
<4> [<ffffffffa089d525>] ? ldlm_lock_enqueue+0x155/0x9d0 [ptlrpc]
<4> [<ffffffffa08c964b>] ? ldlm_handle_enqueue0+0x51b/0x13f0 [ptlrpc]
<4> [<ffffffffa094a2f1>] ? tgt_enqueue+0x61/0x230 [ptlrpc]
<4> [<ffffffffa094af3e>] ? tgt_request_handle+0x8be/0x1000 [ptlrpc]
<4> [<ffffffffa08fa9b1>] ? ptlrpc_main+0xe41/0x1960 [ptlrpc]
<4> [<ffffffffa08f9b70>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
<4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
<4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20

I will upload vmcore and node logs when they are available.



 Comments   
Comment by James Nunez (Inactive) [ 05/Mar/15 ]

vmcore and vmcore-dmesg.txt are at uploads/LU-6334.

Comment by Jian Yu [ 20/May/15 ]

Hi Dustin,
The original failure reported in this ticket was detected under DNE configuration. Did the similar failure found by ORNL also occur under DNE configuration? If not, could you please create a new Jira ticket to track the failure? Thank you.

Comment by Andreas Dilger [ 31/Aug/15 ]

Di, is this something you still see on DNE racer, or could it be closed?

Comment by Di Wang [ 31/Aug/15 ]

No, I did not see this for long time, yes, I can not be closed for now.

Comment by Peter Jones [ 31/Aug/15 ]

Thanks Di!

Comment by James A Simmons [ 15/Sep/15 ]

We just hit this bug on our 2.5.4 servers during our test shot with 2.8 clients.

Comment by Di Wang [ 16/Sep/15 ]

James: Do you have vmcore dump or debug log for the LBUG? If you can tell me how to reproduce the problem, that would be great. Thanks!

Comment by James A Simmons [ 16/Sep/15 ]

Yes I do have vmcore dumps but no lctl debug logs. We believe it happened when a bunch of MPIIO were executed at various scales on titan at slightly varying start times. Once we set the debug logs levels the problem went away Later we went back to the original debug level; plus with ldlm set; to see if we could duplicate the problem. No luck duplicating the problem so we don't know what exact work load caused it.

Oleg suggest I open a need ticket and link to this one.

Comment by Di Wang [ 16/Sep/15 ]

Hmm, did this happen on MDT or OST?

Comment by James A Simmons [ 16/Sep/15 ]

The OOM happened on MDT

Comment by James A Simmons [ 16/Sep/15 ]

Moved to LU-7173

Comment by Andreas Dilger [ 07/Nov/16 ]

Reopen to change resolution.

Generated at Sat Feb 10 01:59:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.