Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
Lustre 2.7.0
-
OpenSFS cluster running lustre 2.7.0-RC-4 build # 29 with two MDSs with two MDTs each, three OSSs with two OSTs each and three clients.
-
3
-
17740
Description
Racer test 1 hangs and second MDS crashes. From the client test log:
== racer test 1: racer on clients: c11,c12,c13 DURATION=900 == 01:00:24 (1425546 024) racers pids: 24447 24448 24449 24450 24452 24455 24457 24463 c12: ./file_exec.sh: line 12: 32546 Segmentation fault (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c13: ./file_exec.sh: line 12: 3014 Segmentation fault (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c12: ./file_exec.sh: line 12: 22190 Bus error (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c13: ./file_exec.sh: line 12: 20868 Segmentation fault (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c13: ./file_exec.sh: line 12: 27709 Bus error (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c12: ./file_exec.sh: line 12: 4393 Segmentation fault (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c13: ./file_exec.sh: line 12: 30807 Bus error (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c13: ./file_exec.sh: line 12: 27894 Bus error (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c13: ./file_exec.sh: line 12: 6553 Bus error (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c13: ./file_exec.sh: line 12: 8077 Segmentation fault (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c13: ./file_exec.sh: line 12: 15490 Bus error (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c12: ./file_exec.sh: line 12: 25110 Bus error (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c12: ./file_exec.sh: line 12: 22466 Bus error $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c12: ./file_exec.sh: line 12: 13403 Bus error $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c12: ./file_exec.sh: line 12: 31697 Bus error (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c12: ./file_exec.sh: line 12: 4156 Bus error (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c12: ./file_exec.sh: line 12: 13081 Bus error (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c12: ./file_exec.sh: line 12: 13991 Segmentation fault (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c12: ./file_exec.sh: line 12: 31225 Bus error (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c12: ./file_exec.sh: line 12: 5779 Bus error $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c12: ./file_exec.sh: line 12: 23907 Bus error (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c12: ./file_exec.sh: line 12: 14579 Segmentation fault $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c11: ./file_exec.sh: line 12: 13126 Segmentation fault (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c11: ./file_exec.sh: line 12: 12056 Segmentation fault (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c11: ./file_exec.sh: line 12: 12121 Bus error (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c11: ./file_exec.sh: line 12: 21271 Bus error (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c11: ./file_exec.sh: line 12: 26034 Bus error $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c11: ./file_exec.sh: line 12: 24871 Bus error $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c11: ./file_exec.sh: line 12: 27385 Bus error (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c11: ./file_exec.sh: line 12: 2404 Bus error (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c11: ./file_exec.sh: line 12: 5912 Bus error $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c11: ./file_exec.sh: line 12: 6463 Segmentation fault (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c11: ./file_exec.sh: line 12: 8118 Bus error (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c11: ./file_exec.sh: line 12: 10024 Bus error $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c11: ./file_exec.sh: line 12: 10552 Bus error $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c11: ./file_exec.sh: line 12: 17440 Segmentation fault (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c11: ./file_exec.sh: line 12: 21595 Bus error (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c11: ./file_exec.sh: line 12: 3143 Bus error (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c11: ./file_exec.sh: line 12: 9341 Bus error (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c11: ./file_exec.sh: line 12: 18994 Bus error (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c11: ./file_exec.sh: line 12: 2735 Bus error $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null c11: ./file_mknod.sh: line 11: 5348 Terminated $MCREATE $file 2> /dev/null c11: ./file_mknod.sh: line 11: 5349 Terminated $MCREATE $file 2> /dev/null c11: ./file_mknod.sh: line 11: 5323 Terminated $MCREATE $file 2> /dev/null …
The second MDS crashed with the following on the console:
Message from syslogd@mds02 at Mar 5 01:02:21 ... kernel:LustreError: 24677:0:(ldlm_lock.c:364:ldlm_lock_destroy_internal()) LBUG
In the dmesg before the crash:
<4>Lustre: DEBUG MARKER: == racer test 1: racer on clients: c11,c12,c13 DURATION=900 == 01:00:24 (1425546024) <3>LustreError: 25276:0:(mdt_reint.c:1575:mdt_reint_migrate_internal()) scratch-MDT0001: can not migrate striped dir [0x3c0002b17:0xa:0x0]: rc = -1 <3>LustreError: 10002:0:(mdd_dir.c:4034:mdd_migrate()) scratch-MDD0003: [0x4400013a9:0x4:0x0]0 is already opened count 1: rc = -16 <3>LustreError: 9999:0:(mdt_reint.c:1536:mdt_reint_migrate_internal()) scratch-MDT0001: parent [0x3c0002b10:0x41e:0x0] is still on the same MDT, which should be migrated first: rc = -1 <3>LustreError: 9993:0:(mdt_reint.c:1173:mdt_reint_link()) scratch-MDT0001: source inode [0x400002347:0x25:0x0] on remote MDT from [0x3c0002b12:0x7:0x0] <3>LustreError: 9993:0:(mdd_dir.c:4034:mdd_migrate()) scratch-MDD0001: [0x3c0002b17:0x6:0x0]13 is already opened count 1: rc = -16 <3>LustreError: 25282:0:(mdt_reint.c:1536:mdt_reint_migrate_internal()) scratch-MDT0003: parent [0x4400013a0:0xa:0x0] is still on the same MDT, which should be migrated first: rc = -1 <3>LustreError: 27158:0:(mdt_reint.c:1173:mdt_reint_link()) scratch-MDT0003: source inode [0x400002348:0x3c:0x0] on remote MDT from [0x4400013aa:0x1:0x0] <3>LustreError: 27158:0:(mdt_reint.c:1173:mdt_reint_link()) Skipped 1 previous similar message …
and call trace:
<3>LustreError: 24677:0:(ldlm_lock.c:363:ldlm_lock_destroy_internal()) ### lock still on resource ns: mdt-scratch-MDT0003_UUID lock: ffff8806b331dd00/0x652e27c3 5471591f lrc: 3/0,0 mode: PR/PR res: [0x4400013a5:0x1e94:0x0].0 bits 0x20 rrc: 1 type: IBT flags: 0x50200000000000 nid: 192.168.2.111@o2ib remote: 0xe6ac309763e dc3fe expref: 30 pid: 9995 timeout: 0 lvb_type: 0 <3>LustreError: 9997:0:(mdt_open.c:1564:mdt_cross_open()) scratch-MDT0001: [0x3c0002b14:0x3d8:0x0] doesn't exist!: rc = -14 <3>LustreError: 9997:0:(mdt_open.c:1564:mdt_cross_open()) Skipped 6 previous similar messages <0>LustreError: 24677:0:(ldlm_lock.c:364:ldlm_lock_destroy_internal()) LBUG <4>Pid: 24677, comm: mdt00_001 <4> <4>Call Trace: <4> [<ffffffffa0565895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa0565e97>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa089b2c1>] ldlm_lock_destroy_internal+0x251/0x2c0 [ptlrpc] <4> [<ffffffffa089cef5>] ldlm_lock_destroy+0x35/0x130 [ptlrpc] <4> [<ffffffffa089d525>] ldlm_lock_enqueue+0x155/0x9d0 [ptlrpc] <4> [<ffffffffa08c964b>] ldlm_handle_enqueue0+0x51b/0x13f0 [ptlrpc] <4> [<ffffffffa094a2f1>] tgt_enqueue+0x61/0x230 [ptlrpc] <4> [<ffffffffa094af3e>] tgt_request_handle+0x8be/0x1000 [ptlrpc] <4> [<ffffffffa08fa9b1>] ptlrpc_main+0xe41/0x1960 [ptlrpc] <4> [<ffffffffa08f9b70>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] <4> [<ffffffff8109abf6>] kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] child_rip+0xa/0x20 <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 <4> <0>Kernel panic - not syncing: LBUG <4>Pid: 24677, comm: mdt00_001 Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1 <4>Call Trace: <4> [<ffffffff81528fdc>] ? panic+0xa7/0x16f <4> [<ffffffffa0565eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] <4> [<ffffffffa089b2c1>] ? ldlm_lock_destroy_internal+0x251/0x2c0 [ptlrpc] <4> [<ffffffffa089cef5>] ? ldlm_lock_destroy+0x35/0x130 [ptlrpc] <4> [<ffffffffa089d525>] ? ldlm_lock_enqueue+0x155/0x9d0 [ptlrpc] <4> [<ffffffffa08c964b>] ? ldlm_handle_enqueue0+0x51b/0x13f0 [ptlrpc] <4> [<ffffffffa094a2f1>] ? tgt_enqueue+0x61/0x230 [ptlrpc] <4> [<ffffffffa094af3e>] ? tgt_request_handle+0x8be/0x1000 [ptlrpc] <4> [<ffffffffa08fa9b1>] ? ptlrpc_main+0xe41/0x1960 [ptlrpc] <4> [<ffffffffa08f9b70>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] <4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20 <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
I will upload vmcore and node logs when they are available.
Attachments
Issue Links
- is related to
-
LU-7173 ldlm_lock_destroy_internal() LBUG encountered during 2.8 large scale testing
- Resolved