Details
-
Bug
-
Resolution: Not a Bug
-
Minor
-
None
-
Lustre 2.6.0
-
None
-
13399
Description
Sanity-hsm test_12c failed autotest during review-dne-part-2 testing. Test results are at https://maloo.whamcloud.com/test_sessions/7f5a57b6-bb27-11e3-b43b-52540035b04c .
From the test log, we see
CMD: shadow-50vm5 lhsmtool_posix --daemon --hsm-root /home/autotest/.autotest/shared_dir/2014-04-02/205646-69871073139300/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2014-04-02/lustre-reviews-el6-x86_64--review-dne-part-2--2_9_1__22860__-69871073139300-205642/sanity-hsm.test_12c.copytool_log.shadow-50vm5.log 2>&1
5+0 records in
5+0 records out
5242880 bytes (5.2 MB) copied, 0.872036 s, 6.0 MB/s
CMD: shadow-50vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x400000400:0x1c:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
shadow-50vm3: error: get_param: /proc/{fs,sys}/{lnet,lustre}/mdt/lustre-MDT0000/hsm/actions: Found no match
CMD: shadow-50vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x400000400:0x1c:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
shadow-50vm3: error: get_param: /proc/{fs,sys}/{lnet,lustre}/mdt/lustre-MDT0000/hsm/actions: Found no match
Waiting 100 secs for update
...
CMD: shadow-50vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x400000400:0x1c:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
shadow-50vm3: error: get_param: /proc/{fs,sys}/{lnet,lustre}/mdt/lustre-MDT0000/hsm/actions: Found no match
Update not seen after 100s: wanted 'SUCCEED' got ''
sanity-hsm test_12c: @@@@@@ FAIL: request on 0x400000400:0x1c:0x0 is not SUCCEED on mds1
On the agent, we see similar problems contacting the MDS:
02:45:27:Lustre: DEBUG MARKER: lhsmtool_posix --daemon --hsm-root /home/autotest/.autotest/shared_dir/2014-04-02/205646-69871073139300/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2014-04-02/lustre-reviews-el6-x86_64--review-dne-part-2--2_9_1__22860__-69871073139300- 03:35:34:Lustre: 6635:0:(client.c:1912:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1396514713/real 1396514713] req@ffff88007b0e6800 x1464332483480120/t0(0) o33->lustre-MDT0000-mdc-ffff880065d19800@10.1.6.66@tcp:12/10 lens 448/1032 e 0 to 1 dl 1396514720 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 03:35:34:Lustre: 6635:0:(client.c:1912:ptlrpc_expire_one_request()) Skipped 1 previous similar message 03:35:34:Lustre: lustre-MDT0000-mdc-ffff880065d19800: Connection to lustre-MDT0000 (at 10.1.6.66@tcp) was lost; in progress operations using this service will wait for recovery to complete
On the MDS console,
02:45:20:Lustre: DEBUG MARKER: == sanity-hsm test 12c: Restore a file with stripe of 2 == 01:45:10 (1396514710) 02:45:20:LustreError: 14737:0:(mdt_lib.c:670:mdt_fix_reply()) ASSERTION( md_size > md_packed ) failed: 02:45:20:LustreError: 14737:0:(mdt_lib.c:670:mdt_fix_reply()) LBUG 02:45:20:Pid: 14737, comm: mdt00_003 02:45:20: 02:45:20:Call Trace: 02:45:20: [<ffffffffa0483895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 02:45:20: [<ffffffffa0483e97>] lbug_with_loc+0x47/0xb0 [libcfs] 02:45:20: [<ffffffffa0ec4970>] mdt_fix_reply+0x5e0/0x6b0 [mdt] 02:45:20: [<ffffffffa0ec27a0>] mdt_getattr+0x1e0/0x8b0 [mdt] 02:45:20: [<ffffffffa08889ac>] tgt_request_handle+0x23c/0xac0 [ptlrpc] 02:45:20: [<ffffffffa083798a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc] 02:45:20: [<ffffffffa0836c70>] ? ptlrpc_main+0x0/0x1980 [ptlrpc] 02:45:20: [<ffffffff8109aee6>] kthread+0x96/0xa0 02:45:20: [<ffffffff8100c20a>] child_rip+0xa/0x20 02:45:20: [<ffffffff8109ae50>] ? kthread+0x0/0xa0 02:45:20: [<ffffffff8100c200>] ? child_rip+0x0/0x20 02:45:20: 02:45:21:Kernel panic - not syncing: LBUG 02:45:21:Pid: 14737, comm: mdt00_003 Not tainted 2.6.32-431.5.1.el6_lustre.ge2b666c.x86_64 #1 02:45:21:Call Trace: 02:45:21: [<ffffffff81527983>] ? panic+0xa7/0x16f 02:45:21: [<ffffffffa0483eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 02:45:21: [<ffffffffa0ec4970>] ? mdt_fix_reply+0x5e0/0x6b0 [mdt] 02:45:21: [<ffffffffa0ec27a0>] ? mdt_getattr+0x1e0/0x8b0 [mdt] 02:45:21: [<ffffffffa08889ac>] ? tgt_request_handle+0x23c/0xac0 [ptlrpc] 02:45:21: [<ffffffffa083798a>] ? ptlrpc_main+0xd1a/0x1980 [ptlrpc] 02:45:21: [<ffffffffa0836c70>] ? ptlrpc_main+0x0/0x1980 [ptlrpc] 02:45:21: [<ffffffff8109aee6>] ? kthread+0x96/0xa0 02:45:21: [<ffffffff8100c20a>] ? child_rip+0xa/0x20 02:45:21: [<ffffffff8109ae50>] ? kthread+0x0/0xa0 02:45:21: [<ffffffff8100c200>] ? child_rip+0x0/0x20
After sanity-hsm test 12c fails, test 12d times out.