[LU-4859] sanity-hsm test_12c failed: ASSERTION( md_size > md_packed ) failed Created: 03/Apr/14 Updated: 06/Apr/14 Resolved: 06/Apr/14 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Rank (Obsolete): | 13399 |
| Description |
|
Sanity-hsm test_12c failed autotest during review-dne-part-2 testing. Test results are at https://maloo.whamcloud.com/test_sessions/7f5a57b6-bb27-11e3-b43b-52540035b04c . From the test log, we see CMD: shadow-50vm5 lhsmtool_posix --daemon --hsm-root /home/autotest/.autotest/shared_dir/2014-04-02/205646-69871073139300/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2014-04-02/lustre-reviews-el6-x86_64--review-dne-part-2--2_9_1__22860__-69871073139300-205642/sanity-hsm.test_12c.copytool_log.shadow-50vm5.log 2>&1
5+0 records in
5+0 records out
5242880 bytes (5.2 MB) copied, 0.872036 s, 6.0 MB/s
CMD: shadow-50vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x400000400:0x1c:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
shadow-50vm3: error: get_param: /proc/{fs,sys}/{lnet,lustre}/mdt/lustre-MDT0000/hsm/actions: Found no match
CMD: shadow-50vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x400000400:0x1c:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
shadow-50vm3: error: get_param: /proc/{fs,sys}/{lnet,lustre}/mdt/lustre-MDT0000/hsm/actions: Found no match
Waiting 100 secs for update
...
CMD: shadow-50vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x400000400:0x1c:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
shadow-50vm3: error: get_param: /proc/{fs,sys}/{lnet,lustre}/mdt/lustre-MDT0000/hsm/actions: Found no match
Update not seen after 100s: wanted 'SUCCEED' got ''
sanity-hsm test_12c: @@@@@@ FAIL: request on 0x400000400:0x1c:0x0 is not SUCCEED on mds1
On the agent, we see similar problems contacting the MDS: 02:45:27:Lustre: DEBUG MARKER: lhsmtool_posix --daemon --hsm-root /home/autotest/.autotest/shared_dir/2014-04-02/205646-69871073139300/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2014-04-02/lustre-reviews-el6-x86_64--review-dne-part-2--2_9_1__22860__-69871073139300- 03:35:34:Lustre: 6635:0:(client.c:1912:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1396514713/real 1396514713] req@ffff88007b0e6800 x1464332483480120/t0(0) o33->lustre-MDT0000-mdc-ffff880065d19800@10.1.6.66@tcp:12/10 lens 448/1032 e 0 to 1 dl 1396514720 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 03:35:34:Lustre: 6635:0:(client.c:1912:ptlrpc_expire_one_request()) Skipped 1 previous similar message 03:35:34:Lustre: lustre-MDT0000-mdc-ffff880065d19800: Connection to lustre-MDT0000 (at 10.1.6.66@tcp) was lost; in progress operations using this service will wait for recovery to complete On the MDS console, 02:45:20:Lustre: DEBUG MARKER: == sanity-hsm test 12c: Restore a file with stripe of 2 == 01:45:10 (1396514710) 02:45:20:LustreError: 14737:0:(mdt_lib.c:670:mdt_fix_reply()) ASSERTION( md_size > md_packed ) failed: 02:45:20:LustreError: 14737:0:(mdt_lib.c:670:mdt_fix_reply()) LBUG 02:45:20:Pid: 14737, comm: mdt00_003 02:45:20: 02:45:20:Call Trace: 02:45:20: [<ffffffffa0483895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 02:45:20: [<ffffffffa0483e97>] lbug_with_loc+0x47/0xb0 [libcfs] 02:45:20: [<ffffffffa0ec4970>] mdt_fix_reply+0x5e0/0x6b0 [mdt] 02:45:20: [<ffffffffa0ec27a0>] mdt_getattr+0x1e0/0x8b0 [mdt] 02:45:20: [<ffffffffa08889ac>] tgt_request_handle+0x23c/0xac0 [ptlrpc] 02:45:20: [<ffffffffa083798a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc] 02:45:20: [<ffffffffa0836c70>] ? ptlrpc_main+0x0/0x1980 [ptlrpc] 02:45:20: [<ffffffff8109aee6>] kthread+0x96/0xa0 02:45:20: [<ffffffff8100c20a>] child_rip+0xa/0x20 02:45:20: [<ffffffff8109ae50>] ? kthread+0x0/0xa0 02:45:20: [<ffffffff8100c200>] ? child_rip+0x0/0x20 02:45:20: 02:45:21:Kernel panic - not syncing: LBUG 02:45:21:Pid: 14737, comm: mdt00_003 Not tainted 2.6.32-431.5.1.el6_lustre.ge2b666c.x86_64 #1 02:45:21:Call Trace: 02:45:21: [<ffffffff81527983>] ? panic+0xa7/0x16f 02:45:21: [<ffffffffa0483eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 02:45:21: [<ffffffffa0ec4970>] ? mdt_fix_reply+0x5e0/0x6b0 [mdt] 02:45:21: [<ffffffffa0ec27a0>] ? mdt_getattr+0x1e0/0x8b0 [mdt] 02:45:21: [<ffffffffa08889ac>] ? tgt_request_handle+0x23c/0xac0 [ptlrpc] 02:45:21: [<ffffffffa083798a>] ? ptlrpc_main+0xd1a/0x1980 [ptlrpc] 02:45:21: [<ffffffffa0836c70>] ? ptlrpc_main+0x0/0x1980 [ptlrpc] 02:45:21: [<ffffffff8109aee6>] ? kthread+0x96/0xa0 02:45:21: [<ffffffff8100c20a>] ? child_rip+0xa/0x20 02:45:21: [<ffffffff8109ae50>] ? kthread+0x0/0xa0 02:45:21: [<ffffffff8100c200>] ? child_rip+0x0/0x20 After sanity-hsm test 12c fails, test 12d times out. |
| Comments |
| Comment by Oleg Drokin [ 06/Apr/14 ] |
|
This issue was only present in patch #1 for lu4008, but fixed in patch #2, so I merged the two to avoid this. |