[LU-4859] sanity-hsm test_12c failed: ASSERTION( md_size > md_packed ) failed Created: 03/Apr/14  Updated: 06/Apr/14  Resolved: 06/Apr/14

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Not a Bug Votes: 0
Labels: None

Rank (Obsolete): 13399

 Description   

Sanity-hsm test_12c failed autotest during review-dne-part-2 testing. Test results are at https://maloo.whamcloud.com/test_sessions/7f5a57b6-bb27-11e3-b43b-52540035b04c .

From the test log, we see

CMD: shadow-50vm5 lhsmtool_posix  --daemon --hsm-root /home/autotest/.autotest/shared_dir/2014-04-02/205646-69871073139300/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2014-04-02/lustre-reviews-el6-x86_64--review-dne-part-2--2_9_1__22860__-69871073139300-205642/sanity-hsm.test_12c.copytool_log.shadow-50vm5.log 2>&1
5+0 records in
5+0 records out
5242880 bytes (5.2 MB) copied, 0.872036 s, 6.0 MB/s
CMD: shadow-50vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x400000400:0x1c:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
shadow-50vm3: error: get_param: /proc/{fs,sys}/{lnet,lustre}/mdt/lustre-MDT0000/hsm/actions: Found no match
CMD: shadow-50vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x400000400:0x1c:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
shadow-50vm3: error: get_param: /proc/{fs,sys}/{lnet,lustre}/mdt/lustre-MDT0000/hsm/actions: Found no match
Waiting 100 secs for update
...
CMD: shadow-50vm3 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x400000400:0x1c:0x0'.*action='ARCHIVE'/ {print \$13}' | cut -f2 -d=
shadow-50vm3: error: get_param: /proc/{fs,sys}/{lnet,lustre}/mdt/lustre-MDT0000/hsm/actions: Found no match
Update not seen after 100s: wanted 'SUCCEED' got ''
 sanity-hsm test_12c: @@@@@@ FAIL: request on 0x400000400:0x1c:0x0 is not SUCCEED on mds1 

On the agent, we see similar problems contacting the MDS:

02:45:27:Lustre: DEBUG MARKER: lhsmtool_posix  --daemon --hsm-root /home/autotest/.autotest/shared_dir/2014-04-02/205646-69871073139300/arc1 --bandwidth 1 /mnt/lustre < /dev/null > /logdir/test_logs/2014-04-02/lustre-reviews-el6-x86_64--review-dne-part-2--2_9_1__22860__-69871073139300-
03:35:34:Lustre: 6635:0:(client.c:1912:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1396514713/real 1396514713]  req@ffff88007b0e6800 x1464332483480120/t0(0) o33->lustre-MDT0000-mdc-ffff880065d19800@10.1.6.66@tcp:12/10 lens 448/1032 e 0 to 1 dl 1396514720 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
03:35:34:Lustre: 6635:0:(client.c:1912:ptlrpc_expire_one_request()) Skipped 1 previous similar message
03:35:34:Lustre: lustre-MDT0000-mdc-ffff880065d19800: Connection to lustre-MDT0000 (at 10.1.6.66@tcp) was lost; in progress operations using this service will wait for recovery to complete

On the MDS console,

02:45:20:Lustre: DEBUG MARKER: == sanity-hsm test 12c: Restore a file with stripe of 2 == 01:45:10 (1396514710)
02:45:20:LustreError: 14737:0:(mdt_lib.c:670:mdt_fix_reply()) ASSERTION( md_size > md_packed ) failed: 
02:45:20:LustreError: 14737:0:(mdt_lib.c:670:mdt_fix_reply()) LBUG
02:45:20:Pid: 14737, comm: mdt00_003
02:45:20:
02:45:20:Call Trace:
02:45:20: [<ffffffffa0483895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
02:45:20: [<ffffffffa0483e97>] lbug_with_loc+0x47/0xb0 [libcfs]
02:45:20: [<ffffffffa0ec4970>] mdt_fix_reply+0x5e0/0x6b0 [mdt]
02:45:20: [<ffffffffa0ec27a0>] mdt_getattr+0x1e0/0x8b0 [mdt]
02:45:20: [<ffffffffa08889ac>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
02:45:20: [<ffffffffa083798a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
02:45:20: [<ffffffffa0836c70>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
02:45:20: [<ffffffff8109aee6>] kthread+0x96/0xa0
02:45:20: [<ffffffff8100c20a>] child_rip+0xa/0x20
02:45:20: [<ffffffff8109ae50>] ? kthread+0x0/0xa0
02:45:20: [<ffffffff8100c200>] ? child_rip+0x0/0x20
02:45:20:
02:45:21:Kernel panic - not syncing: LBUG
02:45:21:Pid: 14737, comm: mdt00_003 Not tainted 2.6.32-431.5.1.el6_lustre.ge2b666c.x86_64 #1
02:45:21:Call Trace:
02:45:21: [<ffffffff81527983>] ? panic+0xa7/0x16f
02:45:21: [<ffffffffa0483eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
02:45:21: [<ffffffffa0ec4970>] ? mdt_fix_reply+0x5e0/0x6b0 [mdt]
02:45:21: [<ffffffffa0ec27a0>] ? mdt_getattr+0x1e0/0x8b0 [mdt]
02:45:21: [<ffffffffa08889ac>] ? tgt_request_handle+0x23c/0xac0 [ptlrpc]
02:45:21: [<ffffffffa083798a>] ? ptlrpc_main+0xd1a/0x1980 [ptlrpc]
02:45:21: [<ffffffffa0836c70>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
02:45:21: [<ffffffff8109aee6>] ? kthread+0x96/0xa0
02:45:21: [<ffffffff8100c20a>] ? child_rip+0xa/0x20
02:45:21: [<ffffffff8109ae50>] ? kthread+0x0/0xa0
02:45:21: [<ffffffff8100c200>] ? child_rip+0x0/0x20

After sanity-hsm test 12c fails, test 12d times out.



 Comments   
Comment by Oleg Drokin [ 06/Apr/14 ]

This issue was only present in patch #1 for lu4008, but fixed in patch #2, so I merged the two to avoid this.

Generated at Sat Feb 10 01:46:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.