[LU-9991] sanity-hsm test 31c fails with 'copytools failed to stop' Created: 14/Sep/17  Updated: 15/Dec/17  Resolved: 15/Dec/17

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-9845 ost-pools test_22 hangs with ‘WARNING... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-hsm test_31c fails to archive the newly created file:

Update not seen after 200s: wanted 'SUCCEED' got 'STARTED'
 sanity-hsm test_31c: @@@@@@ FAIL: request on 0x200000402:0x38:0x0 is not SUCCEED on mds1 

and then we can shutdown the copytool:

copytools still running on trevis-51vm6
CMD: trevis-51vm6 pgrep -x lhsmtool_posix
trevis-51vm6: 6036
copytools still running on trevis-51vm6
CMD: trevis-51vm6 echo 1 >/proc/sys/kernel/sysrq ;  echo t >/proc/sysrq-trigger
copytools failed to stop in 200s
 sanity-hsm test_31c: @@@@@@ FAIL: copytools failed to stop 

From the copytool logs, the archive was progressing and it doesn’t look like it exceeded 200 seconds unless the copytool was hung:

1505294662.538461 lhsmtool_posix[6038]: '[0x200000402:0x38:0x0]' action ARCHIVE reclen 72, cookie=0x59b8f942
1505294662.539849 lhsmtool_posix[6038]: processing file 'd31c.sanity-hsm/f31c.sanity-hsm'
1505294662.555950 lhsmtool_posix[6038]: archiving '/mnt/lustre2/.lustre/fid/0x200000402:0x38:0x0' to '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp'
1505294662.556901 lhsmtool_posix[6038]: saving stripe info of '/mnt/lustre2/.lustre/fid/0x200000402:0x38:0x0' in /tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp.lov
1505294662.558302 lhsmtool_posix[6038]: start copy of 34603008 bytes from '/mnt/lustre2/.lustre/fid/0x200000402:0x38:0x0' to '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp'
1505294692.836074 lhsmtool_posix[6038]: %90 
1505294692.838435 lhsmtool_posix[6038]: bandwith control: 1048576B/s excess=1048576 sleep for 1.000000000s
1505294695.842294 lhsmtool_posix[6038]: copied 34603008 bytes in 33.285317 seconds
1505294695.896192 lhsmtool_posix[6038]: data archiving for '/mnt/lustre2/.lustre/fid/0x200000402:0x38:0x0' to '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp' done
1505294695.896379 lhsmtool_posix[6038]: attr file for '/mnt/lustre2/.lustre/fid/0x200000402:0x38:0x0' saved to archive '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp'
1505294695.897327 lhsmtool_posix[6038]: fsetxattr of 'trusted.hsm' on '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp' rc=0 (Success)
1505294695.897360 lhsmtool_posix[6038]: fsetxattr of 'trusted.version' on '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp' rc=0 (Success)
1505294695.897402 lhsmtool_posix[6038]: fsetxattr of 'trusted.link' on '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp' rc=0 (Success)
1505294695.897432 lhsmtool_posix[6038]: fsetxattr of 'trusted.lov' on '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp' rc=0 (Success)
1505294695.897476 lhsmtool_posix[6038]: fsetxattr of 'trusted.lma' on '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp' rc=0 (Success)
1505294695.897515 lhsmtool_posix[6038]: fsetxattr of 'lustre.lov' on '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp' rc=-1 (Operation not supported)
1505294695.897531 lhsmtool_posix[6038]: xattr file for '/mnt/lustre2/.lustre/fid/0x200000402:0x38:0x0' saved to archive '/tmp/arc1/shsm/0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0_tmp'
1505294695.898946 lhsmtool_posix[6038]: symlink '/tmp/arc1/shsm/shadow/d31c.sanity-hsm/f31c.sanity-hsm' to '../../0038/0000/0402/0000/0002/0000/0x200000402:0x38:0x0' done
exiting: Terminated

So far, we’ve only see this ‘copytool failed to stop’ error for one patch test session. Logs for this failure are at
https://testing.hpdd.intel.com/test_sets/3e2f0620-9872-11e7-b775-5254006e85c2

We’ve seen the fail to archive a file several times for this test and the failure has at least once been attributed to LU-7988.



 Comments   
Comment by Bruno Faccini (Inactive) [ 14/Sep/17 ]

James,
In the auto-test failure logs you have reported, we can see that the copy tool's threads are stuck trying to close the Lustre file :

09:31:59:[12891.943987] lhsmtool_posix  S 000000000000f908     0  6036      1 0x00000080
09:31:59:[12891.945640]  ffff88006c5b3b30 0000000000000086 ffff88005e4d3f40 ffff88006c5b3fd8
09:31:59:[12891.947482]  ffff88006c5b3fd8 ffff88006c5b3fd8 ffff88005e4d3f40 ffffffff81e87180
09:31:59:[12891.949277]  ffff88006c5b3b60 0000000100c07565 ffffffff81e87180 000000000000f908
09:31:59:[12891.950973] Call Trace:
09:31:59:[12891.952275]  [<ffffffff816a94c9>] schedule+0x29/0x70
09:31:59:[12891.953752]  [<ffffffff816a6f14>] schedule_timeout+0x174/0x2c0
09:31:59:[12891.955315]  [<ffffffff81098b20>] ? internal_add_timer+0x70/0x70
09:31:59:[12891.956858]  [<ffffffffc09adec0>] ptlrpc_set_wait+0x4c0/0x910 [ptlrpc]
09:31:59:[12891.958477]  [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
09:31:59:[12891.959988]  [<ffffffffc09ae38d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
09:31:59:[12891.961644]  [<ffffffffc0ac75cc>] mdc_close+0x1bc/0x8a0 [mdc]
09:31:59:[12891.963168]  [<ffffffffc0b6f426>] lmv_close+0x216/0x540 [lmv]
09:31:59:[12891.964760]  [<ffffffffc0bb4b48>] ll_close_inode_openhandle+0x2f8/0xe20 [lustre]
09:31:59:[12891.966402]  [<ffffffffc0bb8270>] ll_md_real_close+0xf0/0x1e0 [lustre]
09:31:59:[12891.968043]  [<ffffffffc0bb899d>] ll_file_release+0x63d/0xa70 [lustre]
09:31:59:[12891.969592]  [<ffffffffc0ba424f>] ll_dir_release+0x2f/0xd0 [lustre]
09:31:59:[12891.971179]  [<ffffffff81202fb9>] __fput+0xe9/0x260
09:31:59:[12891.972581]  [<ffffffff8120321e>] ____fput+0xe/0x10
09:31:59:[12891.974049]  [<ffffffff810ad247>] task_work_run+0xa7/0xf0
09:31:59:[12891.975494]  [<ffffffff8102ab62>] do_notify_resume+0x92/0xb0
09:31:59:[12891.977059]  [<ffffffff816b527d>] int_signal+0x12/0x17
09:31:59:[12891.978503] lhsmtool_posix  S 000000000000f908     0  6038      1 0x00000080
09:31:59:[12891.980173]  ffff88005d3b3b50 0000000000000086 ffff880063d1bf40 ffff88005d3b3fd8
09:31:59:[12891.981897]  ffff88005d3b3fd8 ffff88005d3b3fd8 ffff880063d1bf40 ffff88007c100000
09:31:59:[12891.983599]  ffff88005d3b3b80 0000000100c04b03 ffff88007c100000 000000000000f908
09:31:59:[12891.985316] Call Trace:
09:31:59:[12891.986518]  [<ffffffff816a94c9>] schedule+0x29/0x70
09:31:59:[12891.987982]  [<ffffffff816a6f14>] schedule_timeout+0x174/0x2c0
09:31:59:[12891.989450]  [<ffffffff81098b20>] ? internal_add_timer+0x70/0x70
09:31:59:[12891.991030]  [<ffffffffc09adec0>] ptlrpc_set_wait+0x4c0/0x910 [ptlrpc]
09:31:59:[12891.992610]  [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
09:31:59:[12891.994084]  [<ffffffffc09ae38d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
09:31:59:[12891.995683]  [<ffffffffc0ac75cc>] mdc_close+0x1bc/0x8a0 [mdc]
09:31:59:[12891.997162]  [<ffffffffc0b6f426>] lmv_close+0x216/0x540 [lmv]
09:31:59:[12891.998691]  [<ffffffffc0bb4b48>] ll_close_inode_openhandle+0x2f8/0xe20 [lustre]
09:31:59:[12892.000295]  [<ffffffffc0bb8270>] ll_md_real_close+0xf0/0x1e0 [lustre]
09:31:59:[12892.001882]  [<ffffffffc0bb899d>] ll_file_release+0x63d/0xa70 [lustre]
09:31:59:[12892.003395]  [<ffffffff81202fb9>] __fput+0xe9/0x260
09:31:59:[12892.004846]  [<ffffffff8120321e>] ____fput+0xe/0x10
09:31:59:[12892.006206]  [<ffffffff810ad247>] task_work_run+0xa7/0xf0
09:31:59:[12892.007663]  [<ffffffff8102ab62>] do_notify_resume+0x92/0xb0
09:31:59:[12892.009093]  [<ffffffff816b527d>] int_signal+0x12/0x17

this explains why the Archive operation is still in Started state and not Succeed due to end of archiving could not be sent by copytool and then why they can't be killed/stopped during sub-test exit.

And on the MDS/MDT side we can find the following error message after the archive operation has been started :

09:25:01:[12442.007026] WARNING: Pool 'lustre-mdt1' has encountered an uncorrectable I/O failure and has been suspended.

so my guess is that the problem is on the ZFS/media side.

Comment by John Hammond [ 15/Dec/17 ]

Closing as a duplicate of LU-9845.

Generated at Sat Feb 10 02:31:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.