[LU-12632] sanity-hsm test_90: FAIL: requests did not complete Created: 06/Aug/19  Updated: 19/Nov/20  Resolved: 19/Nov/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.4, Lustre 2.12.5
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Major
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-12870 sanity-hsm test 9A fails with “uuid D... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for jianyu <yujian@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/0c4bfdc4-b860-11e9-a1bd-52540065bddc

test_90 failed with the following error:

CMD: trevis-34vm4 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
CMD: trevis-34vm4 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
Update not seen after 100s: wanted '' got 'lrh=[type=10680000 len=136 idx=1/748] fid=[0x200001b71:0x229:0x0] dfid=[0x200001b71:0x229:0x0] compound/cookie=0x0/0x5d494bd8 action=RESTORE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
lrh=[type=10680000 len=136 idx=1/749] fid=[0x200001b71:0x22a:0x0] dfid=[0x200001b71:0x22a:0x0] compound/cookie=0x0/0x5d494bd9 action=RESTORE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]
lrh=[type=10680000 len=136 idx=1/750] fid=[0x200001b71:0x22b:0x0] dfid=[0x200001b71:0x22b:0x0] compound/cookie=0x0/0x5d494bda action=RESTORE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0 datalen=0 status=STARTED data=[]'
 sanity-hsm test_90: @@@@@@ FAIL: requests did not complete 

<<Please provide additional information about the failure here>>

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-hsm test_90 - requests did not complete



 Comments   
Comment by Jian Yu [ 23/Aug/19 ]

+1 on master branch: https://testing.whamcloud.com/test_sets/39afca24-c5ef-11e9-98c8-52540065bddc

Comment by Emoly Liu [ 27/Aug/19 ]

+1 on master branch: https://testing.whamcloud.com/test_sets/b8aec0e0-c843-11e9-a25b-52540065bddc

Comment by Peter Jones [ 28/Aug/19 ]

Hongchao

Can you please investigate?

Thanks

Peter

Comment by Hongchao Zhang [ 04/Sep/19 ]

As per the logs, the HSM_RESTORE operation was slow and didn't complete during 100 seconds,
but there is no obvious issues in the logs.

Comment by Gerrit Updater [ 08/Sep/19 ]

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36101
Subject: LU-12632 test: debug patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: faeb584eeb8a2d8c51dbe834f5d8c79ee3781783

Comment by Hongchao Zhang [ 08/Sep/19 ]

there are two kinds of failures "requests did not complete".
1, On LDiskFS
the related HSM archive operations are not started, and it could be caused by the absence of "libtool"

CMD: onyx-34vm7 libtool --mode=e pkill -x lhsmtool_posix
onyx-34vm7: sh: libtool: command not found
CMD: onyx-34vm7 rm -rf /tmp/arc1/sanity-hsm.test_90/

it cause the previous copy tool can't be killed and affect the following copy tool.

2, On ZFS
some have the similar "libtool" issue like LDiskFS.
others were caused by the slow HSM Restore operations, it started to show from Jan 10th, 2019
https://testing.whamcloud.com/sub_tests/f4a00a5a-14fc-11e9-b7d4-52540065bddc (zfs 0.7.9, only 1 times)
https://testing.whamcloud.com/sub_tests/b4fe93f2-528d-11e9-a256-52540065bddc (zfs 0.7.12, only 1 times)
https://testing.whamcloud.com/sub_tests/e6a8aa78-70fb-11e9-a6f9-52540065bddc (zfs 0.7.13)
https://testing.whamcloud.com/sub_tests/bfa06a40-c91d-11e9-90ad-52540065bddc (zfs 0.8.1)

Comment by James A Simmons [ 08/Sep/19 ]

How? Its in the spec and deb files for BuildRequires and Requires for the test. How are the images being constructed in the Maloo environment?

Comment by Gerrit Updater [ 11/Sep/19 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36152
Subject: LU-12632 tests: stop running sanity-hsm 90 for ZFS
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5df30a16128b23977ef92de3ca06ec59a5f531de

Comment by Li Xi [ 16/Sep/19 ]

Example on master branch: https://testing.whamcloud.com/test_sets/d52265ac-d409-11e9-9fc9-52540065bddc

Comment by Emoly Liu [ 26/Nov/19 ]

+1 on master: https://testing.whamcloud.com/test_sets/a70981de-0fb2-11ea-bbc3-52540065bddc

Comment by Jian Yu [ 21/Jan/20 ]

+1 on master branch: https://testing.whamcloud.com/test_sets/016d71b8-3c30-11ea-b1e8-52540065bddc

Comment by Jian Yu [ 26/Jan/20 ]

Still failed on master branch: https://testing.whamcloud.com/test_sets/13b3b0f2-406e-11ea-ac52-52540065bddc

Comment by Emoly Liu [ 21/Apr/20 ]

more on master: 

https://testing.whamcloud.com/test_sets/7d2382e8-70e4-4b1f-b455-0e9c9b1e1d1b

https://testing.whamcloud.com/test_sets/404f16d2-3b8a-4bd2-8339-5142f548b657

Comment by Nikitas Angelinas [ 05/Aug/20 ]

+1 on master https://testing.whamcloud.com/test_sets/5ce24eb6-6fb9-404c-8bb0-0c87e9f2cede

Comment by Nikitas Angelinas [ 05/Aug/20 ]

+1 on master https://testing.whamcloud.com/test_sets/070d81f7-e5a2-4fa3-8b8c-67a7820e29be

Comment by Chris Horn [ 12/Aug/20 ]

+1 on master https://testing.whamcloud.com/test_sets/8f0417bf-f011-4a15-a624-3c06ff83bc1d

Comment by James A Simmons [ 18/Oct/20 ]

Still a problem?

Comment by Gerrit Updater [ 23/Oct/20 ]

John L. Hammond (jhammond@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40387
Subject: LU-12632 mdt: wakeup HSM coordinator more oftenly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 45f978d85b91e0abe4d9983ad721b1c65f7eb820

Comment by Gerrit Updater [ 19/Nov/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40387/
Subject: LU-12632 hsm: wait longer in sanity-hsm test_90()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bde0ebd6a3e88de5b8d7681efe5d67b98e7fe6a0

Comment by Peter Jones [ 19/Nov/20 ]

Landed for 2.14

Generated at Sat Feb 10 02:54:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.