[LU-17330] sanity-hsm test_52: FAIL: /usr/bin/lfs hsm_state /mnt/lustre/d52.sanity-hsm/f52.sanity-hsm failed (run as root) Created: 01/Dec/23  Updated: 19/Dec/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Minh Diep <mdiep@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/f4647538-d534-4e60-ba2d-e05022c7a504

test_52 failed with the following error:

hsm flags on /mnt/lustre/d52.sanity-hsm/f52.sanity-hsm are Dumping lctl log to /autotest/autotest-2/2023-11-27/lustre-b2_15_full-part-2_77_2_8c79fa06-c4f7-4d94-b060-d917bd91a9e3//sanity-hsm.test_52.*.1701141122.log != 0x0000000b

Test session details:
clients: https://build.whamcloud.com/job/lustre-b2_15/77 - 4.18.0-477.15.1.el8_8.x86_64
servers: https://build.whamcloud.com/job/lustre-b2_15/77 - 4.18.0-477.15.1.el8_lustre.x86_64

<<Please provide additional information about the failure here>>

== sanity-hsm test 52: Opened for write file on an evicted client should be set dirty ========================================================== 03:11:58 (1701141118)
CMD: trevis-43vm5 mkdir -p /tmp/arc1/sanity-hsm.test_52/
Starting copytool 'agt1' on 'trevis-43vm5' with cmdline 'lhsmtool_posix --hsm-root=/tmp/arc1/sanity-hsm.test_52/ --archive-format=v2 --daemon --pid-file=/var/run/lhsmtool_posix.pid "/mnt/lustre2"'
CMD: trevis-43vm5 lhsmtool_posix --hsm-root=/tmp/arc1/sanity-hsm.test_52/ --archive-format=v2 --daemon --pid-file=/var/run/lhsmtool_posix.pid "/mnt/lustre2" < /dev/null > "/autotest/autotest-2/2023-11-27/lustre-b2_15_full-part-2_77_2_8c79fa06-c4f7-4d94-b060-d917bd91a9e3//sanity-hsm.test_52.copytool_log.trevis-43vm5.log" 2>&1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.190796 s, 5.5 MB/s
CMD: trevis-43vm6 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | awk '/'0x200009872:0x1f1:0x0'.*action='ARCHIVE'/

{print \$13}

' | cut -f2 -d=
multiop /mnt/lustre/d52.sanity-hsm/f52.sanity-hsm vO_c
TMPPIPE=/tmp/multiop_open_wait_pipe.1027228
CMD: trevis-43vm6 /usr/sbin/lctl set_param -n mdt.lustre-MDT0000.evict_client 1c44fc66-7282-4cc3-a207-01f519105f5b
can't get hsm state for /mnt/lustre/d52.sanity-hsm/f52.sanity-hsm: Cannot send after transport endpoint shutdown
sanity-hsm test_52: @@@@@@ FAIL: /usr/bin/lfs hsm_state /mnt/lustre/d52.sanity-hsm/f52.sanity-hsm failed (run as root)
Trace dump:

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-hsm test_52 - hsm flags on /mnt/lustre/d52.sanity-hsm/f52.sanity-hsm are Dumping lctl log to /autotest/autotest-2/2023-11-27/lustre-b2_15_full-part-2_77_2_8c79fa06-c4f7-4d94-b060-d917bd91a9e3//sanity-hsm.test_52.*.1701141122.log != 0x0000000b



 Comments   
Comment by Andreas Dilger [ 09/Dec/23 ]

It isn't clear what this test is trying to do? The client is deliberately evicted and then the HSM command fails because the client is evicted... So this could be some race condition, or maybe a patch related to client eviction or HSM was landed recently? This subtest has only failed twice in 500+ runs, both on b2_15, both on full-part-2, about 30 sessions that ran that way.

Generated at Sat Feb 10 03:34:33 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.