[LU-13543] hsm.actions file is broken on RHEL 8.2 Created: 10/May/20  Updated: 03/Dec/20  Resolved: 03/Dec/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Critical
Reporter: Jian Yu Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None
Environment:

RHEL 8.2


Issue Links:
Duplicate
is duplicated by LU-13985 seq_file next function must change *pos Resolved
is duplicated by LU-14086 sanity-hsm test 260c fails with 'requ... Resolved
is duplicated by LU-13544 sanity-hsm test_104: Data field in re... Closed
is duplicated by LU-13545 sanity-hsm test_260c: FAIL: request o... Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for jianyu <yujian@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/5690066f-e39c-4964-86b6-ceb2f961126c

test_90 failed with the following error:

CMD: trevis-66vm4 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
pdsh@trevis-66vm1: trevis-66vm4: ssh exited with exit code 1
CMD: trevis-66vm4 /usr/sbin/lctl get_param -n mdt.lustre-MDT0000.hsm.actions | egrep 'WAITING|STARTED'
pdsh@trevis-66vm1: trevis-66vm4: ssh exited with exit code 1
Cannot send HSM request (use of /mnt/lustre/d90.sanity-hsm/f90.sanity-hsm.1): Operation not permitted
 sanity-hsm test_90: @@@@@@ FAIL: cannot release a file list 

https://testing.whamcloud.com/test_sets/6badc2b6-c8a6-40db-a740-205913d2f371

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
sanity-hsm test_90 - cannot release a file list



 Comments   
Comment by Jian Yu [ 10/May/20 ]

The following error message is RHEL 8.2 specific:

Cannot send HSM request (use of /mnt/lustre/d90.sanity-hsm/f90.sanity-hsm.1): Operation not permitted
Comment by Jian Yu [ 14/May/20 ]

Hi Minh,
The error message showed that:

pdsh@trevis-66vm1: trevis-66vm4: ssh exited with exit code 1

In lustre-initialization, MY_PDSH was defined as:

MY_PDSH='pdsh -t 120 -S -Rssh -w'

Do you know if the ssh connecting is reliable on RHEL 8.2 test environment?

Comment by Ben Evans (Inactive) [ 19/May/20 ]

I think this is either the ssh issue mentioned above, or a timing issue.  I was able to run the test by hand with no issues.

Comment by Jian Yu [ 20/May/20 ]

Hi Ben,
It's likely a timing issue because the whole test session was run with ssh by autotest and only those three sub-tests failed regularly. Could you please improve the test scripts to resolve the issues?

Comment by Ben Evans (Inactive) [ 20/May/20 ]

Yes, I'm working on that now.

Comment by John Hammond [ 20/May/20 ]

These tests (90, 104, 260c) consistently:

  1. Pass with el8.1 client and el8.1 server: https://review.whamcloud.com/#/c/38675/
  2. Fail with el8.1 client and el8.2 server: https://review.whamcloud.com/#/c/38676/

Test 52 consistently passes with 8.1/8.1, but failed 1/4 times with 8.1/8.2.

Comment by John Hammond [ 22/May/20 ]

This is likely due to the inclusion of https://github.com/torvalds/linux/commit/1f4aace60b0edc2d885aaa263abf4df42c8c65a8 in el8.2. Reading from the debugfs file directly while using Lustre trace and strace shows that hsm_actions_show_cb() is called lots of times but read() returns 0. When there are lots of records in the log then none are returned from read().

Comment by Gerrit Updater [ 25/Nov/20 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40757
Subject: LU-13543 lustre: update *pos in seq_file .next functions
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 31b0a733dd6bc6eae6bd8b1a606da7e6d274c910

Comment by Gerrit Updater [ 03/Dec/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40757/
Subject: LU-13543 lustre: update *pos in seq_file .next functions
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d5d0ff24a84f64e5196341f5ce946952d7fff8b7

Comment by Peter Jones [ 03/Dec/20 ]

Landed for 2.14

Generated at Sat Feb 10 03:02:10 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.