[LU-11788] sanity test 104a fails with ‘lfs df failed’ Created: 15/Dec/18  Updated: 24/Aug/23  Resolved: 24/Aug/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Xinliang Liu
Resolution: Cannot Reproduce Votes: 0
Labels: arm

Issue Links:
Related
is related to LU-10300 Can the Lustre 2.10.x clients support... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity test_104a fails with ‘lfs df failed’ for ARM clients. We’ve only seen this once, https://testing.whamcloud.com/test_sets/ec527b2c-fdef-11e8-b837-52540065bddc , in the past four months. After test 104a fails, a series of other tests fail 107, 118k, 118i, 119c, 119d, 120a, 123a, 124a, 124b, 129, 130a/b/d/e, 131a/d/e, 133a/b/c/d, and test 133g hangs.

It’s clear from the suite_log that there is something wrong with some of the MDTs/MDSs at the beginning of the test

== sanity test 104a: lfs df [-ih] [path] test ======================================================== 19:18:31 (1544469511)
UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-MDT0000_UUID      1165900       22572     1040132   2% /mnt/lustre[MDT:0]
lustre-MDT0001_UUID : Input/output error
lustre-MDT0002_UUID : Input/output error
lustre-MDT0003_UUID : Input/output error
lustre-OST0000_UUID      1933276       34688     1777348   2% /mnt/lustre[OST:0]
lustre-OST0001_UUID      1933276       45700     1766336   3% /mnt/lustre[OST:1]
lustre-OST0002_UUID      1933276       37288     1774748   2% /mnt/lustre[OST:2]
lustre-OST0003_UUID      1933276       31024     1781012   2% /mnt/lustre[OST:3]
lustre-OST0004_UUID      1933276       30168     1781868   2% /mnt/lustre[OST:4]
lustre-OST0005_UUID      1933276       40068     1771968   2% /mnt/lustre[OST:5]
lustre-OST0006_UUID      1933276       41116     1770920   2% /mnt/lustre[OST:6]
lustre-OST0007_UUID      1933276       32700     1779336   2% /mnt/lustre[OST:7]

filesystem_summary:     15466208      292752    14203536   2% /mnt/lustre

 sanity test_104a: @@@@@@ FAIL: lfs df failed 

So, it’s no surprise that ‘lfs df’ failed.

sanity test 104a does deactivate an OST and should expect to see that the OST is in a’FULL” state, but, in this test session log, we see a connection restored message from MDTs. From MDS2, 4 (vm5), we see

[ 8408.485447] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 104a: lfs df [-ih] [path] test ======================================================== 19:18:31 \(1544469511\)
[ 8409.091430] Lustre: DEBUG MARKER: == sanity test 104a: lfs df [-ih] [path] test ======================================================== 19:18:31 (1544469511)
[ 8409.525890] Lustre: lustre-MDT0001: Connection restored to 310dfc65-ad7d-537f-6815-c0fd7f0fb43b (at 10.9.8.38@tcp)
[ 8409.940081] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_104a: @@@@@@ FAIL: lfs df failed 

with a similar message from MDS1, 3 (vm4)

[ 8408.748317] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 104a: lfs df [-ih] [path] test ======================================================== 19:18:31 \(1544469511\)
[ 8409.370355] Lustre: DEBUG MARKER: == sanity test 104a: lfs df [-ih] [path] test ======================================================== 19:18:31 (1544469511)
[ 8409.831727] Lustre: lustre-MDT0002: Connection restored to 56904d5f-959b-023e-bc98-099190cbfba6 (at 10.9.8.38@tcp)
[ 8409.832746] Lustre: Skipped 2 previous similar messages
[ 8410.186046] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_104a: @@@@@@ FAIL: lfs df failed 


 Comments   
Comment by Xinliang Liu [ 08/Dec/21 ]

This issue is hard to reproduced, run 100 times in local test environment all pass. Haven't seen it in CI again also now.

Comment by James A Simmons [ 14/Jun/22 ]

Does LU-15467 help this?

Comment by James A Simmons [ 23/Aug/23 ]

Is this still a problem?

Comment by Xinliang Liu [ 24/Aug/23 ]

I don't think it is a problem now.  Can't see this problem in our Arm CI, neither on branch master nor on b2_15.

Comment by James A Simmons [ 24/Aug/23 ]

We should close it then

Generated at Sat Feb 10 02:46:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.