[LU-11725] replay-single test 41 fails with 'dd on client failed' Created: 02/Dec/18  Updated: 06/Dec/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.6
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

failover test group configuration


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

replay-single test_41 fails for failover test sessions with error message 'dd on client failed' .

Looking at the client test_log for https://testing.whamcloud.com/test_sets/57dc5c3c-ee96-11e8-86c0-52540065bddc , we see

== replay-single test 41: read from a valid osc while other oscs are invalid == 17:33:32 (1542908012)
error on ioctl 0x4008669a for '/mnt/lustre/f41.replay-single' (3): No space left on device
error: setstripe: create striped file '/mnt/lustre/f41.replay-single' failed: No space left on device
CMD: trevis-34vm1.trevis.whamcloud.com dd if=/dev/zero of=/mnt/lustre/f41.replay-single bs=4k count=1
dd: opening `/mnt/lustre/f41.replay-single': No space left on device
 replay-single test_41: @@@@@@ FAIL: dd on client failed 

Most likely, there is not an OST that is full because, looking at the suite_log, we see that there is space on every OST in test 39

== replay-single test 39: test recovery from unlink llog (test llog_gen_rec) == 17:32:13 (1542907933)
total: 800 open/close in 1.19 seconds: 674.75 ops/second
CMD: trevis-34vm8 sync; sync; sync
UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-MDT0000_UUID      5825660       47228     5255600   1% /mnt/lustre[MDT:0]
lustre-OST0000_UUID      1933276       25792     1786244   1% /mnt/lustre[OST:0]
lustre-OST0001_UUID      1933276       25792     1786244   1% /mnt/lustre[OST:1]
lustre-OST0002_UUID      1933276       25784     1786028   1% /mnt/lustre[OST:2]
lustre-OST0003_UUID      1933276       25784     1786028   1% /mnt/lustre[OST:3]
lustre-OST0004_UUID      1933276       25784     1786028   1% /mnt/lustre[OST:4]
lustre-OST0005_UUID      1933276       25784     1786028   1% /mnt/lustre[OST:5]
lustre-OST0006_UUID      1933276       25832     1786204   1% /mnt/lustre[OST:6]

filesystem_summary:     13532932      180552    12502804   1% /mnt/lustre

test 40 is skipped and both tests 41 and 42 fail with errors indicating that an OST is full. Then test 43 again shows us that there is not a full OST

== replay-single test 43: mds osc import failure during recovery; don't LBUG == 17:33:37 (1542908017)
CMD: trevis-34vm7 sync; sync; sync
UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-MDT0000_UUID      5825660       47292     5255536   1% /mnt/lustre[MDT:0]
lustre-OST0000_UUID      1933276       25792     1786244   1% /mnt/lustre[OST:0]
lustre-OST0001_UUID      1933276       25792     1786244   1% /mnt/lustre[OST:1]
lustre-OST0002_UUID      1933276       25784     1786028   1% /mnt/lustre[OST:2]
lustre-OST0003_UUID      1933276       25784     1786028   1% /mnt/lustre[OST:3]
lustre-OST0004_UUID      1933276       25784     1786028   1% /mnt/lustre[OST:4]
lustre-OST0005_UUID      1933276       25784     1786028   1% /mnt/lustre[OST:5]
lustre-OST0006_UUID      1933276       25832     1786204   1% /mnt/lustre[OST:6]

filesystem_summary:     13532932      180552    12502804   1% /mnt/lustre

Looking at the MDS (vm7) console log, we see the following for both test 41 and test 42

[  128.628474] Lustre: DEBUG MARKER: == replay-single test 41: read from a valid osc while other oscs are invalid == 17:33:32 (1542908012)
[  128.639586] LustreError: 2271:0:(lod_qos.c:1354:lod_alloc_specific()) can't lstripe objid [0x200050929:0x643:0x0]: have 0 want 1
[  128.874693] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_41: @@@@@@ FAIL: dd on client failed 

There are several JIRA tickets that have similar failures and same message in the MDS console log. For example, this looks like LU-10613, but we are seeing an error on ioctl for this failure which is not seen in LU-10613.

We have seen this failure and error messages in other test sessions
https://testing.whamcloud.com/test_sets/0ea8f4f0-d350-11e8-b589-52540065bddc
https://testing.whamcloud.com/test_sets/5892b56a-ba69-11e8-9df3-52540065bddc



 Comments   
Comment by Andreas Dilger [ 06/Dec/18 ]

It would be useful if "lfs df" returned an OS_STATFS_OFFLINE state if the OSP on the MDS is inactive or offline, so that it is easier to see this on the client.

Generated at Sat Feb 10 02:46:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.