[LU-11182] parallel-scale test_cascading_rw fails with 'cascading_rw failed! 1' Created: 26/Jul/18  Updated: 05/Jan/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for James Nunez <james.a.nunez@intel.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/f888d9f2-8d67-11e8-87f3-52540065bddc

test_cascading_rw failed with the following error:

cascading_rw failed! 1

In this failure, cascading_rw runs several write to file iterations, in this case 104 iterations, hits some problem and returns -1. From the test_log, we see

23:41:23: Running test #/usr/lib64/lustre/tests/cascading_rw(iter 104)
23:41:23: Process 0 (trevis-9vm1.trevis.whamcloud.com)
	FAILED in cascading_rw.c:150:rw_file()
write of file /mnt/lustre/d0.cascading_rw/cascading_rw return -1--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 

The only interesting output in the console or dmesg logs is in the logs for client running the test. In the client console log, we see a message, but this shouldn’t be causing any issues ... should it?

[88550.129323] Lustre: DEBUG MARKER: == parallel-scale test cascading_rw: cascading_rw ==================================================== 23:40:10 (1532216410)
[88550.536177] Lustre: cascading_rw: using old ioctl(LL_IOC_LOV_GETSTRIPE) on [0x20006bacf:0x10e05:0x0], use llapi_layout_get_by_path()
[88623.058629] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  parallel-scale test_cascading_rw: @@@@@@ FAIL: cascading_rw failed! 1 

The initial thought is that we are filling the file system. So, we need to add some debug logging to see if this is correct and then we can clean up the message in functions.sh/run_cascading_rw()

 730 
 731     # FIXME
 732     # Need space estimation here.
 733 

Although it’s hard to tell when this started, this issue looks like it started around 2018-07-19.

Here are a few other logs for this failure
https://testing.whamcloud.com/test_sets/c36b7bf8-8b55-11e8-9028-52540065bddc
https://testing.whamcloud.com/test_sets/e22eba50-8dad-11e8-87f3-52540065bddc

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
parallel-scale test_cascading_rw - cascading_rw failed! 1



 Comments   
Comment by Sarah Liu [ 27/Jul/18 ]

another one on DNE

https://testing.whamcloud.com/test_sets/e1d0517a-908e-11e8-a9f7-52540065bddc

Comment by James Nunez (Inactive) [ 05/Jan/21 ]

We’ve seen this issue again when testing master, future 2.14.0. The test runs fine for a few iterations and then fails in writing to a file. There are two places in cascading_rw.c that writing fails.

Test sessions
https://testing.whamcloud.com/test_sets/f875b05d-e429-45f3-8758-b2d6ebeb7d3a
https://testing.whamcloud.com/test_sets/a5a22ad6-247c-454b-8109-65abdc0acdd1
https://testing.whamcloud.com/test_sets/b7808f0b-a807-4eff-8128-d82478827fb4
https://testing.whamcloud.com/test_sets/f65dd7d1-1e27-4931-a897-e5ade60ccb92

fail at line 127

 123                 off = 0;
 124                 fill_stride(buf, stride, 0, off);
 125                 rc = write(fd, buf, stride);
 126                 if (rc != stride)
 127                         FAILF("write of file %s return %d", filename, rc);

Test sessions
https://testing.whamcloud.com/test_sets/2a9b53c0-edfc-4822-8638-737a3ed89a2d
https://testing.whamcloud.com/test_sets/49a2f62a-dc3d-42fb-95cf-8d482f3af9ac

fail at line 187

 182                 if (rank == i) {
 183                         fill_stride(buf, stride, i, off);
 184                         rc = write(fd, buf, stride);
 185                         if (rc != stride)
 186                                 FAILF("write of file %s return %d",
 187                                       filename, rc);
 188                 }
Generated at Sat Feb 10 02:41:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.