[LU-11182] parallel-scale test_cascading_rw fails with 'cascading_rw failed! 1' Created: 26/Jul/18 Updated: 05/Jan/21 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0, Lustre 2.14.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for James Nunez <james.a.nunez@intel.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/f888d9f2-8d67-11e8-87f3-52540065bddc test_cascading_rw failed with the following error: cascading_rw failed! 1 In this failure, cascading_rw runs several write to file iterations, in this case 104 iterations, hits some problem and returns -1. From the test_log, we see 23:41:23: Running test #/usr/lib64/lustre/tests/cascading_rw(iter 104) 23:41:23: Process 0 (trevis-9vm1.trevis.whamcloud.com) FAILED in cascading_rw.c:150:rw_file() write of file /mnt/lustre/d0.cascading_rw/cascading_rw return -1-------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD The only interesting output in the console or dmesg logs is in the logs for client running the test. In the client console log, we see a message, but this shouldn’t be causing any issues ... should it? [88550.129323] Lustre: DEBUG MARKER: == parallel-scale test cascading_rw: cascading_rw ==================================================== 23:40:10 (1532216410) [88550.536177] Lustre: cascading_rw: using old ioctl(LL_IOC_LOV_GETSTRIPE) on [0x20006bacf:0x10e05:0x0], use llapi_layout_get_by_path() [88623.058629] Lustre: DEBUG MARKER: /usr/sbin/lctl mark parallel-scale test_cascading_rw: @@@@@@ FAIL: cascading_rw failed! 1 The initial thought is that we are filling the file system. So, we need to add some debug logging to see if this is correct and then we can clean up the message in functions.sh/run_cascading_rw() 730 731 # FIXME 732 # Need space estimation here. 733 Although it’s hard to tell when this started, this issue looks like it started around 2018-07-19. Here are a few other logs for this failure VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Sarah Liu [ 27/Jul/18 ] |
|
another one on DNE https://testing.whamcloud.com/test_sets/e1d0517a-908e-11e8-a9f7-52540065bddc |
| Comment by James Nunez (Inactive) [ 05/Jan/21 ] |
|
We’ve seen this issue again when testing master, future 2.14.0. The test runs fine for a few iterations and then fails in writing to a file. There are two places in cascading_rw.c that writing fails. Test sessions fail at line 127 123 off = 0; 124 fill_stride(buf, stride, 0, off); 125 rc = write(fd, buf, stride); 126 if (rc != stride) 127 FAILF("write of file %s return %d", filename, rc); Test sessions fail at line 187 182 if (rank == i) { 183 fill_stride(buf, stride, i, off); 184 rc = write(fd, buf, stride); 185 if (rc != stride) 186 FAILF("write of file %s return %d", 187 filename, rc); 188 } |