[LU-9367] parallel-scale test_cascading_rw: cascading_rw failed! 1 Created: 19/Apr/17 Updated: 12/Jun/17 Resolved: 12/Jun/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This issue was created by maloo for sarah_lw <wei3.liu@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/45d48942-2507-11e7-9de9-5254006e85c2. The sub-test test_cascading_rw failed with the following error: cascading_rw failed! 1 server/client: lustre-master #3558 ldiskfs el7 test log + su mpiuser sh -c "/usr/lib64/compat-openmpi16/bin/mpirun --mca btl tcp,self --mca btl_tcp_if_include eth0 -mca boot ssh -machinefile /tmp/parallel-scale.machines -np 4 /usr/lib64/lustre/tests/cascading_rw -g -d /mnt/lustre/d0.cascading_rw -n 300 " -------------------------------------------------------------------------- A deprecated MCA parameter value was specified in an MCA parameter file. Deprecated MCA parameters should be avoided; they may disappear in future releases. Deprecated parameter: plm_rsh_agent -------------------------------------------------------------------------- -------------------------------------------------------------------------- A deprecated MCA parameter value was specified in an MCA parameter file. Deprecated MCA parameters should be avoided; they may disappear in future releases. Deprecated parameter: plm_rsh_agent -------------------------------------------------------------------------- -------------------------------------------------------------------------- A deprecated MCA parameter value was specified in an MCA parameter file. Deprecated MCA parameters should be avoided; they may disappear in future releases. Deprecated parameter: plm_rsh_agent -------------------------------------------------------------------------- -------------------------------------------------------------------------- A deprecated MCA parameter value was specified in an MCA parameter file. Deprecated MCA parameters should be avoided; they may disappear in future releases. Deprecated parameter: plm_rsh_agent -------------------------------------------------------------------------- -------------------------------------------------------------------------- A deprecated MCA parameter value was specified in an MCA parameter file. Deprecated MCA parameters should be avoided; they may disappear in future releases. Deprecated parameter: plm_rsh_agent -------------------------------------------------------------------------- -------------------------------------------------------------------------- A deprecated MCA parameter value was specified in an MCA parameter file. Deprecated MCA parameters should be avoided; they may disappear in future releases. Deprecated parameter: plm_rsh_agent -------------------------------------------------------------------------- /usr/lib64/lustre/tests/cascading_rw is running with 4 process(es) in DEBUG mode 23:47:45: Running test #/usr/lib64/lustre/tests/cascading_rw(iter 0) [trevis-55vm1:21694] *** Process received signal *** [trevis-55vm1:21694] Signal: Floating point exception (8) [trevis-55vm1:21694] Signal code: Integer divide-by-zero (1) [trevis-55vm1:21694] Failing at address: 0x4024c8 [trevis-55vm1:21694] [ 0] /lib64/libpthread.so.0(+0xf370) [0x7fdf9fad6370] [trevis-55vm1:21694] [ 1] /usr/lib64/lustre/tests/cascading_rw() [0x4024c8] [trevis-55vm1:21694] [ 2] /usr/lib64/lustre/tests/cascading_rw() [0x402be0] [trevis-55vm1:21694] [ 3] /usr/lib64/lustre/tests/cascading_rw() [0x40158e] [trevis-55vm1:21694] [ 4] /lib64/libc.so.6(__libc_start_main+0xf5) [0x7fdf9f727b35] [trevis-55vm1:21694] [ 5] /usr/lib64/lustre/tests/cascading_rw() [0x40169d] [trevis-55vm1:21694] *** End of error message *** [trevis-55vm1.trevis.hpdd.intel.com][[36239,1],2][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [trevis-55vm2.trevis.hpdd.intel.com][[36239,1],1][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 21694 on node trevis-55vm1.trevis.hpdd.intel.com exited on signal 8 (Floating point exception). -------------------------------------------------------------------------- parallel-scale test_cascading_rw: @@@@@@ FAIL: cascading_rw failed! 1 Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4905:error() = /usr/lib64/lustre/tests/functions.sh:734:run_cascading_rw() = /usr/lib64/lustre/tests/parallel-scale.sh:130:test_cascading_rw() = /usr/lib64/lustre/tests/test-framework.sh:5181:run_one() = /usr/lib64/lustre/tests/test-framework.sh:5220:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:5067:run_test() = /usr/lib64/lustre/tests/parallel-scale.sh:132:main() |
| Comments |
| Comment by James Nunez (Inactive) [ 27/Apr/17 ] |
|
parallel-scale test_cascading_rw started failing on April 11, 2017 and has failed 59 times since that time. The failures are all for the 'full' test group, but no clear configuration is always failing; there are some interop failures, zfs and ldiskfs failures, and SLES11 SP4, CentOS 6.8 or 7 failures. The logs for the first three failures are at: |
| Comment by Andreas Dilger [ 28/Apr/17 ] |
|
This looks like right after the PFL feature landed (April 8), so it probably makes sense for Bobijam to take a look at it. |
| Comment by Zhenyu Xu [ 29/Apr/17 ] |
|
commit fafe6b4d4a6fa63cedff3bd44e6578009578b3d7 changes ll_lov_setstripe() static int ll_lov_setstripe(struct inode *inode, struct file *file, unsigned long arg) @@ -1694,14 +1703,6 @@ static int ll_lov_setstripe(struct inode *inode, struct file *file, lum_size = rc; rc = ll_lov_setstripe_ea_info(inode, file, flags, klum, lum_size); - if (rc == 0) { - __u32 gen; - - put_user(0, &lum->lmm_stripe_count); - - ll_layout_refresh(inode, &gen); - rc = ll_file_getstripe(inode, (struct lov_user_md __user *)arg); - } OBD_FREE(klum, lum_size); RETURN(rc); The thinking behind it is that ll_lov_setstripe() only uses the lum to set the file's stripe, while it shouldn't have the side effect to retrieve the instantiated stripe info back to the lum. And lustre/tests/mpi/cascading_rw.c just exploits the side effect of stripe retrieving during the stripe setting call. void rw_file(char *name, long stride, unsigned int seed) if (rank == 0) { remove_file_or_dir(filename); lum.lmm_magic = LOV_USER_MAGIC; lum.lmm_stripe_size = 0; lum.lmm_stripe_count = 0; lum.lmm_stripe_offset = -1; fd = open(filename, O_CREAT | O_RDWR | O_LOV_DELAY_CREATE, FILEMODE); if (fd == -1) { sprintf(errmsg, "open of file %s", filename); FAIL(errmsg); } rc = ioctl(fd, LL_IOC_LOV_SETSTRIPE, &lum); if (rc == -1) { sprintf(errmsg, "ioctl SETSTRIPE of file %s", filename); FAIL(errmsg); } if (close(fd) == -1) { sprintf(errmsg, "close of file %s", filename); FAIL(errmsg); } } MPI_Barrier(MPI_COMM_WORLD); if (stride < 0) { if (rank == 0) { srandom(seed); while (stride < page_size/2) { stride = random(); stride -= stride % 16; if (stride < 0) stride = -stride; stride %= 2 * lum.lmm_stripe_size; // *DEVIDE BY ZERO EXCEPTION HERE* } } MPI_Barrier(MPI_COMM_WORLD); We can fix the cascading_rw exploit by calling ioctl(fd, LL_IOC_LOV_GETSTRIPE, &lum) after SETSTRIPE call so the lum.lmm_stripe_size is the real stripe_size instead of 0. |
| Comment by Gerrit Updater [ 02/May/17 ] |
|
Bobi Jam (bobijam@hotmail.com) uploaded a new patch: https://review.whamcloud.com/26915 |
| Comment by Andreas Dilger [ 02/May/17 ] |
|
I'm not sure why you consider this to be a side effect? There has been code to explicitly call getstripe and return this to userspace forever. Why not just restore the call to ll_file_getstripe()? |
| Comment by Zhenyu Xu [ 02/May/17 ] |
|
I am fine to restore the call to ll_file_getstripe(). And I always think the get/set methods to be with one direction data flow, and with that in mind, the ll_lov_setstripe() passes lum back to the caller is a side effect of that function to me. |
| Comment by Zhenyu Xu [ 02/May/17 ] |
|
And #define LL_IOC_LOV_SETSTRIPE _IOW ('f', 154, long) shows that LL_IOC_LOV_SETSTRIPE should be WRITE ONLY ioctl(). And it seems that there is an error for LL_IOC_LOV_GETSTRIPE interface definition. |
| Comment by James Casper [ 03/May/17 ] |
|
This might be related to LU-9429. The parallel_grouplock subtest started failing tag testing after 2017-04-05. In each case, parallel_grouplock times out after cascading_rw fails. tag 55 test (b3550): 2017-04-05 (parallel_grouplock 100% passing) |
| Comment by Gerrit Updater [ 12/Jun/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26915/ |
| Comment by Peter Jones [ 12/Jun/17 ] |
|
Landed for 2.10 |