[LU-1824] Test failure on test suite obdfilter-survey, subtest test_2a Created: 04/Sep/12 Updated: 10/Sep/12 Resolved: 10/Sep/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0 |
| Fix Version/s: | Lustre 2.3.0, Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Jian Yu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
server: lustreb2_3-tag2.2.94 RHEL6 |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 4454 | ||||||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/20e050dc-f42c-11e1-b3b2-52540035b04c. The sub-test test_2a failed with the following error:
== obdfilter-survey test 2a: Stripe F/S over the Network ============================================= 00:18:06 (1346483886) CMD: client-25vm8 lctl dl | grep obdfilter client-25vm8: error: dl: No such file or directory opening /proc/fs/lustre/devices client-25vm8: opening /dev/obd failed: No such device client-25vm8: hint: the kernel modules may not be loaded client-25vm8: Error getting device list: No such device: check dmesg. CMD: client-25vm8 /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d '@' client-25vm8: opening /dev/lnet failed: No such device client-25vm8: hint: the kernel modules may not be loaded client-25vm8: IOC_LIBCFS_GET_NI error 19: No such device + NETTYPE=tcp thrlo=8 nobjhi=1 thrhi=16 size=1024 case=netdisk rslt_loc=/tmp targets="" /usr/bin/obdfilter-survey Sat Sep 1 00:18:08 PDT 2012 Obdfilter-survey for case=netdisk from client-25vm2.lab.whamcloud.com Found LBUG in sub test_1c OST log: 22:20:13:LustreError: 8181:0:(filter_io_26.c:483:filter_direct_io()) LBUG 22:20:13:Pid: 8181, comm: lctl 22:20:13: 22:20:13:Call Trace: 22:20:14: [<ffffffffa0939905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 22:20:14: [<ffffffffa0939f17>] lbug_with_loc+0x47/0xb0 [libcfs] 22:20:14: [<ffffffffa0bcf147>] filter_alloc_iobuf+0x1287/0x12d0 [obdfilter] 22:20:14: [<ffffffffa0bd044c>] filter_commitrw_write+0x12bc/0x2eb8 [obdfilter] 22:20:14: [<ffffffffa0bc44f6>] ? filter_preprw_write+0xf06/0x2320 [obdfilter] 22:20:14: [<ffffffffa0950702>] ? cfs_hash_rw_lock+0x12/0x30 [libcfs] 22:20:14: [<ffffffffa0552c7b>] ? cl_env_put+0x1cb/0x370 [obdclass] 22:20:14: [<ffffffffa0bc351d>] filter_commitrw+0x27d/0x2a0 [obdfilter] 22:20:14: [<ffffffffa0bc6710>] ? filter_preprw+0x80/0xa0 [obdfilter] 22:20:14: [<ffffffffa07a41da>] echo_client_brw_ioctl+0x82a/0x13e0 [obdecho] 22:20:14: [<ffffffff811633b5>] ? __kmalloc+0xe5/0x220 22:20:14: [<ffffffffa054d78b>] ? lu_context_init+0xab/0x260 [obdclass] 22:20:14: [<ffffffffa07ac6a0>] echo_client_iocontrol+0xbb0/0x2950 [obdecho] 22:20:14: [<ffffffff81134425>] ? shmem_getpage+0x225/0x9f0 22:20:14: [<ffffffffa093abe0>] ? cfs_alloc+0x30/0x60 [libcfs] 22:20:14: [<ffffffffa04fbd8a>] ? obd_ioctl_getdata+0x13a/0x1160 [obdclass] 22:20:14: [<ffffffffa050f54f>] class_handle_ioctl+0x12ff/0x1ed0 [obdclass] 22:20:15: [<ffffffff8113ff34>] ? handle_mm_fault+0x1e4/0x2b0 22:20:15: [<ffffffffa04fb2ab>] obd_class_ioctl+0x4b/0x190 [obdclass] 22:20:15: [<ffffffff8118dff2>] vfs_ioctl+0x22/0xa0 22:20:15: [<ffffffff81039678>] ? pvclock_clocksource_read+0x58/0xd0 22:20:15: [<ffffffff8118e194>] do_vfs_ioctl+0x84/0x580 22:20:15: [<ffffffff8118e711>] sys_ioctl+0x81/0xa0 22:20:15: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b 22:20:15: 22:20:15:Kernel panic - not syncing: LBUG 22:20:15:Pid: 8181, comm: lctl Not tainted 2.6.32-279.5.1.el6_lustre.x86_64 #1 22:20:15:Call Trace: 22:20:15: [<ffffffff814fd58a>] ? panic+0xa0/0x168 22:20:15: [<ffffffffa0939f6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 22:20:15: [<ffffffffa0bcf147>] ? filter_alloc_iobuf+0x1287/0x12d0 [obdfilter] 22:20:16: [<ffffffffa0bd044c>] ? filter_commitrw_write+0x12bc/0x2eb8 [obdfilter] 22:20:16: [<ffffffffa0bc44f6>] ? filter_preprw_write+0xf06/0x2320 [obdfilter] 22:20:16: [<ffffffffa0950702>] ? cfs_hash_rw_lock+0x12/0x30 [libcfs] 22:20:16: [<ffffffffa0552c7b>] ? cl_env_put+0x1cb/0x370 [obdclass] 22:20:16: [<ffffffffa0bc351d>] ? filter_commitrw+0x27d/0x2a0 [obdfilter] 22:20:17: [<ffffffffa0bc6710>] ? filter_preprw+0x80/0xa0 [obdfilter] 22:20:17: [<ffffffffa07a41da>] ? echo_client_brw_ioctl+0x82a/0x13e0 [obdecho] 22:20:17: [<ffffffff811633b5>] ? __kmalloc+0xe5/0x220 22:20:17: [<ffffffffa054d78b>] ? lu_context_init+0xab/0x260 [obdclass] 22:20:17: [<ffffffffa07ac6a0>] ? echo_client_iocontrol+0xbb0/0x2950 [obdecho] 22:20:17: [<ffffffff81134425>] ? shmem_getpage+0x225/0x9f0 22:20:17: [<ffffffffa093abe0>] ? cfs_alloc+0x30/0x60 [libcfs] 22:20:17: [<ffffffffa04fbd8a>] ? obd_ioctl_getdata+0x13a/0x1160 [obdclass] 22:20:17: [<ffffffffa050f54f>] ? class_handle_ioctl+0x12ff/0x1ed0 [obdclass] 22:20:17: [<ffffffff8113ff34>] ? handle_mm_fault+0x1e4/0x2b0 22:20:17: [<ffffffffa04fb2ab>] ? obd_class_ioctl+0x4b/0x190 [obdclass] 22:20:17: [<ffffffff8118dff2>] ? vfs_ioctl+0x22/0xa0 22:20:17: [<ffffffff81039678>] ? pvclock_clocksource_read+0x58/0xd0 22:20:17: [<ffffffff8118e194>] ? do_vfs_ioctl+0x84/0x580 22:20:17: [<ffffffff8118e711>] ? sys_ioctl+0x81/0xa0 22:20:17: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b 22:20:17:Initializing cgroup subsys cpuset |
| Comments |
| Comment by Sarah Liu [ 04/Sep/12 ] |
|
another failure: https://maloo.whamcloud.com/test_sets/9e55d668-f440-11e1-8032-52540035b04c |
| Comment by Jodi Levi (Inactive) [ 04/Sep/12 ] |
|
I have made this a blocker given the % of failure and feedback from Sarah. |
| Comment by Peter Jones [ 06/Sep/12 ] |
|
Yujian Could you please look into this one Peter |
| Comment by Jian Yu [ 07/Sep/12 ] |
|
Per report https://maloo.whamcloud.com/test_sets/20e050dc-f42c-11e1-b3b2-52540035b04c, the test failed at sub-test 1c: 22:19:04:== obdfilter-survey test 1c: Object Storage Targets survey, big batch ================================ 22:16:34 (1346476594) LBUG on OSS was: 22:20:13:LustreError: 8181:0:(filter_io_26.c:483:filter_direct_io()) ASSERTION( iobuf->dr_npages > 0 ) failed: This issue was introduced by commit a959779: http://review.whamcloud.com/3446 and hidden before commit 88214c0: http://review.whamcloud.com/3692 was landed. The analysis and patch will be uploaded later. |
| Comment by Jian Yu [ 07/Sep/12 ] |
|
The assertion failure occurred in filter_direct_io(): filter_direct_io() // if (rw == OBD_BRW_READ) { //…… } else { LASSERTF(rw == OBD_BRW_WRITE, "%x\n", rw); LASSERT(iobuf->dr_npages > 0); //…… } The iobuf->dr_npages was 0, which triggered the failure. By looking into filter_commitrw_write() (which called filter_direct_io()), we can see the value of iobuf->dr_npages was increased inside filter_iobuf_add_page(): filter_iobuf_add_page() //
LASSERT(iobuf->dr_npages < iobuf->dr_max_pages);
iobuf->dr_pages[iobuf->dr_npages++] = page;
However, the code did not go into filter_iobuf_add_page(), because all of the lnb->rc values were not 0 while checking them in filter_commitrw_write(): filter_commitrw_write() // for (i = 0, lnb = res; i < niocount; i++, lnb++) { //...... if (lnb->rc) { /* ENOSPC, network RPC error, etc. */ CDEBUG(D_INODE, "Skipping [%d] == %d\n", i, lnb->rc); continue; } //...... err = filter_iobuf_add_page(obd, iobuf, inode, lnb->page); //...... } Why all of the lnb->rc values were not 0? The initial value of lnb->rc was 0 which was set inside filter_map_remote_to_local(): // for (i = 0, rnb = nb, lnb = res; i < obj->ioo_bufcnt; i++, rnb++) { //...... lnb->offset = offset; lnb->len = plen; lnb->flags = rnb->flags; lnb->page = NULL; lnb->rc = 0; lnb->lnb_grant_used = 0; lnb++; //...... } And the value of lnb->rc was changed to -ENOSPC inside filter_grant_check() while there was no space for the write operation: filter_grant_check() // for (obj = 0; obj < objcount; obj++) { for (i = 0; i < fso[obj].fso_bufcnt; i++, n++) { //...... lnb[n].rc = -ENOSPC; lnb[n].flags &= ~OBD_BRW_GRANTED; CDEBUG(D_CACHE,"%s: cli %s/%p idx %d no space for %d\n", exp->exp_obd->obd_name, exp->exp_client_uuid.uuid, exp, n, bytes); } } The filter_grant_check() was called by filter_preprw_write(). Before patch http://review.whamcloud.com/3446 was landed, the filter_grant_check() was only called once. After the patch was landed, if there was no space for the write operation, the pending journals would be committed to free more space, and then filter_grant_check() would be called the second time. The issue was that all of the lnb[n].rc values were set to -ENOSPC inside filter_grant_check() the first time it was called, but at the second time, although there was some space and the function returned 0 not -ENOSPC, the lnb[n].rc values were not reset, all of them were still -ENOSPC, which caused the above failure inside filter_commitrw_write(). A patch is being created to reset lnb[n].rc inside filter_grant_check(). |
| Comment by Jian Yu [ 10/Sep/12 ] |
|
Patch for master branch: http://review.whamcloud.com/3913 |
| Comment by Peter Jones [ 10/Sep/12 ] |
|
Landed for 2.3 and 2.4 |
| Comment by Andreas Dilger [ 10/Sep/12 ] |
|
I verified that this patch also let me pass racer.sh on my local test machine. |