[LU-2598] obdfilter-survey LBUG ASSERTION( iobuf->dr_npages < iobuf->dr_max_pages ) failed Created: 09/Jan/13 Updated: 15/Oct/13 Resolved: 15/Oct/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.1.3 |
| Fix Version/s: | Lustre 2.4.0, Lustre 2.1.5, Lustre 2.1.6, Lustre 2.5.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Malcolm Cowe (Inactive) | Assignee: | Jian Yu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | LB | ||
| Environment: |
VirtualBox 4.2.6 VM |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 6053 | ||||||||
| Description |
|
On execution of obdfilter-survey with rszlo="2048" and rszhi="2048" (non-default params) against a single OST, an LBUG is generated: LustreError: 3202:0:(filter_io_26.c:297:filter_iobuf_add_page()) ASSERTION( iobuf->dr_npages < iobuf->dr_max_pages ) failed: The LBUG is consistent and reproducible on my test VM cluster using this command-line for obdfilter-survey: size="512" rszlo="2048" rszhi="2048" nobjlo="2" thrlo="2" nobjhi="32" thrhi="32" case="disk" rslt_loc="/root/obdres" obdfilter-survey |
| Comments |
| Comment by Andreas Dilger [ 11/Jan/13 ] |
|
Yes, the maximum IO size is 1MB, but the code shouldn't crash if some larger IO size is specified. The code should return an error in this case, or handle the larger IO by submitting multiple IO requests. This may also be fixed by the 4MB RPC patch in |
| Comment by Andreas Dilger [ 15/Jan/13 ] |
|
The http://review.whamcloud.com/4993 patch looks like it will resolve this problem in osd-ldiskfs/osd-io.c: - bio = bio_alloc(GFP_NOIO, max(BIO_MAX_PAGES, + bio = bio_alloc(GFP_NOIO, min(BIO_MAX_PAGES, |
| Comment by Jian Yu [ 21/Jan/13 ] |
|
If the patches for Test disk case support maximum 1024KB IO data (rszhi=xxxx is too big) please use a smaller value. Here is the test result on the current master branch with rszhi=2048: == obdfilter-survey test 1a: Object Storage Targets survey == 00:16:37 (1358756197) + NETTYPE=tcp rszlo=2048 rszhi=2048 nobjlo=2 thrlo=2 nobjhi=1 thrhi=4 size=512 case=disk rslt_loc=/tmp targets="10.10.4.209:lustre-OST0000 10.10.4.209:lustre-OST0001 10.10.4.209:lustre-OST0002 10.10.4.209:lustre-OST0003 10.10.4.209:lustre-OST0004 10.10.4.209:lustre-OST0005 10.10.4.209:lustre-OST0006" /usr/bin/obdfilter-survey Test disk case support maximum 1024KB IO data (rszhi=2048 is too big) please use a smaller value. Resetting fail_loc on all nodes...done. PASS 1a (1s) Maloo report: https://maloo.whamcloud.com/test_sets/2d9fdba8-63a3-11e2-824c-52540035b04c |
| Comment by Andreas Dilger [ 21/Jan/13 ] |
|
The http://review.whamcloud.com/1741 patch is at best a workaround of the real problem I now see. The check should be done in the kernel instead of the script, since kernels may be configured differently, and even if the script is changed it should not be possible to cause a kernel oops. |
| Comment by Jian Yu [ 05/Feb/13 ] |
|
On the latest master branch, after commenting out the "rszhi" check from obdfilter-survey, running the test with rszhi="2048" hit the following assertion failure: LustreError: 16216:0:(ofd_internal.h:524:ofd_info_init()) ASSERTION( info->fti_exp == ((void *)0) ) failed: |
| Comment by Andreas Dilger [ 20/Feb/13 ] |
|
Yu Jian, can you please retest manually, now that http://review.whamcloud.com/1741 has landed. Even better would be to write a small sanity test that runs a very short test manually with a huge blocksize (e.g. 32MB) to check that this no longer LASSERTs. |
| Comment by Jian Yu [ 22/Feb/13 ] |
With http://review.whamcloud.com/1741 on master branch, running obdfilter-survey with rszhi="2048" will always get: Test disk case support maximum 1024KB IO data (rszhi=2048 is too big) please use a smaller value. Now that the 4MB RPC patch http://review.whamcloud.com/4993 has been landed on master branch, I commented out the change of http://review.whamcloud.com/1741 and ran the obdfilter-survey test with rszhi="2048", it passed without any assertion failures. Lustre master build: http://build.whamcloud.com/job/lustre-master/1269/ + NETTYPE=tcp rszlo=2048 rszhi=2048 nobjlo=2 thrlo=2 nobjhi=32 thrhi=32 size=512 case=disk rslt_loc=/tmp targets="10.10.4.209:lustre-OST0000 10.10.4.209:lustre-OST0001" /usr/bin/obdfilter-survey Fri Feb 22 04:52:38 PST 2013 Obdfilter-survey for case=disk from client-12vm1 ost 2 sz 1048576K rsz 2048K obj 4 thr 4 write 38.53 [ 14.00, 24.00] rewrite 44.82 [ 19.99, 32.00] read 6816.48 SHORT ost 2 sz 1048576K rsz 2048K obj 4 thr 8 write 42.19 [ 12.00, 31.99] rewrite 50.90 [ 18.00, 31.99] read 6419.65 SHORT ost 2 sz 1048576K rsz 2048K obj 4 thr 16 write 44.38 [ 6.00, 33.99] rewrite 54.78 [ 15.99, 33.99] read 6656.54 SHORT ost 2 sz 1048576K rsz 2048K obj 4 thr 32 write 39.14 [ 0.00, 40.00] rewrite 51.76 [ 6.00, 41.99] read 6373.85 SHORT ost 2 sz 1048576K rsz 2048K obj 4 thr 64 write 44.55 [ 0.00, 47.99] rewrite 55.59 [ 6.00, 51.99] read 5994.27 SHORT ost 2 sz 1048576K rsz 2048K obj 8 thr 8 write 35.73 [ 8.00, 27.99] rewrite 46.46 [ 17.99, 27.99] read 6606.66 SHORT ost 2 sz 1048576K rsz 2048K obj 8 thr 16 write 39.02 [ 8.00, 31.99] rewrite 50.27 [ 10.00, 37.99] read 6695.94 SHORT ost 2 sz 1048576K rsz 2048K obj 8 thr 32 write 43.88 [ 0.00, 39.99] rewrite 50.43 [ 0.00, 39.99] read 6350.12 SHORT ost 2 sz 1048576K rsz 2048K obj 8 thr 64 write 46.22 [ 0.00, 63.98] rewrite 55.43 [ 0.00, 59.98] read 6055.04 SHORT ost 2 sz 1048576K rsz 2048K obj 16 thr 16 write 35.33 [ 0.00, 31.99] rewrite 44.84 [ 6.00, 32.00] read 6597.55 SHORT ost 2 sz 1048576K rsz 2048K obj 16 thr 32 write 39.48 [ 4.00, 35.99] rewrite 44.83 [ 0.00, 35.99] read 6348.42 SHORT ost 2 sz 1048576K rsz 2048K obj 16 thr 64 write 43.35 [ 0.00, 61.98] rewrite 52.14 [ 0.00, 49.99] read 6024.13 SHORT ost 2 sz 1048576K rsz 2048K obj 32 thr 32 write 36.63 [ 0.00, 37.99] rewrite 47.39 [ 8.00, 39.99] read 6045.66 SHORT ost 2 sz 1048576K rsz 2048K obj 32 thr 64 write 40.83 [ 0.00, 49.98] rewrite 50.22 [ 4.00, 51.99] read 5918.83 SHORT ost 2 sz 1048576K rsz 2048K obj 64 thr 64 write 39.22 [ 0.00, 45.99] rewrite 49.17 [ 0.00, 51.99] read 6107.23 SHORT done! Maloo report: https://maloo.whamcloud.com/test_sets/24a44ca0-7cf3-11e2-a108-52540035b04c I'll change the limit of 1024 to 4096 in obdfilter-survey.
OK, will do. |
| Comment by Andreas Dilger [ 22/Feb/13 ] |
Note that this would cause LBUG if new obdfilter-survey script is run against an old OST... It should be conditional upon the remote Lustre version being used. |
| Comment by Jian Yu [ 14/Mar/13 ] |
|
Patch for Lustre b2_1 branch is in http://review.whamcloud.com/5715. |
| Comment by Jian Yu [ 20/Mar/13 ] |
|
Patch for Lustre master branch is in http://review.whamcloud.com/5783. |
| Comment by Jay Lan (Inactive) [ 29/Mar/13 ] |
|
Hmm, "Note that this would cause LBUG if new obdfilter-survey script is run against an old OST... It should be conditional upon the remote Lustre version being used." Thanks! I kicked off a regression test yesterday and found my OSS crashed! Ah, Just want to be sure... This problem only affect sanity test-180c, right? It |
| Comment by Oleg Drokin [ 29/Mar/13 ] |
|
yes, this will only happen if you run obdfilter, which you don't really do in production. |
| Comment by Jian Yu [ 31/Mar/13 ] |
The sanity test 180c on Lustre b2_1 branch needs to be improved to interoperate with the servers (version < 2.1.5, and 2.2.0 <= version < 2.4.0) which do not have the patch fixing the assertion failure. |
| Comment by Jian Yu [ 01/Apr/13 ] |
|
Patch for Lustre b2_1 branch to resolve the interop issues: http://review.whamcloud.com/5902 |
| Comment by Andreas Dilger [ 23/May/13 ] |
|
Patch for b2_4 at http://review.whamcloud.com/6394. |
| Comment by Jian Yu [ 27/May/13 ] |
|
Patches were landed on Lustre b2_1, b2_4 and master branches. |
| Comment by Jodi Levi (Inactive) [ 15/Oct/13 ] |
|
Added 2.5.0 FixVersion |