[LU-2550] sanity test 122 hung Created: 30/Dec/12 Updated: 17/Feb/13 Resolved: 17/Feb/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.1.4, Lustre 2.1.5, Lustre 1.8.8 |
| Fix Version/s: | Lustre 2.4.0, Lustre 2.1.5, Lustre 1.8.9 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Jian Yu | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | LB | ||
| Environment: |
Lustre Branch: b1_8 |
||
| Severity: | 3 |
| Rank (Obsolete): | 5974 |
| Description |
|
sanity test 122 hung as follows: == sanity test 122: fail client bulk callback (shouldn't LBUG) ========= 23:10:22 (1356678622) fail_loc=0x508 1+0 records in 1+0 records out 512 bytes (512 B) copied, 0.00131226 s, 390 kB/s Console log on the client node showed that: 23:13:02:INFO: task sync:19343 blocked for more than 120 seconds. 23:13:02:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 23:13:02:sync D 0000000000000002 0 19343 19221 0x00000080 23:13:02: ffff880302affc98 0000000000000082 0000000000000000 0000000000000000 23:13:02: 0000000000000000 0000000000000000 ffff880302affc68 ffffffff810097cc 23:13:02: ffff88031fb71af8 ffff880302afffd8 000000000000fb88 ffff88031fb71af8 23:13:02:Call Trace: 23:13:02: [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320 23:13:02: [<ffffffff811141f0>] ? sync_page+0x0/0x50 23:13:02: [<ffffffff814fe0f3>] io_schedule+0x73/0xc0 23:13:02: [<ffffffff8111422d>] sync_page+0x3d/0x50 23:13:02: [<ffffffff814feaaf>] __wait_on_bit+0x5f/0x90 23:13:02: [<ffffffff81114463>] wait_on_page_bit+0x73/0x80 23:13:02: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50 23:13:02: [<ffffffff8112a965>] ? pagevec_lookup_tag+0x25/0x40 23:13:02: [<ffffffff811148db>] wait_on_page_writeback_range+0xfb/0x190 23:13:02: [<ffffffff814fe47c>] ? wait_for_common+0x14c/0x180 23:13:02: [<ffffffff81060250>] ? default_wake_function+0x0/0x20 23:13:02: [<ffffffff8111499f>] filemap_fdatawait+0x2f/0x40 23:13:02: [<ffffffff811a4874>] sync_inodes_sb+0x114/0x190 23:13:02: [<ffffffff811aa312>] __sync_filesystem+0x82/0x90 23:13:02: [<ffffffff811aa418>] sync_filesystems+0xf8/0x130 23:13:02: [<ffffffff811aa4b1>] sys_sync+0x21/0x40 23:13:02: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b Maloo report: https://maloo.whamcloud.com/test_sets/154e712e-514f-11e2-b56e-52540035b04c |
| Comments |
| Comment by Jian Yu [ 30/Dec/12 ] |
|
This is a regression issue on Lustre b1_8 branch. |
| Comment by Jian Yu [ 06/Jan/13 ] |
|
The issue can be reproduced on Lustre b1_8 build #236: |
| Comment by Jian Yu [ 06/Jan/13 ] |
|
Another instance on Lustre b1_8 branch: |
| Comment by Peter Jones [ 06/Jan/13 ] |
|
Niu Could you please look into this one? Thanks Peter |
| Comment by Niu Yawei (Inactive) [ 07/Jan/13 ] |
|
Looks there is something wrong in the resend: in osc_brw_redo_request(), each time we rebuild a request, but never increase the aa_resends for the new request, so the io request will be resent infinitely if it always get -EIO (in this test). I'm wondering why it happened only on b_8? b2_1 & master should have the same problem. |
| Comment by Niu Yawei (Inactive) [ 07/Jan/13 ] |
|
The test in master(& b2_1) doesn't have 'sync' after dd, I think that's why this bug isn't triggered in master. |
| Comment by Niu Yawei (Inactive) [ 07/Jan/13 ] |
|
patch for b1_8: http://review.whamcloud.com/4964 |
| Comment by Niu Yawei (Inactive) [ 13/Jan/13 ] |
|
patch for master: http://review.whamcloud.com/5012 |
| Comment by Jian Yu [ 17/Jan/13 ] |
|
Lustre Branch: b1_8 The issue still occurred: https://maloo.whamcloud.com/test_sets/327f6338-5ffb-11e2-84d4-52540035b04c The above build contains patch http://review.whamcloud.com/4964. |
| Comment by Niu Yawei (Inactive) [ 17/Jan/13 ] |
|
The OBD_FAIL_PTLRPC_CLIENT_BULK_CB will result in rq_net_error set on the request, and trigger reconnect immediately... I think the test script needs be improved as well. |
| Comment by Niu Yawei (Inactive) [ 17/Jan/13 ] |
| Comment by Niu Yawei (Inactive) [ 27/Jan/13 ] |
|
patch for b2_1: http://review.whamcloud.com/5184 |
| Comment by Niu Yawei (Inactive) [ 17/Feb/13 ] |
|
patches landed. |