[LU-2550] sanity test 122 hung Created: 30/Dec/12  Updated: 17/Feb/13  Resolved: 17/Feb/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.1.4, Lustre 2.1.5, Lustre 1.8.8
Fix Version/s: Lustre 2.4.0, Lustre 2.1.5, Lustre 1.8.9

Type: Bug Priority: Blocker
Reporter: Jian Yu Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: LB
Environment:

Lustre Branch: b1_8
Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/236/
Distro/Arch: RHEL5.8/x86_64(server), RHEL6.3/x86_64(client)
Network: IB (in-kernel OFED)


Severity: 3
Rank (Obsolete): 5974

 Description   

sanity test 122 hung as follows:

== sanity test 122: fail client bulk callback (shouldn't LBUG) ========= 23:10:22 (1356678622)
fail_loc=0x508
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.00131226 s, 390 kB/s

Console log on the client node showed that:

23:13:02:INFO: task sync:19343 blocked for more than 120 seconds.
23:13:02:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
23:13:02:sync          D 0000000000000002     0 19343  19221 0x00000080
23:13:02: ffff880302affc98 0000000000000082 0000000000000000 0000000000000000
23:13:02: 0000000000000000 0000000000000000 ffff880302affc68 ffffffff810097cc
23:13:02: ffff88031fb71af8 ffff880302afffd8 000000000000fb88 ffff88031fb71af8
23:13:02:Call Trace:
23:13:02: [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
23:13:02: [<ffffffff811141f0>] ? sync_page+0x0/0x50
23:13:02: [<ffffffff814fe0f3>] io_schedule+0x73/0xc0
23:13:02: [<ffffffff8111422d>] sync_page+0x3d/0x50
23:13:02: [<ffffffff814feaaf>] __wait_on_bit+0x5f/0x90
23:13:02: [<ffffffff81114463>] wait_on_page_bit+0x73/0x80
23:13:02: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
23:13:02: [<ffffffff8112a965>] ? pagevec_lookup_tag+0x25/0x40
23:13:02: [<ffffffff811148db>] wait_on_page_writeback_range+0xfb/0x190
23:13:02: [<ffffffff814fe47c>] ? wait_for_common+0x14c/0x180
23:13:02: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
23:13:02: [<ffffffff8111499f>] filemap_fdatawait+0x2f/0x40
23:13:02: [<ffffffff811a4874>] sync_inodes_sb+0x114/0x190
23:13:02: [<ffffffff811aa312>] __sync_filesystem+0x82/0x90
23:13:02: [<ffffffff811aa418>] sync_filesystems+0xf8/0x130
23:13:02: [<ffffffff811aa4b1>] sys_sync+0x21/0x40
23:13:02: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b

Maloo report: https://maloo.whamcloud.com/test_sets/154e712e-514f-11e2-b56e-52540035b04c



 Comments   
Comment by Jian Yu [ 30/Dec/12 ]

This is a regression issue on Lustre b1_8 branch.

Comment by Jian Yu [ 06/Jan/13 ]

The issue can be reproduced on Lustre b1_8 build #236:
https://maloo.whamcloud.com/test_sessions/c8649180-57a3-11e2-8b17-52540035b04c

Comment by Jian Yu [ 06/Jan/13 ]

Another instance on Lustre b1_8 branch:
https://maloo.whamcloud.com/test_sets/00323a48-5774-11e2-8772-52540035b04c

Comment by Peter Jones [ 06/Jan/13 ]

Niu

Could you please look into this one?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 07/Jan/13 ]

Looks there is something wrong in the resend: in osc_brw_redo_request(), each time we rebuild a request, but never increase the aa_resends for the new request, so the io request will be resent infinitely if it always get -EIO (in this test). I'm wondering why it happened only on b_8? b2_1 & master should have the same problem.

Comment by Niu Yawei (Inactive) [ 07/Jan/13 ]

The test in master(& b2_1) doesn't have 'sync' after dd, I think that's why this bug isn't triggered in master.

Comment by Niu Yawei (Inactive) [ 07/Jan/13 ]

patch for b1_8: http://review.whamcloud.com/4964

Comment by Niu Yawei (Inactive) [ 13/Jan/13 ]

patch for master: http://review.whamcloud.com/5012

Comment by Jian Yu [ 17/Jan/13 ]

Lustre Branch: b1_8
Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/242

The issue still occurred: https://maloo.whamcloud.com/test_sets/327f6338-5ffb-11e2-84d4-52540035b04c

The above build contains patch http://review.whamcloud.com/4964.

Comment by Niu Yawei (Inactive) [ 17/Jan/13 ]

The OBD_FAIL_PTLRPC_CLIENT_BULK_CB will result in rq_net_error set on the request, and trigger reconnect immediately... I think the test script needs be improved as well.

Comment by Niu Yawei (Inactive) [ 17/Jan/13 ]

http://review.whamcloud.com/5050

Comment by Niu Yawei (Inactive) [ 27/Jan/13 ]

patch for b2_1: http://review.whamcloud.com/5184

Comment by Niu Yawei (Inactive) [ 17/Feb/13 ]

patches landed.

Generated at Sat Feb 10 01:26:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.