Details

    • 3
    • 5974

    Description

      sanity test 122 hung as follows:

      == sanity test 122: fail client bulk callback (shouldn't LBUG) ========= 23:10:22 (1356678622)
      fail_loc=0x508
      1+0 records in
      1+0 records out
      512 bytes (512 B) copied, 0.00131226 s, 390 kB/s
      

      Console log on the client node showed that:

      23:13:02:INFO: task sync:19343 blocked for more than 120 seconds.
      23:13:02:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      23:13:02:sync          D 0000000000000002     0 19343  19221 0x00000080
      23:13:02: ffff880302affc98 0000000000000082 0000000000000000 0000000000000000
      23:13:02: 0000000000000000 0000000000000000 ffff880302affc68 ffffffff810097cc
      23:13:02: ffff88031fb71af8 ffff880302afffd8 000000000000fb88 ffff88031fb71af8
      23:13:02:Call Trace:
      23:13:02: [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
      23:13:02: [<ffffffff811141f0>] ? sync_page+0x0/0x50
      23:13:02: [<ffffffff814fe0f3>] io_schedule+0x73/0xc0
      23:13:02: [<ffffffff8111422d>] sync_page+0x3d/0x50
      23:13:02: [<ffffffff814feaaf>] __wait_on_bit+0x5f/0x90
      23:13:02: [<ffffffff81114463>] wait_on_page_bit+0x73/0x80
      23:13:02: [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
      23:13:02: [<ffffffff8112a965>] ? pagevec_lookup_tag+0x25/0x40
      23:13:02: [<ffffffff811148db>] wait_on_page_writeback_range+0xfb/0x190
      23:13:02: [<ffffffff814fe47c>] ? wait_for_common+0x14c/0x180
      23:13:02: [<ffffffff81060250>] ? default_wake_function+0x0/0x20
      23:13:02: [<ffffffff8111499f>] filemap_fdatawait+0x2f/0x40
      23:13:02: [<ffffffff811a4874>] sync_inodes_sb+0x114/0x190
      23:13:02: [<ffffffff811aa312>] __sync_filesystem+0x82/0x90
      23:13:02: [<ffffffff811aa418>] sync_filesystems+0xf8/0x130
      23:13:02: [<ffffffff811aa4b1>] sys_sync+0x21/0x40
      23:13:02: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
      

      Maloo report: https://maloo.whamcloud.com/test_sets/154e712e-514f-11e2-b56e-52540035b04c

      Attachments

        Activity

          [LU-2550] sanity test 122 hung

          patches landed.

          niu Niu Yawei (Inactive) added a comment - patches landed.
          niu Niu Yawei (Inactive) added a comment - patch for b2_1: http://review.whamcloud.com/5184
          niu Niu Yawei (Inactive) added a comment - http://review.whamcloud.com/5050

          The OBD_FAIL_PTLRPC_CLIENT_BULK_CB will result in rq_net_error set on the request, and trigger reconnect immediately... I think the test script needs be improved as well.

          niu Niu Yawei (Inactive) added a comment - The OBD_FAIL_PTLRPC_CLIENT_BULK_CB will result in rq_net_error set on the request, and trigger reconnect immediately... I think the test script needs be improved as well.
          yujian Jian Yu added a comment - Lustre Branch: b1_8 Lustre Build: http://build.whamcloud.com/job/lustre-b1_8/242 The issue still occurred: https://maloo.whamcloud.com/test_sets/327f6338-5ffb-11e2-84d4-52540035b04c The above build contains patch http://review.whamcloud.com/4964 .
          niu Niu Yawei (Inactive) added a comment - patch for master: http://review.whamcloud.com/5012
          niu Niu Yawei (Inactive) added a comment - patch for b1_8: http://review.whamcloud.com/4964

          The test in master(& b2_1) doesn't have 'sync' after dd, I think that's why this bug isn't triggered in master.

          niu Niu Yawei (Inactive) added a comment - The test in master(& b2_1) doesn't have 'sync' after dd, I think that's why this bug isn't triggered in master.

          Looks there is something wrong in the resend: in osc_brw_redo_request(), each time we rebuild a request, but never increase the aa_resends for the new request, so the io request will be resent infinitely if it always get -EIO (in this test). I'm wondering why it happened only on b_8? b2_1 & master should have the same problem.

          niu Niu Yawei (Inactive) added a comment - Looks there is something wrong in the resend: in osc_brw_redo_request(), each time we rebuild a request, but never increase the aa_resends for the new request, so the io request will be resent infinitely if it always get -EIO (in this test). I'm wondering why it happened only on b_8? b2_1 & master should have the same problem.
          pjones Peter Jones added a comment -

          Niu

          Could you please look into this one?

          Thanks

          Peter

          pjones Peter Jones added a comment - Niu Could you please look into this one? Thanks Peter

          People

            niu Niu Yawei (Inactive)
            yujian Jian Yu
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: