Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-904 import invalidation doesn't fail all requests
  3. LU-1582

replay-ost-single.sh test_8b: client stuck with -115 (-EINPROGRESS)

    XMLWordPrintable

Details

    • Technical task
    • Resolution: Fixed
    • Minor
    • Lustre 2.3.0
    • Lustre 2.3.0
    • None
    • Local x86-64 dual-core node testing on build: v2_2_57-g84dfd30-CHANGED-2.6.32-220.el6_lustre.ge487264.x86_64
    • 4533

    Description

      Running acceptance-small.sh got stuck for more than a day in replay-ost-single.sh test_8b() "Verify redo io: redo io should success after recovery" that was added for LU-904 "ptlrpc: redo io on -EINPROGRESS" (http://review.whamcloud.com/1962).

      The console message is repeating cfs_fail_loc=230, and the RPC is resent. It looks like the fail_loc is not being cleared in this test, because the "dd" finish too quickly?

      Test output:

      == replay-ost-single test 8b: Verify redo io: redo io should success after recovery ================== 18:29:58 (1340670598)
      1280+0 records in
      1280+0 records out
      5242880 bytes (5.2 MB) copied, 1.78424 s, 2.9 MB/s
      fail_loc=0x230
      Failing ost1 on node sookie-gig.adilger.int
      

      Console logs:

      Jun 25 18:29:59 sookie-gig kernel: Lustre: DEBUG MARKER: == replay-ost-single test 8b: Verify redo io: redo io should success after recovery ================== 18:29:58 (1340670598)
      Jun 25 18:30:00 sookie-gig kernel: Lustre: *** cfs_fail_loc=230, val=0***
      Jun 25 18:30:00 sookie-gig kernel: LustreError: 6126:0:(osc_request.c:1662:osc_brw_redo_request()) @@@ redo for recoverable error  req@ffff8800afb6a800 x1405794707837937/t0(0) o4->testfs-OST0000-osc-ffff8800ae6e4c00@0@lo:6/4 lens 488/416 e 0 to 0 dl 1340670627 ref 2 fl Interpret:R/0/0 rc -115/-115
      Jun 25 18:30:21 sookie-gig kernel: Lustre: *** cfs_fail_loc=230, val=0***
      Jun 25 18:30:21 sookie-gig kernel: Lustre: Skipped 5 previous similar messages
      Jun 25 18:30:21 sookie-gig kernel: LustreError: 6126:0:(osc_request.c:1662:osc_brw_redo_request()) @@@ redo for recoverable error  req@ffff8800c5397800 x1405794707837961/t0(0) o4->testfs-OST0000-osc-ffff8800ae6e4c00@0@lo:6/4 lens 488/416 e 0 to 0 dl 1340670648 ref 2 fl Interpret:R/0/0 rc -115/-115
      Jun 25 18:30:21 sookie-gig kernel: LustreError: 6126:0:(osc_request.c:1662:osc_brw_redo_request()) Skipped 5 previous similar messages
      Jun 25 18:30:55 sookie-gig kernel: Lustre: *** cfs_fail_loc=230, val=0***
      Jun 25 18:30:55 sookie-gig kernel: Lustre: Skipped 3 previous similar messages
      Jun 25 18:30:55 sookie-gig kernel: LustreError: 6126:0:(osc_request.c:1662:osc_brw_redo_request()) @@@ redo for recoverable error  req@ffff8800c5398000 x1405794707838010/t0(0) o4->testfs-OST0000-osc-ffff8800ae6e4c00@0@lo:6/4 lens 488/416 e 0 to 0 dl 1340670682 ref 2 fl Interpret:R/0/0 rc -115/-115
      Jun 25 18:30:55 sookie-gig kernel: LustreError: 6126:0:(osc_request.c:1662:osc_brw_redo_request()) Skipped 3 previous similar messages
      Jun 25 18:31:45 sookie-gig kernel: LustreError: 11-0: an error occurred while communicating with 0@lo. The ost_write operation failed with -115
      Jun 25 18:31:45 sookie-gig kernel: LustreError: Skipped 15 previous similar messages
      Jun 25 18:32:00 sookie-gig kernel: Lustre: *** cfs_fail_loc=230, val=0***
      Jun 25 18:32:00 sookie-gig kernel: Lustre: Skipped 4 previous similar messages
      Jun 25 18:32:00 sookie-gig kernel: LustreError: 6126:0:(osc_request.c:1662:osc_brw_redo_request()) @@@ redo for recoverable error  req@ffff8800c5397400 x1405794707838108/t0(0) o4->testfs-OST0000-osc-ffff8800ae6e4c00@0@lo:6/4 lens 488/416 e 0 to 0 dl 1340670747 ref 2 fl Interpret:R/0/0 rc -115/-115
      Jun 25 18:32:00 sookie-gig kernel: LustreError: 6126:0:(osc_request.c:1662:osc_brw_redo_request()) Skipped 4 previous similar messages
      Jun 25 18:33:14 sookie-gig kernel: INFO: task sync:9577 blocked for more than 120 seconds.
      Jun 25 18:33:14 sookie-gig kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Jun 25 18:33:14 sookie-gig kernel: sync          D 0000000000000000     0  9577   9576 0x00000080
      Jun 25 18:33:14 sookie-gig kernel: ffff8800ab3ebc98 0000000000000082 0000000000000000 ffff880037759500
      Jun 25 18:33:14 sookie-gig kernel: ffff8800c4ff40c0 0000000000000000 0000000000000001 00000000ffffffff
      Jun 25 18:33:14 sookie-gig kernel: ffff880037759ab8 ffff8800ab3ebfd8 000000000000f4e8 ffff880037759ab8
      Jun 25 18:33:14 sookie-gig kernel: Call Trace:
      Jun 25 18:33:14 sookie-gig kernel: [<ffffffff8109b809>] ? ktime_get_ts+0xa9/0xe0
      Jun 25 18:33:14 sookie-gig kernel: [<ffffffff81110b10>] ? sync_page+0x0/0x50
      Jun 25 18:33:14 sookie-gig kernel: [<ffffffff814ed583>] io_schedule+0x73/0xc0
      Jun 25 18:33:14 sookie-gig kernel: [<ffffffff81110b4d>] sync_page+0x3d/0x50
      Jun 25 18:33:14 sookie-gig kernel: [<ffffffff814edf3f>] __wait_on_bit+0x5f/0x90
      Jun 25 18:33:14 sookie-gig kernel: [<ffffffff81110d03>] wait_on_page_bit+0x73/0x80
      Jun 25 18:33:14 sookie-gig kernel: [<ffffffff81090c30>] ? wake_bit_function+0x0/0x50
      Jun 25 18:33:14 sookie-gig kernel: [<ffffffff811271a5>] ? pagevec_lookup_tag+0x25/0x40
      Jun 25 18:33:14 sookie-gig kernel: [<ffffffff8111111b>] wait_on_page_writeback_range+0xfb/0x190
      Jun 25 18:33:14 sookie-gig kernel: [<ffffffff814ed7b0>] ? _cond_resched+0x30/0x40
      Jun 25 18:33:14 sookie-gig kernel: [<ffffffff814ed7fc>] ? wait_for_common+0x3c/0x180
      Jun 25 18:33:14 sookie-gig kernel: [<ffffffff811111df>] filemap_fdatawait+0x2f/0x40
      Jun 25 18:33:14 sookie-gig kernel: [<ffffffff8119fbb4>] sync_inodes_sb+0x114/0x190
      Jun 25 18:33:14 sookie-gig kernel: [<ffffffff811a5682>] __sync_filesystem+0x82/0x90
      Jun 25 18:33:14 sookie-gig kernel: [<ffffffff811a5788>] sync_filesystems+0xf8/0x130
      Jun 25 18:33:14 sookie-gig kernel: [<ffffffff811a5821>] sys_sync+0x21/0x40
      Jun 25 18:33:14 sookie-gig kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
      Jun 25 18:34:13 sookie-gig kernel: Lustre: *** cfs_fail_loc=230, val=0***
      Jun 25 18:34:13 sookie-gig kernel: Lustre: Skipped 6 previous similar messages
      Jun 25 18:34:13 sookie-gig kernel: LustreError: 6126:0:(osc_request.c:1662:osc_brw_redo_request()) @@@ redo for recoverable error  req@ffff8800c5398800 x1405794707838312/t0(0) o4->testfs-OST0000-osc-ffff8800ae6e4c00@0@lo:6/4 lens 488/416 e 0 to 0 dl 1340670880 ref 2 fl Interpret:R/0/0 rc -115/-115
      {color:red}"cfs_fail_loc=230" and "redo for recoverable error" repeat forever
      

      While it was useful for debugging this problem, does it make sense to print every osc_brw_redo_request() message to the console for EINPROGRESS? Since this is somewhat a "normal" message that can happen and not need any attention, it shouldn't be printed to the console via D_ERROR (unlike data checksum errors). Possibly if it has been going on for a long time (e.g. (++aa->aa_resends & 31 == 0) or something).

      Attachments

        Issue Links

          Activity

            People

              niu Niu Yawei (Inactive)
              adilger Andreas Dilger
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: