Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2099

osd_init_iobuf()) ASSERTION( iobuf->dr_elapsed_valid == 0 ) sanity test 63a

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0, Lustre 2.5.2
    • Lustre 2.4.0, Lustre 2.5.0, Lustre 2.6.0
    • None
    • single node, 4 cpu cores. lustre files in /tmp that is tmpfs
    • 3
    • 4386

    Description

      Just hit this when testing with USE_OFD=yes

      [  929.726136] Lustre: DEBUG MARKER: == sanity test 63a: Verify oig_wait interruption does not crash ========= 03:34:59 (1349595299)
      [  964.438007] LustreError: 28767:0:(osd_io.c:107:osd_init_iobuf()) ASSERTION( iobuf->dr_elapsed_valid == 0 ) failed: 
      [  964.439525] LustreError: 28767:0:(osd_io.c:107:osd_init_iobuf()) LBUG
      [  964.441022] Pid: 28767, comm: ll_ost_io00_004
      [  964.442579] 
      [  964.442580] Call Trace:
      [  964.445464]  [<ffffffffa03f4915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [  964.446900]  [<ffffffffa03f4f27>] lbug_with_loc+0x47/0xb0 [libcfs]
      [  964.448333]  [<ffffffffa0c41a82>] osd_init_iobuf+0xd2/0xe0 [osd_ldiskfs]
      [  964.449770]  [<ffffffffa0c44b0b>] osd_write_prep+0x17b/0x420 [osd_ldiskfs]
      [  964.451208]  [<ffffffffa0c43976>] ? osd_bufs_get+0x206/0x390 [osd_ldiskfs]
      [  964.452627]  [<ffffffffa0e26daa>] ofd_preprw+0x7da/0x11c0 [ofd]
      [  964.454077]  [<ffffffffa0781cf4>] ? sptlrpc_svc_alloc_rs+0x74/0x2d0 [ptlrpc]
      [  964.455474]  [<ffffffffa0d557fc>] obd_preprw+0x12c/0x3d0 [ost]
      [  964.456890]  [<ffffffffa0d5c932>] ost_brw_write+0x882/0x15f0 [ost]
      [  964.458284]  [<ffffffff8127a646>] ? vsnprintf+0x2b6/0x5f0
      [  964.459696]  [<ffffffffa0754cbc>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc]
      [  964.461128]  [<ffffffffa0754e18>] ? lustre_msg_check_version+0xe8/0x100 [ptlrpc]
      [  964.462513]  [<ffffffffa0d621f0>] ost_handle+0x3120/0x4550 [ost]
      [  964.463899]  [<ffffffffa0401464>] ? libcfs_id2str+0x74/0xb0 [libcfs]
      [  964.465329]  [<ffffffffa0762883>] ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc]
      [  964.466745]  [<ffffffffa03f566e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      [  964.468163]  [<ffffffffa075b571>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
      [  964.469568]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
      [  964.470993]  [<ffffffffa076541a>] ptlrpc_main+0xb9a/0x1960 [ptlrpc]
      [  964.472399]  [<ffffffffa0764880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
      [  964.473806]  [<ffffffff8100c14a>] child_rip+0xa/0x20
      [  964.475201]  [<ffffffffa0764880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
      [  964.476610]  [<ffffffffa0764880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
      [  964.477937]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      [  964.479240] 
      [  964.481682] Kernel panic - not syncing: LBUG
      [  964.482986] Pid: 28767, comm: ll_ost_io00_004 Not tainted 2.6.32-debug #6
      [  964.484296] Call Trace:
      [  964.485627]  [<ffffffff814f75e4>] ? panic+0xa0/0x168
      [  964.486964]  [<ffffffffa03f4f7b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      [  964.488261]  [<ffffffffa0c41a82>] ? osd_init_iobuf+0xd2/0xe0 [osd_ldiskfs]
      [  964.489572]  [<ffffffffa0c44b0b>] ? osd_write_prep+0x17b/0x420 [osd_ldiskfs]
      [  964.490866]  [<ffffffffa0c43976>] ? osd_bufs_get+0x206/0x390 [osd_ldiskfs]
      [  964.492151]  [<ffffffffa0e26daa>] ? ofd_preprw+0x7da/0x11c0 [ofd]
      [  964.493512]  [<ffffffffa0781cf4>] ? sptlrpc_svc_alloc_rs+0x74/0x2d0 [ptlrpc]
      [  964.494817]  [<ffffffffa0d557fc>] ? obd_preprw+0x12c/0x3d0 [ost]
      [  964.496116]  [<ffffffffa0d5c932>] ? ost_brw_write+0x882/0x15f0 [ost]
      [  964.497432]  [<ffffffff8127a646>] ? vsnprintf+0x2b6/0x5f0
      [  964.498774]  [<ffffffffa0754cbc>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc]
      [  964.500109]  [<ffffffffa0754e18>] ? lustre_msg_check_version+0xe8/0x100 [ptlrpc]
      [  964.501424]  [<ffffffffa0d621f0>] ? ost_handle+0x3120/0x4550 [ost]
      [  964.502716]  [<ffffffffa0401464>] ? libcfs_id2str+0x74/0xb0 [libcfs]
      [  964.504014]  [<ffffffffa0762883>] ? ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc]
      [  964.505319]  [<ffffffffa03f566e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      [  964.506593]  [<ffffffffa075b571>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
      [  964.507817]  [<ffffffff81057d60>] ? default_wake_function+0x0/0x20
      [  964.509104]  [<ffffffffa076541a>] ? ptlrpc_main+0xb9a/0x1960 [ptlrpc]
      [  964.510345]  [<ffffffffa0764880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
      [  964.511510]  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
      [  964.512795]  [<ffffffffa0764880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
      [  964.513993]  [<ffffffffa0764880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
      [  964.515105]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      

      I have a crashdump for this instance too in case you need something from there.

      Attachments

        1. dmesg.txt
          34 kB
        2. log.gz
          0.2 kB

        Activity

          [LU-2099] osd_init_iobuf()) ASSERTION( iobuf->dr_elapsed_valid == 0 ) sanity test 63a
          pjones Peter Jones added a comment -

          Landed for 2.6

          pjones Peter Jones added a comment - Landed for 2.6

          it looks good in the initial testing

          jay Jinshan Xiong (Inactive) added a comment - it looks good in the initial testing

          please, try with http://review.whamcloud.com/#/c/10305/
          I'm still doing local testing, but I'd appreciate your help given you can reproduce this easily.
          thanks in advance.

          bzzz Alex Zhuravlev added a comment - please, try with http://review.whamcloud.com/#/c/10305/ I'm still doing local testing, but I'd appreciate your help given you can reproduce this easily. thanks in advance.

          Log and dmesg.

          I have a crash dump please let me know if you want it

          jay Jinshan Xiong (Inactive) added a comment - Log and dmesg. I have a crash dump please let me know if you want it

          could you grab full debug log for the case please? with the following patch:

          — a/lustre/osd-ldiskfs/osd_io.c
          +++ b/lustre/osd-ldiskfs/osd_io.c
          @@ -1139,7 +1139,9 @@ static int osd_write_commit(const struct lu_env *env, struct dt_object *dt,
          rc = osd_do_bio(osd, inode, iobuf);
          /* we don't do stats here as in read path because

          • write is async: we'll do this in osd_put_bufs() */
          • }
            + } else { + CERROR("NO IO: %d\n", rc); + }

          if (unlikely(rc != 0)) {
          /* if write fails, we should drop pages from the cache */

          bzzz Alex Zhuravlev added a comment - could you grab full debug log for the case please? with the following patch: — a/lustre/osd-ldiskfs/osd_io.c +++ b/lustre/osd-ldiskfs/osd_io.c @@ -1139,7 +1139,9 @@ static int osd_write_commit(const struct lu_env *env, struct dt_object *dt, rc = osd_do_bio(osd, inode, iobuf); /* we don't do stats here as in read path because write is async: we'll do this in osd_put_bufs() */ } + } else { + CERROR("NO IO: %d\n", rc); + } if (unlikely(rc != 0)) { /* if write fails, we should drop pages from the cache */

          I can't continue my testing since I'm keeping seeing this issue on 2.5 and 2.6

          jay Jinshan Xiong (Inactive) added a comment - I can't continue my testing since I'm keeping seeing this issue on 2.5 and 2.6

          Seeing this issue again on latest master.

          loop: Write error at byte offset 243236864, length 4096.
          LustreError: 15444:0:(osd_io.c:71:__osd_init_iobuf()) ASSERTION( iobuf->dr_elapsed_valid == 0 ) failed: iobuf ffff880548d92c00, reqs 0, rw 1, line 1079
          LustreError: 15444:0:(osd_io.c:71:__osd_init_iobuf()) LBUG
          Pid: 15444, comm: ll_ost_io01_021

          Call Trace:
          [<ffffffffa09e8895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
          [<ffffffffa09e8e97>] lbug_with_loc+0x47/0xb0 [libcfs]
          [<ffffffffa054c8b8>] __osd_init_iobuf+0x338/0x350 [osd_ldiskfs]
          [<ffffffffa1232b83>] ? lu_context_init+0xa3/0x240 [obdclass]
          [<ffffffffa0550a42>] osd_write_commit+0x152/0x610 [osd_ldiskfs]
          [<ffffffffa08821fa>] ofd_commitrw_write+0x59a/0x11e0 [ofd]
          [<ffffffffa088336a>] ofd_commitrw+0x52a/0x8c0 [ofd]
          [<ffffffffa1210411>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
          [<ffffffffa14c1c8d>] obd_commitrw.clone.0+0x11d/0x390 [ptlrpc]
          [<ffffffffa14c8eae>] tgt_brw_write+0xc7e/0x1530 [ptlrpc]
          [<ffffffff81059216>] ? enqueue_task+0x66/0x80
          [<ffffffff8105571d>] ? check_preempt_curr+0x6d/0x90
          [<ffffffff81065c5e>] ? try_to_wake_up+0x24e/0x3e0
          [<ffffffffa14247c0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
          [<ffffffffa14c79ac>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
          [<ffffffffa146635c>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
          [<ffffffffa147698a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
          [<ffffffff81068370>] ? pick_next_task_fair+0xd0/0x130
          [<ffffffff81527e08>] ? schedule+0x178/0x3b2
          [<ffffffffa1475c70>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
          [<ffffffff8109aee6>] kthread+0x96/0xa0
          [<ffffffff8100c20a>] child_rip+0xa/0x20
          [<ffffffff8109ae50>] ? kthread+0x0/0xa0
          [<ffffffff8100c200>] ? child_rip+0x0/0x20

          jay Jinshan Xiong (Inactive) added a comment - Seeing this issue again on latest master. loop: Write error at byte offset 243236864, length 4096. LustreError: 15444:0:(osd_io.c:71:__osd_init_iobuf()) ASSERTION( iobuf->dr_elapsed_valid == 0 ) failed: iobuf ffff880548d92c00, reqs 0, rw 1, line 1079 LustreError: 15444:0:(osd_io.c:71:__osd_init_iobuf()) LBUG Pid: 15444, comm: ll_ost_io01_021 Call Trace: [<ffffffffa09e8895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa09e8e97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa054c8b8>] __osd_init_iobuf+0x338/0x350 [osd_ldiskfs] [<ffffffffa1232b83>] ? lu_context_init+0xa3/0x240 [obdclass] [<ffffffffa0550a42>] osd_write_commit+0x152/0x610 [osd_ldiskfs] [<ffffffffa08821fa>] ofd_commitrw_write+0x59a/0x11e0 [ofd] [<ffffffffa088336a>] ofd_commitrw+0x52a/0x8c0 [ofd] [<ffffffffa1210411>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass] [<ffffffffa14c1c8d>] obd_commitrw.clone.0+0x11d/0x390 [ptlrpc] [<ffffffffa14c8eae>] tgt_brw_write+0xc7e/0x1530 [ptlrpc] [<ffffffff81059216>] ? enqueue_task+0x66/0x80 [<ffffffff8105571d>] ? check_preempt_curr+0x6d/0x90 [<ffffffff81065c5e>] ? try_to_wake_up+0x24e/0x3e0 [<ffffffffa14247c0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] [<ffffffffa14c79ac>] tgt_request_handle+0x23c/0xac0 [ptlrpc] [<ffffffffa146635c>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc] [<ffffffffa147698a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc] [<ffffffff81068370>] ? pick_next_task_fair+0xd0/0x130 [<ffffffff81527e08>] ? schedule+0x178/0x3b2 [<ffffffffa1475c70>] ? ptlrpc_main+0x0/0x1980 [ptlrpc] [<ffffffff8109aee6>] kthread+0x96/0xa0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffff8109ae50>] ? kthread+0x0/0xa0 [<ffffffff8100c200>] ? child_rip+0x0/0x20

          from what I heard last time, no more hits happened with the patch.

          bzzz Alex Zhuravlev added a comment - from what I heard last time, no more hits happened with the patch.
          green Oleg Drokin added a comment -

          I've yet to hit it again with the patch applied, though there was one strange problem hit in the same path once.

          green Oleg Drokin added a comment - I've yet to hit it again with the patch applied, though there was one strange problem hit in the same path once.
          bzzz Alex Zhuravlev added a comment - http://review.whamcloud.com/#change,4216

          People

            bzzz Alex Zhuravlev
            green Oleg Drokin
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: