Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6846

dt_record_write()) ASSERTION( dt->do_body_ops->dbo_write ) failed:

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      Found this in 24 hours failover test in OpenSFS cluster.

      Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
      Lustre: DEBUG MARKER: mds7 has failed over 1 times, and counting...
      Lustre: 9887:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1436839484/real 1436839484]  req@ffff88062c53a080 x1506632146392040/t0(0) o400->lustre-MDT0006-osp-MDT0001@192.168.2.127@o2ib:24/4 lens 224/224 e 1 to 1 dl 1436839486 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1
      Lustre: 9887:0:(client.c:2018:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      Lustre: 9887:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1436839503/real 1436839503]  req@ffff88062c53a080 x1506632146392208/t0(0) o400->lustre-MDT0006-osp-MDT0001@192.168.2.127@o2ib:24/4 lens 224/224 e 1 to 1 dl 1436839505 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1
      Lustre: 9887:0:(client.c:2018:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      Lustre: lustre-MDT0006-osp-MDT0001: Connection restored to lustre-MDT0006 (at 192.168.2.127@o2ib)
      LustreError: 12030:0:(dt_object.c:512:dt_record_write()) ASSERTION( dt->do_body_ops->dbo_write ) failed: 
      LustreError: 12030:0:(dt_object.c:512:dt_record_write()) LBUG
      Pid: 12030, comm: mdt_out03_005
      
      Call Trace:
       [<ffffffffa0506875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
       [<ffffffffa0506e77>] lbug_with_loc+0x47/0xb0 [libcfs]
       [<ffffffffa065978f>] dt_record_write+0xbf/0x130 [obdclass]
       [<ffffffffa08f4d0e>] out_tx_write_exec+0x7e/0x300 [ptlrpc]
       [<ffffffffa08ed30a>] out_tx_end+0xda/0x5d0 [ptlrpc]
       [<ffffffffa08f1e7b>] out_handle+0xd9b/0x17e0 [ptlrpc]
       [<ffffffffa083afb0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
       [<ffffffffa08ea212>] tgt_request_handle+0xa42/0x1230 [ptlrpc]
       [<ffffffffa0892891>] ptlrpc_main+0xe41/0x1920 [ptlrpc]
       [<ffffffffa0891a50>] ? ptlrpc_main+0x0/0x1920 [ptlrpc]
       [<ffffffff8109abf6>] kthread+0x96/0xa0
       [<ffffffff8100c20a>] child_rip+0xa/0x20
       [<ffffffff8109ab60>] ? kthread+0x0/0xa0
       [<ffffffff8100c200>] ? child_rip+0x0/0x20
      
      LustreError: dumping log to /tmp/lustre-log.1436839511.12030
      
      

      Attachments

        Issue Links

          Activity

            [LU-6846] dt_record_write()) ASSERTION( dt->do_body_ops->dbo_write ) failed:

            Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/17616
            Subject: LU-6846 llog: combine cancel rec and destroy
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 0e22e861ff702c7515f8716b29770874d1e230e1

            gerrit Gerrit Updater added a comment - Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/17616 Subject: LU-6846 llog: combine cancel rec and destroy Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0e22e861ff702c7515f8716b29770874d1e230e1

            Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/17579
            Subject: LU-6846 llog: create remote llog synchronously
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: b66f9d956c27b7f3b1ee7289ef2e230de93b71e4

            gerrit Gerrit Updater added a comment - Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/17579 Subject: LU-6846 llog: create remote llog synchronously Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b66f9d956c27b7f3b1ee7289ef2e230de93b71e4
            pjones Peter Jones added a comment -

            Landed for 2.8

            pjones Peter Jones added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16333/
            Subject: LU-6846 llog: create remote llog synchronously
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 36f59f94c06c74887a32f2a7757e7c962c6cf8dd

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16333/ Subject: LU-6846 llog: create remote llog synchronously Project: fs/lustre-release Branch: master Current Patch Set: Commit: 36f59f94c06c74887a32f2a7757e7c962c6cf8dd

            All have this what I posted to LU-6831 for the trace. The node just reboots with no kdumps or anything

            simmonsja James A Simmons added a comment - All have this what I posted to LU-6831 for the trace. The node just reboots with no kdumps or anything
            di.wang Di Wang added a comment -

            Hmm, I guess somewhere in ldiskfs still complain about this super giant credit reservation. Do you have stack trace when the panic happens?

            di.wang Di Wang added a comment - Hmm, I guess somewhere in ldiskfs still complain about this super giant credit reservation. Do you have stack trace when the panic happens?
            simmonsja James A Simmons added a comment - - edited

            Yes it is my series. Ah I haven't updates my local test scripts. Still blows up the node with 300k.

            simmonsja James A Simmons added a comment - - edited Yes it is my series. Ah I haven't updates my local test scripts. Still blows up the node with 300k.
            di.wang Di Wang added a comment -

            Hmm, I only saw 300g and 300h failed in your log.

            Sep 10 12:49:43 spoon17.ccs.ornl.gov kernel: [  239.577103] Lustre: DEBUG MARKER: == sanity test 300g: check default striped directory for normal directory == 12:49:43 (1441903783)
            Sep 10 12:49:44 spoon17.ccs.ornl.gov kernel: [  240.291185] Lustre: DEBUG MARKER: sanity test_300g: @@@@@@ FAIL: stripe count 1 != 0 for /lustre/lustre/d300g.sanity/normal_dir/test1
            Sep 10 12:49:45 spoon17.ccs.ornl.gov kernel: [  241.048817] Lustre: DEBUG MARKER: == sanity test 300h: check default striped directory for striped directory == 12:49:45 (1441903785)
            Sep 10 12:49:45 spoon17.ccs.ornl.gov kernel: [  241.837071] Lustre: DEBUG MARKER: sanity test_300h: @@@@@@ FAIL: stripe count 1 != 0 for 
            

            is http://review.whamcloud.com/#/c/16130/ in your series? I think there is a fix in that patch( see test_300_check_default_striped_dir) to deal with this failure.

            di.wang Di Wang added a comment - Hmm, I only saw 300g and 300h failed in your log. Sep 10 12:49:43 spoon17.ccs.ornl.gov kernel: [ 239.577103] Lustre: DEBUG MARKER: == sanity test 300g: check default striped directory for normal directory == 12:49:43 (1441903783) Sep 10 12:49:44 spoon17.ccs.ornl.gov kernel: [ 240.291185] Lustre: DEBUG MARKER: sanity test_300g: @@@@@@ FAIL: stripe count 1 != 0 for /lustre/lustre/d300g.sanity/normal_dir/test1 Sep 10 12:49:45 spoon17.ccs.ornl.gov kernel: [ 241.048817] Lustre: DEBUG MARKER: == sanity test 300h: check default striped directory for striped directory == 12:49:45 (1441903785) Sep 10 12:49:45 spoon17.ccs.ornl.gov kernel: [ 241.837071] Lustre: DEBUG MARKER: sanity test_300h: @@@@@@ FAIL: stripe count 1 != 0 for is http://review.whamcloud.com/#/c/16130/ in your series? I think there is a fix in that patch( see test_300_check_default_striped_dir) to deal with this failure.

            I posted the full log under ticket LU-6831. You can see it here : https://jira.hpdd.intel.com/secure/attachment/18890/kern-09-10-2015.log

            simmonsja James A Simmons added a comment - I posted the full log under ticket LU-6831 . You can see it here : https://jira.hpdd.intel.com/secure/attachment/18890/kern-09-10-2015.log
            di.wang Di Wang added a comment -

            Hi, James, this is a known issue, because credits required for creating a big striped directory is larger than maximum credits here. And I believe you can see this on current master as well.
            But it is just a warning for now, and this should not cause 300k fails. Did you see other error message there? thanks.

            di.wang Di Wang added a comment - Hi, James, this is a known issue, because credits required for creating a big striped directory is larger than maximum credits here. And I believe you can see this on current master as well. But it is just a warning for now, and this should not cause 300k fails. Did you see other error message there? thanks.

            People

              di.wang Di Wang
              di.wang Di Wang
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: