Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7975

"(lod_object.c:700:lod_ah_init()) ASSERTION( lc->ldo_stripenr == 0 )" LBUG/Assert on MDS

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • None
    • 3
    • 9223372036854775807

    Description

      A site has encountered multiple crashes with same signature/stack+msgs following :

      LustreError: 89879:0:(osp_precreate.c:1222:osp_object_truncate()) can't punch object: -11
      Lustre: composit-OST0009-osc-MDT0000: Connection to composit-OST0009 (at 10.0.14.31@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      LustreError: 89879:0:(lod_object.c:700:lod_ah_init()) ASSERTION( lc->ldo_stripenr == 0 ) failed: 
      LustreError: 89879:0:(lod_object.c:700:lod_ah_init()) LBUG
      Pid: 89879, comm: mdt01_006
      
      Call Trace:
       [<ffffffffa057e895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
       [<ffffffffa057ee97>] lbug_with_loc+0x47/0xb0 [libcfs]
       [<ffffffffa266c0af>] lod_ah_init+0x58f/0x5d0 [lod]
       [<ffffffffa26c7ad3>] mdd_object_make_hint+0x83/0xa0 [mdd]
       [<ffffffffa26d4502>] mdd_create_data+0x332/0x7d0 [mdd]
       [<ffffffffa25a93f0>] mdt_finish_open+0x1350/0x19a0 [mdt]
       [<ffffffffa257e5f4>] ? mdt_object_lock+0x14/0x20 [mdt]
       [<ffffffffa25a9fbd>] mdt_open_by_fid_lock+0x57d/0x910 [mdt]
       [<ffffffffa25aabac>] mdt_reint_open+0x56c/0x21a0 [mdt]
       [<ffffffffa059b14c>] ? upcall_cache_get_entry+0x29c/0x890 [libcfs]
       [<ffffffffa0983930>] ? lu_ucred+0x20/0x30 [obdclass]
       [<ffffffffa2572945>] ? mdt_ucred+0x15/0x20 [mdt]
       [<ffffffffa258f8ec>] ? mdt_root_squash+0x2c/0x410 [mdt]
       [<ffffffffa123bad6>] ? __req_capsule_get+0x166/0x710 [ptlrpc]
       [<ffffffffa2593ab1>] mdt_reint_rec+0x41/0xe0 [mdt]
       [<ffffffffa2578f83>] mdt_reint_internal+0x4c3/0x780 [mdt]
       [<ffffffffa257950e>] mdt_intent_reint+0x1ee/0x520 [mdt]
       [<ffffffffa2576cee>] mdt_intent_policy+0x3ae/0x770 [mdt]
       [<ffffffffa11ca2f5>] ldlm_lock_enqueue+0x135/0x980 [ptlrpc]
       [<ffffffffa11f43fb>] ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc]
       [<ffffffffa25771b6>] mdt_enqueue+0x46/0xe0 [mdt]
       [<ffffffffa257c84a>] mdt_handle_common+0x52a/0x1470 [mdt]
       [<ffffffffa25b98f5>] mds_regular_handle+0x15/0x20 [mdt]
       [<ffffffffa12238d5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
       [<ffffffffa05904fa>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
       [<ffffffffa121c289>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
       [<ffffffff81057849>] ? __wake_up_common+0x59/0x90
       [<ffffffffa122605d>] ptlrpc_main+0xaed/0x1780 [ptlrpc]
       [<ffffffffa1225570>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
       [<ffffffff8109e78e>] kthread+0x9e/0xc0
       [<ffffffff8100c28a>] child_rip+0xa/0x20
       [<ffffffff8109e6f0>] ? kthread+0x0/0xc0
       [<ffffffff8100c280>] ? child_rip+0x0/0x20
      

      According to existing tickets, I have found that this kind of problem has already (partially?) been addressed in LU-4260, LU-4791 and LU-5346 tickets.
      And since both fixes for LU-4260 and LU-4791 are already integrated, this means that we encounter a new situation/problem during OST objects pre-creation, likely to be caused by some specific file meta-data pattern (I have identified as "deferred layout" feature usage with open(, ...|O_LOV_DELAY_CREATE|...,) along with a non-0 truncate() to trigger objects preallocation), leading to trigger a similar case than described in LU-5346 upon error return path that is still not fixed.

      BTW, I have also determined that these MDT assert always occurs just following an OSS crash, hence the -EAGAIN/EWOULDBLOCK error in "(osp_precreate.c:1222:osp_object_truncate()) can't punch object: -11" msg just preceding the assert !

      Attachments

        Issue Links

          Activity

            [LU-7975] "(lod_object.c:700:lod_ah_init()) ASSERTION( lc->ldo_stripenr == 0 )" LBUG/Assert on MDS
            pjones Peter Jones added a comment -

            Landed for 2.9

            pjones Peter Jones added a comment - Landed for 2.9

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19302/
            Subject: LU-7975 lod: fix delayed stripe error path & Client resend
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 047dfe489966c8816cbead1a3abbbb1564fdb7db

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19302/ Subject: LU-7975 lod: fix delayed stripe error path & Client resend Project: fs/lustre-release Branch: master Current Patch Set: Commit: 047dfe489966c8816cbead1a3abbbb1564fdb7db

            http://review.whamcloud.com/19301 has been abandoned in favor of http://review.whamcloud.com/19302, according to reviewers comments and choice between both solutions.

            bfaccini Bruno Faccini (Inactive) added a comment - http://review.whamcloud.com/19301 has been abandoned in favor of http://review.whamcloud.com/19302 , according to reviewers comments and choice between both solutions.

            Patch at http://review.whamcloud.com/19301 fixes cleanup in delayed stripe error path and also implements resend mechanism from MDS side. It may keep a MDS thread busy for some time doing so.

            Patch at http://review.whamcloud.com/19302 is an other way to fix, also doing delayed stripe error path necessary cleanup, but offloading resend mechanism to Client side, which may be less intrusive.

            bfaccini Bruno Faccini (Inactive) added a comment - Patch at http://review.whamcloud.com/19301 fixes cleanup in delayed stripe error path and also implements resend mechanism from MDS side. It may keep a MDS thread busy for some time doing so. Patch at http://review.whamcloud.com/19302 is an other way to fix, also doing delayed stripe error path necessary cleanup, but offloading resend mechanism to Client side, which may be less intrusive.

            Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/19302
            Subject: LU-7975 lod: fix delayed stripe error path & Client resend
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: bb8a376656f6b3626a47c32cbc1855045a47d929

            gerrit Gerrit Updater added a comment - Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/19302 Subject: LU-7975 lod: fix delayed stripe error path & Client resend Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bb8a376656f6b3626a47c32cbc1855045a47d929

            Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/19301
            Subject: LU-7975 lod: fix delayed stripe error path & MDS resend
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 43a47da70a47d027f9ac199ed4ff6fdc4fe91614

            gerrit Gerrit Updater added a comment - Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/19301 Subject: LU-7975 lod: fix delayed stripe error path & MDS resend Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 43a47da70a47d027f9ac199ed4ff6fdc4fe91614

            People

              bfaccini Bruno Faccini (Inactive)
              bfaccini Bruno Faccini (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: