Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4260

ASSERTION( lc->ldo_stripenr == 0 ) failed:

Details

    • 3
    • 11624

    Description

      Servers and clients are 2.4.1 configured active-active failover

      one out of two clients is 1.8.9

      As soon as I wrote a file to remote directory, the second MDS crashed.

      Nov 15 09:00:06 lustre-mds-0-1 kernel: LustreError: 20726:0:(lod_lov.c:554:lod_generate_and_set_lovea()) rhino-MDT0001-mdtlov: Can not locate [0x640000bd0:0x22:0x0]: rc = -5
      Nov 15 09:00:06 lustre-mds-0-1 kernel: LustreError: 20726:0:(lod_object.c:704:lod_ah_init()) ASSERTION( lc->ldo_stripenr == 0 ) failed:
      Nov 15 09:00:06 lustre-mds-0-1 kernel: LustreError: 20726:0:(lod_object.c:704:lod_ah_init()) LBUG
      Nov 15 09:00:06 lustre-mds-0-1 kernel: Pid: 20726, comm: mdt03_000
      Nov 15 09:00:06 lustre-mds-0-1 kernel:
      Nov 15 09:00:06 lustre-mds-0-1 kernel: Call Trace:
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0349895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0349e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0e3f78f>] lod_ah_init+0x57f/0x5c0 [lod]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0b73a83>] mdd_object_make_hint+0x83/0xa0 [mdd]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0b7feb2>] mdd_create_data+0x332/0x7d0 [mdd]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0d9cc2c>] mdt_finish_open+0x125c/0x18a0 [mdt]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0d984f8>] ? mdt_object_open_lock+0x1c8/0x510 [mdt]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0d9ee8d>] mdt_reint_open+0x115d/0x20c0 [mdt]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa036682e>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa071fdcc>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0d89911>] mdt_reint_rec+0x41/0xe0 [mdt]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0d6eae3>] mdt_reint_internal+0x4c3/0x780 [mdt]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0d6f06d>] mdt_intent_reint+0x1ed/0x520 [mdt]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0d6cf1e>] mdt_intent_policy+0x39e/0x720 [mdt]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa06d7831>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa06fe1ef>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0d6d3a6>] mdt_enqueue+0x46/0xe0 [mdt]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0d73a97>] mdt_handle_common+0x647/0x16d0 [mdt]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0720bac>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0dad3f5>] mds_regular_handle+0x15/0x20 [mdt]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa07303c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa034a5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa035bd9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0727729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa073175e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0730c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0730c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffffa0730c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      Nov 15 09:00:06 lustre-mds-0-1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      Nov 15 09:00:06 lustre-mds-0-1 kernel:

      Attachments

        Issue Links

          Activity

            [LU-4260] ASSERTION( lc->ldo_stripenr == 0 ) failed:
            jay Jinshan Xiong (Inactive) added a comment - - edited

            Not sure if this is the same issue, but I'm still seeing the crash with exactly the same call stack on latest master.

            LustreError: 20580:0:(lod_object.c:1475:lod_ah_init()) ASSERTION( lc->ldo_stripenr == 0 ) failed: 
            LustreError: 20580:0:(lod_object.c:1475:lod_ah_init()) LBUG
            Pid: 20580, comm: mdt01_008
            
            Call Trace:
             [<ffffffffa03a3895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
             [<ffffffffa03a3e97>] lbug_with_loc+0x47/0xb0 [libcfs]
             [<ffffffffa0e098de>] lod_ah_init+0xaae/0xb80 [lod]
             [<ffffffffa0ce1b09>] mdd_object_make_hint+0x139/0x180 [mdd]
             [<ffffffffa0cd12f9>] mdd_create_data+0x359/0x7f0 [mdd]
             [<ffffffffa075c4e0>] ? lustre_swab_mdt_body+0x0/0x140 [ptlrpc]
             [<ffffffffa0d4afdb>] mdt_mfd_open+0xc8b/0xf10 [mdt]
             [<ffffffffa0e033a3>] ? lod_xattr_get+0x153/0x420 [lod]
             [<ffffffffa0d4c253>] mdt_finish_open+0x553/0xc20 [mdt]
             [<ffffffffa0d46383>] ? mdt_object_open_lock+0x2f3/0x9c0 [mdt]
             [<ffffffffa0d4e76f>] mdt_reint_open+0x12af/0x2130 [mdt]
             [<ffffffffa03c11c6>] ? upcall_cache_get_entry+0x296/0x880 [libcfs]
             [<ffffffffa0550310>] ? lu_ucred+0x20/0x30 [obdclass]
             [<ffffffffa0d36851>] mdt_reint_rec+0x41/0xe0 [mdt]
             [<ffffffffa0d1be13>] mdt_reint_internal+0x4c3/0x7c0 [mdt]
             [<ffffffffa0d1c308>] mdt_intent_reint+0x1f8/0x520 [mdt]
             [<ffffffffa0d1a9e9>] mdt_intent_policy+0x499/0xca0 [mdt]
             [<ffffffff81168742>] ? kmem_cache_alloc+0x182/0x190
             [<ffffffffa070f809>] ldlm_lock_enqueue+0x359/0x920 [ptlrpc]
             [<ffffffffa0738c6f>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
             [<ffffffff8103c7d8>] ? pvclock_clocksource_read+0x58/0xd0
             [<ffffffffa07bb022>] tgt_enqueue+0x62/0x1d0 [ptlrpc]
             [<ffffffffa07bb3cc>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
             [<ffffffffa075a01c>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc]
             [<ffffffffa076a5ca>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
             [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
             [<ffffffff8150e600>] ? thread_return+0x4e/0x76e
             [<ffffffffa07698b0>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
             [<ffffffff81096a36>] kthread+0x96/0xa0
             [<ffffffff8100c0ca>] child_rip+0xa/0x20
             [<ffffffff810969a0>] ? kthread+0x0/0xa0
             [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
            

            I have a core dump so feel free to ask for it.

            jay Jinshan Xiong (Inactive) added a comment - - edited Not sure if this is the same issue, but I'm still seeing the crash with exactly the same call stack on latest master. LustreError: 20580:0:(lod_object.c:1475:lod_ah_init()) ASSERTION( lc->ldo_stripenr == 0 ) failed: LustreError: 20580:0:(lod_object.c:1475:lod_ah_init()) LBUG Pid: 20580, comm: mdt01_008 Call Trace: [<ffffffffa03a3895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa03a3e97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0e098de>] lod_ah_init+0xaae/0xb80 [lod] [<ffffffffa0ce1b09>] mdd_object_make_hint+0x139/0x180 [mdd] [<ffffffffa0cd12f9>] mdd_create_data+0x359/0x7f0 [mdd] [<ffffffffa075c4e0>] ? lustre_swab_mdt_body+0x0/0x140 [ptlrpc] [<ffffffffa0d4afdb>] mdt_mfd_open+0xc8b/0xf10 [mdt] [<ffffffffa0e033a3>] ? lod_xattr_get+0x153/0x420 [lod] [<ffffffffa0d4c253>] mdt_finish_open+0x553/0xc20 [mdt] [<ffffffffa0d46383>] ? mdt_object_open_lock+0x2f3/0x9c0 [mdt] [<ffffffffa0d4e76f>] mdt_reint_open+0x12af/0x2130 [mdt] [<ffffffffa03c11c6>] ? upcall_cache_get_entry+0x296/0x880 [libcfs] [<ffffffffa0550310>] ? lu_ucred+0x20/0x30 [obdclass] [<ffffffffa0d36851>] mdt_reint_rec+0x41/0xe0 [mdt] [<ffffffffa0d1be13>] mdt_reint_internal+0x4c3/0x7c0 [mdt] [<ffffffffa0d1c308>] mdt_intent_reint+0x1f8/0x520 [mdt] [<ffffffffa0d1a9e9>] mdt_intent_policy+0x499/0xca0 [mdt] [<ffffffff81168742>] ? kmem_cache_alloc+0x182/0x190 [<ffffffffa070f809>] ldlm_lock_enqueue+0x359/0x920 [ptlrpc] [<ffffffffa0738c6f>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] [<ffffffff8103c7d8>] ? pvclock_clocksource_read+0x58/0xd0 [<ffffffffa07bb022>] tgt_enqueue+0x62/0x1d0 [ptlrpc] [<ffffffffa07bb3cc>] tgt_request_handle+0x23c/0xac0 [ptlrpc] [<ffffffffa075a01c>] ? lustre_msg_get_transno+0x8c/0x100 [ptlrpc] [<ffffffffa076a5ca>] ptlrpc_main+0xd1a/0x1980 [ptlrpc] [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 [<ffffffff8150e600>] ? thread_return+0x4e/0x76e [<ffffffffa07698b0>] ? ptlrpc_main+0x0/0x1980 [ptlrpc] [<ffffffff81096a36>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffff810969a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 I have a core dump so feel free to ask for it.
            pjones Peter Jones added a comment -

            Landed for 2.6

            pjones Peter Jones added a comment - Landed for 2.6
            di.wang Di Wang added a comment -

            Those OST sequence seems to be added accidentally during wrong strange upgrade process(though I do not know how to reproduce it . But after we removed seq_srv on OST(i.e. those local sequence file on OST, and only used for DNE), and restart OST, and OST will re-acquire new meta sequence from MDT0. Everything works fine.

            Though we still need cleanup the stripe_info of lod object when some error happens, to avoid LBUG. Here is the patch

            http://review.whamcloud.com/8325 (b2_4)
            http://review.whamcloud.com/8324 (master)

            di.wang Di Wang added a comment - Those OST sequence seems to be added accidentally during wrong strange upgrade process(though I do not know how to reproduce it . But after we removed seq_srv on OST(i.e. those local sequence file on OST, and only used for DNE), and restart OST, and OST will re-acquire new meta sequence from MDT0. Everything works fine. Though we still need cleanup the stripe_info of lod object when some error happens, to avoid LBUG. Here is the patch http://review.whamcloud.com/8325 (b2_4) http://review.whamcloud.com/8324 (master)
            mdiep Minh Diep added a comment -

            Andreas, clients < 2.4.0 still can be mounted and use the MDT0. Only 2.4.0+ clients can access all MDTs. This is hit when 2.4.1 client accessing the remote directory.

            mdiep Minh Diep added a comment - Andreas, clients < 2.4.0 still can be mounted and use the MDT0. Only 2.4.0+ clients can access all MDTs. This is hit when 2.4.1 client accessing the remote directory.
            di.wang Di Wang added a comment -

            Hmm, I checked the debug log here (Minh collected for me), it seem the new created OST sequence is not being inserted into FLDB somehow, which caused some garbage stripe_info(of lod object) left in memory, then the LBUG is being hitted. So we need

            1. cleanup the stripe_info of lod object when some error happens.
            2. Figure out why OST sequence is not being inserted into FLDB during upgrade process.

            di.wang Di Wang added a comment - Hmm, I checked the debug log here (Minh collected for me), it seem the new created OST sequence is not being inserted into FLDB somehow, which caused some garbage stripe_info(of lod object) left in memory, then the LBUG is being hitted. So we need 1. cleanup the stripe_info of lod object when some error happens. 2. Figure out why OST sequence is not being inserted into FLDB during upgrade process.

            Minh, it isn't possible to use clients < 2.4.0 with multiple MDTs. There definitely shouldn't be an LASSERT() failure on the MDS, but it should return an error to the client.

            adilger Andreas Dilger added a comment - Minh, it isn't possible to use clients < 2.4.0 with multiple MDTs. There definitely shouldn't be an LASSERT() failure on the MDS, but it should return an error to the client.

            Minh, could you please try the patches referenced by LU-2789. We aren't sure if this is a duplicate or not. That one relates to a race condition, which I don't think is the case here.

            adilger Andreas Dilger added a comment - Minh, could you please try the patches referenced by LU-2789 . We aren't sure if this is a duplicate or not. That one relates to a race condition, which I don't think is the case here.
            green Oleg Drokin added a comment - - edited

            I suspect this is LU-2789 or closely related.

            Or after another look: might be not, but certainly there are several crashes logged in there with different backtraces.

            green Oleg Drokin added a comment - - edited I suspect this is LU-2789 or closely related. Or after another look: might be not, but certainly there are several crashes logged in there with different backtraces.
            pjones Peter Jones added a comment -

            Di

            Could you please assist with this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Di Could you please assist with this one? Thanks Peter
            mdiep Minh Diep added a comment -

            is this similar to LU-4226?

            mdiep Minh Diep added a comment - is this similar to LU-4226 ?

            People

              di.wang Di Wang
              mdiep Minh Diep
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: