Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5686

(mdt_handler.c:3203:mdt_intent_lock_replace()) ASSERTION( lustre_msg_get_flags(req->rq_reqmsg) & 0x0002 ) failed

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.1.6, Lustre 2.4.3
    • Clients:
       - RHEL6 w/ patched kernel 2.6.32-431.11.2.el6
       - Lustre 2.4.3 + bullpatches
      Servers:
       - RHEL6 w/ patched kernel 2.6.32-220.23.1
       - Lustre 2.1.6 + bullpatches
    • 3
    • 15925

    Description

      We hit the following LBUG twice on one of our MDT:

      [78073.117731] Lustre: 31681:0:(ldlm_lib.c:952:target_handle_connect()) work2-MDT0000: connection from 38d12a48-aabd-9279-dc69-b78c4e00321c@10.100.62.72@o2ib2 t189645377601 exp ffff880b95bb1c00 cur 1410508503 last 1410508503
      [78079.176124] Lustre: 31681:0:(mdt_handler.c:1005:mdt_getattr_name_lock()) Although resent, but still not get child lockparent:[0x22f2b0783:0x34b:0x0] child:[0x22d854b6e:0x85d5:0x0]
      [78079.192443] LustreError: 31681:0:(mdt_handler.c:3203:mdt_intent_lock_replace()) ASSERTION( lustre_msg_get_flags(req->rq_reqmsg) & 0x0002 ) failed:
      [78079.205971] LustreError: 31681:0:(mdt_handler.c:3203:mdt_intent_lock_replace()) LBUG
      [78079.215326] Pid: 31681, comm: mdt_104
      [78079.220352]
      [78079.220353] Call Trace:
      [78079.227394]  [<ffffffffa051a7f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [78079.236100]  [<ffffffffa051ae07>] lbug_with_loc+0x47/0xb0 [libcfs]
      [78079.243815]  [<ffffffffa0d9671b>] mdt_intent_lock_replace+0x3bb/0x440 [mdt]
      [78079.252140]  [<ffffffffa0daad26>] mdt_intent_getattr+0x3a6/0x4a0 [mdt]
      [78079.260391]  [<ffffffffa0da6c09>] mdt_intent_policy+0x379/0x690 [mdt]
      [78079.268641]  [<ffffffffa07423c1>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
      [78079.276846]  [<ffffffffa07683cd>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]
      [78079.285614]  [<ffffffffa0da7586>] mdt_enqueue+0x46/0x130 [mdt]
      [78079.292950]  [<ffffffffa0d9c762>] mdt_handle_common+0x932/0x1750 [mdt]
      [78079.300987]  [<ffffffffa0d9d655>] mdt_regular_handle+0x15/0x20 [mdt]
      [78079.309024]  [<ffffffffa07974f6>] ptlrpc_main+0xd16/0x1a80 [ptlrpc]
      [78079.316979]  [<ffffffff810017cc>] ? __switch_to+0x1ac/0x320
      [78079.324222]  [<ffffffffa07967e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
      [78079.331896]  [<ffffffff8100412a>] child_rip+0xa/0x20
      [78079.338522]  [<ffffffffa07967e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
      [78079.346599]  [<ffffffffa07967e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
      [78079.354520]  [<ffffffff81004120>] ? child_rip+0x0/0x20
      [78079.361136]
      [78079.364683] Kernel panic - not syncing: LBUG
      

      The support engineer was able to retrieve the client node from the crash dump. Both time, the client was a login node running Lustre 2.4.3.

      It looks like LU-5314. The backported patch proposal failed on maloo ( http://review.whamcloud.com/#/c/10902/ )

      Attachments

        Issue Links

          Activity

            [LU-5686] (mdt_handler.c:3203:mdt_intent_lock_replace()) ASSERTION( lustre_msg_get_flags(req->rq_reqmsg) & 0x0002 ) failed

            Hello Bruno,

            In fact there are regression issues with b2_4 back-port (http://review.whamcloud.com/#/c/10902/) of LU-2827 changes. And I checked that the b2_5 version (http://review.whamcloud.com/#/c/10492/) is ok and that it will land soon now.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Bruno, In fact there are regression issues with b2_4 back-port ( http://review.whamcloud.com/#/c/10902/ ) of LU-2827 changes. And I checked that the b2_5 version ( http://review.whamcloud.com/#/c/10492/ ) is ok and that it will land soon now.

            Client: Lustre 2.4.3 + patches

            • nf16987_lu4279_dot_lustre_access_by_fid.patch
            • nf16815_lu4298_setstripe_does_not_create_file_with_no_stripe.patch
            • nf16818_lu4136_target_health_status.patch
            • lu4379_dont_always_check_max_pages_per_rpc_alignement.patch
            • nf15746_nf17470_increase_oss_num_threads_max_value.patch
            • nf16693_lu1165_obdclass_add_LCT_SERVER_SESSION_for_server_session.patch
            • nf17550_add_extents_stats_max_processes_tunable.patch
            • nf17342_lu4460_multiple_nids.patch
            • lu4222_mdt_Extra_checking_for_getattr_RPC.patch
            • lu4008_mdt_Shrink_default_LOVEA_reply_buffer.patch
            • lu4008_mdt_actual_size_for_layout_lock_intents.patch
            • lu4719_mdt_Dont_try_to_print_non-existent_objects.patch
            • nf17522_lu3230_ost_umount_stuck.patch
            • nf17870_lu2059_mgc_use_osd_api_for_backup_logs.patch
            • nf17870_lu3915_lu4878_osd-ldiskfs_dont_assert_on_possible_upgrade.patch
            • nf17891_lu4403_extra_lock_during_resend_lock_lookup.patch
            • nf17811_lu4611_too_many_transaction_credits.patch
            • nf17808_lu4558_clio_Solve_a_race_in_cl_lock_put.patch
            • nf17755_lu4790_lu4509_ptlrpc_re-enqueue_ptlrpcd_worker__ptlrpcd_stick_to_a_single_CPU.patch
            • nf17607_lu4791_lod_subtract_xattr_overhead_when_calculate_max_EA_size.patch
            • nf17864_lu4659_mdd_rename_forgets_updating_target_linkEA.patch
            • nf17977_lu4554_lfsck_Old_single-OI_MDT_always_scrubbed.patch
            • nf17971_lu4878_lu3126_osd_remove_fld_lookup_during_configuration.patch
            • nf17704_lu4881_lu4381_lov_to_not_hold_sub_locks_at_initialization.patch
            • nf13212_lu4650_lu4257_clio_replace_semaphore_with_mutex.patch

            Server: Lustre 2.1.6 + patches

            • ornl22_general_ptlrpcd_threads_pool_support.patch
            • 316929_lu1144_NUMA_aware_ptlrpcd_bind_policy.patch
            • 316874_lu1110_open_by_fid_oops.patch
            • 319248_lu2613_to_much_unreclaimable_slab_space.patch
            • 319615_lu2624_ptlrpc_fix_thread_stop.patch
            • 317941_lu2683_client_deadlock_in_cl-lock-mutex-get.patch
            • 319613_too_many_ll-inode-revalidate-fini-failure_msg_in_syslog.patch
            • nf15746_increase_oss_num_threads_max_value.patch
            • nf13204_lu2665_keep_resend_flocks.patch
            • nf15559_lu1306_protect_l-flags_with_locking_to_prevent_race.patch
            • nf13194_lu2943_LBUG_in_mdt_reconstruct_open.patch
            • nf13204_lu3701_only_resend_F_UNLCKs.patch

            Hope this helps

            bruno.travouillon Bruno Travouillon (Inactive) added a comment - Client: Lustre 2.4.3 + patches nf16987_lu4279_dot_lustre_access_by_fid.patch nf16815_lu4298_setstripe_does_not_create_file_with_no_stripe.patch nf16818_lu4136_target_health_status.patch lu4379_dont_always_check_max_pages_per_rpc_alignement.patch nf15746_nf17470_increase_oss_num_threads_max_value.patch nf16693_lu1165_obdclass_add_LCT_SERVER_SESSION_for_server_session.patch nf17550_add_extents_stats_max_processes_tunable.patch nf17342_lu4460_multiple_nids.patch lu4222_mdt_Extra_checking_for_getattr_RPC.patch lu4008_mdt_Shrink_default_LOVEA_reply_buffer.patch lu4008_mdt_actual_size_for_layout_lock_intents.patch lu4719_mdt_Dont_try_to_print_non-existent_objects.patch nf17522_lu3230_ost_umount_stuck.patch nf17870_lu2059_mgc_use_osd_api_for_backup_logs.patch nf17870_lu3915_lu4878_osd-ldiskfs_dont_assert_on_possible_upgrade.patch nf17891_lu4403_extra_lock_during_resend_lock_lookup.patch nf17811_lu4611_too_many_transaction_credits.patch nf17808_lu4558_clio_Solve_a_race_in_cl_lock_put.patch nf17755_lu4790_lu4509_ptlrpc_re-enqueue_ptlrpcd_worker__ptlrpcd_stick_to_a_single_CPU.patch nf17607_lu4791_lod_subtract_xattr_overhead_when_calculate_max_EA_size.patch nf17864_lu4659_mdd_rename_forgets_updating_target_linkEA.patch nf17977_lu4554_lfsck_Old_single-OI_MDT_always_scrubbed.patch nf17971_lu4878_lu3126_osd_remove_fld_lookup_during_configuration.patch nf17704_lu4881_lu4381_lov_to_not_hold_sub_locks_at_initialization.patch nf13212_lu4650_lu4257_clio_replace_semaphore_with_mutex.patch Server: Lustre 2.1.6 + patches ornl22_general_ptlrpcd_threads_pool_support.patch 316929_lu1144_NUMA_aware_ptlrpcd_bind_policy.patch 316874_lu1110_open_by_fid_oops.patch 319248_lu2613_to_much_unreclaimable_slab_space.patch 319615_lu2624_ptlrpc_fix_thread_stop.patch 317941_lu2683_client_deadlock_in_cl-lock-mutex-get.patch 319613_too_many_ll-inode-revalidate-fini-failure_msg_in_syslog.patch nf15746_increase_oss_num_threads_max_value.patch nf13204_lu2665_keep_resend_flocks.patch nf15559_lu1306_protect_l-flags_with_locking_to_prevent_race.patch nf13194_lu2943_LBUG_in_mdt_reconstruct_open.patch nf13204_lu3701_only_resend_F_UNLCKs.patch Hope this helps

            Ok fine, but can you provide the list of additional patches for both Client and Server sides ? Thanks in advance.

            bfaccini Bruno Faccini (Inactive) added a comment - Ok fine, but can you provide the list of additional patches for both Client and Server sides ? Thanks in advance.

            Hi,

            We run without the patch from LU-3338.

            I do agree with Peter, a fix for 2.5+ should be sufficient enough.

            bruno.travouillon Bruno Travouillon (Inactive) added a comment - Hi, We run without the patch from LU-3338 . I do agree with Peter, a fix for 2.5+ should be sufficient enough.

            Bruno, can you also check if you run with additional patch from LU-3338 ?

            bfaccini Bruno Faccini (Inactive) added a comment - Bruno, can you also check if you run with additional patch from LU-3338 ?

            Hello Bruno,
            Unfortunately, my b2_4 back-port of LU-2827 patch at http://review.whamcloud.com/#/c/10902/ is buggy ...
            My colleague Oleg Drokin has been working on this and I will gather the correct set of fixes you need to apply. Will also update LU-5314 with the same infos.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Bruno, Unfortunately, my b2_4 back-port of LU-2827 patch at http://review.whamcloud.com/#/c/10902/ is buggy ... My colleague Oleg Drokin has been working on this and I will gather the correct set of fixes you need to apply. Will also update LU-5314 with the same infos.
            pjones Peter Jones added a comment -

            Bruno T

            If this is indeed a duplicate of LU-2827 then it would be safer to use the patch from LU-4584. However, better still would probably be to pick up the fix for this when you upgrade to the Bull 2.5.x version

            Bruno F

            Anything to add/correct?

            Peter

            pjones Peter Jones added a comment - Bruno T If this is indeed a duplicate of LU-2827 then it would be safer to use the patch from LU-4584 . However, better still would probably be to pick up the fix for this when you upgrade to the Bull 2.5.x version Bruno F Anything to add/correct? Peter

            People

              bfaccini Bruno Faccini (Inactive)
              bruno.travouillon Bruno Travouillon (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: