Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5686

(mdt_handler.c:3203:mdt_intent_lock_replace()) ASSERTION( lustre_msg_get_flags(req->rq_reqmsg) & 0x0002 ) failed

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.1.6, Lustre 2.4.3
    • Clients:
       - RHEL6 w/ patched kernel 2.6.32-431.11.2.el6
       - Lustre 2.4.3 + bullpatches
      Servers:
       - RHEL6 w/ patched kernel 2.6.32-220.23.1
       - Lustre 2.1.6 + bullpatches
    • 3
    • 15925

    Description

      We hit the following LBUG twice on one of our MDT:

      [78073.117731] Lustre: 31681:0:(ldlm_lib.c:952:target_handle_connect()) work2-MDT0000: connection from 38d12a48-aabd-9279-dc69-b78c4e00321c@10.100.62.72@o2ib2 t189645377601 exp ffff880b95bb1c00 cur 1410508503 last 1410508503
      [78079.176124] Lustre: 31681:0:(mdt_handler.c:1005:mdt_getattr_name_lock()) Although resent, but still not get child lockparent:[0x22f2b0783:0x34b:0x0] child:[0x22d854b6e:0x85d5:0x0]
      [78079.192443] LustreError: 31681:0:(mdt_handler.c:3203:mdt_intent_lock_replace()) ASSERTION( lustre_msg_get_flags(req->rq_reqmsg) & 0x0002 ) failed:
      [78079.205971] LustreError: 31681:0:(mdt_handler.c:3203:mdt_intent_lock_replace()) LBUG
      [78079.215326] Pid: 31681, comm: mdt_104
      [78079.220352]
      [78079.220353] Call Trace:
      [78079.227394]  [<ffffffffa051a7f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [78079.236100]  [<ffffffffa051ae07>] lbug_with_loc+0x47/0xb0 [libcfs]
      [78079.243815]  [<ffffffffa0d9671b>] mdt_intent_lock_replace+0x3bb/0x440 [mdt]
      [78079.252140]  [<ffffffffa0daad26>] mdt_intent_getattr+0x3a6/0x4a0 [mdt]
      [78079.260391]  [<ffffffffa0da6c09>] mdt_intent_policy+0x379/0x690 [mdt]
      [78079.268641]  [<ffffffffa07423c1>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
      [78079.276846]  [<ffffffffa07683cd>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]
      [78079.285614]  [<ffffffffa0da7586>] mdt_enqueue+0x46/0x130 [mdt]
      [78079.292950]  [<ffffffffa0d9c762>] mdt_handle_common+0x932/0x1750 [mdt]
      [78079.300987]  [<ffffffffa0d9d655>] mdt_regular_handle+0x15/0x20 [mdt]
      [78079.309024]  [<ffffffffa07974f6>] ptlrpc_main+0xd16/0x1a80 [ptlrpc]
      [78079.316979]  [<ffffffff810017cc>] ? __switch_to+0x1ac/0x320
      [78079.324222]  [<ffffffffa07967e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
      [78079.331896]  [<ffffffff8100412a>] child_rip+0xa/0x20
      [78079.338522]  [<ffffffffa07967e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
      [78079.346599]  [<ffffffffa07967e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
      [78079.354520]  [<ffffffff81004120>] ? child_rip+0x0/0x20
      [78079.361136]
      [78079.364683] Kernel panic - not syncing: LBUG
      

      The support engineer was able to retrieve the client node from the crash dump. Both time, the client was a login node running Lustre 2.4.3.

      It looks like LU-5314. The backported patch proposal failed on maloo ( http://review.whamcloud.com/#/c/10902/ )

      Attachments

        Issue Links

          Activity

            [LU-5686] (mdt_handler.c:3203:mdt_intent_lock_replace()) ASSERTION( lustre_msg_get_flags(req->rq_reqmsg) & 0x0002 ) failed

            Bruno,
            You are right this must be made clear here also, so to be complete, the full list of post LU-2827 related tickets/patches has been well documented and in detail by Oleg in its LU-4584 2x comments dated 11/Sep/14 for LU-4584.

            bfaccini Bruno Faccini (Inactive) added a comment - Bruno, You are right this must be made clear here also, so to be complete, the full list of post LU-2827 related tickets/patches has been well documented and in detail by Oleg in its LU-4584 2x comments dated 11/Sep/14 for LU-4584 .

            Hello Bruno,

            Yes, one of our issue is very close to LU-5530. I see that there is a couple of patches to apply on top of LU-2827. Thanks for the tip.

            bruno.travouillon Bruno Travouillon (Inactive) added a comment - Hello Bruno, Yes, one of our issue is very close to LU-5530 . I see that there is a couple of patches to apply on top of LU-2827 . Thanks for the tip.

            Hello Bruno,
            I wonder if your new "ldlm-related" issues could be like those reported in LU-5530 ?

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Bruno, I wonder if your new "ldlm-related" issues could be like those reported in LU-5530 ?

            Hi,

            We are now running Lustre 2.5.3 + b2_5 patch http://review.whamcloud.com/#/c/10492/. Since the upgrade, we are hitting several issues on MDS/OSS around the ldlm. Are you aware of any complementary fix that we should apply with this one?

            In the meantime, we are still investigating those issues onsite and will report them asap in new JIRA tickets.

            bruno.travouillon Bruno Travouillon (Inactive) added a comment - Hi, We are now running Lustre 2.5.3 + b2_5 patch http://review.whamcloud.com/#/c/10492/ . Since the upgrade, we are hitting several issues on MDS/OSS around the ldlm. Are you aware of any complementary fix that we should apply with this one? In the meantime, we are still investigating those issues onsite and will report them asap in new JIRA tickets.

            Hello Bruno,

            In fact there are regression issues with b2_4 back-port (http://review.whamcloud.com/#/c/10902/) of LU-2827 changes. And I checked that the b2_5 version (http://review.whamcloud.com/#/c/10492/) is ok and that it will land soon now.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Bruno, In fact there are regression issues with b2_4 back-port ( http://review.whamcloud.com/#/c/10902/ ) of LU-2827 changes. And I checked that the b2_5 version ( http://review.whamcloud.com/#/c/10492/ ) is ok and that it will land soon now.

            Client: Lustre 2.4.3 + patches

            • nf16987_lu4279_dot_lustre_access_by_fid.patch
            • nf16815_lu4298_setstripe_does_not_create_file_with_no_stripe.patch
            • nf16818_lu4136_target_health_status.patch
            • lu4379_dont_always_check_max_pages_per_rpc_alignement.patch
            • nf15746_nf17470_increase_oss_num_threads_max_value.patch
            • nf16693_lu1165_obdclass_add_LCT_SERVER_SESSION_for_server_session.patch
            • nf17550_add_extents_stats_max_processes_tunable.patch
            • nf17342_lu4460_multiple_nids.patch
            • lu4222_mdt_Extra_checking_for_getattr_RPC.patch
            • lu4008_mdt_Shrink_default_LOVEA_reply_buffer.patch
            • lu4008_mdt_actual_size_for_layout_lock_intents.patch
            • lu4719_mdt_Dont_try_to_print_non-existent_objects.patch
            • nf17522_lu3230_ost_umount_stuck.patch
            • nf17870_lu2059_mgc_use_osd_api_for_backup_logs.patch
            • nf17870_lu3915_lu4878_osd-ldiskfs_dont_assert_on_possible_upgrade.patch
            • nf17891_lu4403_extra_lock_during_resend_lock_lookup.patch
            • nf17811_lu4611_too_many_transaction_credits.patch
            • nf17808_lu4558_clio_Solve_a_race_in_cl_lock_put.patch
            • nf17755_lu4790_lu4509_ptlrpc_re-enqueue_ptlrpcd_worker__ptlrpcd_stick_to_a_single_CPU.patch
            • nf17607_lu4791_lod_subtract_xattr_overhead_when_calculate_max_EA_size.patch
            • nf17864_lu4659_mdd_rename_forgets_updating_target_linkEA.patch
            • nf17977_lu4554_lfsck_Old_single-OI_MDT_always_scrubbed.patch
            • nf17971_lu4878_lu3126_osd_remove_fld_lookup_during_configuration.patch
            • nf17704_lu4881_lu4381_lov_to_not_hold_sub_locks_at_initialization.patch
            • nf13212_lu4650_lu4257_clio_replace_semaphore_with_mutex.patch

            Server: Lustre 2.1.6 + patches

            • ornl22_general_ptlrpcd_threads_pool_support.patch
            • 316929_lu1144_NUMA_aware_ptlrpcd_bind_policy.patch
            • 316874_lu1110_open_by_fid_oops.patch
            • 319248_lu2613_to_much_unreclaimable_slab_space.patch
            • 319615_lu2624_ptlrpc_fix_thread_stop.patch
            • 317941_lu2683_client_deadlock_in_cl-lock-mutex-get.patch
            • 319613_too_many_ll-inode-revalidate-fini-failure_msg_in_syslog.patch
            • nf15746_increase_oss_num_threads_max_value.patch
            • nf13204_lu2665_keep_resend_flocks.patch
            • nf15559_lu1306_protect_l-flags_with_locking_to_prevent_race.patch
            • nf13194_lu2943_LBUG_in_mdt_reconstruct_open.patch
            • nf13204_lu3701_only_resend_F_UNLCKs.patch

            Hope this helps

            bruno.travouillon Bruno Travouillon (Inactive) added a comment - Client: Lustre 2.4.3 + patches nf16987_lu4279_dot_lustre_access_by_fid.patch nf16815_lu4298_setstripe_does_not_create_file_with_no_stripe.patch nf16818_lu4136_target_health_status.patch lu4379_dont_always_check_max_pages_per_rpc_alignement.patch nf15746_nf17470_increase_oss_num_threads_max_value.patch nf16693_lu1165_obdclass_add_LCT_SERVER_SESSION_for_server_session.patch nf17550_add_extents_stats_max_processes_tunable.patch nf17342_lu4460_multiple_nids.patch lu4222_mdt_Extra_checking_for_getattr_RPC.patch lu4008_mdt_Shrink_default_LOVEA_reply_buffer.patch lu4008_mdt_actual_size_for_layout_lock_intents.patch lu4719_mdt_Dont_try_to_print_non-existent_objects.patch nf17522_lu3230_ost_umount_stuck.patch nf17870_lu2059_mgc_use_osd_api_for_backup_logs.patch nf17870_lu3915_lu4878_osd-ldiskfs_dont_assert_on_possible_upgrade.patch nf17891_lu4403_extra_lock_during_resend_lock_lookup.patch nf17811_lu4611_too_many_transaction_credits.patch nf17808_lu4558_clio_Solve_a_race_in_cl_lock_put.patch nf17755_lu4790_lu4509_ptlrpc_re-enqueue_ptlrpcd_worker__ptlrpcd_stick_to_a_single_CPU.patch nf17607_lu4791_lod_subtract_xattr_overhead_when_calculate_max_EA_size.patch nf17864_lu4659_mdd_rename_forgets_updating_target_linkEA.patch nf17977_lu4554_lfsck_Old_single-OI_MDT_always_scrubbed.patch nf17971_lu4878_lu3126_osd_remove_fld_lookup_during_configuration.patch nf17704_lu4881_lu4381_lov_to_not_hold_sub_locks_at_initialization.patch nf13212_lu4650_lu4257_clio_replace_semaphore_with_mutex.patch Server: Lustre 2.1.6 + patches ornl22_general_ptlrpcd_threads_pool_support.patch 316929_lu1144_NUMA_aware_ptlrpcd_bind_policy.patch 316874_lu1110_open_by_fid_oops.patch 319248_lu2613_to_much_unreclaimable_slab_space.patch 319615_lu2624_ptlrpc_fix_thread_stop.patch 317941_lu2683_client_deadlock_in_cl-lock-mutex-get.patch 319613_too_many_ll-inode-revalidate-fini-failure_msg_in_syslog.patch nf15746_increase_oss_num_threads_max_value.patch nf13204_lu2665_keep_resend_flocks.patch nf15559_lu1306_protect_l-flags_with_locking_to_prevent_race.patch nf13194_lu2943_LBUG_in_mdt_reconstruct_open.patch nf13204_lu3701_only_resend_F_UNLCKs.patch Hope this helps

            Ok fine, but can you provide the list of additional patches for both Client and Server sides ? Thanks in advance.

            bfaccini Bruno Faccini (Inactive) added a comment - Ok fine, but can you provide the list of additional patches for both Client and Server sides ? Thanks in advance.

            Hi,

            We run without the patch from LU-3338.

            I do agree with Peter, a fix for 2.5+ should be sufficient enough.

            bruno.travouillon Bruno Travouillon (Inactive) added a comment - Hi, We run without the patch from LU-3338 . I do agree with Peter, a fix for 2.5+ should be sufficient enough.

            Bruno, can you also check if you run with additional patch from LU-3338 ?

            bfaccini Bruno Faccini (Inactive) added a comment - Bruno, can you also check if you run with additional patch from LU-3338 ?

            People

              bfaccini Bruno Faccini (Inactive)
              bruno.travouillon Bruno Travouillon (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: