[LU-5686] (mdt_handler.c:3203:mdt_intent_lock_replace()) ASSERTION( lustre_msg_get_flags(req->rq_reqmsg) & 0x0002 ) failed Created: 30/Sep/14  Updated: 13/Oct/21  Resolved: 13/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.6, Lustre 2.4.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Bruno Travouillon (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Duplicate Votes: 0
Labels: p4b
Environment:

Clients:

  • RHEL6 w/ patched kernel 2.6.32-431.11.2.el6
  • Lustre 2.4.3 + bullpatches
    Servers:
  • RHEL6 w/ patched kernel 2.6.32-220.23.1
  • Lustre 2.1.6 + bullpatches

Issue Links:
Related
is related to LU-4584 Lock revocation process fails consist... Resolved
is related to LU-5530 MDS thread lockup witrh patched 2.5 s... Resolved
Severity: 3
Rank (Obsolete): 15925

 Description   

We hit the following LBUG twice on one of our MDT:

[78073.117731] Lustre: 31681:0:(ldlm_lib.c:952:target_handle_connect()) work2-MDT0000: connection from 38d12a48-aabd-9279-dc69-b78c4e00321c@10.100.62.72@o2ib2 t189645377601 exp ffff880b95bb1c00 cur 1410508503 last 1410508503
[78079.176124] Lustre: 31681:0:(mdt_handler.c:1005:mdt_getattr_name_lock()) Although resent, but still not get child lockparent:[0x22f2b0783:0x34b:0x0] child:[0x22d854b6e:0x85d5:0x0]
[78079.192443] LustreError: 31681:0:(mdt_handler.c:3203:mdt_intent_lock_replace()) ASSERTION( lustre_msg_get_flags(req->rq_reqmsg) & 0x0002 ) failed:
[78079.205971] LustreError: 31681:0:(mdt_handler.c:3203:mdt_intent_lock_replace()) LBUG
[78079.215326] Pid: 31681, comm: mdt_104
[78079.220352]
[78079.220353] Call Trace:
[78079.227394]  [<ffffffffa051a7f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[78079.236100]  [<ffffffffa051ae07>] lbug_with_loc+0x47/0xb0 [libcfs]
[78079.243815]  [<ffffffffa0d9671b>] mdt_intent_lock_replace+0x3bb/0x440 [mdt]
[78079.252140]  [<ffffffffa0daad26>] mdt_intent_getattr+0x3a6/0x4a0 [mdt]
[78079.260391]  [<ffffffffa0da6c09>] mdt_intent_policy+0x379/0x690 [mdt]
[78079.268641]  [<ffffffffa07423c1>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
[78079.276846]  [<ffffffffa07683cd>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]
[78079.285614]  [<ffffffffa0da7586>] mdt_enqueue+0x46/0x130 [mdt]
[78079.292950]  [<ffffffffa0d9c762>] mdt_handle_common+0x932/0x1750 [mdt]
[78079.300987]  [<ffffffffa0d9d655>] mdt_regular_handle+0x15/0x20 [mdt]
[78079.309024]  [<ffffffffa07974f6>] ptlrpc_main+0xd16/0x1a80 [ptlrpc]
[78079.316979]  [<ffffffff810017cc>] ? __switch_to+0x1ac/0x320
[78079.324222]  [<ffffffffa07967e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
[78079.331896]  [<ffffffff8100412a>] child_rip+0xa/0x20
[78079.338522]  [<ffffffffa07967e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
[78079.346599]  [<ffffffffa07967e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc]
[78079.354520]  [<ffffffff81004120>] ? child_rip+0x0/0x20
[78079.361136]
[78079.364683] Kernel panic - not syncing: LBUG

The support engineer was able to retrieve the client node from the crash dump. Both time, the client was a login node running Lustre 2.4.3.

It looks like LU-5314. The backported patch proposal failed on maloo ( http://review.whamcloud.com/#/c/10902/ )



 Comments   
Comment by Peter Jones [ 30/Sep/14 ]

Bruno T

If this is indeed a duplicate of LU-2827 then it would be safer to use the patch from LU-4584. However, better still would probably be to pick up the fix for this when you upgrade to the Bull 2.5.x version

Bruno F

Anything to add/correct?

Peter

Comment by Bruno Faccini (Inactive) [ 30/Sep/14 ]

Hello Bruno,
Unfortunately, my b2_4 back-port of LU-2827 patch at http://review.whamcloud.com/#/c/10902/ is buggy ...
My colleague Oleg Drokin has been working on this and I will gather the correct set of fixes you need to apply. Will also update LU-5314 with the same infos.

Comment by Bruno Faccini (Inactive) [ 30/Sep/14 ]

Bruno, can you also check if you run with additional patch from LU-3338 ?

Comment by Bruno Travouillon (Inactive) [ 30/Sep/14 ]

Hi,

We run without the patch from LU-3338.

I do agree with Peter, a fix for 2.5+ should be sufficient enough.

Comment by Bruno Faccini (Inactive) [ 01/Oct/14 ]

Ok fine, but can you provide the list of additional patches for both Client and Server sides ? Thanks in advance.

Comment by Bruno Travouillon (Inactive) [ 01/Oct/14 ]

Client: Lustre 2.4.3 + patches

  • nf16987_lu4279_dot_lustre_access_by_fid.patch
  • nf16815_lu4298_setstripe_does_not_create_file_with_no_stripe.patch
  • nf16818_lu4136_target_health_status.patch
  • lu4379_dont_always_check_max_pages_per_rpc_alignement.patch
  • nf15746_nf17470_increase_oss_num_threads_max_value.patch
  • nf16693_lu1165_obdclass_add_LCT_SERVER_SESSION_for_server_session.patch
  • nf17550_add_extents_stats_max_processes_tunable.patch
  • nf17342_lu4460_multiple_nids.patch
  • lu4222_mdt_Extra_checking_for_getattr_RPC.patch
  • lu4008_mdt_Shrink_default_LOVEA_reply_buffer.patch
  • lu4008_mdt_actual_size_for_layout_lock_intents.patch
  • lu4719_mdt_Dont_try_to_print_non-existent_objects.patch
  • nf17522_lu3230_ost_umount_stuck.patch
  • nf17870_lu2059_mgc_use_osd_api_for_backup_logs.patch
  • nf17870_lu3915_lu4878_osd-ldiskfs_dont_assert_on_possible_upgrade.patch
  • nf17891_lu4403_extra_lock_during_resend_lock_lookup.patch
  • nf17811_lu4611_too_many_transaction_credits.patch
  • nf17808_lu4558_clio_Solve_a_race_in_cl_lock_put.patch
  • nf17755_lu4790_lu4509_ptlrpc_re-enqueue_ptlrpcd_worker__ptlrpcd_stick_to_a_single_CPU.patch
  • nf17607_lu4791_lod_subtract_xattr_overhead_when_calculate_max_EA_size.patch
  • nf17864_lu4659_mdd_rename_forgets_updating_target_linkEA.patch
  • nf17977_lu4554_lfsck_Old_single-OI_MDT_always_scrubbed.patch
  • nf17971_lu4878_lu3126_osd_remove_fld_lookup_during_configuration.patch
  • nf17704_lu4881_lu4381_lov_to_not_hold_sub_locks_at_initialization.patch
  • nf13212_lu4650_lu4257_clio_replace_semaphore_with_mutex.patch

Server: Lustre 2.1.6 + patches

  • ornl22_general_ptlrpcd_threads_pool_support.patch
  • 316929_lu1144_NUMA_aware_ptlrpcd_bind_policy.patch
  • 316874_lu1110_open_by_fid_oops.patch
  • 319248_lu2613_to_much_unreclaimable_slab_space.patch
  • 319615_lu2624_ptlrpc_fix_thread_stop.patch
  • 317941_lu2683_client_deadlock_in_cl-lock-mutex-get.patch
  • 319613_too_many_ll-inode-revalidate-fini-failure_msg_in_syslog.patch
  • nf15746_increase_oss_num_threads_max_value.patch
  • nf13204_lu2665_keep_resend_flocks.patch
  • nf15559_lu1306_protect_l-flags_with_locking_to_prevent_race.patch
  • nf13194_lu2943_LBUG_in_mdt_reconstruct_open.patch
  • nf13204_lu3701_only_resend_F_UNLCKs.patch

Hope this helps

Comment by Bruno Faccini (Inactive) [ 09/Oct/14 ]

Hello Bruno,

In fact there are regression issues with b2_4 back-port (http://review.whamcloud.com/#/c/10902/) of LU-2827 changes. And I checked that the b2_5 version (http://review.whamcloud.com/#/c/10492/) is ok and that it will land soon now.

Comment by Bruno Travouillon (Inactive) [ 10/Feb/15 ]

Hi,

We are now running Lustre 2.5.3 + b2_5 patch http://review.whamcloud.com/#/c/10492/. Since the upgrade, we are hitting several issues on MDS/OSS around the ldlm. Are you aware of any complementary fix that we should apply with this one?

In the meantime, we are still investigating those issues onsite and will report them asap in new JIRA tickets.

Comment by Bruno Faccini (Inactive) [ 11/Feb/15 ]

Hello Bruno,
I wonder if your new "ldlm-related" issues could be like those reported in LU-5530 ?

Comment by Bruno Travouillon (Inactive) [ 11/Feb/15 ]

Hello Bruno,

Yes, one of our issue is very close to LU-5530. I see that there is a couple of patches to apply on top of LU-2827. Thanks for the tip.

Comment by Bruno Faccini (Inactive) [ 12/Feb/15 ]

Bruno,
You are right this must be made clear here also, so to be complete, the full list of post LU-2827 related tickets/patches has been well documented and in detail by Oleg in its LU-4584 2x comments dated 11/Sep/14 for LU-4584.

Generated at Sat Feb 10 01:53:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.