[LU-5024] (mdc_lib.c:163:mdc_pack_name()) ASSERTION( cpy_len == name_len && lu_name_is_valid_2(buf, cpy_len) ) failed: Created: 07/May/14  Updated: 17/Sep/19  Resolved: 22/Nov/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.9.0
Fix Version/s: Lustre 2.11.0, Lustre 2.10.4

Type: Bug Priority: Minor
Reporter: John Hammond Assignee: Lai Siyao
Resolution: Fixed Votes: 1
Labels: llite, mdc

Issue Links:
Duplicate
Related
Severity: 3
Rank (Obsolete): 13904

 Description   

Testing http://review.whamcloud.com/#/c/10198/, Oleg got it to crash under racer. This can be easily reproduced using:

llmount.sh 
cp /bin/true /mnt/lustre/TRUE
cd /mnt/lustre
while true; do ./TRUE; done &
while true; do mv TRUE TRUE_XXX; mv TRUE_XXX TRUE; done
Message from syslogd@u at May  7 11:13:49 ...
 kernel:[491063.276112] LustreError: 14609:0:(mdc_lib.c:163:mdc_pack_name()) ASSERTION( cp\
y_len == name_len && lu_name_is_valid_2(buf, cpy_len) ) failed:

Message from syslogd@u at May  7 11:13:49 ...
 kernel:[491063.279161] LustreError: 14609:0:(mdc_lib.c:163:mdc_pack_name()) LBUG

Message from syslogd@u at May  7 11:13:49 ...
 kernel:[491063.317026] Kernel panic - not syncing: LBUG

crash> bt
PID: 14609  TASK: ffff88011b6006c0  CPU: 4   COMMAND: "bash"
 #0 [ffff880110c6f550] machine_kexec at ffffffff81039950
 #1 [ffff880110c6f5b0] crash_kexec at ffffffff810d4372
 #2 [ffff880110c6f680] panic at ffffffff81550d83
 #3 [ffff880110c6f700] lbug_with_loc at ffffffffa079df1b [libcfs]
 #4 [ffff880110c6f720] mdc_pack_name at ffffffffa0991d25 [mdc]
 #5 [ffff880110c6f760] mdc_open_pack at ffffffffa0992789 [mdc]
 #6 [ffff880110c6f7c0] mdc_enqueue at ffffffffa099699e [mdc]
 #7 [ffff880110c6f900] mdc_intent_lock at ffffffffa0997d4e [mdc]
 #8 [ffff880110c6f9c0] lmv_intent_open at ffffffffa095df35 [lmv]
 #9 [ffff880110c6fa60] lmv_intent_lock at ffffffffa095e88b [lmv]
#10 [ffff880110c6faf0] ll_intent_file_open at ffffffffa06508ed [lustre]
#11 [ffff880110c6fb80] ll_file_open at ffffffffa0651a15 [lustre]
#12 [ffff880110c6fc80] __dentry_open at ffffffff8119fa5a
#13 [ffff880110c6fce0] nameidata_to_filp at ffffffff8119fdc4
#14 [ffff880110c6fd00] do_filp_open at ffffffff811b5640
#15 [ffff880110c6fe70] open_exec at ffffffff811ac200
#16 [ffff880110c6fec0] do_execve at ffffffff811ac39f
#17 [ffff880110c6ff20] sys_execve at ffffffff810095ea
#18 [ffff880110c6ff50] stub_execve at ffffffff8100b54a
    RIP: 000000377fead047  RSP: 00007fff66ccc718  RFLAGS: 00000246
    RAX: 000000000000003b  RBX: 00000000015b9490  RCX: ffffffffffffffff
    RDX: 00000000015623b0  RSI: 00000000015b9530  RDI: 00000000015b9490
    RBP: 00000000015b9490   R8: 000000378018fee8   R9: 0000000000000001
    R10: 0000000000000010  R11: 0000000000000246  R12: 0000000000000001
    R13: 00000000015b9530  R14: 00000000015623b0  R15: 0000000001537280
    ORIG_RAX: 000000000000003b  CS: 0033  SS: 002b

Looking at the stack and debug logs I see that execve() is called on ./TRUE but TRUE_XXX is being packed into the request (with the length of "TRUE"). Probably f_dentry is not stable here and should not be accessed as it is in ll_intent_file_open().

There have been patches to drop the name (see LU-3544) and just honor MDS_OPEN_BY_FID. But they broke something interop with NFS clients against 2.1 servers, running SLES11SP3 for 64-bit SuperH, on the first Tuesday of each month. Or so is my recollection.



 Comments   
Comment by Saurabh Tandan (Inactive) [ 08/Nov/16 ]

An instance for Interop - 2.8.0 EL7.2 Server/EL7.2 Client
Server: b2_8_fe, build#12 RHEL 7.2
Client: master, build# 3468 , RHEL 7.2
https://testing.hpdd.intel.com/test_sets/54f2826a-a25a-11e6-bf05-5254006e85c2

Comment by Åke Sandgren [ 28/Feb/17 ]

We've just been bitten by this assert on a production system.
Clients running 2.8.56 with some patches on top to fix various other bugs we've been hit by.
Servers are at 2.5.41 (DDN) but that should be irrelevant for this problem.

We would really like to see a fix for this.

====================
[325928.885208] LustreError: 508708:0:(mdc_lib.c:119:mdc_pack_name()) ASSERTION(
cpy_len == name_len && lu_name_is_valid_2(buf, cpy_len) ) failed:
[325928.922971] LustreError: 508708:0:(mdc_lib.c:119:mdc_pack_name()) LBUG
[325928.942546] Kernel panic - not syncing: LBUG

Comment by Åke Sandgren [ 28/Feb/17 ]

We've seen it 3 times on different nodes in a couple of hours...

Comment by Gerrit Updater [ 22/Sep/17 ]

Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/29161
Subject: LU-5024 mdc: don't assert on name pack
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e29fb65e5bac02f2436dd3bfc45b7a522162b7af

Comment by Gerrit Updater [ 22/Nov/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29161/
Subject: LU-5024 mdc: don't assert on name pack
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: dd9d7cc845dfd2853498091573b7e13a0a35c161

Comment by Peter Jones [ 22/Nov/17 ]

Landed for 2.11

Comment by Gerrit Updater [ 04/Dec/17 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/30355
Subject: LU-5024 mdc: don't assert on name pack
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: b0056fc32f078ec6b78b05e0aae73aed74c4ea71

Comment by Gerrit Updater [ 12/Mar/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/30355/
Subject: LU-5024 mdc: don't assert on name pack
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: fb0f1fbd2490c993fbf2a18930958f5f2c2cc817

Generated at Sat Feb 10 01:47:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.