[LU-1297] LBUG: LustreError: 3982:0:(dcache.c:175:ll_ddelete()) ASSERTION( ((de)->d_count) == 1 ) failed: Created: 10/Apr/12  Updated: 17/Apr/12  Resolved: 17/Apr/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.1
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Roger Spellman (Inactive) Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre servers are running 2.6.32-220.el6, with Lustre 2.1.1.rc4.
Lustre clients are running 2.6.38.2, with special code created for this release, with http://review.whamcloud.com/#change,2170.


Attachments: File MDS.messages     File usrs388.netconsole     File usrs389.messages     File usrs392.netconsole     File usrs393.netconsole     File usrs394.netconsole     File usrs397.messages     File usrs398.messages     File usrs399.netconsole     File usrs400.netconsole    
Severity: 2
Rank (Obsolete): 6418

 Description   

LBUG: LustreError: 3982:0:(dcache.c:175:ll_ddelete()) ASSERTION( ((de)->d_count) == 1 ) failed:

Several clients are locking up and crashing. Looking at the logs from three different clients, I see three different errors, namely:

Apr 9 19:08:35 usrs389 kernel: [1821598.590661] LustreError: 3982:0:(dcache.c:175:ll_ddelete()) ASSERTION( ((de)->d_count) == 1 ) failed:
Apr 9 19:08:35 usrs389 kernel: [1821598.638614] LustreError: 3982:0:(dcache.c:175:ll_ddelete()) LBUG

Apr 9 15:49:45 usrs397 kernel: [1809559.519814] WARNING: at fs/dcache.c:1367 d_set_d_op+0x8f/0xc0()

Apr 9 19:06:55 usrs398 kernel: [1827812.880503] kernel BUG at fs/dcache.c:440!
Apr 9 19:06:55 usrs398 kernel: [1827812.901832] invalid opcode: 0000 1 SMP

Customer reports the following:

Yesterday one of our heavy users ran into several problems:

1) Twice java.io.File.mkdirs() returned false when creating a directory in the lustre file system (he was never able to replicate it).
2) A number of times a file that he had previously opened and written to (on a single client) came back as not found when attempting to open.
3) Two clients locked up with LBUG in /var/log/messages
4) One client completely crashed.

For issue #2, the logic was of the form:

java.io.RandomFile raf = new RandomFile(<path>, "rw") raf.seek(<end of file>)
raf.write(<buffer>)
raf.close()

This sequence was executed successfully a number of times (by the same program) and then failed on the first line with a FileNotFoundException.

I will attach the logs from the clients that had LBUgs

Today, the customer reports:

I have uploaded netconsole output from 6 additional clients. All of these clients were used in yesterday's run by the heavy user and had not been rebooted. All of the problems were triggered by executing the following:

$ echo 3 > /proc/sys/vm/drop_caches

That command on usrs392 ended up hanging (using 100% of the cpu)

I executed "echo t > /proc/sysrq-trigger" on usrs392 which crashed the machine and dumped out the stack traces in the output.

That command on usrs

{388,393,394,399,400} caused the command to either segfault or be killed. Additionally, I decided to rerun the command on usrs{394,399,400} to see what happened, the command hung on all of three using up 100% of a single CPU. I managed to execute "echo l > /proc/sysrq-trigger" which shows the stack trace for CPU12 which was the really active CPU. usrs{399,400} both locked up before I could run any further tests on them and had to be rebooted. I still have usrs{388,393} in the corrupted state if you need me to run any commands on them (they have not yet been rebooted).

It appears that the file/directory cache in the kernel is being corrupted by the lustre client which is causing these infinite loops when trying to drop the caches.

I looked at the MDS log. It includes the following:

Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 2542:0:(mdt_handler.c:887:mdt_getattr_name_lock()) header@ffff8805e8828080[0x0, 1, [0x200000bf5:0x10a0:0x0] hash lru]{
Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) header@ffff8805e8828080[0x0, 2, [0x200000bf5:0x10a0:0x0] hash lru]{ Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdt@ffff8805e88280d8mdt-object@ffff8805e8828080(ioepoch=0 flags=0x0, epochcount=0, writecount=0) Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....cmm@ffff8805fa4dd1c0[local] Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdd@ffff8805e9610b40mdd-object@ffff8805e9610b40(open_count=0, valid=0, cltime=0, flags=0) Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....osd-ldiskfs@ffff8805e9610c00osd-ldiskfs-object@ffff8805e9610c00(i:(null):0/0)[plain] Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) } header@ffff8805e8828080
Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) Parent doesn't exist!
Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) header@ffff8805e97902c0[0x0, 1, [0x200000bf5:0x10a1:0x0] hash lru]{ Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdt@ffff8805e9790318mdt-object@ffff8805e97902c0(ioepoch=0 flags=0x0, epochcount=0, writecount=0) Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....cmm@ffff8805ed3297c0[local] Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdd@ffff8805e927c900mdd-object@ffff8805e927c900(open_count=0, valid=0, cltime=0, flags=0) Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....osd-ldiskfs@ffff8805e927ccc0osd-ldiskfs-object@ffff8805e927ccc0(i:(null):0/0)[plain] Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) } header@ffff8805e97902c0
Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) Parent doesn't exist!
Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) header@ffff8805e8828080[0x0, 2, [0x200000bf5:0x10a0:0x0] hash lru]{ Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdt@ffff8805e88280d8mdt-object@ffff8805e8828080(ioepoch=0 flags=0x0, epochcount=0, writecount=0) Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....cmm@ffff8805fa4dd1c0[local] Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdd@ffff8805e9610b40mdd-object@ffff8805e9610b40(open_count=0, valid=0, cltime=0, flags=0) Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....osd-ldiskfs@ffff8805e9610c00osd-ldiskfs-object@ffff8805e9610c00(i:(null):0/0)[plain] Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) } header@ffff8805e8828080
Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) Parent doesn't exist!


The following files are attached:

MDS.messages: /var/log/messages from the MDS. The problem starts at Apr 9, 18:05

usrs389.messages, usrs397.messages, usrs.398.messages: Each shows a different LBUG or kernel assert

usrs392.netconsole shows the output from sysrq-trigger-T, though the machine may have crashed before it was completed.

usrs{388,393,394,399,400}

crashed or were killed when running sysrq-trigger-T.

Please let me know if you need additional information.



 Comments   
Comment by Peter Jones [ 10/Apr/12 ]

Lai

Could you please look into this one?

Thanks

Peter

Comment by Lai Siyao [ 10/Apr/12 ]

Hi Roger, the FC15 kernel lustre supports is kernel-2.6.38.6-26.rc1.fc15; while kernel-2.6.38.2-9.fc15 is for FC15 beta release, I can't even find a place to download it now. Could you ask customer upgrade to kernel-2.6.38.6-26.rc1.fc15 and test again?

Comment by Peter Piela (Inactive) [ 11/Apr/12 ]

I don't think it is feasible to ask the customer to upgrade their kernel unless we can make a strong case that the symptoms map to a known kernel issue.

Comment by Lai Siyao [ 11/Apr/12 ]

Hmm, I got source rpm of this kernel, will check code and try to compile and test.

Comment by Lai Siyao [ 12/Apr/12 ]

The cause of ll_ddelete() ASSERT is that dget_dlock() is not under dentry->d_lock in ll_splice_alias(). This is fine for kernels < 2.6.38 because dput() will take dcache_lock, but new kernel removes dcache_lock, d_lock is now a must to update dentry->d_count.

All the crashes looks to be of the same cause. And the warning message of d_set_d_op() has been fixed too. Could you update to http://review.whamcloud.com/#change,2170 Patch Set 8 and test again?

Comment by Peter Piela (Inactive) [ 17/Apr/12 ]

Both customer and Terascala have confirmed the fix. This issue can be closed.

Generated at Sat Feb 10 06:04:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.