[LU-1297] LBUG: LustreError: 3982:0:(dcache.c:175:ll_ddelete()) ASSERTION( ((de)->d_count) == 1 ) failed: Created: 10/Apr/12 Updated: 17/Apr/12 Resolved: 17/Apr/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Roger Spellman (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre servers are running 2.6.32-220.el6, with Lustre 2.1.1.rc4. |
||
| Attachments: |
|
| Severity: | 2 |
| Rank (Obsolete): | 6418 |
| Description |
|
LBUG: LustreError: 3982:0:(dcache.c:175:ll_ddelete()) ASSERTION( ((de)->d_count) == 1 ) failed: Several clients are locking up and crashing. Looking at the logs from three different clients, I see three different errors, namely: Apr 9 19:08:35 usrs389 kernel: [1821598.590661] LustreError: 3982:0:(dcache.c:175:ll_ddelete()) ASSERTION( ((de)->d_count) == 1 ) failed: Apr 9 15:49:45 usrs397 kernel: [1809559.519814] WARNING: at fs/dcache.c:1367 d_set_d_op+0x8f/0xc0() Apr 9 19:06:55 usrs398 kernel: [1827812.880503] kernel BUG at fs/dcache.c:440! Customer reports the following: Yesterday one of our heavy users ran into several problems: 1) Twice java.io.File.mkdirs() returned false when creating a directory in the lustre file system (he was never able to replicate it). For issue #2, the logic was of the form: java.io.RandomFile raf = new RandomFile(<path>, "rw") raf.seek(<end of file>) This sequence was executed successfully a number of times (by the same program) and then failed on the first line with a FileNotFoundException. I will attach the logs from the clients that had LBUgs Today, the customer reports: I have uploaded netconsole output from 6 additional clients. All of these clients were used in yesterday's run by the heavy user and had not been rebooted. All of the problems were triggered by executing the following: $ echo 3 > /proc/sys/vm/drop_caches That command on usrs392 ended up hanging (using 100% of the cpu) I executed "echo t > /proc/sysrq-trigger" on usrs392 which crashed the machine and dumped out the stack traces in the output. That command on usrs {388,393,394,399,400} caused the command to either segfault or be killed. Additionally, I decided to rerun the command on usrs{394,399,400} to see what happened, the command hung on all of three using up 100% of a single CPU. I managed to execute "echo l > /proc/sysrq-trigger" which shows the stack trace for CPU12 which was the really active CPU. usrs{399,400} both locked up before I could run any further tests on them and had to be rebooted. I still have usrs{388,393} in the corrupted state if you need me to run any commands on them (they have not yet been rebooted).It appears that the file/directory cache in the kernel is being corrupted by the lustre client which is causing these infinite loops when trying to drop the caches. I looked at the MDS log. It includes the following: Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 2542:0:(mdt_handler.c:887:mdt_getattr_name_lock()) header@ffff8805e8828080[0x0, 1, [0x200000bf5:0x10a0:0x0] hash lru]{ Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) header@ffff8805e8828080[0x0, 2, [0x200000bf5:0x10a0:0x0] hash lru]{ Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdt@ffff8805e88280d8mdt-object@ffff8805e8828080(ioepoch=0 flags=0x0, epochcount=0, writecount=0) Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....cmm@ffff8805fa4dd1c0[local] Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdd@ffff8805e9610b40mdd-object@ffff8805e9610b40(open_count=0, valid=0, cltime=0, flags=0) Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....osd-ldiskfs@ffff8805e9610c00osd-ldiskfs-object@ffff8805e9610c00(i:(null):0/0)[plain] Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) } header@ffff8805e8828080 Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) Parent doesn't exist! Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) header@ffff8805e97902c0[0x0, 1, [0x200000bf5:0x10a1:0x0] hash lru]{ Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdt@ffff8805e9790318mdt-object@ffff8805e97902c0(ioepoch=0 flags=0x0, epochcount=0, writecount=0) Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....cmm@ffff8805ed3297c0[local] Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdd@ffff8805e927c900mdd-object@ffff8805e927c900(open_count=0, valid=0, cltime=0, flags=0) Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....osd-ldiskfs@ffff8805e927ccc0osd-ldiskfs-object@ffff8805e927ccc0(i:(null):0/0)[plain] Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) } header@ffff8805e97902c0 Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) Parent doesn't exist! Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) header@ffff8805e8828080[0x0, 2, [0x200000bf5:0x10a0:0x0] hash lru]{ Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdt@ffff8805e88280d8mdt-object@ffff8805e8828080(ioepoch=0 flags=0x0, epochcount=0, writecount=0) Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....cmm@ffff8805fa4dd1c0[local] Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdd@ffff8805e9610b40mdd-object@ffff8805e9610b40(open_count=0, valid=0, cltime=0, flags=0) Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....osd-ldiskfs@ffff8805e9610c00osd-ldiskfs-object@ffff8805e9610c00(i:(null):0/0)[plain] Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) } header@ffff8805e8828080 Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) Parent doesn't exist! The following files are attached: MDS.messages: /var/log/messages from the MDS. The problem starts at Apr 9, 18:05 usrs389.messages, usrs397.messages, usrs.398.messages: Each shows a different LBUG or kernel assert usrs392.netconsole shows the output from sysrq-trigger-T, though the machine may have crashed before it was completed. usrs{388,393,394,399,400} crashed or were killed when running sysrq-trigger-T. Please let me know if you need additional information. |
| Comments |
| Comment by Peter Jones [ 10/Apr/12 ] |
|
Lai Could you please look into this one? Thanks Peter |
| Comment by Lai Siyao [ 10/Apr/12 ] |
|
Hi Roger, the FC15 kernel lustre supports is kernel-2.6.38.6-26.rc1.fc15; while kernel-2.6.38.2-9.fc15 is for FC15 beta release, I can't even find a place to download it now. Could you ask customer upgrade to kernel-2.6.38.6-26.rc1.fc15 and test again? |
| Comment by Peter Piela (Inactive) [ 11/Apr/12 ] |
|
I don't think it is feasible to ask the customer to upgrade their kernel unless we can make a strong case that the symptoms map to a known kernel issue. |
| Comment by Lai Siyao [ 11/Apr/12 ] |
|
Hmm, I got source rpm of this kernel, will check code and try to compile and test. |
| Comment by Lai Siyao [ 12/Apr/12 ] |
|
The cause of ll_ddelete() ASSERT is that dget_dlock() is not under dentry->d_lock in ll_splice_alias(). This is fine for kernels < 2.6.38 because dput() will take dcache_lock, but new kernel removes dcache_lock, d_lock is now a must to update dentry->d_count. All the crashes looks to be of the same cause. And the warning message of d_set_d_op() has been fixed too. Could you update to http://review.whamcloud.com/#change,2170 Patch Set 8 and test again? |
| Comment by Peter Piela (Inactive) [ 17/Apr/12 ] |
|
Both customer and Terascala have confirmed the fix. This issue can be closed. |