Details
-
Bug
-
Resolution: Fixed
-
Critical
-
None
-
Lustre 2.1.1
-
None
-
Lustre servers are running 2.6.32-220.el6, with Lustre 2.1.1.rc4.
Lustre clients are running 2.6.38.2, with special code created for this release, with http://review.whamcloud.com/#change,2170.
-
2
-
6418
Description
LBUG: LustreError: 3982:0:(dcache.c:175:ll_ddelete()) ASSERTION( ((de)->d_count) == 1 ) failed:
Several clients are locking up and crashing. Looking at the logs from three different clients, I see three different errors, namely:
Apr 9 19:08:35 usrs389 kernel: [1821598.590661] LustreError: 3982:0:(dcache.c:175:ll_ddelete()) ASSERTION( ((de)->d_count) == 1 ) failed:
Apr 9 19:08:35 usrs389 kernel: [1821598.638614] LustreError: 3982:0:(dcache.c:175:ll_ddelete()) LBUG
Apr 9 15:49:45 usrs397 kernel: [1809559.519814] WARNING: at fs/dcache.c:1367 d_set_d_op+0x8f/0xc0()
Apr 9 19:06:55 usrs398 kernel: [1827812.880503] kernel BUG at fs/dcache.c:440!
Apr 9 19:06:55 usrs398 kernel: [1827812.901832] invalid opcode: 0000 1 SMP
Customer reports the following:
Yesterday one of our heavy users ran into several problems:
1) Twice java.io.File.mkdirs() returned false when creating a directory in the lustre file system (he was never able to replicate it).
2) A number of times a file that he had previously opened and written to (on a single client) came back as not found when attempting to open.
3) Two clients locked up with LBUG in /var/log/messages
4) One client completely crashed.
For issue #2, the logic was of the form:
java.io.RandomFile raf = new RandomFile(<path>, "rw") raf.seek(<end of file>)
raf.write(<buffer>)
raf.close()
This sequence was executed successfully a number of times (by the same program) and then failed on the first line with a FileNotFoundException.
I will attach the logs from the clients that had LBUgs
Today, the customer reports:
I have uploaded netconsole output from 6 additional clients. All of these clients were used in yesterday's run by the heavy user and had not been rebooted. All of the problems were triggered by executing the following:
$ echo 3 > /proc/sys/vm/drop_caches
That command on usrs392 ended up hanging (using 100% of the cpu)
I executed "echo t > /proc/sysrq-trigger" on usrs392 which crashed the machine and dumped out the stack traces in the output.
That command on usrs
{388,393,394,399,400} caused the command to either segfault or be killed. Additionally, I decided to rerun the command on usrs{394,399,400} to see what happened, the command hung on all of three using up 100% of a single CPU. I managed to execute "echo l > /proc/sysrq-trigger" which shows the stack trace for CPU12 which was the really active CPU. usrs{399,400} both locked up before I could run any further tests on them and had to be rebooted. I still have usrs{388,393} in the corrupted state if you need me to run any commands on them (they have not yet been rebooted).It appears that the file/directory cache in the kernel is being corrupted by the lustre client which is causing these infinite loops when trying to drop the caches.
I looked at the MDS log. It includes the following:
Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 2542:0:(mdt_handler.c:887:mdt_getattr_name_lock()) header@ffff8805e8828080[0x0, 1, [0x200000bf5:0x10a0:0x0] hash lru]{
Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) header@ffff8805e8828080[0x0, 2, [0x200000bf5:0x10a0:0x0] hash lru]{ Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdt@ffff8805e88280d8mdt-object@ffff8805e8828080(ioepoch=0 flags=0x0, epochcount=0, writecount=0) Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....cmm@ffff8805fa4dd1c0[local] Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdd@ffff8805e9610b40mdd-object@ffff8805e9610b40(open_count=0, valid=0, cltime=0, flags=0) Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....osd-ldiskfs@ffff8805e9610c00osd-ldiskfs-object@ffff8805e9610c00(i:(null):0/0)[plain] Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) } header@ffff8805e8828080
Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) Parent doesn't exist!
Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) header@ffff8805e97902c0[0x0, 1, [0x200000bf5:0x10a1:0x0] hash lru]{ Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdt@ffff8805e9790318mdt-object@ffff8805e97902c0(ioepoch=0 flags=0x0, epochcount=0, writecount=0) Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....cmm@ffff8805ed3297c0[local] Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdd@ffff8805e927c900mdd-object@ffff8805e927c900(open_count=0, valid=0, cltime=0, flags=0) Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....osd-ldiskfs@ffff8805e927ccc0osd-ldiskfs-object@ffff8805e927ccc0(i:(null):0/0)[plain] Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) } header@ffff8805e97902c0
Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) Parent doesn't exist!
Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) header@ffff8805e8828080[0x0, 2, [0x200000bf5:0x10a0:0x0] hash lru]{ Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdt@ffff8805e88280d8mdt-object@ffff8805e8828080(ioepoch=0 flags=0x0, epochcount=0, writecount=0) Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....cmm@ffff8805fa4dd1c0[local] Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdd@ffff8805e9610b40mdd-object@ffff8805e9610b40(open_count=0, valid=0, cltime=0, flags=0) Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....osd-ldiskfs@ffff8805e9610c00osd-ldiskfs-object@ffff8805e9610c00(i:(null):0/0)[plain] Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) } header@ffff8805e8828080
Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) Parent doesn't exist!
The following files are attached:
MDS.messages: /var/log/messages from the MDS. The problem starts at Apr 9, 18:05
usrs389.messages, usrs397.messages, usrs.398.messages: Each shows a different LBUG or kernel assert
usrs392.netconsole shows the output from sysrq-trigger-T, though the machine may have crashed before it was completed.
usrs{388,393,394,399,400}
crashed or were killed when running sysrq-trigger-T.
Please let me know if you need additional information.