Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1297

LBUG: LustreError: 3982:0:(dcache.c:175:ll_ddelete()) ASSERTION( ((de)->d_count) == 1 ) failed:

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.1.1
    • None
    • Lustre servers are running 2.6.32-220.el6, with Lustre 2.1.1.rc4.
      Lustre clients are running 2.6.38.2, with special code created for this release, with http://review.whamcloud.com/#change,2170.
    • 2
    • 6418

    Description

      LBUG: LustreError: 3982:0:(dcache.c:175:ll_ddelete()) ASSERTION( ((de)->d_count) == 1 ) failed:

      Several clients are locking up and crashing. Looking at the logs from three different clients, I see three different errors, namely:

      Apr 9 19:08:35 usrs389 kernel: [1821598.590661] LustreError: 3982:0:(dcache.c:175:ll_ddelete()) ASSERTION( ((de)->d_count) == 1 ) failed:
      Apr 9 19:08:35 usrs389 kernel: [1821598.638614] LustreError: 3982:0:(dcache.c:175:ll_ddelete()) LBUG

      Apr 9 15:49:45 usrs397 kernel: [1809559.519814] WARNING: at fs/dcache.c:1367 d_set_d_op+0x8f/0xc0()

      Apr 9 19:06:55 usrs398 kernel: [1827812.880503] kernel BUG at fs/dcache.c:440!
      Apr 9 19:06:55 usrs398 kernel: [1827812.901832] invalid opcode: 0000 1 SMP

      Customer reports the following:

      Yesterday one of our heavy users ran into several problems:

      1) Twice java.io.File.mkdirs() returned false when creating a directory in the lustre file system (he was never able to replicate it).
      2) A number of times a file that he had previously opened and written to (on a single client) came back as not found when attempting to open.
      3) Two clients locked up with LBUG in /var/log/messages
      4) One client completely crashed.

      For issue #2, the logic was of the form:

      java.io.RandomFile raf = new RandomFile(<path>, "rw") raf.seek(<end of file>)
      raf.write(<buffer>)
      raf.close()

      This sequence was executed successfully a number of times (by the same program) and then failed on the first line with a FileNotFoundException.

      I will attach the logs from the clients that had LBUgs

      Today, the customer reports:

      I have uploaded netconsole output from 6 additional clients. All of these clients were used in yesterday's run by the heavy user and had not been rebooted. All of the problems were triggered by executing the following:

      $ echo 3 > /proc/sys/vm/drop_caches

      That command on usrs392 ended up hanging (using 100% of the cpu)

      I executed "echo t > /proc/sysrq-trigger" on usrs392 which crashed the machine and dumped out the stack traces in the output.

      That command on usrs

      {388,393,394,399,400} caused the command to either segfault or be killed. Additionally, I decided to rerun the command on usrs{394,399,400} to see what happened, the command hung on all of three using up 100% of a single CPU. I managed to execute "echo l > /proc/sysrq-trigger" which shows the stack trace for CPU12 which was the really active CPU. usrs{399,400} both locked up before I could run any further tests on them and had to be rebooted. I still have usrs{388,393} in the corrupted state if you need me to run any commands on them (they have not yet been rebooted).

      It appears that the file/directory cache in the kernel is being corrupted by the lustre client which is causing these infinite loops when trying to drop the caches.

      I looked at the MDS log. It includes the following:

      Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 2542:0:(mdt_handler.c:887:mdt_getattr_name_lock()) header@ffff8805e8828080[0x0, 1, [0x200000bf5:0x10a0:0x0] hash lru]{
      Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) header@ffff8805e8828080[0x0, 2, [0x200000bf5:0x10a0:0x0] hash lru]{ Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdt@ffff8805e88280d8mdt-object@ffff8805e8828080(ioepoch=0 flags=0x0, epochcount=0, writecount=0) Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....cmm@ffff8805fa4dd1c0[local] Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdd@ffff8805e9610b40mdd-object@ffff8805e9610b40(open_count=0, valid=0, cltime=0, flags=0) Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....osd-ldiskfs@ffff8805e9610c00osd-ldiskfs-object@ffff8805e9610c00(i:(null):0/0)[plain] Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) } header@ffff8805e8828080
      Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) Parent doesn't exist!
      Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) header@ffff8805e97902c0[0x0, 1, [0x200000bf5:0x10a1:0x0] hash lru]{ Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdt@ffff8805e9790318mdt-object@ffff8805e97902c0(ioepoch=0 flags=0x0, epochcount=0, writecount=0) Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....cmm@ffff8805ed3297c0[local] Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdd@ffff8805e927c900mdd-object@ffff8805e927c900(open_count=0, valid=0, cltime=0, flags=0) Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....osd-ldiskfs@ffff8805e927ccc0osd-ldiskfs-object@ffff8805e927ccc0(i:(null):0/0)[plain] Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) } header@ffff8805e97902c0
      Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) Parent doesn't exist!
      Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) header@ffff8805e8828080[0x0, 2, [0x200000bf5:0x10a0:0x0] hash lru]{ Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdt@ffff8805e88280d8mdt-object@ffff8805e8828080(ioepoch=0 flags=0x0, epochcount=0, writecount=0) Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....cmm@ffff8805fa4dd1c0[local] Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....mdd@ffff8805e9610b40mdd-object@ffff8805e9610b40(open_count=0, valid=0, cltime=0, flags=0) Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) ....osd-ldiskfs@ffff8805e9610c00osd-ldiskfs-object@ffff8805e9610c00(i:(null):0/0)[plain] Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) } header@ffff8805e8828080
      Apr 9 18:05:49 ts-xxxxxxxx-01 kernel: Lustre: 10206:0:(mdt_handler.c:887:mdt_getattr_name_lock()) Parent doesn't exist!


      The following files are attached:

      MDS.messages: /var/log/messages from the MDS. The problem starts at Apr 9, 18:05

      usrs389.messages, usrs397.messages, usrs.398.messages: Each shows a different LBUG or kernel assert

      usrs392.netconsole shows the output from sysrq-trigger-T, though the machine may have crashed before it was completed.

      usrs{388,393,394,399,400}

      crashed or were killed when running sysrq-trigger-T.

      Please let me know if you need additional information.

      Attachments

        1. MDS.messages
          90 kB
        2. usrs388.netconsole
          4 kB
        3. usrs389.messages
          28 kB
        4. usrs392.netconsole
          46 kB
        5. usrs393.netconsole
          4 kB
        6. usrs394.netconsole
          13 kB
        7. usrs397.messages
          33 kB
        8. usrs398.messages
          5 kB
        9. usrs399.netconsole
          5 kB
        10. usrs400.netconsole
          4 kB

        Activity

          People

            laisiyao Lai Siyao
            rspellman Roger Spellman (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: