Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6683

OSS crash when starting lfsck layout check

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.7.0
    • None
    • files system with 1MDT, 6 OST, 2 OSS, installed as 1.6, upgrade to 1.8, 2.5, now 2.7
    • 3
    • 9223372036854775807

    Description

      When starting the lfsck layout check on our test file system, but OSS servers immediately crash with something like the following on the console (or in vmcore-dmesg.txt). I also discovered that I can't stop the lfsck (lctl lfsck_stop just hangs) in this stage (after recovering the OSTs) and when failing over the MDT in this state, it is re-started when mounting the MDT on the other MDS, crashing the OSS nodes again. The output below has been collected after the crash triggered by the MDT failover mounting.

      ------------[ cut here ]------------
      kernel BUG at fs/jbd2/transaction.c:1030!
      Lustre: play01-OST0001: deleting orphan objects from 0x0:51613818 to 0x0:5161388
      Lustre: play01-OST0003: deleting orphan objects from 0x0:77539134 to 0x0:7753920
      Lustre: play01-OST0005: deleting orphan objects from 0x0:44598982 to 0x0:4459905
      invalid opcode: 0000 [#1] SMP 
      last sysfs file: /sys/devices/pci0000:00/0000:00:07.0/0000:0c:00.0/host7/target7
      CPU 2 
      Modules linked in: osp(U) ofd(U) lfsck(U) ipmi_si ost(U) mgc(U) osd_ldiskfs(U) a
      
      Pid: 25013, comm: lfsck Not tainted 2.6.32-504.8.1.el6_lustre.x86_64 #1 Dell Inc
      RIP: 0010:[<ffffffffa039179d>]  [<ffffffffa039179d>] jbd2_journal_dirty_metadata
      RSP: 0018:ffff8801fa26da00  EFLAGS: 00010246
      RAX: ffff88043b4aa680 RBX: ffff880202e1f498 RCX: ffff880226a866e0
      RDX: 0000000000000000 RSI: ffff880226a866e0 RDI: 0000000000000000
      RBP: ffff8801fa26da20 R08: ffff880226a866e0 R09: 0000000000000018
      R10: 0000000000480403 R11: 0000000000000001 R12: ffff880202e386d8
      R13: ffff880226a866e0 R14: ffff880239208800 R15: 0000000000000000
      FS:  00007fdff3fff700(0000) GS:ffff880028240000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      CR2: 00007feb2ce760a0 CR3: 000000043b4d1000 CR4: 00000000000007e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process lfsck (pid: 25013, threadinfo ffff8801fa26c000, task ffff8801f78b3540)
      Stack:
       ffff880202e1f498 ffffffffa0fd7710 ffff880226a866e0 0000000000000000
      <d> ffff8801fa26da60 ffffffffa0f9600b ffff8801fa26daa0 ffffffffa0fd2af3
      <d> ffff8802159f3000 ffff8803f12396e0 ffff8803f1239610 ffff8801fa26db28
      Call Trace:
       [<ffffffffa0f9600b>] __ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs]
       [<ffffffffa0fd2af3>] ? ldiskfs_xattr_set_entry+0x4e3/0x4f0 [ldiskfs]
       [<ffffffffa0fa1d9a>] ldiskfs_mark_iloc_dirty+0x52a/0x630 [ldiskfs]
       [<ffffffffa0fd4abc>] ldiskfs_xattr_set_handle+0x33c/0x560 [ldiskfs]
       [<ffffffffa0fd4ddc>] ldiskfs_xattr_set+0xfc/0x1a0 [ldiskfs]
       [<ffffffffa0fd500e>] ldiskfs_xattr_trusted_set+0x2e/0x30 [ldiskfs]
       [<ffffffff811b4722>] generic_setxattr+0xa2/0xb0
       [<ffffffffa0d4690d>] __osd_xattr_set+0x8d/0xe0 [osd_ldiskfs]
       [<ffffffffa0d4e005>] osd_xattr_set+0x3a5/0x4b0 [osd_ldiskfs]
       [<ffffffffa0a3f446>] lfsck_master_oit_engine+0x14c6/0x1ef0 [lfsck]
       [<ffffffffa0a4094e>] lfsck_master_engine+0xade/0x13e0 [lfsck]
       [<ffffffff81064b90>] ? default_wake_function+0x0/0x20
       [<ffffffffa0a3fe70>] ? lfsck_master_engine+0x0/0x13e0 [lfsck]
       [<ffffffff8109e66e>] kthread+0x9e/0xc0
       [<ffffffff8100c20a>] child_rip+0xa/0x20
       [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
       [<ffffffff8100c200>] ? child_rip+0x0/0x20
      Code: c6 9c 03 00 00 4c 89 f7 e8 91 bf 19 e1 48 8b 33 ba 01 00 00 00 4c 89 e7 e 
      RIP  [<ffffffffa039179d>] jbd2_journal_dirty_metadata+0x10d/0x150 [jbd2]
       RSP <ffff8801fa26da00>
      

      We've got a vmcore file on one of the servers, which we can upload if this is required.

      After failing over the MDT and recovering the OSTs, I can stop the lfsck layout check.

      Attachments

        Activity

          [LU-6683] OSS crash when starting lfsck layout check
          pjones Peter Jones added a comment -

          Landed for 2.8

          pjones Peter Jones added a comment - Landed for 2.8

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15361/
          Subject: LU-6683 osd: declare enough credits for generating LMA
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 3675d14de7ffcd761eca1448aab950f80773412a

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/15361/ Subject: LU-6683 osd: declare enough credits for generating LMA Project: fs/lustre-release Branch: master Current Patch Set: Commit: 3675d14de7ffcd761eca1448aab950f80773412a

          Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/15361
          Subject: LU-6683 osd: declare enough credits for generating LMA
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 71d360f5a2201aa666382a0b7da1b89860596777

          gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/15361 Subject: LU-6683 osd: declare enough credits for generating LMA Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 71d360f5a2201aa666382a0b7da1b89860596777

          Yes, master needs the patch also.

          yong.fan nasf (Inactive) added a comment - Yes, master needs the patch also.

          Is this patch needed for master?

          adilger Andreas Dilger added a comment - Is this patch needed for master?

          Thanks Frederik for the updating. The patch for b2_7 has been replaced by the patch for b2_7_fe: http://review.whamcloud.com/#/c/15133/

          yong.fan nasf (Inactive) added a comment - Thanks Frederik for the updating. The patch for b2_7 has been replaced by the patch for b2_7_fe: http://review.whamcloud.com/#/c/15133/

          Thanks for confirming regarding the build failure.

          I have now updated our test file system to include the patch and can confirm that this fixed the crash for us.

          ferner Frederik Ferner (Inactive) added a comment - Thanks for confirming regarding the build failure. I have now updated our test file system to include the patch and can confirm that this fixed the crash for us.

          The failure is related with the build system, not the patch. So please go ahead with the patch. Thanks!

          yong.fan nasf (Inactive) added a comment - The failure is related with the build system, not the patch. So please go ahead with the patch. Thanks!

          I noticed on review page that the builds are marked as failure, but this seems to be RHEL7 only. I'll certainly try the patch ASAP.

          ferner Frederik Ferner (Inactive) added a comment - I noticed on review page that the builds are marked as failure, but this seems to be RHEL7 only. I'll certainly try the patch ASAP.

          People

            yong.fan nasf (Inactive)
            ferner Frederik Ferner (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: