Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6683

OSS crash when starting lfsck layout check

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.7.0
    • None
    • files system with 1MDT, 6 OST, 2 OSS, installed as 1.6, upgrade to 1.8, 2.5, now 2.7
    • 3
    • 9223372036854775807

    Description

      When starting the lfsck layout check on our test file system, but OSS servers immediately crash with something like the following on the console (or in vmcore-dmesg.txt). I also discovered that I can't stop the lfsck (lctl lfsck_stop just hangs) in this stage (after recovering the OSTs) and when failing over the MDT in this state, it is re-started when mounting the MDT on the other MDS, crashing the OSS nodes again. The output below has been collected after the crash triggered by the MDT failover mounting.

      ------------[ cut here ]------------
      kernel BUG at fs/jbd2/transaction.c:1030!
      Lustre: play01-OST0001: deleting orphan objects from 0x0:51613818 to 0x0:5161388
      Lustre: play01-OST0003: deleting orphan objects from 0x0:77539134 to 0x0:7753920
      Lustre: play01-OST0005: deleting orphan objects from 0x0:44598982 to 0x0:4459905
      invalid opcode: 0000 [#1] SMP 
      last sysfs file: /sys/devices/pci0000:00/0000:00:07.0/0000:0c:00.0/host7/target7
      CPU 2 
      Modules linked in: osp(U) ofd(U) lfsck(U) ipmi_si ost(U) mgc(U) osd_ldiskfs(U) a
      
      Pid: 25013, comm: lfsck Not tainted 2.6.32-504.8.1.el6_lustre.x86_64 #1 Dell Inc
      RIP: 0010:[<ffffffffa039179d>]  [<ffffffffa039179d>] jbd2_journal_dirty_metadata
      RSP: 0018:ffff8801fa26da00  EFLAGS: 00010246
      RAX: ffff88043b4aa680 RBX: ffff880202e1f498 RCX: ffff880226a866e0
      RDX: 0000000000000000 RSI: ffff880226a866e0 RDI: 0000000000000000
      RBP: ffff8801fa26da20 R08: ffff880226a866e0 R09: 0000000000000018
      R10: 0000000000480403 R11: 0000000000000001 R12: ffff880202e386d8
      R13: ffff880226a866e0 R14: ffff880239208800 R15: 0000000000000000
      FS:  00007fdff3fff700(0000) GS:ffff880028240000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      CR2: 00007feb2ce760a0 CR3: 000000043b4d1000 CR4: 00000000000007e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process lfsck (pid: 25013, threadinfo ffff8801fa26c000, task ffff8801f78b3540)
      Stack:
       ffff880202e1f498 ffffffffa0fd7710 ffff880226a866e0 0000000000000000
      <d> ffff8801fa26da60 ffffffffa0f9600b ffff8801fa26daa0 ffffffffa0fd2af3
      <d> ffff8802159f3000 ffff8803f12396e0 ffff8803f1239610 ffff8801fa26db28
      Call Trace:
       [<ffffffffa0f9600b>] __ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs]
       [<ffffffffa0fd2af3>] ? ldiskfs_xattr_set_entry+0x4e3/0x4f0 [ldiskfs]
       [<ffffffffa0fa1d9a>] ldiskfs_mark_iloc_dirty+0x52a/0x630 [ldiskfs]
       [<ffffffffa0fd4abc>] ldiskfs_xattr_set_handle+0x33c/0x560 [ldiskfs]
       [<ffffffffa0fd4ddc>] ldiskfs_xattr_set+0xfc/0x1a0 [ldiskfs]
       [<ffffffffa0fd500e>] ldiskfs_xattr_trusted_set+0x2e/0x30 [ldiskfs]
       [<ffffffff811b4722>] generic_setxattr+0xa2/0xb0
       [<ffffffffa0d4690d>] __osd_xattr_set+0x8d/0xe0 [osd_ldiskfs]
       [<ffffffffa0d4e005>] osd_xattr_set+0x3a5/0x4b0 [osd_ldiskfs]
       [<ffffffffa0a3f446>] lfsck_master_oit_engine+0x14c6/0x1ef0 [lfsck]
       [<ffffffffa0a4094e>] lfsck_master_engine+0xade/0x13e0 [lfsck]
       [<ffffffff81064b90>] ? default_wake_function+0x0/0x20
       [<ffffffffa0a3fe70>] ? lfsck_master_engine+0x0/0x13e0 [lfsck]
       [<ffffffff8109e66e>] kthread+0x9e/0xc0
       [<ffffffff8100c20a>] child_rip+0xa/0x20
       [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
       [<ffffffff8100c200>] ? child_rip+0x0/0x20
      Code: c6 9c 03 00 00 4c 89 f7 e8 91 bf 19 e1 48 8b 33 ba 01 00 00 00 4c 89 e7 e 
      RIP  [<ffffffffa039179d>] jbd2_journal_dirty_metadata+0x10d/0x150 [jbd2]
       RSP <ffff8801fa26da00>
      

      We've got a vmcore file on one of the servers, which we can upload if this is required.

      After failing over the MDT and recovering the OSTs, I can stop the lfsck layout check.

      Attachments

        Activity

          People

            yong.fan nasf (Inactive)
            ferner Frederik Ferner (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: