Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.7.0
-
None
-
files system with 1MDT, 6 OST, 2 OSS, installed as 1.6, upgrade to 1.8, 2.5, now 2.7
-
3
-
9223372036854775807
Description
When starting the lfsck layout check on our test file system, but OSS servers immediately crash with something like the following on the console (or in vmcore-dmesg.txt). I also discovered that I can't stop the lfsck (lctl lfsck_stop just hangs) in this stage (after recovering the OSTs) and when failing over the MDT in this state, it is re-started when mounting the MDT on the other MDS, crashing the OSS nodes again. The output below has been collected after the crash triggered by the MDT failover mounting.
------------[ cut here ]------------ kernel BUG at fs/jbd2/transaction.c:1030! Lustre: play01-OST0001: deleting orphan objects from 0x0:51613818 to 0x0:5161388 Lustre: play01-OST0003: deleting orphan objects from 0x0:77539134 to 0x0:7753920 Lustre: play01-OST0005: deleting orphan objects from 0x0:44598982 to 0x0:4459905 invalid opcode: 0000 [#1] SMP last sysfs file: /sys/devices/pci0000:00/0000:00:07.0/0000:0c:00.0/host7/target7 CPU 2 Modules linked in: osp(U) ofd(U) lfsck(U) ipmi_si ost(U) mgc(U) osd_ldiskfs(U) a Pid: 25013, comm: lfsck Not tainted 2.6.32-504.8.1.el6_lustre.x86_64 #1 Dell Inc RIP: 0010:[<ffffffffa039179d>] [<ffffffffa039179d>] jbd2_journal_dirty_metadata RSP: 0018:ffff8801fa26da00 EFLAGS: 00010246 RAX: ffff88043b4aa680 RBX: ffff880202e1f498 RCX: ffff880226a866e0 RDX: 0000000000000000 RSI: ffff880226a866e0 RDI: 0000000000000000 RBP: ffff8801fa26da20 R08: ffff880226a866e0 R09: 0000000000000018 R10: 0000000000480403 R11: 0000000000000001 R12: ffff880202e386d8 R13: ffff880226a866e0 R14: ffff880239208800 R15: 0000000000000000 FS: 00007fdff3fff700(0000) GS:ffff880028240000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007feb2ce760a0 CR3: 000000043b4d1000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process lfsck (pid: 25013, threadinfo ffff8801fa26c000, task ffff8801f78b3540) Stack: ffff880202e1f498 ffffffffa0fd7710 ffff880226a866e0 0000000000000000 <d> ffff8801fa26da60 ffffffffa0f9600b ffff8801fa26daa0 ffffffffa0fd2af3 <d> ffff8802159f3000 ffff8803f12396e0 ffff8803f1239610 ffff8801fa26db28 Call Trace: [<ffffffffa0f9600b>] __ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs] [<ffffffffa0fd2af3>] ? ldiskfs_xattr_set_entry+0x4e3/0x4f0 [ldiskfs] [<ffffffffa0fa1d9a>] ldiskfs_mark_iloc_dirty+0x52a/0x630 [ldiskfs] [<ffffffffa0fd4abc>] ldiskfs_xattr_set_handle+0x33c/0x560 [ldiskfs] [<ffffffffa0fd4ddc>] ldiskfs_xattr_set+0xfc/0x1a0 [ldiskfs] [<ffffffffa0fd500e>] ldiskfs_xattr_trusted_set+0x2e/0x30 [ldiskfs] [<ffffffff811b4722>] generic_setxattr+0xa2/0xb0 [<ffffffffa0d4690d>] __osd_xattr_set+0x8d/0xe0 [osd_ldiskfs] [<ffffffffa0d4e005>] osd_xattr_set+0x3a5/0x4b0 [osd_ldiskfs] [<ffffffffa0a3f446>] lfsck_master_oit_engine+0x14c6/0x1ef0 [lfsck] [<ffffffffa0a4094e>] lfsck_master_engine+0xade/0x13e0 [lfsck] [<ffffffff81064b90>] ? default_wake_function+0x0/0x20 [<ffffffffa0a3fe70>] ? lfsck_master_engine+0x0/0x13e0 [lfsck] [<ffffffff8109e66e>] kthread+0x9e/0xc0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffff8109e5d0>] ? kthread+0x0/0xc0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 Code: c6 9c 03 00 00 4c 89 f7 e8 91 bf 19 e1 48 8b 33 ba 01 00 00 00 4c 89 e7 e RIP [<ffffffffa039179d>] jbd2_journal_dirty_metadata+0x10d/0x150 [jbd2] RSP <ffff8801fa26da00>
We've got a vmcore file on one of the servers, which we can upload if this is required.
After failing over the MDT and recovering the OSTs, I can stop the lfsck layout check.