Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.7.0
Labels:
None
Environment:
files system with 1MDT, 6 OST, 2 OSS, installed as 1.6, upgrade to 1.8, 2.5, now 2.7

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

When starting the lfsck layout check on our test file system, but OSS servers immediately crash with something like the following on the console (or in vmcore-dmesg.txt). I also discovered that I can't stop the lfsck (lctl lfsck_stop just hangs) in this stage (after recovering the OSTs) and when failing over the MDT in this state, it is re-started when mounting the MDT on the other MDS, crashing the OSS nodes again. The output below has been collected after the crash triggered by the MDT failover mounting.

------------[ cut here ]------------
kernel BUG at fs/jbd2/transaction.c:1030!
Lustre: play01-OST0001: deleting orphan objects from 0x0:51613818 to 0x0:5161388
Lustre: play01-OST0003: deleting orphan objects from 0x0:77539134 to 0x0:7753920
Lustre: play01-OST0005: deleting orphan objects from 0x0:44598982 to 0x0:4459905
invalid opcode: 0000 [#1] SMP 
last sysfs file: /sys/devices/pci0000:00/0000:00:07.0/0000:0c:00.0/host7/target7
CPU 2 
Modules linked in: osp(U) ofd(U) lfsck(U) ipmi_si ost(U) mgc(U) osd_ldiskfs(U) a

Pid: 25013, comm: lfsck Not tainted 2.6.32-504.8.1.el6_lustre.x86_64 #1 Dell Inc
RIP: 0010:[<ffffffffa039179d>]  [<ffffffffa039179d>] jbd2_journal_dirty_metadata
RSP: 0018:ffff8801fa26da00  EFLAGS: 00010246
RAX: ffff88043b4aa680 RBX: ffff880202e1f498 RCX: ffff880226a866e0
RDX: 0000000000000000 RSI: ffff880226a866e0 RDI: 0000000000000000
RBP: ffff8801fa26da20 R08: ffff880226a866e0 R09: 0000000000000018
R10: 0000000000480403 R11: 0000000000000001 R12: ffff880202e386d8
R13: ffff880226a866e0 R14: ffff880239208800 R15: 0000000000000000
FS:  00007fdff3fff700(0000) GS:ffff880028240000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007feb2ce760a0 CR3: 000000043b4d1000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process lfsck (pid: 25013, threadinfo ffff8801fa26c000, task ffff8801f78b3540)
Stack:
 ffff880202e1f498 ffffffffa0fd7710 ffff880226a866e0 0000000000000000
<d> ffff8801fa26da60 ffffffffa0f9600b ffff8801fa26daa0 ffffffffa0fd2af3
<d> ffff8802159f3000 ffff8803f12396e0 ffff8803f1239610 ffff8801fa26db28
Call Trace:
 [<ffffffffa0f9600b>] __ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs]
 [<ffffffffa0fd2af3>] ? ldiskfs_xattr_set_entry+0x4e3/0x4f0 [ldiskfs]
 [<ffffffffa0fa1d9a>] ldiskfs_mark_iloc_dirty+0x52a/0x630 [ldiskfs]
 [<ffffffffa0fd4abc>] ldiskfs_xattr_set_handle+0x33c/0x560 [ldiskfs]
 [<ffffffffa0fd4ddc>] ldiskfs_xattr_set+0xfc/0x1a0 [ldiskfs]
 [<ffffffffa0fd500e>] ldiskfs_xattr_trusted_set+0x2e/0x30 [ldiskfs]
 [<ffffffff811b4722>] generic_setxattr+0xa2/0xb0
 [<ffffffffa0d4690d>] __osd_xattr_set+0x8d/0xe0 [osd_ldiskfs]
 [<ffffffffa0d4e005>] osd_xattr_set+0x3a5/0x4b0 [osd_ldiskfs]
 [<ffffffffa0a3f446>] lfsck_master_oit_engine+0x14c6/0x1ef0 [lfsck]
 [<ffffffffa0a4094e>] lfsck_master_engine+0xade/0x13e0 [lfsck]
 [<ffffffff81064b90>] ? default_wake_function+0x0/0x20
 [<ffffffffa0a3fe70>] ? lfsck_master_engine+0x0/0x13e0 [lfsck]
 [<ffffffff8109e66e>] kthread+0x9e/0xc0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
Code: c6 9c 03 00 00 4c 89 f7 e8 91 bf 19 e1 48 8b 33 ba 01 00 00 00 4c 89 e7 e 
RIP  [<ffffffffa039179d>] jbd2_journal_dirty_metadata+0x10d/0x150 [jbd2]
 RSP <ffff8801fa26da00>

We've got a vmcore file on one of the servers, which we can upload if this is required.

After failing over the MDT and recovering the OSTs, I can stop the lfsck layout check.

Attachments

Activity

People

Assignee:: nasf (Inactive)

Reporter:: Frederik Ferner (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 03/Jun/15 4:34 PM

Updated:: 09/Nov/15 6:36 PM

Resolved:: 21/Jul/15 4:23 PM