Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.4.0
-
Lustre Branch: master
Lustre Build: http://build.whamcloud.com/job/lustre-master/1396/
Distro/Arch: RHEL6.3/x86_64
Test Group: failover
FAILURE_MODE=HARD
-
3
-
7697
Description
While running recovery-random-scale (failing one random client and then failing mds), dd operation on the other live client (wtm-4vm5) hung and the client crashed:
2013-04-11 03:01:35: dd run starting + mkdir -p /mnt/lustre/d0.dd-wtm-4vm5.rosso.whamcloud.com + /usr/bin/lfs setstripe -c -1 /mnt/lustre/d0.dd-wtm-4vm5.rosso.whamcloud.com + cd /mnt/lustre/d0.dd-wtm-4vm5.rosso.whamcloud.com ++ /usr/bin/lfs df /mnt/lustre/d0.dd-wtm-4vm5.rosso.whamcloud.com + FREE_SPACE=97381848 + BLKS=21910915 + echo 'Free disk space is 97381848, 4k blocks to dd is 21910915' + load_pid=2634 + wait 2634 + dd bs=4k count=21910915 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-wtm-4vm5.rosso.whamcloud.com/dd-file
Console log on wtm-4vm5:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000098 IP: [<ffffffffa05b4b4e>] cl_object_top+0xe/0x150 [obdclass] PGD 7b01e067 PUD 7b016067 PMD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/system/cpu/possible CPU 0 Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic libcfs(U) nfsd exportfs autofs4 nfs lockd fscache nfs_acl auth_rpcgss sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] Pid: 2683, comm: flush-lustre-1 Not tainted 2.6.32-279.19.1.el6.x86_64 #1 Red Hat KVM RIP: 0010:[<ffffffffa05b4b4e>] [<ffffffffa05b4b4e>] cl_object_top+0xe/0x150 [obdclass] RSP: 0018:ffff88004c3c5980 EFLAGS: 00010282 RAX: ffff88007bc75800 RBX: ffff88007d1c21e8 RCX: 0000000000000098 RDX: ffff88003e2bb200 RSI: ffffffffa0602400 RDI: 0000000000000098 RBP: ffff88004c3c5990 R08: 0000000000000001 R09: 0000000000000000 R10: 000000000000000f R11: 000000000000000f R12: ffff88007d1bc3d0 R13: 0000000000000004 R14: 0000000000000098 R15: ffff88007bc75800 FS: 0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000098 CR3: 000000007b02d000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process flush-lustre-1 (pid: 2683, threadinfo ffff88004c3c4000, task ffff88007b730080) Stack: ffff88004c3c5990 ffff88007d1c21e8 ffff88004c3c59d0 ffffffffa05c575d <d> 0000000000000000 ffff8800430e3c00 ffff88007d1bc378 0000000000000000 <d> ffff88007d1bf768 ffff88007b020e80 ffff88004c3c5a30 ffffffffa09d3488 Call Trace: [<ffffffffa05c575d>] cl_io_sub_init+0x3d/0xc0 [obdclass] [<ffffffffa09d3488>] lov_sub_get+0x218/0x690 [lov] [<ffffffffa09d5116>] lov_io_iter_init+0xd6/0x480 [lov] [<ffffffffa05c279d>] cl_io_iter_init+0x5d/0x110 [obdclass] [<ffffffffa05c6d3c>] cl_io_loop+0x4c/0x1b0 [obdclass] [<ffffffffa0a5233b>] cl_sync_file_range+0x2fb/0x4e0 [lustre] [<ffffffffa0a7ba7f>] ll_writepages+0x6f/0x1a0 [lustre] [<ffffffff811255d1>] do_writepages+0x21/0x40 [<ffffffff8119fe8d>] writeback_single_inode+0xdd/0x290 [<ffffffff811a029e>] writeback_sb_inodes+0xce/0x180 [<ffffffff811a03fb>] writeback_inodes_wb+0xab/0x1b0 [<ffffffff811a079b>] wb_writeback+0x29b/0x3f0 [<ffffffff814e9c50>] ? thread_return+0x4e/0x76e [<ffffffff8107d572>] ? del_timer_sync+0x22/0x30 [<ffffffff811a0a89>] wb_do_writeback+0x199/0x240 [<ffffffff811a0b93>] bdi_writeback_task+0x63/0x1b0 [<ffffffff81090857>] ? bit_waitqueue+0x17/0xd0 [<ffffffff81134170>] ? bdi_start_fn+0x0/0x100 [<ffffffff811341f6>] bdi_start_fn+0x86/0x100 [<ffffffff81134170>] ? bdi_start_fn+0x0/0x100 [<ffffffff81090626>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffff81090590>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Code: 04 00 00 00 04 00 e8 52 b7 e8 ff 48 c7 c7 60 2b 60 a0 e8 16 b3 e7 ff 66 0f 1f 44 00 00 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 <48> 8b 07 0f 1f 80 00 00 00 00 48 89 c2 48 8b 80 88 00 00 00 48 RIP [<ffffffffa05b4b4e>] cl_object_top+0xe/0x150 [obdclass] RSP <ffff88004c3c5980>
Maloo report: https://maloo.whamcloud.com/test_sets/c1c906c6-a294-11e2-81ba-52540035b04c
The console logs in the above Maloo report were not gathered completely due to TT-1107. Please refer to the attachment for the full console logs.