[LU-3160] recovery-random-scale test_fail_client_mds: RIP: cl_object_top+0xe/0x150 [obdclass] - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.4.0
Affects Version/s: Lustre 2.4.0
Labels:
- MB
Environment:

Lustre Branch: master
Lustre Build: http://build.whamcloud.com/job/lustre-master/1396/
Distro/Arch: RHEL6.3/x86_64
Test Group: failover
FAILURE_MODE=HARD

Severity:
3
Rank (Obsolete):
7697

Description

While running recovery-random-scale (failing one random client and then failing mds), dd operation on the other live client (wtm-4vm5) hung and the client crashed:

2013-04-11 03:01:35: dd run starting
+ mkdir -p /mnt/lustre/d0.dd-wtm-4vm5.rosso.whamcloud.com
+ /usr/bin/lfs setstripe -c -1 /mnt/lustre/d0.dd-wtm-4vm5.rosso.whamcloud.com
+ cd /mnt/lustre/d0.dd-wtm-4vm5.rosso.whamcloud.com
++ /usr/bin/lfs df /mnt/lustre/d0.dd-wtm-4vm5.rosso.whamcloud.com
+ FREE_SPACE=97381848
+ BLKS=21910915
+ echo 'Free disk space is 97381848, 4k blocks to dd is 21910915'
+ load_pid=2634
+ wait 2634
+ dd bs=4k count=21910915 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-wtm-4vm5.rosso.whamcloud.com/dd-file

Console log on wtm-4vm5:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
IP: [<ffffffffa05b4b4e>] cl_object_top+0xe/0x150 [obdclass]
PGD 7b01e067 PUD 7b016067 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/system/cpu/possible
CPU 0
Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic libcfs(U) nfsd exportfs autofs4 nfs lockd fscache nfs_acl auth_rpcgss sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]

Pid: 2683, comm: flush-lustre-1 Not tainted 2.6.32-279.19.1.el6.x86_64 #1 Red Hat KVM
RIP: 0010:[<ffffffffa05b4b4e>]  [<ffffffffa05b4b4e>] cl_object_top+0xe/0x150 [obdclass]
RSP: 0018:ffff88004c3c5980  EFLAGS: 00010282
RAX: ffff88007bc75800 RBX: ffff88007d1c21e8 RCX: 0000000000000098
RDX: ffff88003e2bb200 RSI: ffffffffa0602400 RDI: 0000000000000098
RBP: ffff88004c3c5990 R08: 0000000000000001 R09: 0000000000000000
R10: 000000000000000f R11: 000000000000000f R12: ffff88007d1bc3d0
R13: 0000000000000004 R14: 0000000000000098 R15: ffff88007bc75800
FS:  0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000098 CR3: 000000007b02d000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process flush-lustre-1 (pid: 2683, threadinfo ffff88004c3c4000, task ffff88007b730080)
Stack:
 ffff88004c3c5990 ffff88007d1c21e8 ffff88004c3c59d0 ffffffffa05c575d
<d> 0000000000000000 ffff8800430e3c00 ffff88007d1bc378 0000000000000000
<d> ffff88007d1bf768 ffff88007b020e80 ffff88004c3c5a30 ffffffffa09d3488
Call Trace:
 [<ffffffffa05c575d>] cl_io_sub_init+0x3d/0xc0 [obdclass]
 [<ffffffffa09d3488>] lov_sub_get+0x218/0x690 [lov]
 [<ffffffffa09d5116>] lov_io_iter_init+0xd6/0x480 [lov]
 [<ffffffffa05c279d>] cl_io_iter_init+0x5d/0x110 [obdclass]
 [<ffffffffa05c6d3c>] cl_io_loop+0x4c/0x1b0 [obdclass]
 [<ffffffffa0a5233b>] cl_sync_file_range+0x2fb/0x4e0 [lustre]
 [<ffffffffa0a7ba7f>] ll_writepages+0x6f/0x1a0 [lustre]
 [<ffffffff811255d1>] do_writepages+0x21/0x40
 [<ffffffff8119fe8d>] writeback_single_inode+0xdd/0x290
 [<ffffffff811a029e>] writeback_sb_inodes+0xce/0x180
 [<ffffffff811a03fb>] writeback_inodes_wb+0xab/0x1b0
 [<ffffffff811a079b>] wb_writeback+0x29b/0x3f0
 [<ffffffff814e9c50>] ? thread_return+0x4e/0x76e
 [<ffffffff8107d572>] ? del_timer_sync+0x22/0x30
 [<ffffffff811a0a89>] wb_do_writeback+0x199/0x240
 [<ffffffff811a0b93>] bdi_writeback_task+0x63/0x1b0
 [<ffffffff81090857>] ? bit_waitqueue+0x17/0xd0
 [<ffffffff81134170>] ? bdi_start_fn+0x0/0x100
 [<ffffffff811341f6>] bdi_start_fn+0x86/0x100
 [<ffffffff81134170>] ? bdi_start_fn+0x0/0x100
 [<ffffffff81090626>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff81090590>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Code: 04 00 00 00 04 00 e8 52 b7 e8 ff 48 c7 c7 60 2b 60 a0 e8 16 b3 e7 ff 66 0f 1f 44 00 00 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 <48> 8b 07 0f 1f 80 00 00 00 00 48 89 c2 48 8b 80 88 00 00 00 48
RIP  [<ffffffffa05b4b4e>] cl_object_top+0xe/0x150 [obdclass]
 RSP <ffff88004c3c5980>

Maloo report: https://maloo.whamcloud.com/test_sets/c1c906c6-a294-11e2-81ba-52540035b04c

The console logs in the above Maloo report were not gathered completely due to TT-1107. Please refer to the attachment for the full console logs.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

recovery-random-scale.test_fail_client_mds.console.tar.bz2
22 kB
12/Apr/13 6:27 AM

Activity

[LU-3160] recovery-random-scale test_fail_client_mds: RIP: cl_object_top+0xe/0x150 [obdclass]

Peter Jones added a comment - 02/May/13 10:37 AM

Landed for 2.4

Peter Jones added a comment - 02/May/13 10:37 AM Landed for 2.4

Jinshan Xiong (Inactive) added a comment - 28/Apr/13 9:16 PM

This should be a problem of force umount that it causes failure in refresh layout. I have updated the patch, please take a look.

Jinshan Xiong (Inactive) added a comment - 28/Apr/13 9:16 PM This should be a problem of force umount that it causes failure in refresh layout. I have updated the patch, please take a look.

Jinshan Xiong (Inactive) added a comment - 28/Apr/13 7:00 PM - edited

Do you know why we didn't flush dirty on layout lock revocation in the begining? Thanks.

We don't want to introduce cascading problem - in case one of the OSTs is unaccessible, the client will be evicted by the MDT because it can't cancel the layout lock in time.

If what you said was true, we should have seen this problem in migration, because we're using an orphan object to restripe. I will take a look at this.

Jinshan Xiong (Inactive) added a comment - 28/Apr/13 7:00 PM - edited Do you know why we didn't flush dirty on layout lock revocation in the begining? Thanks. We don't want to introduce cascading problem - in case one of the OSTs is unaccessible, the client will be evicted by the MDT because it can't cancel the layout lock in time. If what you said was true, we should have seen this problem in migration, because we're using an orphan object to restripe. I will take a look at this.

Niu Yawei (Inactive) added a comment - 28/Apr/13 9:38 AM

Looks we chose to flush dirty when changing layout but not on layout lock revocation, not sure if it was the best choice, but the problem can be worked around by anohter way (ignore -ENOENT error on layout refresh for io).

Niu Yawei (Inactive) added a comment - 28/Apr/13 9:38 AM Looks we chose to flush dirty when changing layout but not on layout lock revocation, not sure if it was the best choice, but the problem can be worked around by anohter way (ignore -ENOENT error on layout refresh for io).

Niu Yawei (Inactive) added a comment - 28/Apr/13 7:05 AM

Hi, Xiong

Why don't we flush dirty before lost layout lock (see ll_md_blocking_ast())?

This patch caused LBUG when running conf-sanity test_0:

client write data to file, then do unlink;
layout lock is revoked, but the dirty isn't being flushed;
on client umount, kernel try to flush dirty back by calling ll_writepages(), which will verify layout (what this patch did), but layout fetch will defenitely fail because the file has been removed on MDS;
dirty isn't flushed back at the end, and the LBUG is triggered.

Do you know why we didn't flush dirty on layout lock revocation in the begining? Thanks.

Niu Yawei (Inactive) added a comment - 28/Apr/13 7:05 AM Hi, Xiong Why don't we flush dirty before lost layout lock (see ll_md_blocking_ast())? This patch caused LBUG when running conf-sanity test_0: client write data to file, then do unlink; layout lock is revoked, but the dirty isn't being flushed; on client umount, kernel try to flush dirty back by calling ll_writepages(), which will verify layout (what this patch did), but layout fetch will defenitely fail because the file has been removed on MDS; dirty isn't flushed back at the end, and the LBUG is triggered. Do you know why we didn't flush dirty on layout lock revocation in the begining? Thanks.

Niu Yawei (Inactive) added a comment - 28/Apr/13 4:22 AM

Looks the patch revealed another problem, conf-sanity test_0 will always hit LBUG. (see ~~LU-3230~~), I'm looking into it.

Niu Yawei (Inactive) added a comment - 28/Apr/13 4:22 AM Looks the patch revealed another problem, conf-sanity test_0 will always hit LBUG. (see LU-3230 ), I'm looking into it.

Niu Yawei (Inactive) added a comment - 25/Apr/13 3:17 AM

http://review.whamcloud.com/6154

Niu Yawei (Inactive) added a comment - 25/Apr/13 3:17 AM http://review.whamcloud.com/6154

Jinshan Xiong (Inactive) added a comment - 24/Apr/13 4:50 PM

Yes, this is exactly what we're doing. However, there are some cases that we need to start an IO at OSC layer, which may have already held conf_lock so it may cause deadlock. ci_ignore_layout was worked out to indicate if it needs to hold layout lock to start the IO. Usually if the IO is about page we're pretty safe because layout change has to clean up pages.

For cl_sync_file_range() case, I set ci_ignore_layout to 1 because I thought if there exists dirty pages, layout won't be changed. Obviously I was wrong because there is race.

For this specific problem, it can be fixed by revising ci_ignore_layout to 0. Please also check the other callers of cl_sync_file_range() to make sure it's true.

Jinshan Xiong (Inactive) added a comment - 24/Apr/13 4:50 PM Yes, this is exactly what we're doing. However, there are some cases that we need to start an IO at OSC layer, which may have already held conf_lock so it may cause deadlock. ci_ignore_layout was worked out to indicate if it needs to hold layout lock to start the IO. Usually if the IO is about page we're pretty safe because layout change has to clean up pages. For cl_sync_file_range() case, I set ci_ignore_layout to 1 because I thought if there exists dirty pages, layout won't be changed. Obviously I was wrong because there is race. For this specific problem, it can be fixed by revising ci_ignore_layout to 0. Please also check the other callers of cl_sync_file_range() to make sure it's true.

Niu Yawei (Inactive) added a comment - 24/Apr/13 8:09 AM

Hi, Xiong

I don't see why we didn't use lo_type_guard to protect layout at the beginning, which looks simpler and safer than current way (active_ios & waitq).

Get read lock for io: call lov_conf_freeze() in lov_io_init_raid0(), call lov_conf_thaw() in lov_io_fini();
Get write lock for layout change: call lov_conf_lock() & lov_conf_unlock() in lov_layout_change() to make the "layout delete -> layout reinstall" atomic;

Did I miss anything?

Niu Yawei (Inactive) added a comment - 24/Apr/13 8:09 AM Hi, Xiong I don't see why we didn't use lo_type_guard to protect layout at the beginning, which looks simpler and safer than current way (active_ios & waitq). Get read lock for io: call lov_conf_freeze() in lov_io_init_raid0(), call lov_conf_thaw() in lov_io_fini(); Get write lock for layout change: call lov_conf_lock() & lov_conf_unlock() in lov_layout_change() to make the "layout delete -> layout reinstall" atomic; Did I miss anything?

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Jian Yu

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 12/Apr/13 5:44 AM

Updated:: 02/May/13 10:37 AM

Resolved:: 02/May/13 10:37 AM