[LU-3160] recovery-random-scale test_fail_client_mds: RIP: cl_object_top+0xe/0x150 [obdclass] Created: 12/Apr/13  Updated: 02/May/13  Resolved: 02/May/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Jian Yu Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: MB
Environment:

Lustre Branch: master
Lustre Build: http://build.whamcloud.com/job/lustre-master/1396/
Distro/Arch: RHEL6.3/x86_64
Test Group: failover
FAILURE_MODE=HARD


Attachments: File recovery-random-scale.test_fail_client_mds.console.tar.bz2    
Severity: 3
Rank (Obsolete): 7697

 Description   

While running recovery-random-scale (failing one random client and then failing mds), dd operation on the other live client (wtm-4vm5) hung and the client crashed:

2013-04-11 03:01:35: dd run starting
+ mkdir -p /mnt/lustre/d0.dd-wtm-4vm5.rosso.whamcloud.com
+ /usr/bin/lfs setstripe -c -1 /mnt/lustre/d0.dd-wtm-4vm5.rosso.whamcloud.com
+ cd /mnt/lustre/d0.dd-wtm-4vm5.rosso.whamcloud.com
++ /usr/bin/lfs df /mnt/lustre/d0.dd-wtm-4vm5.rosso.whamcloud.com
+ FREE_SPACE=97381848
+ BLKS=21910915
+ echo 'Free disk space is 97381848, 4k blocks to dd is 21910915'
+ load_pid=2634
+ wait 2634
+ dd bs=4k count=21910915 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-wtm-4vm5.rosso.whamcloud.com/dd-file

Console log on wtm-4vm5:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
IP: [<ffffffffa05b4b4e>] cl_object_top+0xe/0x150 [obdclass]
PGD 7b01e067 PUD 7b016067 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/system/cpu/possible
CPU 0
Modules linked in: lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic libcfs(U) nfsd exportfs autofs4 nfs lockd fscache nfs_acl auth_rpcgss sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]

Pid: 2683, comm: flush-lustre-1 Not tainted 2.6.32-279.19.1.el6.x86_64 #1 Red Hat KVM
RIP: 0010:[<ffffffffa05b4b4e>]  [<ffffffffa05b4b4e>] cl_object_top+0xe/0x150 [obdclass]
RSP: 0018:ffff88004c3c5980  EFLAGS: 00010282
RAX: ffff88007bc75800 RBX: ffff88007d1c21e8 RCX: 0000000000000098
RDX: ffff88003e2bb200 RSI: ffffffffa0602400 RDI: 0000000000000098
RBP: ffff88004c3c5990 R08: 0000000000000001 R09: 0000000000000000
R10: 000000000000000f R11: 000000000000000f R12: ffff88007d1bc3d0
R13: 0000000000000004 R14: 0000000000000098 R15: ffff88007bc75800
FS:  0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000098 CR3: 000000007b02d000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process flush-lustre-1 (pid: 2683, threadinfo ffff88004c3c4000, task ffff88007b730080)
Stack:
 ffff88004c3c5990 ffff88007d1c21e8 ffff88004c3c59d0 ffffffffa05c575d
<d> 0000000000000000 ffff8800430e3c00 ffff88007d1bc378 0000000000000000
<d> ffff88007d1bf768 ffff88007b020e80 ffff88004c3c5a30 ffffffffa09d3488
Call Trace:
 [<ffffffffa05c575d>] cl_io_sub_init+0x3d/0xc0 [obdclass]
 [<ffffffffa09d3488>] lov_sub_get+0x218/0x690 [lov]
 [<ffffffffa09d5116>] lov_io_iter_init+0xd6/0x480 [lov]
 [<ffffffffa05c279d>] cl_io_iter_init+0x5d/0x110 [obdclass]
 [<ffffffffa05c6d3c>] cl_io_loop+0x4c/0x1b0 [obdclass]
 [<ffffffffa0a5233b>] cl_sync_file_range+0x2fb/0x4e0 [lustre]
 [<ffffffffa0a7ba7f>] ll_writepages+0x6f/0x1a0 [lustre]
 [<ffffffff811255d1>] do_writepages+0x21/0x40
 [<ffffffff8119fe8d>] writeback_single_inode+0xdd/0x290
 [<ffffffff811a029e>] writeback_sb_inodes+0xce/0x180
 [<ffffffff811a03fb>] writeback_inodes_wb+0xab/0x1b0
 [<ffffffff811a079b>] wb_writeback+0x29b/0x3f0
 [<ffffffff814e9c50>] ? thread_return+0x4e/0x76e
 [<ffffffff8107d572>] ? del_timer_sync+0x22/0x30
 [<ffffffff811a0a89>] wb_do_writeback+0x199/0x240
 [<ffffffff811a0b93>] bdi_writeback_task+0x63/0x1b0
 [<ffffffff81090857>] ? bit_waitqueue+0x17/0xd0
 [<ffffffff81134170>] ? bdi_start_fn+0x0/0x100
 [<ffffffff811341f6>] bdi_start_fn+0x86/0x100
 [<ffffffff81134170>] ? bdi_start_fn+0x0/0x100
 [<ffffffff81090626>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff81090590>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Code: 04 00 00 00 04 00 e8 52 b7 e8 ff 48 c7 c7 60 2b 60 a0 e8 16 b3 e7 ff 66 0f 1f 44 00 00 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 <48> 8b 07 0f 1f 80 00 00 00 00 48 89 c2 48 8b 80 88 00 00 00 48
RIP  [<ffffffffa05b4b4e>] cl_object_top+0xe/0x150 [obdclass]
 RSP <ffff88004c3c5980>

Maloo report: https://maloo.whamcloud.com/test_sets/c1c906c6-a294-11e2-81ba-52540035b04c

The console logs in the above Maloo report were not gathered completely due to TT-1107. Please refer to the attachment for the full console logs.



 Comments   
Comment by Jian Yu [ 12/Apr/13 ]

The issue was hit consistently on master branch:
https://maloo.whamcloud.com/test_sets/70214812-a1bb-11e2-bdac-52540035b04c

Console log on client client-32vm5:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000098^M
IP: [<ffffffffa05204ce>] cl_object_top+0xe/0x150 [obdclass]^M
PGD 7cd87067 PUD 79952067 PMD 0 ^M
Oops: 0000 [#1] SMP ^M
last sysfs file: /sys/module/lockd/initstate^M
CPU 0 ^M
Modules linked in: nfs fscache lmv(U) mgc(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic libcfs(U) nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]^M
^M
Pid: 2532, comm: flush-lustre-1 Not tainted 2.6.32-279.19.1.el6.x86_64 #1 Red Hat KVM^M
RIP: 0010:[<ffffffffa05204ce>]  [<ffffffffa05204ce>] cl_object_top+0xe/0x150 [obdclass]^M
RSP: 0018:ffff880079f91980  EFLAGS: 00010282^M
RAX: ffff88006e9e5000 RBX: ffff88007a179140 RCX: 0000000000000098^M
RDX: ffff88002a683000 RSI: ffffffffa056db40 RDI: 0000000000000098^M
RBP: ffff880079f91990 R08: 0000000000000001 R09: 0000000000000000^M
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88007a173560^M
R13: 0000000000000004 R14: 0000000000000098 R15: ffff88006e9e5000^M
FS:  0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000^M
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b^M
CR2: 0000000000000098 CR3: 000000007a3d6000 CR4: 00000000000006f0^M
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400^M
Process flush-lustre-1 (pid: 2532, threadinfo ffff880079f90000, task ffff88007a34aae0)^M
Stack:^M
 ffff880079f91990 ffff88007a179140 ffff880079f919d0 ffffffffa05310dd^M
<d> 0000000000000000 ffff880027ceca00 ffff88007a173508 0000000000000000^M
<d> ffff8800511257a8 ffff88007a70ee80 ffff880079f91a30 ffffffffa093aab8^M
Call Trace:^M
 [<ffffffffa05310dd>] cl_io_sub_init+0x3d/0xc0 [obdclass]^M
 [<ffffffffa093aab8>] lov_sub_get+0x218/0x690 [lov]^M
 [<ffffffffa093c746>] lov_io_iter_init+0xd6/0x480 [lov]^M
 [<ffffffffa052e11d>] cl_io_iter_init+0x5d/0x110 [obdclass]^M
 [<ffffffffa05326bc>] cl_io_loop+0x4c/0x1b0 [obdclass]^M
 [<ffffffffa09b933b>] cl_sync_file_range+0x2fb/0x4e0 [lustre]^M
 [<ffffffffa09e277f>] ll_writepages+0x6f/0x1a0 [lustre]^M
 [<ffffffff811255d1>] do_writepages+0x21/0x40^M
 [<ffffffff8119fe8d>] writeback_single_inode+0xdd/0x290^M
 [<ffffffff811a029e>] writeback_sb_inodes+0xce/0x180^M
 [<ffffffff811a03fb>] writeback_inodes_wb+0xab/0x1b0^M
 [<ffffffff811a079b>] wb_writeback+0x29b/0x3f0^M
 [<ffffffff814e9c50>] ? thread_return+0x4e/0x76e^M
 [<ffffffff8107d572>] ? del_timer_sync+0x22/0x30^M
 [<ffffffff811a0a89>] wb_do_writeback+0x199/0x240^M
 [<ffffffff811a0b93>] bdi_writeback_task+0x63/0x1b0^M
 [<ffffffff81090857>] ? bit_waitqueue+0x17/0xd0^M
 [<ffffffff81134170>] ? bdi_start_fn+0x0/0x100^M
 [<ffffffff811341f6>] bdi_start_fn+0x86/0x100^M
 [<ffffffff81134170>] ? bdi_start_fn+0x0/0x100^M
 [<ffffffff81090626>] kthread+0x96/0xa0^M
 [<ffffffff8100c0ca>] child_rip+0xa/0x20^M
 [<ffffffff81090590>] ? kthread+0x0/0xa0^M
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20^M
Comment by Nathaniel Clark [ 12/Apr/13 ]

This crash comes from inside an ASSERT:

int cl_io_sub_init(const struct lu_env *env, struct cl_io *io,
                   enum cl_io_type iot, struct cl_object *obj)
{
        struct cl_thread_info *info = cl_env_info(env);

        LASSERT(obj != cl_object_top(obj));
        if (info->clt_current_io == NULL)
                info->clt_current_io = io;
        return cl_io_init0(env, io, iot, obj);
}

obj is either NULL or very close to NULL (pulled from a struct ptr that was NULL).

Comment by Zhenyu Xu [ 15/Apr/13 ]
00000000000617d0 <cl_object_top>:
cl_object_top():
/root/work/lustre/lustre/obdclass/cl_object.c:162
   617d0:       55                      push   %rbp
   617d1:       48 89 e5                mov    %rsp,%rbp
   617d4:       53                      push   %rbx
   617d5:       48 83 ec 08             sub    $0x8,%rsp
   617d9:       e8 00 00 00 00          callq  617de <cl_object_top+0xe>
luh2coh():
/root/work/lustre/lustre/include/cl_object.h:2691
   617de:       48 8b 07                mov    (%rdi),%rax              // <==============
   617e1:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)

which is

static inline struct cl_object_header *luh2coh(const struct lu_object_header *h)
{
        return container_of0(h, struct cl_object_header, coh_lu);
}

comes from

 
static inline
struct cl_object_header *cl_object_header(const struct cl_object *obj)
{
        return luh2coh(obj->co_lu.lo_header);
}

the RDI content is "RDI: 0000000000000098", does not look like a use-after-free, more like a memory corruption.

Comment by Jian Yu [ 16/Apr/13 ]

Lustre Branch: master
Lustre Build: http://build.whamcloud.com/job/lustre-master/1409/
Distro/Arch: RHEL6.4/x86_64
Test Group: failover
FAILURE_MODE=HARD

recovery-mds-scale test_failover_mds also failed with the same issue:
https://maloo.whamcloud.com/test_sets/f41cb000-a668-11e2-9b48-52540035b04c

Comment by Jian Yu [ 19/Apr/13 ]

Lustre Branch: master
Lustre Build: http://build.whamcloud.com/job/lustre-master/1411/
Distro/Arch: RHEL6.3/x86_64
Test Group: failover
FAILURE_MODE=HARD

The issue consistently occurred while running recovery-random-scale test:
https://maloo.whamcloud.com/test_sets/6b1ef8d6-a702-11e2-90ad-52540035b04c
https://maloo.whamcloud.com/test_sets/3827c09a-a87c-11e2-ba78-52540035b04c

Comment by Jian Yu [ 19/Apr/13 ]

Lustre Branch: master
Lustre Build: http://build.whamcloud.com/job/lustre-master/1406/
Distro/Arch: RHEL6.3/x86_64
Test Group: failover
FAILURE_MODE=HARD

The issue still occurred while running recovery-mds-scale test_failover_mds:
https://maloo.whamcloud.com/test_sets/97ebea9c-a890-11e2-9f50-52540035b04c

The console logs in the above report are complete.

The corresponding vmcore file is /scratch/logs/2.4.0/LU-3160/127.0.0.1-2013-04-18-08:47:21/wtm-77-vmcore on brent node.

Comment by Nathaniel Clark [ 23/Apr/13 ]

lov_io.c::lov_io_sub_init() snip

                sub_obj = lovsub2cl(lov_r0(lov)->lo_sub[stripe]);
                sub_io  = sub->sub_io;

                sub_io->ci_obj    = sub_obj;
                sub_io->ci_result = 0;

                sub_io->ci_parent  = io;
                sub_io->ci_lockreq = io->ci_lockreq;
                sub_io->ci_type    = io->ci_type;
                sub_io->ci_no_srvlock = io->ci_no_srvlock;

                lov_sub_enter(sub);
                result = cl_io_sub_init(sub->sub_env, sub_io,
                                        io->ci_type, sub_obj);

The code lov_r0(lov)->lo_sub[stripe] returns NULL, then lovsub2cl() adds 0x98 to that, giving the bad pointer referenced in cl_object_top.

So it's sub_obj that is bad, which means lov_r0(lov)->lo_sub[stripe] never got initialized (or hasn't yet been initialized).

Comment by Peter Jones [ 23/Apr/13 ]

Niu

Can you please help with this one?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 24/Apr/13 ]

I suspect this is caused by the race of layout change vs. kernel writeback, though we'll cancel locks & flush dirty pages before destroy layout on layout changing (to avoid concurrent io), but kernel writeback thread could still try to writeback the inode since it was already on the writeback list. This writeback will not flush any dirty pages at all (all the dirty should have been flushed before layout changing), but the call chain of ll_writepages()>cl_sync_file_range()>cl_io_loop()->cl_io_iter_init() is enough to trigger the assertion.

Unfortunately there isn't client log at all, we don't know if the layout was changed during testing.

Comment by Jinshan Xiong (Inactive) [ 24/Apr/13 ]

yujian, scratch is not accessible, please copy the vmcore to my home directory at brent if you have one.

Comment by Jian Yu [ 24/Apr/13 ]

yujian, scratch is not accessible, please copy the vmcore to my home directory at brent if you have one.

Hi Jinshan, the logs are copied to /home/jay/test_logs/LU-3160.

Comment by Jinshan Xiong (Inactive) [ 24/Apr/13 ]

Niu, I think you're absolutely right about this. I extracted the stack trace from vmcore, so here is another process:

PID: 11897  TASK: ffff8808365a3500  CPU: 25  COMMAND: "dd"
 #0 [ffff88044e527e90] crash_nmi_callback at ffffffff81029796
 #1 [ffff88044e527ea0] notifier_call_chain at ffffffff814ef745
 #2 [ffff88044e527ee0] atomic_notifier_call_chain at ffffffff814ef7aa
 #3 [ffff88044e527ef0] notify_die at ffffffff810969ae
 #4 [ffff88044e527f20] do_nmi at ffffffff814ed3c3
 #5 [ffff88044e527f50] nmi at ffffffff814eccd0
    [exception RIP: strrchr+23]
    RIP: ffffffff81270227  RSP: ffff8808344b3268  RFLAGS: 00000202
    RAX: ffffffffa05c21bb  RBX: ffff880418403200  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 000000000000002f  RDI: ffffffffa05c21b0
    RBP: ffff8808344b3268   R8: 0000000000000073   R9: 00000000fffffffc
    R10: 0000000000000001  R11: 000000000000000f  R12: ffff880833a92f08
    R13: ffff88041d764f38  R14: ffff880418403200  R15: ffffffffa060ed80
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff8808344b3268] strrchr at ffffffff81270227
 #7 [ffff8808344b3270] libcfs_debug_vmsg2 at ffffffffa045c76a [libcfs]
 #8 [ffff8808344b33e0] libcfs_debug_msg at ffffffffa045d2c1 [libcfs]
 #9 [ffff8808344b3440] cl_page_put at ffffffffa05a056d [obdclass]
#10 [ffff8808344b34b0] cl_page_delete0 at ffffffffa05a10eb [obdclass]
#11 [ffff8808344b34f0] cl_page_delete at ffffffffa05a1412 [obdclass]
#12 [ffff8808344b3510] ll_invalidatepage at ffffffffa0a759bd [lustre]
#13 [ffff8808344b3550] vvp_page_discard at ffffffffa0a87cdc [lustre]
#14 [ffff8808344b3580] cl_page_invoid at ffffffffa059cf98 [obdclass]
#15 [ffff8808344b35d0] cl_page_discard at ffffffffa059d0a3 [obdclass]
#16 [ffff8808344b35e0] discard_cb at ffffffffa05a4a84 [obdclass]
#17 [ffff8808344b3620] cl_page_gang_lookup at ffffffffa05a1f14 [obdclass]
#18 [ffff8808344b36d0] cl_lock_discard_pages at ffffffffa05a489e [obdclass]
#19 [ffff8808344b3720] osc_lock_flush at ffffffffa091aa6f [osc]
#20 [ffff8808344b3780] osc_lock_cancel at ffffffffa091acd7 [osc]
#21 [ffff8808344b37d0] cl_lock_cancel0 at ffffffffa05a2735 [obdclass]
#22 [ffff8808344b3800] cl_lock_cancel at ffffffffa05a32db [obdclass]
#23 [ffff8808344b3820] cl_locks_prune at ffffffffa05a6bd3 [obdclass]
#24 [ffff8808344b38c0] lov_delete_raid0 at ffffffffa09aa7dc [lov]
#25 [ffff8808344b3970] lov_conf_set at ffffffffa09ab1fb [lov]
#26 [ffff8808344b39e0] cl_conf_set at ffffffffa059b298 [obdclass]
#27 [ffff8808344b3a10] ll_layout_conf at ffffffffa0a2e1b8 [lustre]
#28 [ffff8808344b3a50] ll_layout_lock_set at ffffffffa0a3bced [lustre]
#29 [ffff8808344b3b40] ll_layout_refresh at ffffffffa0a3f61b [lustre]
#30 [ffff8808344b3c90] vvp_io_init at ffffffffa0a8b61f [lustre]
#31 [ffff8808344b3cd0] cl_io_init0 at ffffffffa05a98e8 [obdclass]
#32 [ffff8808344b3d10] cl_io_init at ffffffffa05ac6a4 [obdclass]
#33 [ffff8808344b3d50] cl_io_rw_init at ffffffffa05adf64 [obdclass]
#34 [ffff8808344b3da0] ll_file_io_generic at ffffffffa0a31598 [lustre]
#35 [ffff8808344b3e20] ll_file_aio_write at ffffffffa0a32c12 [lustre]
#36 [ffff8808344b3e80] ll_file_write at ffffffffa0a32efc [lustre]
#37 [ffff8808344b3ef0] vfs_write at ffffffff81176588
#38 [ffff8808344b3f30] sys_write at ffffffff81176e81
#39 [ffff8808344b3f80] system_call_fastpath at ffffffff8100b072
    RIP: 00000030690dae60  RSP: 00007fffd66dfbc8  RFLAGS: 00000206
    RAX: 0000000000000001  RBX: ffffffff8100b072  RCX: 00000030690dae60
    RDX: 0000000000001000  RSI: 00000000019e3000  RDI: 0000000000000001
    RBP: 00000000019e3000   R8: 000000306938eee8   R9: 0000000000000001
    R10: 0000000000003003  R11: 0000000000000246  R12: 00000000019e2fff
    R13: 0000000000000000  R14: 0000000000001000  R15: 0000000000000000
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

It's changing layout. However, I don't know why this happened. I can't extract lustre log unfortunately.

Comment by Niu Yawei (Inactive) [ 24/Apr/13 ]

Hi, Xiong

I don't see why we didn't use lo_type_guard to protect layout at the beginning, which looks simpler and safer than current way (active_ios & waitq).

  • Get read lock for io: call lov_conf_freeze() in lov_io_init_raid0(), call lov_conf_thaw() in lov_io_fini();
  • Get write lock for layout change: call lov_conf_lock() & lov_conf_unlock() in lov_layout_change() to make the "layout delete -> layout reinstall" atomic;

Did I miss anything?

Comment by Jinshan Xiong (Inactive) [ 24/Apr/13 ]

Yes, this is exactly what we're doing. However, there are some cases that we need to start an IO at OSC layer, which may have already held conf_lock so it may cause deadlock. ci_ignore_layout was worked out to indicate if it needs to hold layout lock to start the IO. Usually if the IO is about page we're pretty safe because layout change has to clean up pages.

For cl_sync_file_range() case, I set ci_ignore_layout to 1 because I thought if there exists dirty pages, layout won't be changed. Obviously I was wrong because there is race.

For this specific problem, it can be fixed by revising ci_ignore_layout to 0. Please also check the other callers of cl_sync_file_range() to make sure it's true.

Comment by Niu Yawei (Inactive) [ 25/Apr/13 ]

http://review.whamcloud.com/6154

Comment by Niu Yawei (Inactive) [ 28/Apr/13 ]

Looks the patch revealed another problem, conf-sanity test_0 will always hit LBUG. (see LU-3230), I'm looking into it.

Comment by Niu Yawei (Inactive) [ 28/Apr/13 ]

Hi, Xiong

Why don't we flush dirty before lost layout lock (see ll_md_blocking_ast())?

This patch caused LBUG when running conf-sanity test_0:

  • client write data to file, then do unlink;
  • layout lock is revoked, but the dirty isn't being flushed;
  • on client umount, kernel try to flush dirty back by calling ll_writepages(), which will verify layout (what this patch did), but layout fetch will defenitely fail because the file has been removed on MDS;
  • dirty isn't flushed back at the end, and the LBUG is triggered.

Do you know why we didn't flush dirty on layout lock revocation in the begining? Thanks.

Comment by Niu Yawei (Inactive) [ 28/Apr/13 ]

Looks we chose to flush dirty when changing layout but not on layout lock revocation, not sure if it was the best choice, but the problem can be worked around by anohter way (ignore -ENOENT error on layout refresh for io).

Comment by Jinshan Xiong (Inactive) [ 28/Apr/13 ]

Do you know why we didn't flush dirty on layout lock revocation in the begining? Thanks.

We don't want to introduce cascading problem - in case one of the OSTs is unaccessible, the client will be evicted by the MDT because it can't cancel the layout lock in time.

If what you said was true, we should have seen this problem in migration, because we're using an orphan object to restripe. I will take a look at this.

Comment by Jinshan Xiong (Inactive) [ 28/Apr/13 ]

This should be a problem of force umount that it causes failure in refresh layout. I have updated the patch, please take a look.

Comment by Peter Jones [ 02/May/13 ]

Landed for 2.4

Generated at Sat Feb 10 01:31:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.