import invalidation doesn't fail all requests (LU-904)

[LU-1582] replay-ost-single.sh test_8b: client stuck with -115 (-EINPROGRESS) Created: 28/Jun/12  Updated: 27/Nov/12  Resolved: 13/Sep/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: Lustre 2.3.0

Type: Technical task Priority: Minor
Reporter: Andreas Dilger Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 1
Labels: None
Environment:

Local x86-64 dual-core node testing on build: v2_2_57-g84dfd30-CHANGED-2.6.32-220.el6_lustre.ge487264.x86_64


Issue Links:
Duplicate
is duplicated by LU-2389 move sync out of facet_failover Resolved
Related
is related to LU-924 Test failure on test suite recovery-s... Resolved
Rank (Obsolete): 4533

 Description   

Running acceptance-small.sh got stuck for more than a day in replay-ost-single.sh test_8b() "Verify redo io: redo io should success after recovery" that was added for LU-904 "ptlrpc: redo io on -EINPROGRESS" (http://review.whamcloud.com/1962).

The console message is repeating cfs_fail_loc=230, and the RPC is resent. It looks like the fail_loc is not being cleared in this test, because the "dd" finish too quickly?

Test output:

== replay-ost-single test 8b: Verify redo io: redo io should success after recovery ================== 18:29:58 (1340670598)
1280+0 records in
1280+0 records out
5242880 bytes (5.2 MB) copied, 1.78424 s, 2.9 MB/s
fail_loc=0x230
Failing ost1 on node sookie-gig.adilger.int

Console logs:

Jun 25 18:29:59 sookie-gig kernel: Lustre: DEBUG MARKER: == replay-ost-single test 8b: Verify redo io: redo io should success after recovery ================== 18:29:58 (1340670598)
Jun 25 18:30:00 sookie-gig kernel: Lustre: *** cfs_fail_loc=230, val=0***
Jun 25 18:30:00 sookie-gig kernel: LustreError: 6126:0:(osc_request.c:1662:osc_brw_redo_request()) @@@ redo for recoverable error  req@ffff8800afb6a800 x1405794707837937/t0(0) o4->testfs-OST0000-osc-ffff8800ae6e4c00@0@lo:6/4 lens 488/416 e 0 to 0 dl 1340670627 ref 2 fl Interpret:R/0/0 rc -115/-115
Jun 25 18:30:21 sookie-gig kernel: Lustre: *** cfs_fail_loc=230, val=0***
Jun 25 18:30:21 sookie-gig kernel: Lustre: Skipped 5 previous similar messages
Jun 25 18:30:21 sookie-gig kernel: LustreError: 6126:0:(osc_request.c:1662:osc_brw_redo_request()) @@@ redo for recoverable error  req@ffff8800c5397800 x1405794707837961/t0(0) o4->testfs-OST0000-osc-ffff8800ae6e4c00@0@lo:6/4 lens 488/416 e 0 to 0 dl 1340670648 ref 2 fl Interpret:R/0/0 rc -115/-115
Jun 25 18:30:21 sookie-gig kernel: LustreError: 6126:0:(osc_request.c:1662:osc_brw_redo_request()) Skipped 5 previous similar messages
Jun 25 18:30:55 sookie-gig kernel: Lustre: *** cfs_fail_loc=230, val=0***
Jun 25 18:30:55 sookie-gig kernel: Lustre: Skipped 3 previous similar messages
Jun 25 18:30:55 sookie-gig kernel: LustreError: 6126:0:(osc_request.c:1662:osc_brw_redo_request()) @@@ redo for recoverable error  req@ffff8800c5398000 x1405794707838010/t0(0) o4->testfs-OST0000-osc-ffff8800ae6e4c00@0@lo:6/4 lens 488/416 e 0 to 0 dl 1340670682 ref 2 fl Interpret:R/0/0 rc -115/-115
Jun 25 18:30:55 sookie-gig kernel: LustreError: 6126:0:(osc_request.c:1662:osc_brw_redo_request()) Skipped 3 previous similar messages
Jun 25 18:31:45 sookie-gig kernel: LustreError: 11-0: an error occurred while communicating with 0@lo. The ost_write operation failed with -115
Jun 25 18:31:45 sookie-gig kernel: LustreError: Skipped 15 previous similar messages
Jun 25 18:32:00 sookie-gig kernel: Lustre: *** cfs_fail_loc=230, val=0***
Jun 25 18:32:00 sookie-gig kernel: Lustre: Skipped 4 previous similar messages
Jun 25 18:32:00 sookie-gig kernel: LustreError: 6126:0:(osc_request.c:1662:osc_brw_redo_request()) @@@ redo for recoverable error  req@ffff8800c5397400 x1405794707838108/t0(0) o4->testfs-OST0000-osc-ffff8800ae6e4c00@0@lo:6/4 lens 488/416 e 0 to 0 dl 1340670747 ref 2 fl Interpret:R/0/0 rc -115/-115
Jun 25 18:32:00 sookie-gig kernel: LustreError: 6126:0:(osc_request.c:1662:osc_brw_redo_request()) Skipped 4 previous similar messages
Jun 25 18:33:14 sookie-gig kernel: INFO: task sync:9577 blocked for more than 120 seconds.
Jun 25 18:33:14 sookie-gig kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 25 18:33:14 sookie-gig kernel: sync          D 0000000000000000     0  9577   9576 0x00000080
Jun 25 18:33:14 sookie-gig kernel: ffff8800ab3ebc98 0000000000000082 0000000000000000 ffff880037759500
Jun 25 18:33:14 sookie-gig kernel: ffff8800c4ff40c0 0000000000000000 0000000000000001 00000000ffffffff
Jun 25 18:33:14 sookie-gig kernel: ffff880037759ab8 ffff8800ab3ebfd8 000000000000f4e8 ffff880037759ab8
Jun 25 18:33:14 sookie-gig kernel: Call Trace:
Jun 25 18:33:14 sookie-gig kernel: [<ffffffff8109b809>] ? ktime_get_ts+0xa9/0xe0
Jun 25 18:33:14 sookie-gig kernel: [<ffffffff81110b10>] ? sync_page+0x0/0x50
Jun 25 18:33:14 sookie-gig kernel: [<ffffffff814ed583>] io_schedule+0x73/0xc0
Jun 25 18:33:14 sookie-gig kernel: [<ffffffff81110b4d>] sync_page+0x3d/0x50
Jun 25 18:33:14 sookie-gig kernel: [<ffffffff814edf3f>] __wait_on_bit+0x5f/0x90
Jun 25 18:33:14 sookie-gig kernel: [<ffffffff81110d03>] wait_on_page_bit+0x73/0x80
Jun 25 18:33:14 sookie-gig kernel: [<ffffffff81090c30>] ? wake_bit_function+0x0/0x50
Jun 25 18:33:14 sookie-gig kernel: [<ffffffff811271a5>] ? pagevec_lookup_tag+0x25/0x40
Jun 25 18:33:14 sookie-gig kernel: [<ffffffff8111111b>] wait_on_page_writeback_range+0xfb/0x190
Jun 25 18:33:14 sookie-gig kernel: [<ffffffff814ed7b0>] ? _cond_resched+0x30/0x40
Jun 25 18:33:14 sookie-gig kernel: [<ffffffff814ed7fc>] ? wait_for_common+0x3c/0x180
Jun 25 18:33:14 sookie-gig kernel: [<ffffffff811111df>] filemap_fdatawait+0x2f/0x40
Jun 25 18:33:14 sookie-gig kernel: [<ffffffff8119fbb4>] sync_inodes_sb+0x114/0x190
Jun 25 18:33:14 sookie-gig kernel: [<ffffffff811a5682>] __sync_filesystem+0x82/0x90
Jun 25 18:33:14 sookie-gig kernel: [<ffffffff811a5788>] sync_filesystems+0xf8/0x130
Jun 25 18:33:14 sookie-gig kernel: [<ffffffff811a5821>] sys_sync+0x21/0x40
Jun 25 18:33:14 sookie-gig kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Jun 25 18:34:13 sookie-gig kernel: Lustre: *** cfs_fail_loc=230, val=0***
Jun 25 18:34:13 sookie-gig kernel: Lustre: Skipped 6 previous similar messages
Jun 25 18:34:13 sookie-gig kernel: LustreError: 6126:0:(osc_request.c:1662:osc_brw_redo_request()) @@@ redo for recoverable error  req@ffff8800c5398800 x1405794707838312/t0(0) o4->testfs-OST0000-osc-ffff8800ae6e4c00@0@lo:6/4 lens 488/416 e 0 to 0 dl 1340670880 ref 2 fl Interpret:R/0/0 rc -115/-115
{color:red}"cfs_fail_loc=230" and "redo for recoverable error" repeat forever

While it was useful for debugging this problem, does it make sense to print every osc_brw_redo_request() message to the console for EINPROGRESS? Since this is somewhat a "normal" message that can happen and not need any attention, it shouldn't be printed to the console via D_ERROR (unlike data checksum errors). Possibly if it has been going on for a long time (e.g. (++aa->aa_resends & 31 == 0) or something).



 Comments   
Comment by Niu Yawei (Inactive) [ 28/Jun/12 ]

I don't see why 'dd' finished too quickly could cause fail_loc not cleared, maybe the 'do_facet ost1 lctl set_param fail_loc=0' failed for some network problem?
Can this be reproduced?

I agree with you that we'd better limit the redo io error message, after -EINPROGRESS introduced, there could be much more such messages.

Comment by Jinshan Xiong (Inactive) [ 28/Jun/12 ]

In case you're scared by the backtrace of sync_page(), the sync process waited too long time for the writeback bit because there were so many retries to the RPC. I will take a look why it kept returning EINPROGRESS by OST.

Comment by Andreas Dilger [ 29/Jun/12 ]

This was a single-node test, so "do_facet" was really "eval". After I manually did "lctl set_param fail_loc=0" the test continued running:

Stopping /mnt/ost1 (opts:) on sookie-gig.adilger.int
affected facets: ost1
Failover ost1 to sookie-gig.adilger.int
17:12:36 (1340925156) waiting for sookie-gig.adilger.int network 900 secs ...
17:12:36 (1340925156) network interface is UP
Starting ost1: -o noauto  /dev/vg_sookie/lvost1 /mnt/ost1
Started testfs-OST0000
fail_loc=0
1280+0 records in
1280+0 records out
5242880 bytes (5.2 MB) copied, 254571 s, 0.0 kB/s
Resetting fail_loc on all nodes...done.
PASS 8b (254573s)

It seems my initial comment that "dd" completed too soon was incorrect. That was only the first "dd of=$verify" one that completed. As can be seen above, the second "dd of=$TDIR/$tfile" was stuck the whole 3 days this test was left running.

Comment by Andreas Dilger [ 29/Jun/12 ]

It seems that this problem is actually in the test-framework.sh::facet_failover() code:

    # Make sure the client data is synced to disk. LU-924
    #
    # We don't write client data synchrnously (to avoid flooding sync writes
    # when there are many clients connecting), so if the server reboots before
    # the client data reachs disk, the client data will be lost and the client
    # will be evicted after recovery, which is not what we expected.
    do_facet $facet "sync; sync; sync"

This will be stuck on single-node tests, because it will also cause the client mountpoint to be sync'd, and that will block forever with fail_loc=0x230.

Comment by Andreas Dilger [ 29/Jun/12 ]

This problem appears more related to the test-framework change that was made for LU-924 in http://review.whamcloud.com/1888.

While using "sync" can force the test to pass by flushing all client data to disk, it does not represent what happens in real life when OSS nodes crash, or OSTs get remounted read-only due to detected corruption.

The core problem of the test that was failing is that the client was initially connected to some OST to which it did not do any IO, and then it was evicted by that OST during the test. This is an unlikely scenario in real life (i.e. OST reboot immediately after mounting the first time, but client has already submitted writes to it). Instead of "sync" (which flushes everything), it would be better to submit a single modifying RPC to each affected OST (to trigger exp_need_sync behaviour) in just recovery-small.sh test_105(), rather than doing the sync for every target before it is "failed".

I've uploaded change http://review.whamcloud.com/3239 to hopefully fix this problem in a proper manner. I'm able to test replay-ost-single.sh test_8b() works correctly with this fix, but recovery-small.sh test_105() needs multiple client nodes.

Comment by Niu Yawei (Inactive) [ 29/Jun/12 ]

I don't see why there was dirty data in the cache while running replay-ost-single test_8b? Even if we don't do 'sync' in facet_failover(), won't it be blocked when umount the filesystem (stop ost by umount in facet_failover())?

Comment by Johann Lombardi (Inactive) [ 04/Jul/12 ]

The dd has produced dirty data which aren't flushed yet (due to resend), so the syncs wait for the BRW to complete, which won't happen because of the fail_loc.

Here is what i see in the process list when the issue happens:

root     11668 11639  0 01:08 pts/0    00:00:00 dd if=/tmp/verify-8528 of=/mnt/lustre/d0.replay-ost-single/f8b bs=4096 count=1280 oflag=sync
root     11680 11679  0 01:08 pts/0    00:00:00 sh -c sync; sync; sync
Comment by Niu Yawei (Inactive) [ 04/Jul/12 ]

Right, the dirty is produced by dd, and since it's client dirty, umount ost won't be blocked. Thanks.

Comment by Jodi Levi (Inactive) [ 05/Jul/12 ]

If you aren't the right person to assign this to, please reassign.

Comment by Andreas Dilger [ 06/Jul/12 ]

Niu, to clarify this problem is only hit with a single-node test, where the client "dd" is running on the same node as the OST. The "sync; sync; sync" is blocked forever on the client mountpoint, due to EINPROGRESS, so the script never gets as far as unmounting the OST and then clearing fail_loc.

More important than this particular test being fixed for the single-node test case, is the fact that calling "sync" on the OST before "failover" is not what happens in the real world when an OST fails. Instead of doing "sync" for every failover, this case where the client is newly mounted at the time of failover should be fixed. According to the original bug, there are no other tests that have this requirement. If there are a small number of such tests, they could be fixed individually, but since my patch passed testing there are probably not many tests affected.

If there was a need to fix this case in a general manner, it would be better to do a "sync" immediately after mount (which is like real life where clients are mounted for more than a few seconds before the OST fails), rather than after the clients have done operations that need to be recovered.

Comment by Niu Yawei (Inactive) [ 13/Sep/12 ]

the patch proposed by Andreas has been landed.

Generated at Sat Feb 10 01:17:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.