[LU-8101] sanity test_118c fails with 'No page in writeback, writeback=0' Created: 04/May/16  Updated: 19/Mar/23  Resolved: 19/Mar/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0, Lustre 2.13.0, Lustre 2.10.7, Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

autotest reivedw-zfs-part-1


Issue Links:
Duplicate
is duplicated by LU-16515 sanity test_118c test_118d: No page i... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity test 118c fails in review-zfs-part-1 with

'No page in writeback, writeback=0' 

In OST console logs when sanity test 118c PASS, there is only one call to cfs_fail_timeout_set(). When test 118c fails, on the OSTs, the console log has multiple calls to cfs_fail_timeout_set():

10:48:43:Lustre: DEBUG MARKER: == sanity test 118c: Fsync blocks on EROFS until dirty pages are flushed ============ 10:48:38 (1462297718)
10:48:43:Lustre: DEBUG MARKER: lctl set_param fail_val=0 fail_loc=0x216
10:48:43:LustreError: 4699:0:(fail.c:133:__cfs_fail_timeout_set()) cfs_fail_timeout id 216 sleeping for 1000ms
10:48:43:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_118c: @@@@@@ FAIL: No page in writeback, writeback=0 
10:48:43:Lustre: DEBUG MARKER: sanity test_118c: @@@@@@ FAIL: No page in writeback, writeback=0
10:48:43:Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /logdir/test_logs/2016-05-03/lustre-reviews-el6_7-x86_64--review-zfs-part-1--1_8_1__38579__-70130514263840-074815/sanity.test_118c.debug_log.$(hostname -s).1462297721.log;
10:48:43:         dmesg > /logdir/test_logs/2016-05-03/lustre-reviews-el
10:48:43:LustreError: 4699:0:(fail.c:137:__cfs_fail_timeout_set()) cfs_fail_timeout id 216 awake
10:50:24:LustreError: 4699:0:(fail.c:133:__cfs_fail_timeout_set()) cfs_fail_timeout id 216 sleeping for 1000ms
10:50:24:LustreError: 4699:0:(fail.c:137:__cfs_fail_timeout_set()) cfs_fail_timeout id 216 awake
10:50:24:LustreError: 4699:0:(fail.c:133:__cfs_fail_timeout_set()) cfs_fail_timeout id 216 sleeping for 1000ms
10:50:24:LustreError: 4699:0:(fail.c:137:__cfs_fail_timeout_set()) cfs_fail_timeout id 216 awake
10:50:24:Lustre: *** cfs_fail_loc=216, val=0***
10:50:24:Lustre: Skipped 2 previous similar messages
10:50:24:Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null || true
10:50:24:Lustre: DEBUG MARKER: for i in $(find /proc/spl -type f); do
10:50:24:				echo ====${i}=======================;
10:50:24:				cat ${i};
10:50:24:				done
10:50:24:Lustre: DEBUG MARKER: echo ====zpool events -v=======================;
10:50:24:				zpool events -v
10:50:24:Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 118d: Fsync validation inject a delay of the bulk ============ 10:48:54 \(1462297734\)

Similarly, on the client, there are multiple calls to osc_brw_redo_request()in the console logs for the failed test

10:49:59:Lustre: DEBUG MARKER: == sanity test 118c: Fsync blocks on EROFS until dirty pages are flushed ============ 10:48:38 (1462297718)
10:49:59:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_118c: @@@@@@ FAIL: No page in writeback, writeback=0 
10:49:59:Lustre: DEBUG MARKER: sanity test_118c: @@@@@@ FAIL: No page in writeback, writeback=0
10:49:59:Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /logdir/test_logs/2016-05-03/lustre-reviews-el6_7-x86_64--review-zfs-part-1--1_8_1__38579__-70130514263840-074815/sanity.test_118c.debug_log.$(hostname -s).1462297721.log;
10:49:59:         dmesg > /logdir/test_logs/2016-05-03/lustre-reviews-el
10:49:59:LustreError: 11-0: lustre-OST0001-osc-ffff8800623bc800: operation ost_write to node 10.2.5.40@tcp failed: rc = -30
10:49:59:LustreError: Skipped 1 previous similar message
10:49:59:LustreError: 4786:0:(osc_request.c:1527:osc_brw_redo_request()) @@@ redo for recoverable error -30  req@ffff880077ddac80 x1533326821172416/t0(0) o4->lustre-OST0001-osc-ffff8800623bc800@10.2.5.40@tcp:6/4 lens 608/192 e 0 to 0 dl 1462297765 ref 2 fl Interpret:RM/0/0 rc -30/-30
10:49:59:LustreError: 4786:0:(osc_request.c:1527:osc_brw_redo_request()) Skipped 1 previous similar message
10:49:59:LustreError: 4786:0:(osc_request.c:1527:osc_brw_redo_request()) @@@ redo for recoverable error -30  req@ffff88007d2f1cc0 x1533326821172448/t0(0) o4->lustre-OST0001-osc-ffff8800623bc800@10.2.5.40@tcp:6/4 lens 608/192 e 0 to 0 dl 1462297767 ref 2 fl Interpret:RM/0/0 rc -30/-30
10:49:59:LustreError: 4786:0:(osc_request.c:1527:osc_brw_redo_request()) @@@ redo for recoverable error -30  req@ffff880077ddac80 x1533326821172464/t0(0) o4->lustre-OST0001-osc-ffff8800623bc800@10.2.5.40@tcp:6/4 lens 608/192 e 0 to 0 dl 1462297770 ref 2 fl Interpret:RM/0/0 rc -30/-30
10:49:59:Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null || true
10:49:59:Lustre: DEBUG MARKER: /usr/sbin/lctl mark == sanity test 118d: Fsync validation inject a delay of the bulk ============ 10:48:54 \(1462297734\)

Logs for this failure are at:
https://testing.hpdd.intel.com/test_sets/f12f9b3e-1184-11e6-b5f1-5254006e85c2
https://testing.hpdd.intel.com/test_sets/721631e0-f399-11e5-b32f-5254006e85c2
https://testing.hpdd.intel.com/test_sets/f0c2aa46-c7f2-11e5-80b3-5254006e85c2



 Comments   
Comment by Minh Diep [ 26/Oct/17 ]

seen on b2_10 https://testing.hpdd.intel.com/test_sessions/f1b86689-3404-4a90-af08-5efd68c6bceb

Comment by James Nunez (Inactive) [ 14/Mar/19 ]

We see a very similar issue for sanity test 118d for 2.10.7 RC1 testing, but no errors in the client logs. Logs are at https://testing.whamcloud.com/test_sets/389bf13a-4363-11e9-9646-52540065bddc .

Note: this is an ldiskfs test session, not ZFS.

Comment by Chris Horn [ 22/Oct/19 ]

+1 on master https://testing.whamcloud.com/test_sets/9e1b9298-f4db-11e9-a0ba-52540065bddc

Comment by James Nunez (Inactive) [ 10/Dec/21 ]

A couple more recent failures for DNE/ZFS:
https://testing.whamcloud.com/test_sets/0651d843-48ed-4de9-8795-61ff90b775d5
https://testing.whamcloud.com/test_sets/b7e33a0e-80c1-4a3d-9f15-80efd98c7be3

Comment by Jian Yu [ 27/Jan/23 ]

+1 on master: https://testing.whamcloud.com/test_sets/169b0755-6281-4add-bbb5-38717d9807de

Comment by Qian Yingjin [ 02/Feb/23 ]

+1 on master: https://testing.whamcloud.com/test_sets/004d8e6c-7e7f-4692-a4ee-be276f3879d0

Comment by Chris Horn [ 14/Feb/23 ]

+1 on master https://testing.whamcloud.com/test_sets/4ecfbfd1-4b0c-4659-91d1-2c5526930839

Comment by Serguei Smirnov [ 22/Feb/23 ]

+1 on master: https://testing.whamcloud.com/test_sets/b890d117-9879-43bb-aa3e-2ba79c5f2066

Comment by Arshad Hussain [ 25/Feb/23 ]

+1 on Master (https://testing.whamcloud.com/test_sets/74277446-d066-41aa-9bc8-b9f7a20f8172)

Comment by Andreas Dilger [ 19/Mar/23 ]

Track this under LU-16515 since this 7-year-old ticket likely was fixed and the new failures have a different cause.

Generated at Sat Feb 10 02:14:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.