[LU-5526] recovery-mds-scale test failover_mds: dd: No space left on device Created: 20/Aug/14  Updated: 05/Aug/20  Resolved: 05/Aug/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.5.3
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Jian Yu Assignee: Hongchao Zhang
Resolution: Cannot Reproduce Votes: 0
Labels: 22pl, mq414, p4hc, zfs
Environment:

Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/80/
Distro/Arch: RHEL6.5/x86_64
Test Group: failover
FSTYPE=zfs


Issue Links:
Related
is related to LU-3326 recovery-mds-scale test_failover_ost:... Resolved
is related to LU-4846 Failover test failure on test suite r... Open
is related to LU-6493 sanity test 42b failure on sync Open
is related to LU-6200 Failover recovery-mds-scale test_fail... Resolved
is related to LU-7309 replay-single test_70b: no space left... Resolved
is related to LU-7387 Failover: recovery-random-scale test_... Resolved
Severity: 3
Rank (Obsolete): 15384

 Description   

While running recovery-mds-scale test failover_mds (MDS failed over 1 time), client load on one of the clients failed as follows:

2014-08-16 20:44:59: dd run starting
+ mkdir -p /mnt/lustre/d0.dd-shadow-41vm5.shadow.whamcloud.com
+ /usr/bin/lfs setstripe -c -1 /mnt/lustre/d0.dd-shadow-41vm5.shadow.whamcloud.com
+ cd /mnt/lustre/d0.dd-shadow-41vm5.shadow.whamcloud.com
+ sync
++ /usr/bin/lfs df /mnt/lustre/d0.dd-shadow-41vm5.shadow.whamcloud.com
++ awk '/filesystem summary:/ {print $5}'
+ FREE_SPACE=14195328
+ BLKS=1596974
+ echoerr 'Total free disk space is 14195328, 4k blocks to dd is 1596974'
+ echo 'Total free disk space is 14195328, 4k blocks to dd is 1596974'
Total free disk space is 14195328, 4k blocks to dd is 1596974
+ load_pid=3715
+ wait 3715
+ dd bs=4k count=1596974 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-shadow-41vm5.shadow.whamcloud.com/dd-file
dd: writing `/mnt/lustre/d0.dd-shadow-41vm5.shadow.whamcloud.com/dd-file': No space left on device
1213957+0 records in
1213956+0 records out
+ '[' 1 -eq 0 ']'
++ date '+%F %H:%M:%S'
+ echoerr '2014-08-16 20:52:06: dd failed'
+ echo '2014-08-16 20:52:06: dd failed'
2014-08-16 20:52:06: dd failed

Console log on the client:

LustreError: 3715:0:(vvp_io.c:1081:vvp_io_commit_write()) Write page 1213956 of inode ffff88007ccedb78 failed -28

Maloo report: https://testing.hpdd.intel.com/test_sets/acafc288-26a6-11e4-84f2-5254006e85c2

The failure was reported in LU-3326 before. However, Lustre b2_5 build #80 already contained patch http://review.whamcloud.com/11425. So, more fixup is needed to resolve the failure.



 Comments   
Comment by Hongchao Zhang [ 21/Aug/14 ]

the patch has been updated and tracked at http://review.whamcloud.com/#/c/8671/

Comment by Jian Yu [ 21/Aug/14 ]

The failure occurred consistently on Lustre b2_5 branch with FSTYPE=zfs: https://testing.hpdd.intel.com/test_sets/15e2479e-2904-11e4-9362-5254006e85c2
The FSTYPE=ldiskfs failover test session is blocked by LU-5483.

Hi Hongchao,
While creating the patch for Lustre b2_5 branch, could you please add the following test parameters to verify the change on FSTYPE=zfs failover test session?

Test-Parameters: alwaysuploadlogs \
envdefinitions=SLOW=yes,ENABLE_QUOTA=yes,REQFAIL=100,DURATION=28800 \
mdtfilesystemtype=zfs mdsfilesystemtype=zfs ostfilesystemtype=zfs \
clientdistro=el6 ossdistro=el6 mdsdistro=el6 clientarch=x86_64 \
ossarch=x86_64 mdsarch=x86_64 clientcount=4 osscount=2 mdscount=2 \
mdssizegb=10 ostsizegb=10 austeroptions=-R failover=true iscsi=1 \
testlist=recovery-mds-scale
Comment by Jian Yu [ 31/Aug/14 ]

Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/86/ (2.5.3 RC1)
FSTYPE=zfs

recovery-mds-scale test hit the same failure: https://testing.hpdd.intel.com/test_sets/2f3a6d92-30f8-11e4-a455-5254006e85c2

Comment by Sarah Liu [ 12/Sep/14 ]

Hit same error in lustre-master build # 2639 zfs failover test:

https://testing.hpdd.intel.com/test_sets/2e7e0c1a-37c4-11e4-a2a6-5254006e85c2

Comment by Hongchao Zhang [ 30/Sep/14 ]

the patch is tracked at http://review.whamcloud.com/#/c/12126/

Comment by Jian Yu [ 30/Oct/14 ]

More instances on Lustre b2_5 branch with FSTYPE=zfs:
https://testing.hpdd.intel.com/test_sets/66474710-5fb0-11e4-a865-5254006e85c2
https://testing.hpdd.intel.com/test_sets/10035164-7a72-11e4-be53-5254006e85c2

Comment by Sarah Liu [ 03/Feb/15 ]

this also affect lustre-master build # 2835 replay-single

https://testing.hpdd.intel.com/test_sets/42a5bd28-a81c-11e4-93dd-5254006e85c2

Comment by Hongchao Zhang [ 06/Feb/15 ]

this issue is caused by the slow creation at OST after the recovery of MDT, which caused "osp_precreate_reserve" failed with "-ETIMEDOUT",
and LOD return "-ENOSPC" in this case.

I have tried to double the wait time in "sop_precreate_reserve", but it still failed, and have created another patch to wait the creation until completion or error,
the patch is still tracked at http://review.whamcloud.com/#/c/12126/

Comment by Hongchao Zhang [ 02/Mar/15 ]

the requests sent to OST after MDS failover spent too much time to complete, https://testing.hpdd.intel.com/test_logs/a9f3b136-bfe8-11e4-b6c3-5254006e85c2

Lustre: 2810:0:(client.c:1949:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1425098817/real 1425098820]  req@ffff88006c4bc9c0 x1494324416086096/t0(0) o38->lustre-MDT0000-lwp-MDT0000@10.2.4.167@tcp:12/10 lens 400/544 e 0 to 1 dl 1425098822 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 2 clients reconnect
Lustre: lustre-MDT0000: Recovery over after 0:01, of 2 clients 2 recovered and 0 were evicted.
LustreError: 3165:0:(osp_precreate.c:1393:osp_precreate_reserve()) wait recovery to complete!
LustreError: 2811:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,37 [sent 1425098819/real 1425098819], current 1425098856  req@ffff88006c4bccc0 x1494324416086132/t4294967824(4294967824) o2->lustre-OST0001-osc-MDT0000@10.2.4.168@tcp:28/4 lens 560/400 e 2 to 0 dl 1425098877 ref 2 fl Interpret:R/0/0 rc 0/0
LustreError: 3165:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects
LustreError: 2812:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,50 [sent 1425098822/real 1425098822], current 1425098872  req@ffff88006fccb980 x1494324416086168/t4294967885(4294967885) o2->lustre-OST0000-osc-MDT0000@10.2.4.168@tcp:28/4 lens 560/400 e 1 to 0 dl 1425098893 ref 2 fl Interpret:R/0/0 rc 0/0
LustreError: 2812:0:(client.c:1866:ptlrpc_check_set()) Skipped 2 previous similar messages
LustreError: 2811:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,57 [sent 1425098819/real 1425098819], current 1425098876  req@ffff88006ea53380 x1494324416086152/t4294967855(4294967855) o2->lustre-OST0001-osc-MDT0000@10.2.4.168@tcp:28/4 lens 560/400 e 3 to 0 dl 1425098902 ref 2 fl Interpret:R/0/0 rc 0/0

...


LustreError: 3165:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects
LustreError: 2812:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,77 [sent 1425098822/real 1425098822], current 1425098899  req@ffff88007d574680 x1494324416086196/t4294967962(4294967962) o2->lustre-OST0000-osc-MDT0000@10.2.4.168@tcp:28/4 lens 560/400 e 2 to 0 dl 1425098918 ref 2 fl Interpret:R/0/0 rc 0/0
LNet: Service thread pid 3165 completed after 61.52s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
LustreError: 2812:0:(client.c:1866:ptlrpc_check_set()) Skipped 5 previous similar messages
LustreError: 3165:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects
Lustre: wanted 1, found 0
LustreError: 3165:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects
Lustre: wanted 1, found 0
LustreError: 2839:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects
Lustre: wanted 1, found 0
LustreError: 2839:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects
Lustre: wanted 1, found 0
LustreError: 2841:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects
Lustre: wanted 1, found 0
Lustre: wanted 1, found 0
LustreError: 2840:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects
LustreError: 2840:0:(osp_precreate.c:1384:osp_precreate_reserve()) Skipped 1 previous similar message
Lustre: wanted 1, found 0
Lustre: wanted 1, found 0
Lustre: Skipped 1 previous similar message
LustreError: 2867:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,289 [sent 1425098846/real 1425098846], current 1425099135  req@ffff88006c7fb080 x1494324416086280/t0(0) o5->lustre-OST0001-osc-MDT0000@10.2.4.168@tcp:28/4 lens 432/400 e 9 to 0 dl 1425099148 ref 2 fl Interpret:RN/0/0 rc 0/0
LustreError: 3168:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects
LustreError: 3168:0:(osp_precreate.c:1384:osp_precreate_reserve()) Skipped 3 previous similar messages
LustreError: 2865:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,345 [sent 1425098899/real 1425098899], current 1425099244  req@ffff88006fccb680 x1494324416086820/t0(0) o5->lustre-OST0000-osc-MDT0000@10.2.4.168@tcp:28/4 lens 432/400 e 5 to 0 dl 1425099251 ref 2 fl Interpret:RN/0/0 rc 0/0
Lustre: wanted 1, found 0
Lustre: Skipped 1 previous similar message
LustreError: 2865:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,26 [sent 1425099257/real 1425099257], current 1425099283  req@ffff88006bba46c0 x1494324416088404/t0(0) o5->lustre-OST0000-osc-MDT0000@10.2.4.168@tcp:28/4 lens 432/400 e 0 to 0 dl 1425099701 ref 2 fl Interpret:RN/0/0 rc 0/0
LustreError: 2865:0:(client.c:1866:ptlrpc_check_set()) Skipped 1 previous similar message
LustreError: 2812:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,26 [sent 1425099383/real 1425099383], current 1425099409  req@ffff88006bca3980 x1494324416091456/t4294972478(4294972478) o2->lustre-OST0000-osc-MDT0000@10.2.4.168@tcp:28/4 lens 560/400 e 0 to 0 dl 1425099827 ref 2 fl Interpret:R/4/0 rc 0/0
LustreError: 2812:0:(client.c:1866:ptlrpc_check_set()) Skipped 1 previous similar message
Comment by Gerrit Updater [ 26/May/15 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/14934
Subject: LU-5526 llite: retry object creation
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cde90a2316328562d4c95babe09ab3276f18fdc2

Comment by Gerrit Updater [ 03/Jun/15 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/15122
Subject: LU-5526 llite: retry object creation
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ddd9319367f2373dc32c491e3ff2f1fd7ea2d617

Comment by Saurabh Tandan (Inactive) [ 11/Dec/15 ]

master, build# 3264, 2.7.64 tag
Hard Failover: EL7 Server/Client
https://testing.hpdd.intel.com/test_sets/56d3054e-9e20-11e5-91b0-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 20/Jan/16 ]

Another instance found for hardfailover: EL6.7 Server/Client - ZFS
https://testing.hpdd.intel.com/test_sets/df3903c8-bbd7-11e5-8506-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 09/Feb/16 ]

Another instance found for hardfailover : EL7 Server/Client - ZFS, tag 2.7.66, master build 3314
https://testing.hpdd.intel.com/test_sessions/f0dd9616-ca6e-11e5-9609-5254006e85c2

Generated at Sat Feb 10 01:52:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.