[LU-5526] recovery-mds-scale test failover_mds: dd: No space left on device Created: 20/Aug/14 Updated: 05/Aug/20 Resolved: 05/Aug/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0, Lustre 2.5.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Jian Yu | Assignee: | Hongchao Zhang |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | 22pl, mq414, p4hc, zfs | ||
| Environment: |
Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/80/ |
||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||
| Rank (Obsolete): | 15384 | ||||||||||||||||||||||||||||
| Description |
|
While running recovery-mds-scale test failover_mds (MDS failed over 1 time), client load on one of the clients failed as follows: 2014-08-16 20:44:59: dd run starting
+ mkdir -p /mnt/lustre/d0.dd-shadow-41vm5.shadow.whamcloud.com
+ /usr/bin/lfs setstripe -c -1 /mnt/lustre/d0.dd-shadow-41vm5.shadow.whamcloud.com
+ cd /mnt/lustre/d0.dd-shadow-41vm5.shadow.whamcloud.com
+ sync
++ /usr/bin/lfs df /mnt/lustre/d0.dd-shadow-41vm5.shadow.whamcloud.com
++ awk '/filesystem summary:/ {print $5}'
+ FREE_SPACE=14195328
+ BLKS=1596974
+ echoerr 'Total free disk space is 14195328, 4k blocks to dd is 1596974'
+ echo 'Total free disk space is 14195328, 4k blocks to dd is 1596974'
Total free disk space is 14195328, 4k blocks to dd is 1596974
+ load_pid=3715
+ wait 3715
+ dd bs=4k count=1596974 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-shadow-41vm5.shadow.whamcloud.com/dd-file
dd: writing `/mnt/lustre/d0.dd-shadow-41vm5.shadow.whamcloud.com/dd-file': No space left on device
1213957+0 records in
1213956+0 records out
+ '[' 1 -eq 0 ']'
++ date '+%F %H:%M:%S'
+ echoerr '2014-08-16 20:52:06: dd failed'
+ echo '2014-08-16 20:52:06: dd failed'
2014-08-16 20:52:06: dd failed
Console log on the client: LustreError: 3715:0:(vvp_io.c:1081:vvp_io_commit_write()) Write page 1213956 of inode ffff88007ccedb78 failed -28 Maloo report: https://testing.hpdd.intel.com/test_sets/acafc288-26a6-11e4-84f2-5254006e85c2 The failure was reported in |
| Comments |
| Comment by Hongchao Zhang [ 21/Aug/14 ] |
|
the patch has been updated and tracked at http://review.whamcloud.com/#/c/8671/ |
| Comment by Jian Yu [ 21/Aug/14 ] |
|
The failure occurred consistently on Lustre b2_5 branch with FSTYPE=zfs: https://testing.hpdd.intel.com/test_sets/15e2479e-2904-11e4-9362-5254006e85c2 Hi Hongchao, Test-Parameters: alwaysuploadlogs \ envdefinitions=SLOW=yes,ENABLE_QUOTA=yes,REQFAIL=100,DURATION=28800 \ mdtfilesystemtype=zfs mdsfilesystemtype=zfs ostfilesystemtype=zfs \ clientdistro=el6 ossdistro=el6 mdsdistro=el6 clientarch=x86_64 \ ossarch=x86_64 mdsarch=x86_64 clientcount=4 osscount=2 mdscount=2 \ mdssizegb=10 ostsizegb=10 austeroptions=-R failover=true iscsi=1 \ testlist=recovery-mds-scale |
| Comment by Jian Yu [ 31/Aug/14 ] |
|
Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/86/ (2.5.3 RC1) recovery-mds-scale test hit the same failure: https://testing.hpdd.intel.com/test_sets/2f3a6d92-30f8-11e4-a455-5254006e85c2 |
| Comment by Sarah Liu [ 12/Sep/14 ] |
|
Hit same error in lustre-master build # 2639 zfs failover test: https://testing.hpdd.intel.com/test_sets/2e7e0c1a-37c4-11e4-a2a6-5254006e85c2 |
| Comment by Hongchao Zhang [ 30/Sep/14 ] |
|
the patch is tracked at http://review.whamcloud.com/#/c/12126/ |
| Comment by Jian Yu [ 30/Oct/14 ] |
|
More instances on Lustre b2_5 branch with FSTYPE=zfs: |
| Comment by Sarah Liu [ 03/Feb/15 ] |
|
this also affect lustre-master build # 2835 replay-single https://testing.hpdd.intel.com/test_sets/42a5bd28-a81c-11e4-93dd-5254006e85c2 |
| Comment by Hongchao Zhang [ 06/Feb/15 ] |
|
this issue is caused by the slow creation at OST after the recovery of MDT, which caused "osp_precreate_reserve" failed with "-ETIMEDOUT", I have tried to double the wait time in "sop_precreate_reserve", but it still failed, and have created another patch to wait the creation until completion or error, |
| Comment by Hongchao Zhang [ 02/Mar/15 ] |
|
the requests sent to OST after MDS failover spent too much time to complete, https://testing.hpdd.intel.com/test_logs/a9f3b136-bfe8-11e4-b6c3-5254006e85c2 Lustre: 2810:0:(client.c:1949:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1425098817/real 1425098820] req@ffff88006c4bc9c0 x1494324416086096/t0(0) o38->lustre-MDT0000-lwp-MDT0000@10.2.4.167@tcp:12/10 lens 400/544 e 0 to 1 dl 1425098822 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 2 clients reconnect Lustre: lustre-MDT0000: Recovery over after 0:01, of 2 clients 2 recovered and 0 were evicted. LustreError: 3165:0:(osp_precreate.c:1393:osp_precreate_reserve()) wait recovery to complete! LustreError: 2811:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,37 [sent 1425098819/real 1425098819], current 1425098856 req@ffff88006c4bccc0 x1494324416086132/t4294967824(4294967824) o2->lustre-OST0001-osc-MDT0000@10.2.4.168@tcp:28/4 lens 560/400 e 2 to 0 dl 1425098877 ref 2 fl Interpret:R/0/0 rc 0/0 LustreError: 3165:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects LustreError: 2812:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,50 [sent 1425098822/real 1425098822], current 1425098872 req@ffff88006fccb980 x1494324416086168/t4294967885(4294967885) o2->lustre-OST0000-osc-MDT0000@10.2.4.168@tcp:28/4 lens 560/400 e 1 to 0 dl 1425098893 ref 2 fl Interpret:R/0/0 rc 0/0 LustreError: 2812:0:(client.c:1866:ptlrpc_check_set()) Skipped 2 previous similar messages LustreError: 2811:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,57 [sent 1425098819/real 1425098819], current 1425098876 req@ffff88006ea53380 x1494324416086152/t4294967855(4294967855) o2->lustre-OST0001-osc-MDT0000@10.2.4.168@tcp:28/4 lens 560/400 e 3 to 0 dl 1425098902 ref 2 fl Interpret:R/0/0 rc 0/0 ... LustreError: 3165:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects LustreError: 2812:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,77 [sent 1425098822/real 1425098822], current 1425098899 req@ffff88007d574680 x1494324416086196/t4294967962(4294967962) o2->lustre-OST0000-osc-MDT0000@10.2.4.168@tcp:28/4 lens 560/400 e 2 to 0 dl 1425098918 ref 2 fl Interpret:R/0/0 rc 0/0 LNet: Service thread pid 3165 completed after 61.52s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). LustreError: 2812:0:(client.c:1866:ptlrpc_check_set()) Skipped 5 previous similar messages LustreError: 3165:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects Lustre: wanted 1, found 0 LustreError: 3165:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects Lustre: wanted 1, found 0 LustreError: 2839:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects Lustre: wanted 1, found 0 LustreError: 2839:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects Lustre: wanted 1, found 0 LustreError: 2841:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects Lustre: wanted 1, found 0 Lustre: wanted 1, found 0 LustreError: 2840:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects LustreError: 2840:0:(osp_precreate.c:1384:osp_precreate_reserve()) Skipped 1 previous similar message Lustre: wanted 1, found 0 Lustre: wanted 1, found 0 Lustre: Skipped 1 previous similar message LustreError: 2867:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,289 [sent 1425098846/real 1425098846], current 1425099135 req@ffff88006c7fb080 x1494324416086280/t0(0) o5->lustre-OST0001-osc-MDT0000@10.2.4.168@tcp:28/4 lens 432/400 e 9 to 0 dl 1425099148 ref 2 fl Interpret:RN/0/0 rc 0/0 LustreError: 3168:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects LustreError: 3168:0:(osp_precreate.c:1384:osp_precreate_reserve()) Skipped 3 previous similar messages LustreError: 2865:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,345 [sent 1425098899/real 1425098899], current 1425099244 req@ffff88006fccb680 x1494324416086820/t0(0) o5->lustre-OST0000-osc-MDT0000@10.2.4.168@tcp:28/4 lens 432/400 e 5 to 0 dl 1425099251 ref 2 fl Interpret:RN/0/0 rc 0/0 Lustre: wanted 1, found 0 Lustre: Skipped 1 previous similar message LustreError: 2865:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,26 [sent 1425099257/real 1425099257], current 1425099283 req@ffff88006bba46c0 x1494324416088404/t0(0) o5->lustre-OST0000-osc-MDT0000@10.2.4.168@tcp:28/4 lens 432/400 e 0 to 0 dl 1425099701 ref 2 fl Interpret:RN/0/0 rc 0/0 LustreError: 2865:0:(client.c:1866:ptlrpc_check_set()) Skipped 1 previous similar message LustreError: 2812:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,26 [sent 1425099383/real 1425099383], current 1425099409 req@ffff88006bca3980 x1494324416091456/t4294972478(4294972478) o2->lustre-OST0000-osc-MDT0000@10.2.4.168@tcp:28/4 lens 560/400 e 0 to 0 dl 1425099827 ref 2 fl Interpret:R/4/0 rc 0/0 LustreError: 2812:0:(client.c:1866:ptlrpc_check_set()) Skipped 1 previous similar message |
| Comment by Gerrit Updater [ 26/May/15 ] |
|
Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/14934 |
| Comment by Gerrit Updater [ 03/Jun/15 ] |
|
Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/15122 |
| Comment by Saurabh Tandan (Inactive) [ 11/Dec/15 ] |
|
master, build# 3264, 2.7.64 tag |
| Comment by Saurabh Tandan (Inactive) [ 20/Jan/16 ] |
|
Another instance found for hardfailover: EL6.7 Server/Client - ZFS |
| Comment by Saurabh Tandan (Inactive) [ 09/Feb/16 ] |
|
Another instance found for hardfailover : EL7 Server/Client - ZFS, tag 2.7.66, master build 3314 |