Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5526

recovery-mds-scale test failover_mds: dd: No space left on device

Details

    • 3
    • 15384

    Description

      While running recovery-mds-scale test failover_mds (MDS failed over 1 time), client load on one of the clients failed as follows:

      2014-08-16 20:44:59: dd run starting
      + mkdir -p /mnt/lustre/d0.dd-shadow-41vm5.shadow.whamcloud.com
      + /usr/bin/lfs setstripe -c -1 /mnt/lustre/d0.dd-shadow-41vm5.shadow.whamcloud.com
      + cd /mnt/lustre/d0.dd-shadow-41vm5.shadow.whamcloud.com
      + sync
      ++ /usr/bin/lfs df /mnt/lustre/d0.dd-shadow-41vm5.shadow.whamcloud.com
      ++ awk '/filesystem summary:/ {print $5}'
      + FREE_SPACE=14195328
      + BLKS=1596974
      + echoerr 'Total free disk space is 14195328, 4k blocks to dd is 1596974'
      + echo 'Total free disk space is 14195328, 4k blocks to dd is 1596974'
      Total free disk space is 14195328, 4k blocks to dd is 1596974
      + load_pid=3715
      + wait 3715
      + dd bs=4k count=1596974 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-shadow-41vm5.shadow.whamcloud.com/dd-file
      dd: writing `/mnt/lustre/d0.dd-shadow-41vm5.shadow.whamcloud.com/dd-file': No space left on device
      1213957+0 records in
      1213956+0 records out
      + '[' 1 -eq 0 ']'
      ++ date '+%F %H:%M:%S'
      + echoerr '2014-08-16 20:52:06: dd failed'
      + echo '2014-08-16 20:52:06: dd failed'
      2014-08-16 20:52:06: dd failed
      

      Console log on the client:

      LustreError: 3715:0:(vvp_io.c:1081:vvp_io_commit_write()) Write page 1213956 of inode ffff88007ccedb78 failed -28
      

      Maloo report: https://testing.hpdd.intel.com/test_sets/acafc288-26a6-11e4-84f2-5254006e85c2

      The failure was reported in LU-3326 before. However, Lustre b2_5 build #80 already contained patch http://review.whamcloud.com/11425. So, more fixup is needed to resolve the failure.

      Attachments

        Issue Links

          Activity

            [LU-5526] recovery-mds-scale test failover_mds: dd: No space left on device

            Another instance found for hardfailover : EL7 Server/Client - ZFS, tag 2.7.66, master build 3314
            https://testing.hpdd.intel.com/test_sessions/f0dd9616-ca6e-11e5-9609-5254006e85c2

            standan Saurabh Tandan (Inactive) added a comment - Another instance found for hardfailover : EL7 Server/Client - ZFS, tag 2.7.66, master build 3314 https://testing.hpdd.intel.com/test_sessions/f0dd9616-ca6e-11e5-9609-5254006e85c2

            Another instance found for hardfailover: EL6.7 Server/Client - ZFS
            https://testing.hpdd.intel.com/test_sets/df3903c8-bbd7-11e5-8506-5254006e85c2

            standan Saurabh Tandan (Inactive) added a comment - Another instance found for hardfailover: EL6.7 Server/Client - ZFS https://testing.hpdd.intel.com/test_sets/df3903c8-bbd7-11e5-8506-5254006e85c2

            master, build# 3264, 2.7.64 tag
            Hard Failover: EL7 Server/Client
            https://testing.hpdd.intel.com/test_sets/56d3054e-9e20-11e5-91b0-5254006e85c2

            standan Saurabh Tandan (Inactive) added a comment - master, build# 3264, 2.7.64 tag Hard Failover: EL7 Server/Client https://testing.hpdd.intel.com/test_sets/56d3054e-9e20-11e5-91b0-5254006e85c2

            Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/15122
            Subject: LU-5526 llite: retry object creation
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: ddd9319367f2373dc32c491e3ff2f1fd7ea2d617

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/15122 Subject: LU-5526 llite: retry object creation Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ddd9319367f2373dc32c491e3ff2f1fd7ea2d617

            Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/14934
            Subject: LU-5526 llite: retry object creation
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: cde90a2316328562d4c95babe09ab3276f18fdc2

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/14934 Subject: LU-5526 llite: retry object creation Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: cde90a2316328562d4c95babe09ab3276f18fdc2

            the requests sent to OST after MDS failover spent too much time to complete, https://testing.hpdd.intel.com/test_logs/a9f3b136-bfe8-11e4-b6c3-5254006e85c2

            Lustre: 2810:0:(client.c:1949:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1425098817/real 1425098820]  req@ffff88006c4bc9c0 x1494324416086096/t0(0) o38->lustre-MDT0000-lwp-MDT0000@10.2.4.167@tcp:12/10 lens 400/544 e 0 to 1 dl 1425098822 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 2 clients reconnect
            Lustre: lustre-MDT0000: Recovery over after 0:01, of 2 clients 2 recovered and 0 were evicted.
            LustreError: 3165:0:(osp_precreate.c:1393:osp_precreate_reserve()) wait recovery to complete!
            LustreError: 2811:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,37 [sent 1425098819/real 1425098819], current 1425098856  req@ffff88006c4bccc0 x1494324416086132/t4294967824(4294967824) o2->lustre-OST0001-osc-MDT0000@10.2.4.168@tcp:28/4 lens 560/400 e 2 to 0 dl 1425098877 ref 2 fl Interpret:R/0/0 rc 0/0
            LustreError: 3165:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects
            LustreError: 2812:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,50 [sent 1425098822/real 1425098822], current 1425098872  req@ffff88006fccb980 x1494324416086168/t4294967885(4294967885) o2->lustre-OST0000-osc-MDT0000@10.2.4.168@tcp:28/4 lens 560/400 e 1 to 0 dl 1425098893 ref 2 fl Interpret:R/0/0 rc 0/0
            LustreError: 2812:0:(client.c:1866:ptlrpc_check_set()) Skipped 2 previous similar messages
            LustreError: 2811:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,57 [sent 1425098819/real 1425098819], current 1425098876  req@ffff88006ea53380 x1494324416086152/t4294967855(4294967855) o2->lustre-OST0001-osc-MDT0000@10.2.4.168@tcp:28/4 lens 560/400 e 3 to 0 dl 1425098902 ref 2 fl Interpret:R/0/0 rc 0/0
            
            ...
            
            
            LustreError: 3165:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects
            LustreError: 2812:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,77 [sent 1425098822/real 1425098822], current 1425098899  req@ffff88007d574680 x1494324416086196/t4294967962(4294967962) o2->lustre-OST0000-osc-MDT0000@10.2.4.168@tcp:28/4 lens 560/400 e 2 to 0 dl 1425098918 ref 2 fl Interpret:R/0/0 rc 0/0
            LNet: Service thread pid 3165 completed after 61.52s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
            LustreError: 2812:0:(client.c:1866:ptlrpc_check_set()) Skipped 5 previous similar messages
            LustreError: 3165:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects
            Lustre: wanted 1, found 0
            LustreError: 3165:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects
            Lustre: wanted 1, found 0
            LustreError: 2839:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects
            Lustre: wanted 1, found 0
            LustreError: 2839:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects
            Lustre: wanted 1, found 0
            LustreError: 2841:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects
            Lustre: wanted 1, found 0
            Lustre: wanted 1, found 0
            LustreError: 2840:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects
            LustreError: 2840:0:(osp_precreate.c:1384:osp_precreate_reserve()) Skipped 1 previous similar message
            Lustre: wanted 1, found 0
            Lustre: wanted 1, found 0
            Lustre: Skipped 1 previous similar message
            LustreError: 2867:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,289 [sent 1425098846/real 1425098846], current 1425099135  req@ffff88006c7fb080 x1494324416086280/t0(0) o5->lustre-OST0001-osc-MDT0000@10.2.4.168@tcp:28/4 lens 432/400 e 9 to 0 dl 1425099148 ref 2 fl Interpret:RN/0/0 rc 0/0
            LustreError: 3168:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects
            LustreError: 3168:0:(osp_precreate.c:1384:osp_precreate_reserve()) Skipped 3 previous similar messages
            LustreError: 2865:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,345 [sent 1425098899/real 1425098899], current 1425099244  req@ffff88006fccb680 x1494324416086820/t0(0) o5->lustre-OST0000-osc-MDT0000@10.2.4.168@tcp:28/4 lens 432/400 e 5 to 0 dl 1425099251 ref 2 fl Interpret:RN/0/0 rc 0/0
            Lustre: wanted 1, found 0
            Lustre: Skipped 1 previous similar message
            LustreError: 2865:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,26 [sent 1425099257/real 1425099257], current 1425099283  req@ffff88006bba46c0 x1494324416088404/t0(0) o5->lustre-OST0000-osc-MDT0000@10.2.4.168@tcp:28/4 lens 432/400 e 0 to 0 dl 1425099701 ref 2 fl Interpret:RN/0/0 rc 0/0
            LustreError: 2865:0:(client.c:1866:ptlrpc_check_set()) Skipped 1 previous similar message
            LustreError: 2812:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,26 [sent 1425099383/real 1425099383], current 1425099409  req@ffff88006bca3980 x1494324416091456/t4294972478(4294972478) o2->lustre-OST0000-osc-MDT0000@10.2.4.168@tcp:28/4 lens 560/400 e 0 to 0 dl 1425099827 ref 2 fl Interpret:R/4/0 rc 0/0
            LustreError: 2812:0:(client.c:1866:ptlrpc_check_set()) Skipped 1 previous similar message
            
            hongchao.zhang Hongchao Zhang added a comment - the requests sent to OST after MDS failover spent too much time to complete, https://testing.hpdd.intel.com/test_logs/a9f3b136-bfe8-11e4-b6c3-5254006e85c2 Lustre: 2810:0:(client.c:1949:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1425098817/real 1425098820] req@ffff88006c4bc9c0 x1494324416086096/t0(0) o38->lustre-MDT0000-lwp-MDT0000@10.2.4.167@tcp:12/10 lens 400/544 e 0 to 1 dl 1425098822 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 2 clients reconnect Lustre: lustre-MDT0000: Recovery over after 0:01, of 2 clients 2 recovered and 0 were evicted. LustreError: 3165:0:(osp_precreate.c:1393:osp_precreate_reserve()) wait recovery to complete! LustreError: 2811:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,37 [sent 1425098819/real 1425098819], current 1425098856 req@ffff88006c4bccc0 x1494324416086132/t4294967824(4294967824) o2->lustre-OST0001-osc-MDT0000@10.2.4.168@tcp:28/4 lens 560/400 e 2 to 0 dl 1425098877 ref 2 fl Interpret:R/0/0 rc 0/0 LustreError: 3165:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects LustreError: 2812:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,50 [sent 1425098822/real 1425098822], current 1425098872 req@ffff88006fccb980 x1494324416086168/t4294967885(4294967885) o2->lustre-OST0000-osc-MDT0000@10.2.4.168@tcp:28/4 lens 560/400 e 1 to 0 dl 1425098893 ref 2 fl Interpret:R/0/0 rc 0/0 LustreError: 2812:0:(client.c:1866:ptlrpc_check_set()) Skipped 2 previous similar messages LustreError: 2811:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,57 [sent 1425098819/real 1425098819], current 1425098876 req@ffff88006ea53380 x1494324416086152/t4294967855(4294967855) o2->lustre-OST0001-osc-MDT0000@10.2.4.168@tcp:28/4 lens 560/400 e 3 to 0 dl 1425098902 ref 2 fl Interpret:R/0/0 rc 0/0 ... LustreError: 3165:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects LustreError: 2812:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,77 [sent 1425098822/real 1425098822], current 1425098899 req@ffff88007d574680 x1494324416086196/t4294967962(4294967962) o2->lustre-OST0000-osc-MDT0000@10.2.4.168@tcp:28/4 lens 560/400 e 2 to 0 dl 1425098918 ref 2 fl Interpret:R/0/0 rc 0/0 LNet: Service thread pid 3165 completed after 61.52s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). LustreError: 2812:0:(client.c:1866:ptlrpc_check_set()) Skipped 5 previous similar messages LustreError: 3165:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects Lustre: wanted 1, found 0 LustreError: 3165:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects Lustre: wanted 1, found 0 LustreError: 2839:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects Lustre: wanted 1, found 0 LustreError: 2839:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects Lustre: wanted 1, found 0 LustreError: 2841:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects Lustre: wanted 1, found 0 Lustre: wanted 1, found 0 LustreError: 2840:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects LustreError: 2840:0:(osp_precreate.c:1384:osp_precreate_reserve()) Skipped 1 previous similar message Lustre: wanted 1, found 0 Lustre: wanted 1, found 0 Lustre: Skipped 1 previous similar message LustreError: 2867:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,289 [sent 1425098846/real 1425098846], current 1425099135 req@ffff88006c7fb080 x1494324416086280/t0(0) o5->lustre-OST0001-osc-MDT0000@10.2.4.168@tcp:28/4 lens 432/400 e 9 to 0 dl 1425099148 ref 2 fl Interpret:RN/0/0 rc 0/0 LustreError: 3168:0:(osp_precreate.c:1384:osp_precreate_reserve()) expired on waiting for precreated objects LustreError: 3168:0:(osp_precreate.c:1384:osp_precreate_reserve()) Skipped 3 previous similar messages LustreError: 2865:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,345 [sent 1425098899/real 1425098899], current 1425099244 req@ffff88006fccb680 x1494324416086820/t0(0) o5->lustre-OST0000-osc-MDT0000@10.2.4.168@tcp:28/4 lens 432/400 e 5 to 0 dl 1425099251 ref 2 fl Interpret:RN/0/0 rc 0/0 Lustre: wanted 1, found 0 Lustre: Skipped 1 previous similar message LustreError: 2865:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,26 [sent 1425099257/real 1425099257], current 1425099283 req@ffff88006bba46c0 x1494324416088404/t0(0) o5->lustre-OST0000-osc-MDT0000@10.2.4.168@tcp:28/4 lens 432/400 e 0 to 0 dl 1425099701 ref 2 fl Interpret:RN/0/0 rc 0/0 LustreError: 2865:0:(client.c:1866:ptlrpc_check_set()) Skipped 1 previous similar message LustreError: 2812:0:(client.c:1866:ptlrpc_check_set()) @@@ Request cost too much time,26 [sent 1425099383/real 1425099383], current 1425099409 req@ffff88006bca3980 x1494324416091456/t4294972478(4294972478) o2->lustre-OST0000-osc-MDT0000@10.2.4.168@tcp:28/4 lens 560/400 e 0 to 0 dl 1425099827 ref 2 fl Interpret:R/4/0 rc 0/0 LustreError: 2812:0:(client.c:1866:ptlrpc_check_set()) Skipped 1 previous similar message

            this issue is caused by the slow creation at OST after the recovery of MDT, which caused "osp_precreate_reserve" failed with "-ETIMEDOUT",
            and LOD return "-ENOSPC" in this case.

            I have tried to double the wait time in "sop_precreate_reserve", but it still failed, and have created another patch to wait the creation until completion or error,
            the patch is still tracked at http://review.whamcloud.com/#/c/12126/

            hongchao.zhang Hongchao Zhang added a comment - this issue is caused by the slow creation at OST after the recovery of MDT, which caused "osp_precreate_reserve" failed with "-ETIMEDOUT", and LOD return "-ENOSPC" in this case. I have tried to double the wait time in "sop_precreate_reserve", but it still failed, and have created another patch to wait the creation until completion or error, the patch is still tracked at http://review.whamcloud.com/#/c/12126/
            sarah Sarah Liu added a comment -

            this also affect lustre-master build # 2835 replay-single

            https://testing.hpdd.intel.com/test_sets/42a5bd28-a81c-11e4-93dd-5254006e85c2

            sarah Sarah Liu added a comment - this also affect lustre-master build # 2835 replay-single https://testing.hpdd.intel.com/test_sets/42a5bd28-a81c-11e4-93dd-5254006e85c2
            yujian Jian Yu added a comment - - edited More instances on Lustre b2_5 branch with FSTYPE=zfs: https://testing.hpdd.intel.com/test_sets/66474710-5fb0-11e4-a865-5254006e85c2 https://testing.hpdd.intel.com/test_sets/10035164-7a72-11e4-be53-5254006e85c2
            hongchao.zhang Hongchao Zhang added a comment - the patch is tracked at http://review.whamcloud.com/#/c/12126/

            People

              hongchao.zhang Hongchao Zhang
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: