Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7309

replay-single test_70b: no space left on device

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/4f0d6d92-718c-11e5-bffb-5254006e85c2.

      The sub-test test_70b failed with the following error in the client test log:

      shadow-52vm5: [11429] open ./clients/client0/~dmtmp/PWRPNT/PPTC112.TMP failed for handle 12322 (No space left on device)
      shadow-52vm5: (11430) ERROR: handle 12322 was not found
      shadow-52vm5: Child failed with status 1
      shadow-52vm1: [11429] open ./clients/client0/~dmtmp/PWRPNT/PPTC112.TMP failed for handle 12322 (No space left on device)
      shadow-52vm1: (11430) ERROR: handle 12322 was not found
      shadow-52vm1: Child failed with status 1
      

      Please provide additional information about the failure here.

      Info required for matching: replay-single 70b

      Attachments

        Issue Links

          Activity

            [LU-7309] replay-single test_70b: no space left on device

            Landed for 2.8

            jgmitter Joseph Gmitter (Inactive) added a comment - Landed for 2.8

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17839/
            Subject: LU-7309 lod: notify client retry creation
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 5ebc00ec79565ad62e978af65b023343ad360675

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17839/ Subject: LU-7309 lod: notify client retry creation Project: fs/lustre-release Branch: master Current Patch Set: Commit: 5ebc00ec79565ad62e978af65b023343ad360675

            Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/17839
            Subject: LU-7309 lod: wait OSP to connect for creation
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: d4500300ad8d94d574fe9c87bc91910e54dd9828

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/17839 Subject: LU-7309 lod: wait OSP to connect for creation Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d4500300ad8d94d574fe9c87bc91910e54dd9828

            Okay, I'll try to create the corresponding patch to wait the connections to OSTs for the creation request.

            hongchao.zhang Hongchao Zhang added a comment - Okay, I'll try to create the corresponding patch to wait the connections to OSTs for the creation request.

            I don't think the client should be retrying, since this is the MDS's job to handle the recovery properly. I think it should hold the create request until the connections to the OSTs are available after restarting the MDS.

            adilger Andreas Dilger added a comment - I don't think the client should be retrying, since this is the MDS's job to handle the recovery properly. I think it should hold the create request until the connections to the OSTs are available after restarting the MDS.

            I have analyzed several failed test, and some didn't contain obvious error logs related to "ENOSPC", but it did indicate the problem
            was caused by the object creation at MDT, which is smiler with the issue in LU-4846 and LU-5526.

            in https://testing.hpdd.intel.com/test_logs/bdd52dbe-a080-11e5-85ed-5254006e85c2/show_text

            00020000:01000000:0.0:1449883124.461043:0:14270:0:(lod_qos.c:218:lod_statfs_and_check()) lustre-OST0000-osc-MDT0000: turns inactive
            00020000:01000000:0.0:1449883124.461045:0:14270:0:(lod_qos.c:218:lod_statfs_and_check()) lustre-OST0001-osc-MDT0000: turns inactive
            00020000:01000000:0.0:1449883124.461046:0:14270:0:(lod_qos.c:218:lod_statfs_and_check()) lustre-OST0002-osc-MDT0000: turns inactive
            00020000:01000000:0.0:1449883124.461048:0:14270:0:(lod_qos.c:218:lod_statfs_and_check()) lustre-OST0003-osc-MDT0000: turns inactive
            00020000:01000000:0.0:1449883124.461048:0:14270:0:(lod_qos.c:218:lod_statfs_and_check()) lustre-OST0004-osc-MDT0000: turns inactive
            00020000:01000000:0.0:1449883124.461049:0:14270:0:(lod_qos.c:218:lod_statfs_and_check()) lustre-OST0005-osc-MDT0000: turns inactive
            

            the OSP devices were turned to inactive due to "-ENOTCONN" just between the failed request from client.
            the problem here is a little different from that in LU-5526, but it's still better to let the client to retry the creation request
            just as that in LU-5526/LU-4846.

            hongchao.zhang Hongchao Zhang added a comment - I have analyzed several failed test, and some didn't contain obvious error logs related to "ENOSPC", but it did indicate the problem was caused by the object creation at MDT, which is smiler with the issue in LU-4846 and LU-5526 . in https://testing.hpdd.intel.com/test_logs/bdd52dbe-a080-11e5-85ed-5254006e85c2/show_text 00020000:01000000:0.0:1449883124.461043:0:14270:0:(lod_qos.c:218:lod_statfs_and_check()) lustre-OST0000-osc-MDT0000: turns inactive 00020000:01000000:0.0:1449883124.461045:0:14270:0:(lod_qos.c:218:lod_statfs_and_check()) lustre-OST0001-osc-MDT0000: turns inactive 00020000:01000000:0.0:1449883124.461046:0:14270:0:(lod_qos.c:218:lod_statfs_and_check()) lustre-OST0002-osc-MDT0000: turns inactive 00020000:01000000:0.0:1449883124.461048:0:14270:0:(lod_qos.c:218:lod_statfs_and_check()) lustre-OST0003-osc-MDT0000: turns inactive 00020000:01000000:0.0:1449883124.461048:0:14270:0:(lod_qos.c:218:lod_statfs_and_check()) lustre-OST0004-osc-MDT0000: turns inactive 00020000:01000000:0.0:1449883124.461049:0:14270:0:(lod_qos.c:218:lod_statfs_and_check()) lustre-OST0005-osc-MDT0000: turns inactive the OSP devices were turned to inactive due to "-ENOTCONN" just between the failed request from client. the problem here is a little different from that in LU-5526 , but it's still better to let the client to retry the creation request just as that in LU-5526 / LU-4846 .

            Hongchao,
            James reports that this problem is hitting master, 2.x, and 3.0.0 for the failover test group. This issue is not being seen in autotest. In most of these failure scenarios, several other tests will also fail with the same error. He strongly believes the issue is not directly related to test_70b, but rather is a larger issue. He also believes there is evidence to say that it is a duplicate of LU-4846.
            Can you have look please?
            Thanks.
            Joe

            jgmitter Joseph Gmitter (Inactive) added a comment - Hongchao, James reports that this problem is hitting master, 2.x, and 3.0.0 for the failover test group. This issue is not being seen in autotest. In most of these failure scenarios, several other tests will also fail with the same error. He strongly believes the issue is not directly related to test_70b, but rather is a larger issue. He also believes there is evidence to say that it is a duplicate of LU-4846 . Can you have look please? Thanks. Joe

            Server: 2.5.5, b2_5_fe/62
            Client: Master, Build# 3266, Tag 2.7.64
            https://testing.hpdd.intel.com/test_sets/a061b2ba-a04a-11e5-a33d-5254006e85c2

            standan Saurabh Tandan (Inactive) added a comment - Server: 2.5.5, b2_5_fe/62 Client: Master, Build# 3266, Tag 2.7.64 https://testing.hpdd.intel.com/test_sets/a061b2ba-a04a-11e5-a33d-5254006e85c2
            sarah Sarah Liu added a comment - - edited more instance: master tag-2.7.63 client/server: RHEL6.7 https://testing.hpdd.intel.com/test_sets/805b3852-947d-11e5-95f7-5254006e85c2 server: RHEL6.7, client SLES11 SP3 https://testing.hpdd.intel.com/test_sessions/54c65392-9131-11e5-ad50-5254006e85c2 client/server: RHEL7 zfs https://testing.hpdd.intel.com/test_sessions/e41920d2-945c-11e5-b268-5254006e85c2 master tag-2.7.64 client/server: RHEL6.7 https://testing.hpdd.intel.com/test_sets/80a20678-9edd-11e5-87a9-5254006e85c2 server: RHEL7, client SLES11 SP3 https://testing.hpdd.intel.com/test_sets/a8b3fb9e-a077-11e5-8d69-5254006e85c2 RHEL7 zfs https://testing.hpdd.intel.com/test_sets/5ba6d7bc-9e20-11e5-91b0-5254006e85c2

            James,
            Per triage discussion, can you take look to gather a bit more info?
            Thanks.
            Joe

            jgmitter Joseph Gmitter (Inactive) added a comment - James, Per triage discussion, can you take look to gather a bit more info? Thanks. Joe

            People

              hongchao.zhang Hongchao Zhang
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: