[LU-7309] replay-single test_70b: no space left on device Created: 16/Oct/15  Updated: 28/Nov/16  Resolved: 28/Jan/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Major
Reporter: Maloo Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: p4hc

Issue Links:
Related
is related to LU-5526 recovery-mds-scale test failover_mds:... Resolved
is related to LU-6844 replay-single test 70b failure: 'rund... Resolved
is related to LU-4846 Failover test failure on test suite r... Open
is related to LU-7352 conf-sanity test_78: no space left on... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/4f0d6d92-718c-11e5-bffb-5254006e85c2.

The sub-test test_70b failed with the following error in the client test log:

shadow-52vm5: [11429] open ./clients/client0/~dmtmp/PWRPNT/PPTC112.TMP failed for handle 12322 (No space left on device)
shadow-52vm5: (11430) ERROR: handle 12322 was not found
shadow-52vm5: Child failed with status 1
shadow-52vm1: [11429] open ./clients/client0/~dmtmp/PWRPNT/PPTC112.TMP failed for handle 12322 (No space left on device)
shadow-52vm1: (11430) ERROR: handle 12322 was not found
shadow-52vm1: Child failed with status 1

Please provide additional information about the failure here.

Info required for matching: replay-single 70b



 Comments   
Comment by Joseph Gmitter (Inactive) [ 19/Oct/15 ]

James,
Per triage discussion, can you take look to gather a bit more info?
Thanks.
Joe

Comment by Sarah Liu [ 15/Dec/15 ]

more instance:
master tag-2.7.63 client/server: RHEL6.7
https://testing.hpdd.intel.com/test_sets/805b3852-947d-11e5-95f7-5254006e85c2
server: RHEL6.7, client SLES11 SP3
https://testing.hpdd.intel.com/test_sessions/54c65392-9131-11e5-ad50-5254006e85c2
client/server: RHEL7 zfs
https://testing.hpdd.intel.com/test_sessions/e41920d2-945c-11e5-b268-5254006e85c2
master tag-2.7.64 client/server: RHEL6.7
https://testing.hpdd.intel.com/test_sets/80a20678-9edd-11e5-87a9-5254006e85c2
server: RHEL7, client SLES11 SP3
https://testing.hpdd.intel.com/test_sets/a8b3fb9e-a077-11e5-8d69-5254006e85c2
RHEL7 zfs
https://testing.hpdd.intel.com/test_sets/5ba6d7bc-9e20-11e5-91b0-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 16/Dec/15 ]

Server: 2.5.5, b2_5_fe/62
Client: Master, Build# 3266, Tag 2.7.64
https://testing.hpdd.intel.com/test_sets/a061b2ba-a04a-11e5-a33d-5254006e85c2

Comment by Joseph Gmitter (Inactive) [ 04/Jan/16 ]

Hongchao,
James reports that this problem is hitting master, 2.x, and 3.0.0 for the failover test group. This issue is not being seen in autotest. In most of these failure scenarios, several other tests will also fail with the same error. He strongly believes the issue is not directly related to test_70b, but rather is a larger issue. He also believes there is evidence to say that it is a duplicate of LU-4846.
Can you have look please?
Thanks.
Joe

Comment by Hongchao Zhang [ 05/Jan/16 ]

I have analyzed several failed test, and some didn't contain obvious error logs related to "ENOSPC", but it did indicate the problem
was caused by the object creation at MDT, which is smiler with the issue in LU-4846 and LU-5526.

in https://testing.hpdd.intel.com/test_logs/bdd52dbe-a080-11e5-85ed-5254006e85c2/show_text

00020000:01000000:0.0:1449883124.461043:0:14270:0:(lod_qos.c:218:lod_statfs_and_check()) lustre-OST0000-osc-MDT0000: turns inactive
00020000:01000000:0.0:1449883124.461045:0:14270:0:(lod_qos.c:218:lod_statfs_and_check()) lustre-OST0001-osc-MDT0000: turns inactive
00020000:01000000:0.0:1449883124.461046:0:14270:0:(lod_qos.c:218:lod_statfs_and_check()) lustre-OST0002-osc-MDT0000: turns inactive
00020000:01000000:0.0:1449883124.461048:0:14270:0:(lod_qos.c:218:lod_statfs_and_check()) lustre-OST0003-osc-MDT0000: turns inactive
00020000:01000000:0.0:1449883124.461048:0:14270:0:(lod_qos.c:218:lod_statfs_and_check()) lustre-OST0004-osc-MDT0000: turns inactive
00020000:01000000:0.0:1449883124.461049:0:14270:0:(lod_qos.c:218:lod_statfs_and_check()) lustre-OST0005-osc-MDT0000: turns inactive

the OSP devices were turned to inactive due to "-ENOTCONN" just between the failed request from client.
the problem here is a little different from that in LU-5526, but it's still better to let the client to retry the creation request
just as that in LU-5526/LU-4846.

Comment by Andreas Dilger [ 05/Jan/16 ]

I don't think the client should be retrying, since this is the MDS's job to handle the recovery properly. I think it should hold the create request until the connections to the OSTs are available after restarting the MDS.

Comment by Hongchao Zhang [ 06/Jan/16 ]

Okay, I'll try to create the corresponding patch to wait the connections to OSTs for the creation request.

Comment by Gerrit Updater [ 06/Jan/16 ]

Hongchao Zhang (hongchao.zhang@intel.com) uploaded a new patch: http://review.whamcloud.com/17839
Subject: LU-7309 lod: wait OSP to connect for creation
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d4500300ad8d94d574fe9c87bc91910e54dd9828

Comment by Gerrit Updater [ 28/Jan/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17839/
Subject: LU-7309 lod: notify client retry creation
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5ebc00ec79565ad62e978af65b023343ad360675

Comment by Joseph Gmitter (Inactive) [ 28/Jan/16 ]

Landed for 2.8

Generated at Sat Feb 10 02:07:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.