[LU-1475] lov_update_create_set() error creating fid 0xf99b sub-object on OST idx 0/1: rc = -5 Created: 05/Jun/12  Updated: 29/May/17  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0, Lustre 2.1.2, Lustre 2.1.3
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Jian Yu Assignee: Oleg Drokin
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Lustre Tag: v2_1_2_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/86/
Distro/Arch: RHEL6.2/x86_64
Network: IB (in-kernel OFED)


Issue Links:
Duplicate
is duplicated by LU-1454 1.8.8-wc1<->2.1.2 interop: Test failu... Resolved
Severity: 3
Rank (Obsolete): 8533

 Description   

While using mdsrate to create large number files to use up the free inodes on a 48TB OST on Juelich test cluster, the following EIO error occurred:

+ su mpiuser sh -c "/usr/lib64/openmpi/bin/mpirun -mca orte_rsh_agent rsh:ssh -np 25 -machinefile /tmp/large-lun.machines /usr/lib64/lustre/tests/mdsrate --create --verbose --ndirs 25 --dirfmt '/mnt/lustre/mdsrate/dir%d' --nfiles 1993129 --filefmt 'file%%d' "
<~snip~>
Rank 2: 71.24 creates/sec 140.38 secs (total: 260000 creates 3334.84 secs)
Rank 21: 71.24 creates/sec 140.38 secs (total: 260000 creates 3334.84 secs)
rank 4: open(file264701) error: Input/output error

Dmesg on the MDS node showed that:

LustreError: 4988:0:(lov_request.c:572:lov_update_create_set()) error creating fid 0xf99b sub-object on OST idx 0/1: rc = -5

Maloo report:
https://maloo.whamcloud.com/test_sets/379a475c-aee9-11e1-b0f7-52540035b04c

The same issue also occurred on 1TB OST: https://maloo.whamcloud.com/test_sets/f900e0aa-aee8-11e1-b0f7-52540035b04c



 Comments   
Comment by Peter Jones [ 05/Jun/12 ]

Oleg could you please look into this one? Thanks!

Comment by Jian Yu [ 05/Jun/12 ]

From http://wiki.whamcloud.com/display/ENG/Lustre+2.1.1+release+testing+tracker, I found the large number inodes testing passed on Lustre 2.1.1 RC1 against 128TB OST on DDN SFA10KE storage appliance:
https://maloo.whamcloud.com/test_sets/22872264-aef7-11e1-b0f7-52540035b04c

The test was run on one single node at that time.

Now, on Juelich cluster with separated Lustre Client, MDS and OSS nodes, I performed the same test on Lustre 2.1.1 RC4, the EIO error also occurred while testing against 1TB OST:
https://maloo.whamcloud.com/test_sets/bbce36f6-aee8-11e1-b0f7-52540035b04c

Comment by Oleg Drokin [ 05/Jun/12 ]

from the logs MDS has a strong belief it got disconnected from OSTs, even though there are no messages supportign this state (evictions or whatever).

Comment by Oleg Drokin [ 05/Jun/12 ]

I imagine a higher debug log might shed some light onto the situation, though I am not sure how high youcan reasonably go.

Ideally if you increase debug_mb to 1024 and debug level to -1 that probably would be best.

Comment by Jian Yu [ 06/Jun/12 ]

Ideally if you increase debug_mb to 1024 and debug level to -1 that probably would be best.

OK.
It seems there is some hardware issue on the MDS node n002. It's disconnected. I've been trying to powercycle it and even re-provision it, it still does not start up.

I'm provisioning another MDS node n001...

Comment by Jian Yu [ 06/Jun/12 ]

Both n001 and n002 did not work. Finally, I provisioned n003 and booted it up. The testing is ongoing now.

Comment by Jian Yu [ 06/Jun/12 ]

Hi Oleg,
I reproduced the failure and gathered the debug logs with -1 level. Please find /scratch/logs/2.1.2/large-lun-145447.tar.bz2 on brent node. Thanks!

Comment by Oleg Drokin [ 07/Jun/12 ]

Thanks!
This log is pretty useful.

I think I see the problem now, introduced by WangDi's patch for bug 21379, change eb7c28ff977f4e0a280558aa74e23f2a9ab0ea0c

in lov_check_and_create_object() we start with rc = -EIO for the early exit case of no available OSCs,
then we are trying to see if there's some objects available anywhere:

+        /* check if objects has been created on this ost */
+        for (stripe = 0; stripe < lsm->lsm_stripe_count; stripe++) {
+                /* already have object at this stripe */
+                if (ost_idx == lsm->lsm_oinfo[stripe]->loi_ost_idx)
+                        break;
+        }

If we did find something (result of a racing create completion that we can clearly see in the logs), then the next if statement is not triggered:

+        /* check if objects has been created on this ost */
+        for (stripe = 0; stripe < lsm->lsm_stripe_count; stripe++) {
+                /* already have object at this stripe */
+                if (ost_idx == lsm->lsm_oinfo[stripe]->loi_ost_idx)
+                        break;
+        }

And as such we exit via following RETURN(rc) with the rc still being initial -EIO.

I think the fix should just be to either add "else rc = 0" to the last if statement, or to add rc = 0 before the last if statement.

Wangdi, what do you think?

Comment by Jian Yu [ 07/Jun/12 ]

Lustre Tag: v2_1_2_RC2
Lustre Build: http://build.whamcloud.com/job/lustre-b2_1/87/
Distro/Arch: RHEL6.2/x86_64(server), RHEL6.2/i686(client)
Network: TCP (1GigE)
ENABLE_QUOTA=yes

sanity test 220 failed with the same issue: https://maloo.whamcloud.com/test_sets/b0aeb296-b07d-11e1-99c6-52540035b04c

I found this failure occurred before and was marked as LU-782.

Comment by Di Wang [ 07/Jun/12 ]

Wangdi, what do you think?

Yes, it should reset EIO to 0 if it found some objects on the stripe.

Comment by Oleg Drokin [ 07/Jun/12 ]

patch at http://review.whamcloud.com/3058

Comment by Sebastien Piechurski [ 06/Jul/12 ]

I uploaded debug logs in LU-782 with similar issue, except the return code is ENOTCONN, but the real current issue is a mix of ENOTCONN and EIO.

What is the status of the patch ? I see there was a regression introduced with the current version, but can you see why ?

Comment by Oleg Drokin [ 25/Jul/12 ]

Hm, I just dug some more int logs as part of trying to improve my original patch and came to a conclusion that my initial analysis was wrong (and the "fix" is wrong too).

What really happens (in the log file I have at hand):
OST starts to feel real bad (logs have multiple messages that I interpret as i/o errors like

sd 6:0:15:0: attempting task abort! scmd(ffff88060e198d80)
sd 6:0:15:0: [sdo] CDB: Write(10): 2a 00 01 cb 62 c0 00 00 08 00
scsi target6:0:15: handle(0x001d), sas_address(0x5000c50040cf2bc1), phy(18)
scsi target6:0:15: enclosure_logical_id(0x500304800137caff), slot(6)
sd 6:0:15:0: task abort: SUCCESS scmd(ffff88060e198d80)

)
A precreate request is sent as we start to deplete our precreated objects.
This precreate requests gets stuck on the OST somewhere. The OST knows it's slow so it sends periodic early replies to let us know it's stil working on the request.
Eventually we exhaust all the precreated objects. A fake request is created for the purposes of waiting for some more objects to appear, and it is placed in ptlrpcd queue.

As no more objects are created (see OST in trouble).
10 seconds later ptlrpc decides that the fake request is sittign in the queue for way too long and should be timed out,
this sets off the entire error stuff eventually propagated to userspace (because we don't have any other osts to allocate from).

Sebasten: The LU782 log excerpt you posted there has the same sign of fake request timing out (o-1), so likely something very similar is ongoing in that case.

Anyway for this bug we need to figure out how come OST is unhappy.
YuJian, is this still repeatable? On other nodes too? Every time with scsi complaining about errors?

Comment by Sebastien Piechurski [ 26/Jul/12 ]

Oleg,

Looking back to the logs on the OSS when the problem first appeared, I don't see any obvious problems. However, the problem was raised after a maintenance where a writeconf was done on all the targets, and remounted all in parallel (around 200 targets at once).
One of our expert suspected that there had been something wrong at re-registration time and that the configs were somehow corrupted.
Reapplying the writeconf and re-registering each target one by one made the problem disappear. This is still difficult to understand why ....

Comment by Jodi Levi (Inactive) [ 26/Jul/12 ]

Yu Jian,
Can you let us know if this is still happening often? We are trying to determine if this should still be a blocker for 2.3 Release.

Comment by Jian Yu [ 01/Aug/12 ]

Hi Oleg and Jodi,

YuJian, is this still repeatable? On other nodes too? Every time with scsi complaining about errors?

Can you let us know if this is still happening often? We are trying to determine if this should still be a blocker for 2.3 Release.

I've not run mdsrate on b2_1 branch since Lustre 2.1.2 is released. I'll find out some Toro test nodes to see whether I can reproduce it on the latest b2_1 and master branches or not.

However, I found LU-1454 (which is a duplicate of this ticket) consistently occurred in autotest runs:
http://tinyurl.com/crz74xd

Comment by Oleg Drokin [ 06/Aug/12 ]

So lu1454 was caused by the same poblem, but it's a faulty test and default behavior.

What happens there is we have MDT that has just started and then we restart the (only) OST.
Next we connect the client and try to touch a file.
MDT does not yet know that OST restarted and it happens to not have any precreated objects for this OST. Attempt to get some results in finding out that OST was restarted and we no longer have a connection to it. o other OSTs to try, so we return IO error.

For Bull reported failures the situation is less clear, on the other hand we had multiple complaints from other sites that registering many targets at the same time on MGS (many = more than 30 or so at a time) results in strange problems, mgs creashes and such. This warrants its own bug, I guess.

Comment by Jian Yu [ 13/Aug/12 ]

RHEL6.3/x86_64 (2.1.3 Server + 1.8.8-wc1 Client):
https://maloo.whamcloud.com/test_sets/bc2dffb8-e427-11e1-b6d3-52540035b04c

Comment by Jian Yu [ 21/Aug/12 ]

Another instance:
https://maloo.whamcloud.com/test_sets/d98970d4-eabb-11e1-b137-52540035b04c

Comment by Jian Yu [ 15/Oct/12 ]

RHEL6.3/x86_64 (2.3.0 RC3 Server + 1.8.8-wc1 Client):
https://maloo.whamcloud.com/test_sets/e309f73e-169c-11e2-962d-52540035b04c

Comment by Andreas Dilger [ 29/May/17 ]

Close old ticket.

Generated at Sat Feb 10 01:16:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.