[LU-1797] MDS deactivates/reactivates OSTs Created: 28/Aug/12  Updated: 09/Jan/20  Resolved: 09/Jan/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Cliff White (Inactive) Assignee: Oleg Drokin
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Lustre 2.2.93, LLNL/Hyperion


Attachments: File failure.txt.gz    
Severity: 3
Rank (Obsolete): 8543

 Description   

Running the SWL test suite (mixture of IO jobs spread across all clients) I am seeing this on multiple OSTs.

Aug 28 05:11:41 ehyperion-dit34 kernel: Lustre: lustre-OST000d: Client lustre-MDT0000-mdtlov_UUID (at 192.168.127.6@o2ib1) reconnecting
Aug 28 05:11:41 ehyperion-dit34 kernel: Lustre: lustre-OST000d: received MDS connection from 192.168.127.6@o2ib1
Aug 28 05:11:41 ehyperion-rst6 kernel: Lustre: 3947:0:(client.c:1920:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1346155795/real 1346155795]  req@ffff880093a2d000 x1411378033913049/t0(0) o5->lustre-OST000d-osc-MDT0000@192.168.127.65@o2ib1:7/4 lens 432/432 e 0 to 1 dl 1346155901 ref 1 fl Rpc:RXN/0/ffffffff rc 0/-1
Aug 28 05:11:41 ehyperion-rst6 kernel: Lustre: lustre-OST000d-osc-MDT0000: Connection to lustre-OST000d (at 192.168.127.65@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
Aug 28 05:11:41 ehyperion-rst6 kernel: Lustre: lustre-OST000d-osc-MDT0000: Connection restored to lustre-OST000d (at 192.168.127.65@o2ib1)
Aug 28 05:11:41 ehyperion-rst6 kernel: Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST000d_UUID now active, resetting orphans
Aug 28 05:11:55 ehyperion-rst6 kernel: LustreError: 3947:0:(osc_create.c:169:osc_interpret_create()) @@@ Unknown rc -107 from async create: failing oscc  req@ffff880093a2d000 x1411378033913049/t0(0) o5->lustre-OST000d-osc-MDT0000@192.168.127.65@o2ib1:7/4 lens 432/432 e 0 to 1 dl 1346155901 ref 1 fl Interpret:RXN/0/ffffffff rc -107/-1
Aug 28 05:12:31 ehyperion-rst6 kernel: LustreError: 12495:0:(mds_lov.c:883:__mds_lov_synchronize()) lustre-OST000d_UUID failed at mds_lov_clear_orphans: -5
Aug 28 05:12:31 ehyperion-rst6 kernel: LustreError: 12495:0:(mds_lov.c:903:__mds_lov_synchronize()) lustre-OST000d_UUID sync failed -5, deactivating
Aug 28 05:23:07 ehyperion-rst6 kernel: Lustre: 3948:0:(client.c:1920:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1346156476/real 1346156476]  req@ffff8801fa613400 x1411378034235068/t0(0) o400->lustre-OST000d-osc-MDT0000@192.168.127.65@o2ib1:28/4 lens 224/224 e 0 to 1 dl 1346156587 ref 1 fl Rpc:RXN/0/ffffffff rc 0/-1
Aug 28 05:23:07 ehyperion-rst6 kernel: Lustre: lustre-OST000d-osc-MDT0000: Connection to lustre-OST000d (at 192.168.127.65@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
Aug 28 05:23:07 ehyperion-dit34 kernel: Lustre: lustre-OST000d: Client lustre-MDT0000-mdtlov_UUID (at 192.168.127.6@o2ib1) reconnecting
Aug 28 05:23:07 ehyperion-dit34 kernel: Lustre: lustre-OST000d: received MDS connection from 192.168.127.6@o2ib1
Aug 28 05:23:07 ehyperion-rst6 kernel: Lustre: lustre-OST000d-osc-MDT0000: Connection restored to lustre-OST000d (at 192.168.127.65@o2ib1)
Aug 28 05:23:07 ehyperion-rst6 kernel: Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST000d_UUID now active, resetting orphans

In this case, the OST recovered without intervention. In other cases 'lctl --device NN recover' has fixed the problem, resulting msg:

Aug 28 07:56:04 ehyperion-dit33 kernel: Lustre: 7426:0:(llog_net.c:162:llog_receptor_accept()) changing the import ffff8802dbaaf800 - ffff8802cf6ff000
Aug 28 07:56:04 ehyperion-dit33 kernel: Lustre: 7426:0:(llog_net.c:162:llog_receptor_accept()) Skipped 1 previous similar message

This does not appear to be causing SWL tests to fail at this time.



 Comments   
Comment by Cliff White (Inactive) [ 28/Aug/12 ]

The OSS node does not report errors, nor does it inactivate connections. There are no client-side errors reported.

Comment by Peter Jones [ 28/Aug/12 ]

Oleg

Could you please look into this one?

Peter

Comment by Oleg Drokin [ 28/Aug/12 ]

Hm, it almost looks like we are going through del orphan, all is fine, we progress further into precreate and get disconnected there for some reason again, and then the check for osc in recovery fails and we bail out with EIO.
It would be useful to have lctl dk from MDT when this happens (+ha enabled in debug level).

The other separate question is how come OST did not reply to the create request? Even if OST is busy on the backend, at least early replies should still be flowing. Would be cool to get a matching lctl dk from ost as well, but I guess under this kind of load the buffer rotates quite quickly and it would be useless.

Comment by Cliff White (Inactive) [ 28/Aug/12 ]

Attached is lctl dk from a failure (lustre-OST0032)

Comment by Cliff White (Inactive) [ 28/Aug/12 ]

I have uploaded lu-1797-1538.txt.gz to FTP. That should be an lctl dk with all the calls included.
Associated syslog errors:

Aug 28 15:35:40 ehyperion-rst6 kernel: Lustre: 3942:0:(client.c:1920:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1346193234/real 1346193234]  req@ffff880014297400 x1411378058874282/t0(0) o5->lustre-OST000d-osc-MDT0000@192.168.127.65@o2ib1:7/4 lens 432/432 e 0 to 1 dl 1346193340 ref 1 fl Rpc:RXN/0/ffffffff rc 0/-1
Aug 28 15:35:40 ehyperion-rst6 kernel: Lustre: lustre-OST000d-osc-MDT0000: Connection to lustre-OST000d (at 192.168.127.65@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
Aug 28 15:35:40 ehyperion-rst6 kernel: Lustre: lustre-OST000d-osc-MDT0000: Connection restored to lustre-OST000d (at 192.168.127.65@o2ib1)
Aug 28 15:35:40 ehyperion-rst6 kernel: Lustre: Skipped 1 previous similar message
Aug 28 15:35:40 ehyperion-rst6 kernel: Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST000d_UUID now active, resetting orphans
Aug 28 15:35:51 ehyperion-rst6 kernel: LustreError: 3942:0:(osc_create.c:169:osc_interpret_create()) @@@ Unknown rc -107 from async create: failing oscc  req@ffff880014297400 x1411378058874282/t0(0) o5->lustre-OST000d-osc-MDT0000@192.168.127.65@o2ib1:7/4 lens 432/432 e 0 to 1 dl 1346193340 ref 1 fl Interpret:RXN/0/ffffffff rc -107/-1
Aug 28 15:36:30 ehyperion-rst6 kernel: LustreError: 15560:0:(lov_obd.c:1063:lov_clear_orphans()) error in orphan recovery on OST idx 13/60: rc = -5
Aug 28 15:36:30 ehyperion-rst6 kernel: LustreError: 15560:0:(mds_lov.c:883:__mds_lov_synchronize()) lustre-OST000d_UUID failed at mds_lov_clear_orphans: -5
Aug 28 15:36:31 ehyperion-rst6 kernel: LustreError: 15560:0:(mds_lov.c:903:__mds_lov_synchronize()) lustre-OST000d_UUID sync failed -5, deactivating
Aug 28 15:38:20 ehyperion-rst6 kernel: cannot allocate a tage (1)
Aug 28 15:38:20 ehyperion-rst6 kernel: cannot allocate a tage (4)
Aug 28 15:39:06 ehyperion-rst6 kernel: Lustre: 3942:0:(client.c:1920:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1346193440/real 1346193440]  req@ffff8801f8cca800 x1411378059090858/t0(0) o400->lustre-OST000d-osc-MDT0000@192.168.127.65@o2ib1:28/4 lens 224/224 e 0 to 1 dl 1346193546 ref 1 fl Rpc:RXN/0/ffffffff rc 0/-1
Aug 28 15:39:06 ehyperion-rst6 kernel: Lustre: lustre-OST000d-osc-MDT0000: Connection to lustre-OST000d (at 192.168.127.65@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
Aug 28 15:39:06 ehyperion-rst6 kernel: Lustre: lustre-OST000d-osc-MDT0000: Connection restored to lustre-OST000d (at 192.168.127.65@o2ib1)
Aug 28 15:39:06 ehyperion-rst6 kernel: Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST000d_UUID now active, resetting orphans

Comment by Oleg Drokin [ 28/Aug/12 ]

Latest logs shed some light onto what's going on.

OST becomes very overloaded and creates start to take a lot of time (due to other disk activity I guess).
MDT sends a precreate request that eventually times out
MDT reconnects back (which goes pretty fast as long as we don't need to write anything to disk in a synchronous manner, or read anything that's not cached)
After reconnect MDT tries to populate MDT precreate pool and send a precreate request agan, there's a hard timeout for this activity at 1/2 obd_timeout (not converted to AT).
Once the timeout passes, mdt_lov portion of the code gets and error and decided that OST is in too bad shape to use and disables it (which might not be all that bad idea if it's too loaded).

Later one in the original set of logs for case #1 we see that a ping times out for some reason (network packet loss or something) which triggers reconnect and clears the inactive flag.

So in the end I imagine for the "Fix" we need to actually schedule a later reconnect (or even precreate) atttempt for such busy OSTs later on and once the load eases, enable them back. Some of the messages would need to be silenced as well.
This code is redone in 2.4 so needs to be inspected separately. Also Create on write would fix this by eliminating precreates.

Comment by Peter Jones [ 30/Aug/12 ]

Dropping priority as this appears to be a long-existing issue

Comment by Cliff White (Inactive) [ 30/Aug/12 ]

Could not replicate with 1->3 OSS, 2->24 OST. Returned to normal testing, inbetween parallel-scale runs had one OST drop. Obtained lctl dk from OST, but suspect it was too old.
However, this time observed bothe client and OSS errors:
Client:

LustreError: 26333:0:(mdc_request.c:1429:mdc_quotactl()) ptlrpc_queue_wait failed, rc: -114
OSS

192.168.127.63: Lustre: DEBUG MARKER: == parallel-scale test iorfpp: iorfpp == 18:30:49 (1346376649)
192.168.127.63: LustreError: 9253:0:(ldlm_resource.c:1101:ldlm_resource_get()) lvbo_init failed for resource 165782: rc -2
192.168.127.63: LustreError: 9175:0:(ldlm_resource.c:1101:ldlm_resource_get()) lvbo_init failed for resource 165740: rc -2
192.168.127.63: LustreError: 9175:0:(ldlm_resource.c:1101:ldlm_resource_get()) Skipped 33 previous similar messages
192.168.127.63: LustreError: 9253:0:(ldlm_resource.c:1101:ldlm_resource_get()) Skipped 5 previous similar messages
192.168.127.63: Lustre: DEBUG MARKER: == parallel-scale parallel-scale.sh test complete, duration 6104 sec == 18:59:05 (1346378345)
192.168.127.63: Lustre: lustre-OST0033: Client lustre-MDT0000-mdtlov_UUID (at 192.168.127.6@o2ib1) reconnecting
192.168.127.63: Lustre: lustre-OST0033: received MDS connection from 192.168.127.6@o2ib1
192.168.127.63: Lustre: Skipped 1 previous similar message
192.168.127.63: Lustre: 9424:0:(llog_net.c:162:llog_receptor_accept()) changing the import ffff8802cc298800 - ffff8801c4c4a000
192.168.127.63: Lustre: 9424:0:(llog_net.c:162:llog_receptor_accept()) changing the import ffff8802cc298800 - ffff8801c4c4a000

Comment by Cliff White (Inactive) [ 30/Sep/12 ]

We are hitting this bug again with the lustre-review 9573 build

hyperion-rst6 login: Lustre: 4171:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1349040375/real 1349040375]  req@ffff88020629a000 x1414547145605102/t0(0) o5->lustre-OST0034-osc-MDT0000@192.168.127.64@o2ib1:7/4 lens 432/432 e 0 to 1 dl 1349040481 ref 1 fl Rpc:RXN/0/ffffffff rc 0/-1
Lustre: 4171:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 85 previous similar messages
Lustre: lustre-OST0034-osc-MDT0000: Connection to lustre-OST0034 (at 192.168.127.64@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
Lustre: Skipped 6 previous similar messages
Lustre: lustre-OST0034-osc-MDT0000: Connection restored to lustre-OST0034 (at 192.168.127.64@o2ib1)
Lustre: Skipped 1 previous similar message
Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0034_UUID now active, resetting orphans
Lustre: Skipped 19 previous similar messages
LustreError: 4171:0:(osc_create.c:169:osc_interpret_create()) @@@ Unknown rc -107 from async create: failing oscc  req@ffff88020629a000 x1414547145605102/t0(0) o5->lustre-OST0034-osc-MDT0000@192.168.127.64@o2ib1:7/4 lens 432/432 e 0 to 1 dl 1349040481 ref 1 fl Interpret:RXN/0/ffffffff rc -107/-1
LustreError: 7487:0:(lov_obd.c:1063:lov_clear_orphans()) error in orphan recovery on OST idx 52/60: rc = -5
LustreError: 7487:0:(mds_lov.c:883:__mds_lov_synchronize()) lustre-OST0034_UUID failed at mds_lov_clear_orphans: -5
LustreError: 7487:0:(mds_lov.c:903:__mds_lov_synchronize()) lustre-OST0034_UUID sync failed -5, deactivating
Comment by Alexandre Louvet [ 31/Jan/13 ]

One of our system is showing this behaviour with a high reproduction rate : about 1 disconnection every 5 minutes for the past 2 days and I fail to find any evidence of high load on OSSes or MDS.

The filesystem is made of 540 OSTs equally distributed across 36 OSSes. OSTs disconnected are not particularly located. Interconnect between nodes & servers doesn't report problems and there isn't any messages in the log than the one reported in the LU.

The schema is always the same. One request did expire for slow reply after obd_timeout / 2 (ie 200s for me), then a reconnect fail during orphan recovery. If I try to force a recover it works like a charm, even if the forced recover is scheduled immediately after the deactivation. It is surprising as I can admit that something went wrong at a time, then that obd_timeout / 2 secondes later it was still wrong and failed the reconnection, but I failed to convince myself that I'm luky enough for the problem to disappear 1 second later when I schedule a forced recover. Am i wrong ?

Site is running lustre 2.1.3 + patches. We have just upgraded the patches list last week to add LU-1650 + LU-1039 + LU-2170, but we have another cluster running with the same software which doesn't have this problem.
Looking backward, I have also found a trace of the same problem 3 months ago, so I tend to think that those patches are not responsible of this problem. At this time, it seems that the cluster was rebooted and the problem disappeared.

Site is classified so exporting data is not that easy. What can I check to try to make progress ?

Thanks,

Comment by Alexandre Louvet [ 08/Feb/13 ]

While I wasn't able to find any evidence of OST being overloaded, and after having lost connection to over 400 OSTs, we did reboot everything. Now everything is back to normal.
The only particularity was that we made a filesystem extension just before moving to production. Don't know if it makes any difference, just a fact.

Comment by Oleg Drokin [ 21/Feb/13 ]

Well, I suspect extending the filesystem hints at perhaps you having a lot of mostly full (node-wise) filesystem and in order to satisfy precreates a lot of inode groups would need to be traversed which took a lot of time as they are not cached.
Now immediate reconnect succeeds because the same traversal is now faster as all those blocks are already read and in cache (but would be thrown away later under load again after some time).

Extension of the FS added some very empty inode groups where a single block read allows to satisfy entire precreate request which happens a lot faster.

You can probably confirm this by looking at debugfs output and I suspect observe inode groups to be very sparsely populated with free inodes right to about the point of pre-extension.
df -i pre-extension would probably shown those OSTs as very low on free inodes too?

Comment by Andreas Dilger [ 09/Jan/20 ]

Close old bug

Generated at Sat Feb 10 01:19:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.