Details
-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
None
-
3
-
12530
Description
Some users have reported to us that the "rm" command is taking a long time. Some investigation revealed that at least the first "rm" in a directory takes just over 100 seconds, which of course sounds like OBD_TIMEOUT_DEFAULT.
This isn't necessarily the simplest reproducer, but the following reproducer is completely consistent:
- set directory striping default count to 48
- touch a file on client A
- rm file on client B
The clients are running 2.4.0-19chaos, servers are at 2.4.0-21chaos. The servers are using zfs as the backend.
I have some lustre logs that I will share and talk about in additional posts to this ticket. But essentially it looks like the server always times out on a AST to client A (explaining the 100 second delay). It is not really clear yet to me why that happens, because client A appears to be completely responsive. My current suspicion is the the MDT is to blame.
Attachments
Issue Links
- duplicates
-
LU-4963 client eviction during IOR test - lock callback timer expired
-
- Closed
-
- is related to
-
LU-5525 ASSERTION( new_lock->l_readers + new_lock->l_writers == 0 ) failed
-
- Resolved
-
-
LU-5632 ldlm_lock_addref()) ASSERTION( lock != ((void *)0) )
-
- Resolved
-
-
LU-5686 (mdt_handler.c:3203:mdt_intent_lock_replace()) ASSERTION( lustre_msg_get_flags(req->rq_reqmsg) & 0x0002 ) failed
-
- Resolved
-
- is related to
-
LU-2827 mdt_intent_fixup_resent() cannot find the proper lock in hash
-
- Resolved
-
For -2827, we have:
LU-2827itself http://review.whamcloud.com/5978 and http://review.whamcloud.com/#/c/10378LU-5266http://review.whamcloud.com/10903LU-5496http://review.whamcloud.com/11469 and http://review.whamcloud.com/#/c/11644LU-5579http://review.whamcloud.com/#/c/11839/and the final nail in this bug
LU-5530: http://review.whamcloud.com/#/c/11841/1Some of those patches are not picking cleanly into b2_5, but I know James Simmons from ORNL ported them. All of this is now in testing after which they hopefully will publish their tree with backports.
Or another easy way to get rid of all of these problems (occuring as frequently, at least, sicne resend might also happen if a reply from server was genuinely lost on the network) is to drop
LU-3338patch, that is not part of standard b2_5 (only landed to b2_6)