Details
-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
None
-
3
-
12530
Description
Some users have reported to us that the "rm" command is taking a long time. Some investigation revealed that at least the first "rm" in a directory takes just over 100 seconds, which of course sounds like OBD_TIMEOUT_DEFAULT.
This isn't necessarily the simplest reproducer, but the following reproducer is completely consistent:
- set directory striping default count to 48
- touch a file on client A
- rm file on client B
The clients are running 2.4.0-19chaos, servers are at 2.4.0-21chaos. The servers are using zfs as the backend.
I have some lustre logs that I will share and talk about in additional posts to this ticket. But essentially it looks like the server always times out on a AST to client A (explaining the 100 second delay). It is not really clear yet to me why that happens, because client A appears to be completely responsive. My current suspicion is the the MDT is to blame.
Attachments
Issue Links
- duplicates
-
LU-4963 client eviction during IOR test - lock callback timer expired
-
- Closed
-
- is related to
-
LU-5525 ASSERTION( new_lock->l_readers + new_lock->l_writers == 0 ) failed
-
- Resolved
-
-
LU-5632 ldlm_lock_addref()) ASSERTION( lock != ((void *)0) )
-
- Resolved
-
-
LU-5686 (mdt_handler.c:3203:mdt_intent_lock_replace()) ASSERTION( lustre_msg_get_flags(req->rq_reqmsg) & 0x0002 ) failed
-
- Resolved
-
- is related to
-
LU-2827 mdt_intent_fixup_resent() cannot find the proper lock in hash
-
- Resolved
-
Actually our Cray clients don't have the
LU-3338patch and the problems ofLU-5530still showed up. So these patches are still needed. We plan to do a full scale test shot Tuesday next week with all the above patches plus a few others. If it goes well we will leave this system running 2.5 clients and 2.5 servers.We still need this work due to another file system running at the lab that will be stuck at 2.4 for the time being but some of our clients will be moving to 2.5 so these problems will show up.