Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
Lustre 2.10.3
-
None
-
3
-
9223372036854775807
Description
I have found a weird problem on our Lustre system when we try to move a file from a different file system (here /tmp) onto the lustre file server. This problem only affects a mv. A cp works ok. The problem is that the 'mv' hangs forever, and the process can not be a killed WHen I did a strace on the mv, the program hangs on fchown.
strace mv /tmp/simon.small.txt /mnt/lustre/projects/pMOSP/simon <stuff> write(4, "1\n", 2) = 2 read(3, "", 4194304) = 0 utimensat(4, NULL, [{1530777797, 478293939}, {1530777797, 478293939}], 0) = 0 fchown(4, 10001, 10025 If you look at demsg, you see these multiple errors start appearing at the same time: The errors don't stop as we can't kill the 'mv' process Thu Jul 5 18:08:43 2018] Lustre: lustre-MDT0000-mdc-ffff88351771f000: Connection restored to 172.16.231.50@o2ib (at 172.16.231.50@o2ib) [Thu Jul 5 18:08:43 2018] Lustre: Skipped 140105 previous similar messages [Thu Jul 5 18:09:47 2018] Lustre: lustre-MDT0000-mdc-ffff88351771f000: Connection to lustre-MDT0000 (at 172.16.231.50@o2ib) was lost; in progress operations using this service will wait for recovery to complete [Thu Jul 5 18:09:47 2018] Lustre: Skipped 285517 previous similar messages [Thu Jul 5 18:09:47 2018] Lustre: lustre-MDT0000-mdc-ffff88351771f000: Connection restored to 172.16.231.50@o2ib (at 172.16.231.50@o2ib) [Thu Jul 5 18:09:47 2018] Lustre: Skipped 285516 previous similar messages
We have the following ofed drivers, which I believe have a known problem with connecting to Lustre servers
ofed_info | head -1 MLNX_OFED_LINUX-4.2-1.2.0.0 (OFED-4.2-1.2.0):
Could you run your reproducer and wait until the client blocks in fchown(), then run the attached script stack1 on the MDT (to collect and consolidate the kernel stack traces) and attach the output here?