Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
Lustre 2.10.3
-
None
-
3
-
9223372036854775807
Description
I have found a weird problem on our Lustre system when we try to move a file from a different file system (here /tmp) onto the lustre file server. This problem only affects a mv. A cp works ok. The problem is that the 'mv' hangs forever, and the process can not be a killed WHen I did a strace on the mv, the program hangs on fchown.
strace mv /tmp/simon.small.txt /mnt/lustre/projects/pMOSP/simon <stuff> write(4, "1\n", 2) = 2 read(3, "", 4194304) = 0 utimensat(4, NULL, [{1530777797, 478293939}, {1530777797, 478293939}], 0) = 0 fchown(4, 10001, 10025 If you look at demsg, you see these multiple errors start appearing at the same time: The errors don't stop as we can't kill the 'mv' process Thu Jul 5 18:08:43 2018] Lustre: lustre-MDT0000-mdc-ffff88351771f000: Connection restored to 172.16.231.50@o2ib (at 172.16.231.50@o2ib) [Thu Jul 5 18:08:43 2018] Lustre: Skipped 140105 previous similar messages [Thu Jul 5 18:09:47 2018] Lustre: lustre-MDT0000-mdc-ffff88351771f000: Connection to lustre-MDT0000 (at 172.16.231.50@o2ib) was lost; in progress operations using this service will wait for recovery to complete [Thu Jul 5 18:09:47 2018] Lustre: Skipped 285517 previous similar messages [Thu Jul 5 18:09:47 2018] Lustre: lustre-MDT0000-mdc-ffff88351771f000: Connection restored to 172.16.231.50@o2ib (at 172.16.231.50@o2ib) [Thu Jul 5 18:09:47 2018] Lustre: Skipped 285516 previous similar messages
We have the following ofed drivers, which I believe have a known problem with connecting to Lustre servers
ofed_info | head -1 MLNX_OFED_LINUX-4.2-1.2.0.0 (OFED-4.2-1.2.0):