[LU-11119] A 'mv' of a file from a local file system to a lustre file system hangs - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.10.3
Labels:
None
Environment:

Hide
uname -a
Linux monarch-login2 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

cat /etc/redhat-release
CentOS Linux release 7.4.1708 (Core)

lctl lustre_build_version
Lustre version: 2.10.3

Show
uname -a Linux monarch-login2 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux cat /etc/redhat-release CentOS Linux release 7.4.1708 (Core) lctl lustre_build_version Lustre version: 2.10.3

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

I have found a weird problem on our Lustre system when we try to move a file from a different file system (here /tmp) onto the lustre file server. This problem only affects a mv. A cp works ok. The problem is that the 'mv' hangs forever, and the process can not be a killed WHen I did a strace on the mv, the program hangs on fchown.

strace mv /tmp/simon.small.txt  /mnt/lustre/projects/pMOSP/simon
<stuff>
write(4, "1\n", 2)                      = 2
read(3, "", 4194304)                    = 0
utimensat(4, NULL, [{1530777797, 478293939}, {1530777797, 478293939}], 0) = 0
fchown(4, 10001, 10025 

If you look at demsg, you see these multiple errors start appearing at the same time:
The errors don't stop as we can't kill the 'mv' process

Thu Jul  5 18:08:43 2018] Lustre: lustre-MDT0000-mdc-ffff88351771f000: Connection restored to 172.16.231.50@o2ib (at 172.16.231.50@o2ib)
[Thu Jul  5 18:08:43 2018] Lustre: Skipped 140105 previous similar messages
[Thu Jul  5 18:09:47 2018] Lustre: lustre-MDT0000-mdc-ffff88351771f000: Connection to lustre-MDT0000 (at 172.16.231.50@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[Thu Jul  5 18:09:47 2018] Lustre: Skipped 285517 previous similar messages
[Thu Jul  5 18:09:47 2018] Lustre: lustre-MDT0000-mdc-ffff88351771f000: Connection restored to 172.16.231.50@o2ib (at 172.16.231.50@o2ib)
[Thu Jul  5 18:09:47 2018] Lustre: Skipped 285516 previous similar messages

We have the following ofed drivers, which I believe have a known problem with connecting to Lustre servers

ofed_info | head -1
MLNX_OFED_LINUX-4.2-1.2.0.0 (OFED-4.2-1.2.0):

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

chgrp-dk-wed18july.out
3.44 MB
18/Jul/18 4:19 AM
chgrp-stack1-wed18July.out
15 kB
18/Jul/18 4:19 AM
client-chgrp-dk.4aug.out
7.37 MB
03/Aug/18 7:20 AM
client-chgrp-dk-2Aug.out
15.78 MB
02/Aug/18 8:03 AM
client-chgrp-stack1.4aug.out
15 kB
03/Aug/18 7:19 AM
dmesg.MDS.4.47.6july.txt
1.10 MB
06/Jul/18 7:09 AM
dmesg.txt
6 kB
05/Jul/18 8:17 AM
l_getidentity
234 kB
16/Jul/18 7:17 AM
mdt-chgrp-dk.4Aug.out
22.50 MB
03/Aug/18 7:20 AM
mdt-chgrp-dk-2Aug.out
20.26 MB
02/Aug/18 8:04 AM
mdt-chgrp-stack1.4Aug.out
24 kB
03/Aug/18 7:19 AM
output.Tue.17.july.18.txt
24 kB
17/Jul/18 7:19 AM
stack1
1 kB
10/Jul/18 4:11 PM
strace.output.txt
14 kB
05/Jul/18 8:16 AM

Issue Links

is related to

LU-11236 client MDT OST ENOTCONN loops

Open

is related to

LU-10310 Change l_getidentity error level from NOTICE to WARNING

Resolved

LU-11227 client process hangs when lod_sync accesses deactivated OSTs

Resolved

Activity

People

Assignee:: John Hammond

Reporter:: Monash HPC

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 05/Jul/18 8:22 AM

Updated:: 16/Nov/20 2:26 PM

Resolved:: 16/Nov/20 2:26 PM