[LU-11119] A 'mv' of a file from a local file system to a lustre file system hangs Created: 05/Jul/18  Updated: 16/Nov/20  Resolved: 16/Nov/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Monash HPC Assignee: John Hammond
Resolution: Fixed Votes: 0
Labels: None
Environment:

uname -a
Linux monarch-login2 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

cat /etc/redhat-release
CentOS Linux release 7.4.1708 (Core)

lctl lustre_build_version
Lustre version: 2.10.3


Attachments: File chgrp-dk-wed18july.out     File chgrp-stack1-wed18July.out     File client-chgrp-dk-2Aug.out     File client-chgrp-dk.4aug.out     File client-chgrp-stack1.4aug.out     Text File dmesg.MDS.4.47.6july.txt     Text File dmesg.txt     HTML File l_getidentity     File mdt-chgrp-dk-2Aug.out     File mdt-chgrp-dk.4Aug.out     File mdt-chgrp-stack1.4Aug.out     Text File output.Tue.17.july.18.txt     HTML File stack1     Text File strace.output.txt    
Issue Links:
Related
is related to LU-10310 Change l_getidentity error level from... Resolved
is related to LU-11227 client process hangs when lod_sync ac... Resolved
is related to LU-11236 client MDT OST ENOTCONN loops Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I have found a weird problem on our Lustre system when we try to move a file from a different file system (here /tmp) onto the lustre file server. This problem only affects a mv. A cp works ok. The problem is that the 'mv' hangs forever, and the process can not be a killed WHen I did a strace on the mv, the program hangs on fchown.

strace mv /tmp/simon.small.txt  /mnt/lustre/projects/pMOSP/simon
<stuff>
write(4, "1\n", 2)                      = 2
read(3, "", 4194304)                    = 0
utimensat(4, NULL, [{1530777797, 478293939}, {1530777797, 478293939}], 0) = 0
fchown(4, 10001, 10025 

If you look at demsg, you see these multiple errors start appearing at the same time:
The errors don't stop as we can't kill the 'mv' process

Thu Jul  5 18:08:43 2018] Lustre: lustre-MDT0000-mdc-ffff88351771f000: Connection restored to 172.16.231.50@o2ib (at 172.16.231.50@o2ib)
[Thu Jul  5 18:08:43 2018] Lustre: Skipped 140105 previous similar messages
[Thu Jul  5 18:09:47 2018] Lustre: lustre-MDT0000-mdc-ffff88351771f000: Connection to lustre-MDT0000 (at 172.16.231.50@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[Thu Jul  5 18:09:47 2018] Lustre: Skipped 285517 previous similar messages
[Thu Jul  5 18:09:47 2018] Lustre: lustre-MDT0000-mdc-ffff88351771f000: Connection restored to 172.16.231.50@o2ib (at 172.16.231.50@o2ib)
[Thu Jul  5 18:09:47 2018] Lustre: Skipped 285516 previous similar messages

We have the following ofed drivers, which I believe have a known problem with connecting to Lustre servers

ofed_info | head -1
MLNX_OFED_LINUX-4.2-1.2.0.0 (OFED-4.2-1.2.0):


 Comments   
Comment by Nathaniel Clark [ 05/Jul/18 ]

Lustre said it lost connect to MDT and needs to wait for Recovery to complete.  Can you attach logs from MDS?

Comment by John Hammond [ 05/Jul/18 ]

Hi,

Does 'mv' consistently hang in fchown()? Is the 'mv' command being run by root, by the owner of the file, or some other user? From the strace output it appears to be by the file owner. Could you post the output of the 'id' command (in the setup used to reproduce)? I noticed that the destination directory is marked SGID. Does the hang still occur if the destination directory in not marked SGID?

Does the MDS use the same Lustre and kernel versions?

Comment by Monash HPC [ 06/Jul/18 ]

John/Nathaniels.
(1) 'mv' works when ran as root
(2) 'mv' always seems to fail at
'
utimensat(4, NULL, [

{1530778053, 0}

,

{1530777797, 0}

], 0) = 0
fchown(4, 10001, 10060

'
(3) I went to the MDS and give you the logs below. I started the 'mv' after 4:41 so the errors you see should begin at "LustreError: 181962:0:(lod_dev.c:1414:lod_sync()) lustre-MDT0000-mdtlov: can't sync ost 12: -107
[Fri Jul 6 16:51:04 2018] LustreError: 181962:0:(lod_dev.c:1414:lod_sync()) Skipped 8274044 previous similar messages
[Fri Jul 6 16:51:04 2018] Lustre: lustre-MDT0000: Client 66471b3c-6a3e-724d-5030-ee8252fcfcd2 (at 172.16.230.87@o2ib) reconnecting
[Fri Jul 6 16:51:04 2018] Lustre: Skipped 5378430 previous similar messages
[Fri Jul 6 16:51:04 2018] Lustre: lustre-MDT0000: Connection restored to 8e150630-2b04-00b6-c100-b58229b19cac (at 172.16.230.87@o2ib)
[Fri Jul 6 16:51:04 2018] Lustre: Skipped 5377677 previous similar messages
"
(4) The id of the user is
id smichnow
uid=10001(smichnow) gid=10025(monashuniversity) groups=10025(monashuniversity),10150(systems),10052(monarch),10054(pMona0002),10057(pMona0005),10060(pMOSP),10077(pMona0006),10086(pMeRC),10130(gurobi),10151(gaussianmonash),10167(M3EarlyAdopters),10184(mc01),10206(wb82),10218(nq46),10290(avizo),10295(comsol-civil),10399(cplex),10567(gauss),10610(icm)

(5) The MDS Server is
[root@rclmddc1r14-02-e1 ~]# uname -a
Linux rclmddc1r14-02-e1.erc.monash.edu.au 3.10.0-693.17.1.el7_lustre.x86_64 #1 SMP Mon Feb 26 13:33:51 AEDT 2018 x86_64 x86_64 x86_64 GNU/Linux
[root@rclmddc1r14-02-e1 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.4 (Maipo)

(6) THe client is
uname -a
Linux monarch-login2 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
uname -a
Linux monarch-login2 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

(7) I removed the sticky bit
ls -ld /mnt/lustre/projects/pMOSP/simon/
drwxrwsr-x 8 smichnow pMOSP 4096 Jul 6 16:47 /mnt/lustre/projects/pMOSP/simon/
[root@monarch-login2 etc]# chmod g-s /mnt/lustre/projects/pMOSP/simon/
[root@monarch-login2 etc]# ls -ld /mnt/lustre/projects/pMOSP/simon/
drwxrwxr-x 8 smichnow pMOSP 4096 Jul 6 16:47 /mnt/lustre/projects/pMOSP/simon/
but the 'mv' still broke when run as user smichnow
thanks
Simon

Comment by Monash HPC [ 06/Jul/18 ]

PS I added the MDS dmesg output as an attachment

Comment by Monash HPC [ 06/Jul/18 ]

Also I noticed that a cp -p fails like the mv
cp simon.small.txt /mnt/lustre/projects/pMOSP/simon/simon.small.txt.1
works but
cp -p simon.small.txt /mnt/lustre/projects/pMOSP/simon/simon.small.txt.2
fails i.e it hangs
Drilling in further, a chmod works but a chgrp fails (when ran as the user)
i.e.
chown smichnow /mnt/lustre/projects/pMOSP/simon/simon.small.txt.3
OK
chgrp pMOSP /mnt/lustre/projects/pMOSP/simon/simon.small.txt.3
fails, i.e. hangs

regards
Simon

Comment by John Hammond [ 06/Jul/18 ]

Hi Simon,

I suspect your identity upcall is not finding the supplementary group info it needs for the chown(). Could you run id smichnow on the MDT and post the output? How do you manage the user and groups databases? Do you use /etc/passwd and /etc/group, LDAP, AD, or other? Are the user and group databases on the MDT kept in sync with the rest of the cluster?

The identity upcall mechanism and configuration are described at http://doc.lustre.org/lustre_manual.xhtml#lustreprogramminginterfaces

Comment by Monash HPC [ 09/Jul/18 ]

Dear John,
our identity on all nodes is managed by a LDAP server. On the client (that hangs) I see
*[smichnow@monarch-login2 tmp]$ id smichnow
uid=10001(smichnow) gid=10025(monashuniversity) groups=10025(monashuniversity),10150(systems),10052(monarch),10054(pMona0002),10057(pMona0005),10060(pMOSP),10077(pMona0006),10086(pMeRC),10130(gurobi),10151(gaussianmonash),10167(M3EarlyAdopters),10184(mc01),10206(wb82),10218(nq46),10290(avizo),10295(comsol-civil),10399(cplex),10567(gauss),10610(icm)*

and on the metadata server
*[root@rclmddc1r14-02-e1 lustre-MDT0000]# id smichnow
uid=10001(smichnow) gid=10025(monashuniversity) groups=10025(monashuniversity),10610(icm),10167(M3EarlyAdopters),10151(gaussianmonash),10295(comsol-civil),10054(pMona0002),10057(pMona0005),10077(pMona0006),10052(monarch),10150(systems),10130(gurobi),10290(avizo),10399(cplex),10060(pMOSP),10086(pMeRC),10184(mc01),10206(wb82),10567(gauss),10218(nq46)*

which shows the same number of groups but in a different order.

*[root@rclmddc1r14-02-e1 lustre-MDT0000]# lctl get_param mdt.lustre-MDT0000.identity_upcall
mdt.lustre-MDT0000.identity_upcall=/usr/sbin/l_getidentity*

and

*[root@rclmddc1r14-02-e1 lustre-MDT0000]# l_getidentity -d 10001
uid=10001 gid=10025,10052,10054,10057,10060,10077,10086,10130,10150,10151,10167,10184,10206,10218,10290,10295,10399,10567,10610
permissions:
nid perm*

regards
Simon

Comment by John Hammond [ 10/Jul/18 ]

Hi Simon,

l_getidentity does generate some logging on error but it uses the authpriv facility with an error loevel of warning. If needed then could you reconfigure syslog on the MDT to temporarily capture any logging generated by l_getidentity when you run your reproducer and attach it here?

Second, could you run your reproducer and wait until the client blocks in fchown(),then run the attached script stack1 on the MDT (to collect and consolidate the kernel stack traces) and attach the output here?

Comment by Monash HPC [ 11/Jul/18 ]

John
I'm away for a few days so will run script next week. Could you also tell me how to modify syslogger to collect l-getidentify errors

Thanks
Simon

Comment by John Hammond [ 12/Jul/18 ]

Hi Simon,

> Could you also tell me how to modify syslogger to collect l-getidentify errors

Sure. Do you use rsyslog? If so then you can add a line of the following form to /etc/rsyslog.conf on the MDS:

authpriv.warning /var/log/l_getidentity

Then restart rsyslog, run your reproducer, and attach the contents of /var/log/l_getidentity. You should check that no site sensitive content ended up in that file before you attach it.

Comment by Monash HPC [ 16/Jul/18 ]

Dear John
I enabled the logging, and reran the chgrp command that hung.
I could not see any useful comments in the log file but I include it here. Logging started today so that part worked but nothing relating to the issue
regards
Simon

Comment by Monash HPC [ 16/Jul/18 ]

John
the dmesg errors from the node that hangs upon a chgrp are
[Mon Jul 16 17:14:54 2018] Lustre: lustre-MDT0000-mdc-ffff88351a566800: Connection to lustre-MDT0000 (at 172.16.231.50@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[Mon Jul 16 17:14:55 2018] Lustre: Skipped 1175070 previous similar messages
[Mon Jul 16 17:14:55 2018] Lustre: lustre-MDT0000-mdc-ffff88351a566800: Connection restored to 172.16.231.50@o2ib (at 172.16.231.50@o2ib)
[Mon Jul 16 17:14:55 2018] Lustre: Skipped 1175070 previous similar messages

regards
Simon

Comment by John Hammond [ 16/Jul/18 ]

Could you run your reproducer and wait until the client blocks in fchown(), then run the attached script stack1 on the MDT (to collect and consolidate the kernel stack traces) and attach the output here?

Comment by Monash HPC [ 17/Jul/18 ]

Dear John,
I ran the command and attached the file.
The client executed the command
strace chgrp pMOSP /mnt/lustre/projects/pMOSP/simon/simon.small.txt.3

which goes up to..

read(4, "\1\0\0\0\0\0\0\0L'\0\0\20\0\0\0pMOSP\0*\0smichnow"..., 136) = 136
newfstatat(AT_FDCWD, "/mnt/lustre/projects/pMOSP/simon/simon.small.txt.3",

{st_mode=S_IFREG|0600, st_size=2, ...}

, AT_SYMLINK_NOFOLLOW) = 0
fchownat(AT_FDCWD, "/mnt/lustre/projects/pMOSP/simon/simon.small.txt.3", -1, 10060, 0

and hangs
regards
Simon

Comment by John Hammond [ 17/Jul/18 ]

Hi Simon,

OK this is interesting. The file you attached shows that the MDT was completely idle when stack1 was run. So I'm sorry to say we may have been going in the wrong direction? Could you run it on the client after chown/chgrp hangs? Could you also enable some more debugging and run the following on the client:

lctl set_param debug='+dlmtrace rpctrace vfstrace inode trace'
lctl set_param debug_mb=64
lctl clear
chgrp pMOSP /mnt/lustre/projects/pMOSP/simon/simon.small.txt.3 &
### Wait for chgrp to hang in fchown()
lctl dk > /tmp/chgrp-dk.out
strack1 > /tmp/chgrp-stack1.out

And then attach /tmp/chgrp-dk.out and /tmp/chgrp-stack1.out?

Comment by Monash HPC [ 18/Jul/18 ]

Dear John
I have uploaded the files to you. These were run on the client monarch-dtn
regards
Simon

Comment by Monash HPC [ 23/Jul/18 ]

Dear John,
has there been any update on the problem?
regards
Simon

Comment by John Hammond [ 01/Aug/18 ]

Hi Simon,

The attachments from the 17th show that the client is waiting for the MDT to respond. But the previously attached stack1 output from the MDT show that the server is idle. Can you do the following:

# On the client and MDT:
lctl set_param debug='+dlmtrace rpctrace vfstrace inode trace'
lctl set_param debug_mb=64
lctl clear
# On the client:
chgrp pMOSP /mnt/lustre/projects/pMOSP/simon/simon.small.txt.3 &
# Wait for chgrp to hang in fchown()
lctl dk > /tmp/client-chgrp-dk.out
strack1 > /tmp/client-chgrp-stack1.out
# On the MDT:
lctl dk > /tmp/mdt-chgrp-dk.out
strack1 > /tmp/mdt-chgrp-stack1.out
Comment by Monash HPC [ 02/Aug/18 ]

Dear John
I uploaded 2 of the files. I am afraid I could not find the program strack1. Is this part of the lustre distribution or a tool I need to install?
regards
Simon

Comment by John Hammond [ 02/Aug/18 ]

A copy of stack1 is attached to this ticket. You used it previously.

Comment by Monash HPC [ 03/Aug/18 ]

John,
Firstly I'd like to note that when I tried running the chgrp command, the OS became unstable and crashed on me. This has happened twice on me:

*Message from syslogd@monarch-login2 at Aug 3 17:06:27 ...
kernel:LustreError: 12556:0:(niobuf.c:330:ptlrpc_register_bulk()) ASSERTION( desc->bd_md_count == 0 ) failed:

Message from syslogd@monarch-login2 at Aug 3 17:06:27 ...
kernel:LustreError: 12556:0:(niobuf.c:330:ptlrpc_register_bulk()) LBUG*

Would you be interested in any of the contents of the /var/crash directories, i.e either the large vmcore or the vmcore-dmesg files? This may/may not be related to another bug we have in lustre which crashes our machines regularly. I am led to believe that bug is caused by an issue with the current version of mofed drivers.

I ran the jobs an have uploaded the files for you, with the Aug4 postfix on them.
regards Simon

Comment by John Hammond [ 03/Aug/18 ]

OK this is better. The chgrp is failing because the MDT is not connected to OST000c. What is the status of that OST? It appears that the client and server are not handling this condition correctly.

The MDT logs you provided are not from Lustre 2.10.3. What version of Lustre is the MDT running?

The failed assertions are due to LU-8573 and possibly OFED issues.

Comment by Monash HPC [ 06/Aug/18 ]

Dear John,
we have been updating our mofed drivers and lustre client versions to try and resolve both bugs. So some clients are
lctl lustre_build_version
Lustre version: 2.10.3

and some
lctl lustre_build_version
Lustre version: 2.10.4

Both versions show this problem.

As for the OST000c, this is an inactive OST
lfs osts
OBDS:
0: lustre-OST0000_UUID ACTIVE
1: lustre-OST0001_UUID ACTIVE
2: lustre-OST0002_UUID ACTIVE
3: lustre-OST0003_UUID ACTIVE
4: lustre-OST0004_UUID ACTIVE
5: lustre-OST0005_UUID ACTIVE
6: lustre-OST0006_UUID ACTIVE
7: lustre-OST0007_UUID ACTIVE
8: lustre-OST0008_UUID ACTIVE
9: lustre-OST0009_UUID ACTIVE
10: lustre-OST000a_UUID ACTIVE
11: lustre-OST000b_UUID ACTIVE
12: lustre-OST000c_UUID INACTIVE
13: lustre-OST000d_UUID ACTIVE
14: lustre-OST000e_UUID ACTIVE

But the file belongs on a different OST ( I ran some code I found on the internet to find this)
/mnt/lustre/projects/pMOSP/simon/simon.small.txt.3: ['lustre-OST0009']

The MDT Lustre version is
[root@rclmddc1r14-02-e1 ~]# lctl lustre_build_version
Lustre version: 2.10.58_46_ge528677

regards
Simon

Comment by John Hammond [ 13/Aug/18 ]

Hi Simon,

Is OST000c offline or just deactivated to prevent new files from being created on it?

If it's only to prevent new files from being created then you should use the max_create_count parameter. See http://doc.lustre.org/lustre_manual.xhtml#section_remove_ost.

Comment by Julien Wallior [ 14/Aug/18 ]

Simon, John,
We had a similar (same?) issue on 2.10.4 servers/2.10.1 clients. We applied the patch from LU-11227 and it seems good now.

Regarding the assertion(bd_md_count==0), how does your trace look like? Do you have the {client,server}_bulk_callback() LustreErrors like in LU-8573? We were having this assertion too (before patching with LU-11227), but the trace looks a bit different than LU-8573 and no bulk_callback() LustreError.

[ 142.947358] [<ffffffff8100471d>] dump_trace+0x7d/0x2d0
[ 142.947369] [<ffffffffa0b4176a>] libcfs_call_trace+0x4a/0x60 [libcfs]
[ 142.947388] [<ffffffffa0b417e8>] lbug_with_loc+0x48/0xa0 [libcfs]
[ 142.947427] [<ffffffffa0ece3c4>] ptlrpc_register_bulk+0x7c4/0x990 [ptlrpc]
[ 142.947462] [<ffffffffa0ecef26>] ptl_send_rpc+0x236/0xe20 [ptlrpc]
[ 142.947497] [<ffffffffa0ec984c>] ptlrpc_check_set.part.23+0x18bc/0x1dd0 [ptlrpc]
[ 142.947531] [<ffffffffa0eca2b1>] ptlrpc_set_wait+0x481/0x8a0 [ptlrpc]
[ 142.947563] [<ffffffffa0eca74c>] ptlrpc_queue_wait+0x7c/0x220 [ptlrpc]
[ 142.947586] [<ffffffffa0e630fb>] mdc_getpage+0x18b/0x620 [mdc]
[ 142.947592] [<ffffffffa0e636ab>] mdc_read_page_remote+0x11b/0x650 [mdc]
[ 142.947598] [<ffffffff8113ea0e>] do_read_cache_page+0x7e/0x1c0
[ 142.947602] [<ffffffffa0e5eb07>] mdc_read_page+0x167/0x960 [mdc]
[ 142.947610] [<ffffffffa10010ce>] lmv_read_page+0x1ae/0x520 [lmv]
[ 142.947628] [<ffffffffa10964d0>] ll_get_dir_page+0xb0/0x350 [lustre]
[ 142.947637] [<ffffffffa10968c9>] ll_dir_read+0x99/0x310 [lustre]
[ 142.947648] [<ffffffffa10c9cf0>] ll_get_name+0x110/0x2d0 [lustre]
[ 142.947660] [<ffffffff8121b3b2>] exportfs_get_name+0x32/0x50
[ 142.947663] [<ffffffff8121b526>] reconnect_path+0x156/0x2e0
[ 142.947666] [<ffffffff8121b9df>] exportfs_decode_fh+0xef/0x2c0
[ 142.947684] [<ffffffffa04cd425>] fh_verify+0x2f5/0x5e0 [nfsd]
[ 142.947694] [<ffffffffa04d142c>] nfsd_access+0x2c/0x140 [nfsd]
[ 142.947705] [<ffffffffa04d7378>] nfsd3_proc_access+0x68/0xb0 [nfsd]
[ 142.947719] [<ffffffffa04c9cc2>] nfsd_dispatch+0xb2/0x200 [nfsd]
[ 142.947734] [<ffffffffa0320cab>] svc_process_common+0x43b/0x680 [sunrpc]
[ 142.947754] [<ffffffffa0320ffc>] svc_process+0x10c/0x160 [sunrpc]
[ 142.947770] [<ffffffffa04c96cf>] nfsd+0xaf/0x120 [nfsd]
[ 142.947775] [<ffffffff8107ae74>] kthread+0xb4/0xc0
[ 142.947780] [<ffffffff8152ae18>] ret_from_fork+0x58/0x90

 

Comment by Peter Jones [ 01/Sep/18 ]

Simon

JFYI- the above-mentioned fix for LU-11227 was included in the recently release Lustre 2.10.5.

Peter

Comment by Malcolm Haak - NCI (Inactive) [ 06/Sep/18 ]

Hi Peter,

Which commit ID? I looked at the git log for the 2.10.5 tag and could not see a commit claiming to pull this fix.

Was it merged into a different commit?

Comment by Peter Jones [ 06/Sep/18 ]

Ah. I apologize - it is not actually in 2.10.5. We had discussed including it but decided against it in the end because at the time it had not had very long in testing. It should be in 2.10.6

Comment by Malcolm Haak - NCI (Inactive) [ 06/Sep/18 ]

Hi Peter,

All good. Just though I might have been going crazy was all

Comment by Malcolm Haak - NCI (Inactive) [ 06/Sep/18 ]

Also Peter,

LU-11227 is on the list on the change log page for 2.10.5

http://wiki.lustre.org/Lustre_2.10.5_Changelog

Thanks

Comment by Peter Jones [ 07/Sep/18 ]

Not any more - thx

Comment by Ian Costello [ 16/Nov/20 ]

Hi Peter,

Problem resolved, we can close this one.

Thanks,
Ian

Generated at Sat Feb 10 02:41:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.