[LU-8935] Lustre mount re-export via NFS long timeouts, not working with autofs Created: 13/Dec/16 Updated: 22/Aug/17 Resolved: 29/Jul/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.10.1, Lustre 2.11.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Steve Dainard (Inactive) | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 7.2 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Epic/Theme: | Lustre-2.8.0 | ||||||||
| Severity: | 2 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
We have two lustre clusters (all machines CentOS 7.2). Each of the clusters has a lustre filesystem mounted on a CentOS 7.2 server which exports the filesystem via NFS. The start of our issue may have coincided with a system downtime for hardware maintenance on the 'delta' cluster. The 'delta' cluster lustre filesystem will mount properly on the 'vanlustre4' NFS server, but clients intermittently cannot access the export via autofs mounts and if they can it is after long delays. Manually mounting the NFS export on a client machine will work but there is an incredibly long delay (~ 1-5 minutes), which I assume is why autofs has issues mounting. We also attempted to mount the 'delta' lustre filesystem on the 'vanlustre3' server which exports a lustre filesystem from the 'echo' cluster, but encounter the same long delays so it seems reasonable this is not an NFS server issue. On the MDS server for the 'delta' cluster 'dmds1' kernel logs I see reconnecting messages relating to the NFS server vanlustre4 (10.23.22.114). As well as LustreError ldlm_lib.c:3169:target_bulk_io(). I've attached the 'Lustre' logs from syslog. I've also attached an strace of a client machine manually mounting the 'delta' lustre filesystem export from 'vanlustre4'. On 'vanlustre4' messages shows multiple auth attempts from my client during the above manual mount which did eventually succeed: Any suggestions on how to further troubleshoot this issue would be appreciated. |
| Comments |
| Comment by Oleg Drokin [ 14/Dec/16 ] |
|
Wha is in the lustre client (=nfs server) logs? |
| Comment by Andreas Dilger [ 14/Dec/16 ] |
|
Given that this problem started with system maintenance, the first place to look would be related to what changed there. One possible issue is that there is a problem with the network on this client? It is hard to say what the problem is without also seeing the console messages from the client. Minor note - running strace -ttt will include timestamps in the output so that it is possible to see which commands are taking a long time. I suspect it is mount() that is slow, but hard to know for sure. |
| Comment by Steve Dainard (Inactive) [ 14/Dec/16 ] |
|
I've added the NFS server kernel logs, and another client strace with timestamps. To clarify the 'hardware maintenance' was a disk shelf being replaced in one of the OSS servers. Although we cannot export the lustre filesystem via NFS without long delays, the NFS server does run rsync to backup other hosts onto its mounted lustre filesystem, and this is working properly so the lustre filesystem mount seems ok which makes us think its an NFS server issue. But the NFS server being a culprit seems to be discounted because if we mount the lustre filesystem and export on a different NFS server we get the same NFS mount delay. The network looks pretty solid, although quite a few rx_discards, but these stats are not incrementing while I'm test mounting the NFS export on my client system, or at any point today. |
| Comment by Oleg Drokin [ 15/Dec/16 ] |
Dec 11 00:00:56 vanlustre4 kernel: Lustre: 46864:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1481443249/real 1481443249] req@ffff880946c29e00 x1553319753601680/t0(0) o36->delta-MDT0000-mdc-ffff880fd9647800@10.23.22.90@tcp:12/10 lens 608/1016 e 0 to 1 dl 1481443256 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 So the sent time is 1481443249 which is 00:0:49, just 7 seconds. Dec 13 00:02:24 dmds1 kernel: LustreError: 49950:0:(ldlm_lib.c:3169:target_bulk_io()) @@@ Reconnect on bulk READ req@ffff884825f43000 x1553326849866236/t0(0) o37->cfbb5c72-aeba-7fa6-0c04-35195c6e8ec0@10.23.22.114@tcp:375/0 lens 568/440 e 0 to 0 dl 1481616150 ref 1 fl Interpret:/0/0 rc 0/0 Dec 13 00:02:46 dmds1 kernel: LustreError: 49821:0:(ldlm_lib.c:3119:target_bulk_io()) @@@ bulk READ failed: rc -107 req@ffff882f3d19ec50 x1553326852292996/t0(0) o37->cfbb5c72-aeba-7fa6-0c04-35195c6e8ec0@10.23.22.114@tcp:434/0 lens 568/440 e 0 to 0 dl 1481616209 ref 1 fl Interpret:/0/0 rc 0/0 So it takes the server another 22 seconds to prepare and send the readdir reply and then notice that nobody is listening anymore. Also you might want to check that mtu settings everywhere match since I guess readdir is one of the bigger RPCs mds does so it might be the one to trigger various MTU mismatches. Another thing 0 if you rebooted your MDS too - it would start with an empty cache, and it'll take awhile for it to be repopulated, though one would imagine that would need to happen only once and after that things would become fast, which is clearly not the case here. Enabling some extended debug on the nfs server/lustre client node (i.e. echo +vfstrace +rpctrace >/proc/sys/lnet/debug (or even -1 to get everything with corresponding debug_mb increase) might prove useful to see what is the directory eing accessed, why is it timing out so fast and so on. |
| Comment by Steve Dainard (Inactive) [ 15/Dec/16 ] |
|
From the NFS server, listing the top level in the lustre filesystem is very quick, and the top level only has one sub dir: [sdai@vanlustre4 user_data]$ pwd real 0m0.005s If I do a find with a few depths the response time is still pretty quick: [sdai@vanlustre4 user_data]$ time find . -maxdepth 5 | grep ARCHIVE | wc -l real 0m1.495s I'm guessing at this point the cache is coming into play. I've checked the MTU on all the hosts, and they are all set to the default 1500. This mirrors the setup of the 'echo' cluster hosts which are on the same switches.
I've got about 6 hours of time troubleshooting Lustre, so I can't comment too much on configuration. I should also mention, when restarting or re-exporting NFS exports on the NFS server, the service start also hangs for a very long time. ~1-5 minutes. I've attached an strace (see 'nfs-server-strace-exportfs-a') of a different NFS server which has mounted the 'delta' filesystem and run exportfs -r. I also noticed when I unmounted the delta filesystem on the new NFS test server, the following is logged: And in all cases after echoing -1 to /proc/sys/lnet/debug I'm not seeing any 'Lustre' logging on the NFS server, although to be fair most of the logging we've seen to date has been in the early morning hours, not when troubleshooting NFS exports. stat /tmp/lustre-log |
| Comment by Steve Dainard (Inactive) [ 15/Dec/16 ] |
|
In addition the NFS server looks like it crashed early this morning, here are the last logs in syslog: Dec 15 00:35:17 vanlustre4 rpc.mountd[2810]: authenticated mount request from 10.23.20.50:999 for /user_data (/use |
| Comment by Oleg Drokin [ 19/Dec/16 ] |
|
Sorry, I was not really verbose. |
| Comment by Steve Dainard (Inactive) [ 19/Dec/16 ] |
|
I've attached the log output from just before the client tries to mount the NFS export, until it mounts successfully. I have a TON of intermediate files as well if they end up being useful. |
| Comment by Steve Dainard (Inactive) [ 20/Dec/16 ] |
|
Kernel Panic on the NFS server this morning: Dec 20 01:58:54 vanlustre4 kernel: LustreError: 11-0: delta-MDT0000-mdc-ffff881fe3158000: operation ldlm_enqueue t Full syslog attached. |
| Comment by Steve Dainard (Inactive) [ 22/Dec/16 ] |
|
NFS server crash: Dec 22 02:03:05 vanlustre4 kernel: Lustre: 20662:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1482400977/real 1482400977] req@ffff880280affb00 x1554362320709380/t0(0) o36->delta-MDT0 |
| Comment by Oleg Drokin [ 22/Dec/16 ] |
|
Hm, did the lbug happen while you had debug set to -1 by any chance? Did you happen to capture a crashdump? |
| Comment by Oleg Drokin [ 22/Dec/16 ] |
|
about the debug logs the log buffer is cleared once you do lctl dk, so all the intermediate ones you took potentially have all the info I am looking for, but the one before the mount has nothing and the one once the mount completes potentially only has the last request that was taking the long time. By increasing the debug_mb settign you increase the size of the debug buffer so more stuff fits in if the process takes a long time (say 1024 in there would be good enough for an otherwise totally idle client, I guess). |
| Comment by Steve Dainard (Inactive) [ 22/Dec/16 ] |
|
crashdump as in kdump? If yes, I do. I've got crashes from Dec 10, 12, 15, 17, 18, 20, 22. Do you have somewhere I can upload these to? They're 53-70G in size... at least one of them should have debug set to -1 but I don't think the most recent one would have, although it is the largest on disk. |
| Comment by Oleg Drokin [ 22/Dec/16 ] |
|
you can upload to our ftp at ftp.hpdd.intel.com to /uploads/ directory, just create a directory like I would need the crashdump and vmlinux from -debuginfo rpm and lustre-modules-debuginfo rpm (assuming you selfbuilt lustre + kernel, if you used one of the ones we built, just let me know which one was it). Thank you. |
| Comment by Steve Dainard (Inactive) [ 03/Jan/17 ] |
|
I think there is a We're using the pre-built kernels from lustre repos: Thanks |
| Comment by Steve Dainard (Inactive) [ 05/Jan/17 ] |
|
Hello, Has anyone had a change to take a look through these kernel dumps? Thanks |
| Comment by Oleg Drokin [ 06/Jan/17 ] |
|
I am taking a look at the files you uploaded (they made it in just fine). |
| Comment by Steve Dainard (Inactive) [ 09/Jan/17 ] |
|
Sometime soon after delta cluster was created, delta-OST0012 seems to have been deactivated: 2016-11-02-syslog.log:Nov 2 15:10:12 dmds1 kernel: Lustre: 4643:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1478124604/real 1478124604] req@ffff8852d8dd7200 x1549476144894000/t0(0) o13->delta-OST0012-osc-MDT0000@10.23.22.92@tcp:7/4 lens 224/368 e 0 to 1 dl 1478124611 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 2016-11-02-syslog.log:Nov 2 15:10:12 dmds1 kernel: Lustre: delta-OST0012-osc-MDT0000: Connection to delta-OST0012 (at 10.23.22.92@tcp) was lost; in progress operations using this service will wait for recovery to complete 2016-11-02-syslog.log:Nov 2 15:10:18 dmds1 kernel: Lustre: 4588:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1478124612/real 1478124612] req@ffff88407c1e2100 x1549476144899076/t0(0) o8->delta-OST0012-osc-MDT0000@10.23.22.92@tcp:28/4 lens 520/544 e 0 to 1 dl 1478124618 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 2016-11-02-syslog.log:Nov 2 15:10:48 dmds1 kernel: Lustre: 4588:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1478124637/real 1478124637] req@ffff883d0c413900 x1549476144914028/t0(0) o8->delta-OST0012-osc-MDT0000@10.23.22.92@tcp:28/4 lens 520/544 e 0 to 1 dl 1478124648 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 2016-11-02-syslog.log:Nov 2 15:11:18 dmds1 kernel: Lustre: 4588:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1478124662/real 1478124662] req@ffff884070e6d100 x1549476144932136/t0(0) o8->delta-OST0012-osc-MDT0000@10.23.22.92@tcp:28/4 lens 520/544 e 0 to 1 dl 1478124678 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 2016-11-02-syslog.log:Nov 2 15:11:40 dmds1 kernel: LustreError: 140-5: Server delta-OST0012 requested index 18, but that index is already in use. Use --writeconf to force 2016-11-02-syslog.log:Nov 2 15:11:40 dmds1 kernel: LustreError: 108857:0:(mgs_handler.c:460:mgs_target_reg()) Failed to write delta-OST0012 log (-98) 2016-11-02-syslog.log:Nov 2 15:11:48 dmds1 kernel: Lustre: 4588:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1478124687/real 1478124687] req@ffff884070e6a100 x1549476144946028/t0(0) o8->delta-OST0012-osc-MDT0000@10.23.22.92@tcp:28/4 lens 520/544 e 0 to 1 dl 1478124708 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 2016-11-02-syslog.log:Nov 2 15:12:18 dmds1 kernel: Lustre: 4588:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1478124712/real 1478124712] req@ffff880622c00000 x1549476144962948/t0(0) o8->delta-OST0012-osc-MDT0000@10.23.22.92@tcp:28/4 lens 520/544 e 0 to 1 dl 1478124738 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 2016-11-02-syslog.log:Nov 2 15:13:13 dmds1 kernel: Lustre: 4588:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1478124762/real 1478124762] req@ffff8843c7f14b00 x1549476145004052/t0(0) o8->delta-OST0012-osc-MDT0000@10.23.22.92@tcp:28/4 lens 520/544 e 0 to 1 dl 1478124793 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 2016-11-02-syslog.log:Nov 2 15:14:08 dmds1 kernel: Lustre: 4588:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1478124812/real 1478124812] req@ffff8843c7f16300 x1549476145039664/t0(0) o8->delta-OST0012-osc-MDT0000@10.23.22.92@tcp:28/4 lens 520/544 e 0 to 1 dl 1478124848 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 2016-11-02-syslog.log:Nov 2 15:15:58 dmds1 kernel: Lustre: 4588:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1478124912/real 1478124912] req@ffff8843c7f13c00 x1549476145113496/t0(0) o8->delta-OST0012-osc-MDT0000@10.23.22.92@tcp:28/4 lens 520/544 e 0 to 1 dl 1478124958 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 2016-11-02-syslog.log:Nov 2 15:18:12 dmds1 kernel: Lustre: 4588:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1478125037/real 1478125038] req@ffff8844389de900 x1549476145200192/t0(0) o8->delta-OST0012-osc-MDT0000@10.23.22.92@tcp:28/4 lens 520/544 e 0 to 1 dl 1478125092 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 2016-11-02-syslog.log:Nov 2 15:23:12 dmds1 kernel: Lustre: 4588:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1478125337/real 1478125337] req@ffff88446375f200 x1549476145360432/t0(0) o8->delta-OST0012-osc-MDT0000@10.23.22.92@tcp:28/4 lens 520/544 e 0 to 1 dl 1478125392 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 2016-11-02-syslog.log:Nov 2 15:31:57 dmds1 kernel: Lustre: 4588:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1478125862/real 1478125862] req@ffff88446375a100 x1549476145540712/t0(0) o8->delta-OST0012-osc-MDT0000@10.23.22.92@tcp:28/4 lens 520/544 e 0 to 1 dl 1478125917 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 2016-11-02-syslog.log:Nov 2 15:43:12 dmds1 kernel: Lustre: 4588:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1478126537/real 1478126537] req@ffff88446375bc00 x1549476145818356/t0(0) o8->delta-OST0012-osc-MDT0000@10.23.22.92@tcp:28/4 lens 520/544 e 0 to 1 dl 1478126592 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 2016-11-02-syslog.log:Nov 2 15:54:27 dmds1 kernel: Lustre: 4588:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1478127212/real 1478127212] req@ffff8843f8cde600 x1549476146019280/t0(0) o8->delta-OST0012-osc-MDT0000@10.23.22.92@tcp:28/4 lens 520/544 e 0 to 1 dl 1478127267 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 2016-11-02-syslog.log:Nov 2 16:05:42 dmds1 kernel: Lustre: 4588:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1478127887/real 1478127887] req@ffff8843f8cdf200 x1549476146209332/t0(0) o8->delta-OST0012-osc-MDT0000@10.23.22.92@tcp:28/4 lens 520/544 e 0 to 1 dl 1478127942 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 2016-11-02-syslog.log:Nov 2 16:13:00 dmds1 kernel: LustreError: 140-5: Server delta-OST0012 requested index 18, but that index is already in use. Use --writeconf to force 2016-11-02-syslog.log:Nov 2 16:13:00 dmds1 kernel: LustreError: 4713:0:(mgs_handler.c:460:mgs_target_reg()) Failed to write delta-OST0012 log (-98) 2016-11-02-syslog.log:Nov 2 16:50:53 dmds1 kernel: Lustre: setting import delta-OST0012_UUID INACTIVE by administrator request 2016-11-02-syslog.log:Nov 2 16:52:58 dmds1 kernel: Lustre: Permanently deactivating delta-OST0012 2016-11-02-syslog.log:Nov 2 16:52:58 dmds1 kernel: Lustre: Setting parameter delta-OST0012-osc.osc.active in log delta-client 2016-12-09-syslog.log:Dec 9 15:37:59 dmds1 kernel: Lustre: setting import delta-OST0012_UUID INACTIVE by administrator request 2016-12-09-syslog.log:Dec 9 15:52:05 dmds1 kernel: Lustre: setting import delta-OST0012_UUID INACTIVE by administrator request 2016-12-21-syslog.log:Dec 21 12:46:16 dmds1 kernel: Lustre: setting import delta-OST0012_UUID INACTIVE by administrator request 2017-01-03-syslog.log:Jan 3 11:41:22 dmds1 kernel: Lustre: setting import delta-OST0012_UUID INACTIVE by administrator request 2017-01-03-syslog.log:Jan 3 12:35:14 dmds1 kernel: Lustre: setting import delta-OST0012_UUID INACTIVE by administrator request 2017-01-03-syslog.log:Jan 3 15:39:52 dmds1 kernel: Lustre: setting import delta-OST0012_UUID INACTIVE by administrator request 2017-01-04-syslog.log:Jan 4 03:11:15 dmds1 kernel: LustreError: 140-5: Server delta-OST0012 requested index 18, but that index is already in use. Use --writeconf to force 2017-01-04-syslog.log:Jan 4 03:11:15 dmds1 kernel: LustreError: 4702:0:(mgs_handler.c:460:mgs_target_reg()) Failed to write delta-OST0012 log (-98) 2017-01-04-syslog.log:Jan 4 08:49:27 dmds1 kernel: Lustre: setting import delta-OST0012_UUID INACTIVE by administrator request 2017-01-04-syslog.log:Jan 4 08:49:27 dmds1 kernel: LustreError: 114443:0:(osp_precreate.c:899:osp_precreate_cleanup_orphans()) delta-OST0012-osc-MDT0000: cannot cleanup orphans: rc = -108 2017-01-04-syslog.log:Jan 4 09:36:59 dmds1 kernel: Lustre: delta-OST0012-osc-MDT0000: Connection to delta-OST0012 (at 10.23.22.92@tcp) was lost; in progress operations using this service will wait for recovery to complete 2017-01-04-syslog.log:Jan 4 09:36:59 dmds1 kernel: LustreError: 167-0: delta-OST0012-osc-MDT0000: This client was evicted by delta-OST0012; in progress operations using this service will fail. 2017-01-04-syslog.log:Jan 4 09:36:59 dmds1 kernel: Lustre: delta-OST0012-osc-MDT0000: Connection restored to 10.23.22.92@tcp (at 10.23.22.92@tcp) 2017-01-04-syslog.log:Jan 4 10:54:57 dmds1 kernel: Lustre: delta-OST0012-osc-MDT0000: Connection to delta-OST0012 (at 10.23.22.92@tcp) was lost; in progress operations using this service will wait for recovery to complete 2017-01-04-syslog.log:Jan 4 10:54:57 dmds1 kernel: Lustre: delta-OST0012-osc-MDT0000: Connection restored to 10.23.22.92@tcp (at 10.23.22.92@tcp) 2017-01-05-syslog.log:Jan 5 03:08:44 dmds1 kernel: LustreError: 14d-8: No target specified: delta-OST0012_UUID=1 On test client: # lctl device_list 0 UP mgc MGC10.23.22.90@tcp 1d99ec10-5360-1b32-d30d-18c34eadf020 5 1 UP lov delta-clilov-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 4 2 UP lmv delta-clilmv-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 4 3 UP mdc delta-MDT0000-mdc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 4 UP osc delta-OST0000-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 5 UP osc delta-OST0033-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 6 UP osc delta-OST0001-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 7 UP osc delta-OST0034-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 8 UP osc delta-OST0022-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 9 UP osc delta-OST0002-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 10 UP osc delta-OST0035-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 11 UP osc delta-OST0023-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 12 UP osc delta-OST0003-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 13 UP osc delta-OST0036-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 14 UP osc delta-OST0037-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 15 UP osc delta-OST0004-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 16 UP osc delta-OST0024-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 17 UP osc delta-OST0038-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 18 UP osc delta-OST0005-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 19 UP osc delta-OST0025-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 20 UP osc delta-OST0039-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 21 UP osc delta-OST0006-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 22 UP osc delta-OST003a-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 23 UP osc delta-OST0026-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 24 UP osc delta-OST0007-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 25 UP osc delta-OST003b-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 26 UP osc delta-OST0008-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 27 UP osc delta-OST0027-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 28 UP osc delta-OST003c-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 29 UP osc delta-OST0009-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 30 UP osc delta-OST003d-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 31 UP osc delta-OST0028-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 32 UP osc delta-OST000a-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 33 UP osc delta-OST003e-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 34 UP osc delta-OST0029-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 35 UP osc delta-OST000b-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 36 UP osc delta-OST003f-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 37 UP osc delta-OST000c-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 38 UP osc delta-OST0040-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 39 UP osc delta-OST002a-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 40 UP osc delta-OST000d-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 41 UP osc delta-OST0041-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 42 UP osc delta-OST0043-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 43 UP osc delta-OST002b-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 44 UP osc delta-OST000e-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 45 UP osc delta-OST000f-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 46 UP osc delta-OST002c-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 47 UP osc delta-OST0010-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 48 UP osc delta-OST002d-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 49 UP osc delta-OST002e-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 50 UP osc delta-OST002f-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 51 UP osc delta-OST0030-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 52 UP osc delta-OST0031-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 53 UP osc delta-OST0032-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 54 UP osc delta-OST0012-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 4 55 UP osc delta-OST0013-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 56 UP osc delta-OST0014-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 57 UP osc delta-OST0015-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 58 UP osc delta-OST0016-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 59 UP osc delta-OST0017-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 60 UP osc delta-OST0018-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 61 UP osc delta-OST0019-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 62 UP osc delta-OST001a-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 63 UP osc delta-OST001b-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 64 UP osc delta-OST001c-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 65 UP osc delta-OST001d-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 66 UP osc delta-OST001e-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 67 UP osc delta-OST001f-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 68 UP osc delta-OST0020-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 69 UP osc delta-OST0021-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 70 UP osc delta-OST0011-osc-ffff880032d64800 32b62615-3145-295b-feb4-bbca447e962d 5 After mounting lustre on my test machine, and activating OST0012, df commands, nfs restarts, and client automounts are returning quickly. Can you advise if this is a bug, or whereabouts I could permanently enable OST0012 so that it is activated when clients mount lustre? |
| Comment by Steve Dainard (Inactive) [ 09/Jan/17 ] |
|
lctl --device 54 conf_param delta-OST0012.osc.active looks to have set the OST to activate on client mount. Is this a known bug, that if a OST is inactive it will cause significant hangs for df, nfs mounts etc? |
| Comment by Oleg Drokin [ 10/Jan/17 ] |
|
Hm, the disabled OST delays is likely occurence of Now your crashes is a new bug that seems to be triggered by some network problem (not 100% sure on that yet, though) |
| Comment by Oleg Drokin [ 10/Jan/17 ] |
|
So the debug logs did not have what I hoped they would have probably due to page dropping. [48378.577309] Lustre: 20662:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1482400977/real 1482400977] req@ffff880280affb00 x1554362320709380/t0(0) o36->delta-MDT0000-mdc-ffff880fe594d800@10.23.22.90@tcp:12/10 lens 608/1016 e 0 to 1 dl 1482400984 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 What was the actual time on the node when it was printed? The request was sent at "Thu Dec 22 05:02:57" EST (though I don't know what timezone servers are in) and |
| Comment by Steve Dainard (Inactive) [ 10/Jan/17 ] |
|
Not sure about a specific rsync job crashing, I believe there are a few that crash. There are quite a few (hundreds) clients backing up over rsync to the NFS server. On NFS server (PST timestamps): Dec 22 02:03:05 vanlustre4 kernel: Lustre: 20662:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1482400977/real 1482400977] req@ffff880280affb00 x1554362320709380/t0(0) o36->delta-MDT0000-mdc-ffff880fe594d800@10.23.22.90@tcp:12/10 lens 608/1016 e 0 to 1 dl 1482400984 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Dec 22 02:03:05 vanlustre4 kernel: Lustre: 20662:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 82 previous similar messages
On MDS dmds1 (PST): Dec 22 00:01:59 dmds1 kernel: Lustre: delta-MDT0000: Connection restored to 5d0ddfdb-65fb-bb1d-85f6-2509c2d8bc16 (at 10.23.22.114@tcp) Dec 22 00:02:06 dmds1 kernel: Lustre: delta-MDT0000: Client 89fa1c40-896d-2b3e-6ccd-c237c9d4f145 (at 10.23.22.114@tcp) reconnecting Dec 22 00:02:06 dmds1 kernel: LustreError: 22492:0:(ldlm_lib.c:3169:target_bulk_io()) @@@ Reconnect on bulk READ req@ffff881c9e367200 x1554361449174764/t0(0) o37->89fa1c40-896d-2b3e-6ccd-c237c9d4f145@10.23.22.114@tcp:307/0 lens 568/440 e 0 to 0 dl 1482393732 ref 1 fl Interpret:/0/0 rc 0/0 Dec 22 00:02:06 dmds1 kernel: Lustre: delta-MDT0000: Connection restored to 5d0ddfdb-65fb-bb1d-85f6-2509c2d8bc16 (at 10.23.22.114@tcp) Dec 22 00:02:13 dmds1 kernel: Lustre: delta-MDT0000: Client 89fa1c40-896d-2b3e-6ccd-c237c9d4f145 (at 10.23.22.114@tcp) reconnecting Dec 22 00:02:13 dmds1 kernel: Lustre: delta-MDT0000: Connection restored to 5d0ddfdb-65fb-bb1d-85f6-2509c2d8bc16 (at 10.23.22.114@tcp) |
| Comment by Oleg Drokin [ 10/Jan/17 ] |
|
So only 55 seconds since the request was sent. Also did the lazystatfs option help with the delay? |
| Comment by Gerrit Updater [ 13/Apr/17 ] |
|
Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/26583 |
| Comment by Lai Siyao [ 13/Apr/17 ] |
|
It looks like ptlrpc_client_wake_req() misses a memory barrier, which may cause ptlrpc_resend_req() wake up ptlrpc_send_rpc -> ptlrpc_register_bulk, while the latter doesn't see rq_resend set. |
| Comment by Gerrit Updater [ 29/Jul/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26583/ |
| Comment by Peter Jones [ 29/Jul/17 ] |
|
Landed for 2.11 |
| Comment by Gerrit Updater [ 02/Aug/17 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/28321 |
| Comment by Gerrit Updater [ 18/Aug/17 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28321/ |