[LU-11989] Global filesystem hangs in 2.12 Created: 21/Feb/19 Updated: 20/May/22 Resolved: 20/May/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Stephane Thiell | Assignee: | Peter Jones |
| Resolution: | Cannot Reproduce | Votes: | 1 |
| Labels: | None | ||
| Environment: |
CentOS 7.6, Lustre 2.12.0 clients and servers, some clients with 2.12.0 + patch |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
We are having more issues with a full 2.12 production setup on Sherlock and Fir, we can notice sometimes a global filesystem hang, on all nodes, for at least 30 seconds, often more. The filesystem can run fine for 2 hours and then hang during a few minutes. This is impacting production, especially interactive jobs. These filesystem hangs could be related to compute nodes rebooting and matching messages like the following on the MDTs: [769459.092993] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550784454/real 1550784454] req@ffff9cc82f229800 x1625957396013728/t0(0) o104->fir-MDT0002@10.9.101.45@o2ib4:15/16 lens 296/224 e 0 to 1 dl 1550784461 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 [769459.120452] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message [769473.130314] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550784468/real 1550784468] req@ffff9cc82f229800 x1625957396013728/t0(0) o104->fir-MDT0002@10.9.101.45@o2ib4:15/16 lens 296/224 e 0 to 1 dl 1550784475 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 [769473.157759] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message [769494.167799] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550784489/real 1550784489] req@ffff9cc82f229800 x1625957396013728/t0(0) o104->fir-MDT0002@10.9.101.45@o2ib4:15/16 lens 296/224 e 0 to 1 dl 1550784496 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 [769494.195248] Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 2 previous similar messages I'm not 100% sure but it sounds like when these messages stop on the MDTs, the filesystem comes back online. There is no log on the clients though, as far as I know... Please note that we're also in the process of fixing the locking issue described in |
| Comments |
| Comment by Patrick Farrell (Inactive) [ 21/Feb/19 ] |
|
Stephane, I won't be able to examine them immediately, but dmesg and the output of lctl dk [you don't have extra debug on, but it still includes additional info vs dmesg] from the MDS(es) would be useful. Just to be clear: Do these timed out messages specifically correspond to clients you're rebooting? ie the NID is that of a client that was being rebooted at that time? |
| Comment by Stephane Thiell [ 21/Feb/19 ] |
|
Thanks Patrick, I'll try to gather more logs a bit later. But to answer your question, yes this specific node was redeployed around 13:26:43 and the MDT logs with timestamps are: Feb 21 13:27:41 fir-md1-s1 kernel: Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550784454/real 1550784454] req@ffff9cc82f229800 x1625957396013728/t0(0) o104->fir-MDT0002@10.9.101.45@o2ib4:15/16 lens 296/224 e 0 to 1 dl 1550784461 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Feb 21 13:27:55 fir-md1-s1 kernel: Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550784468/real 1550784468] req@ffff9cc82f229800 x1625957396013728/t0(0) o104->fir-MDT0002@10.9.101.45@o2ib4:15/16 lens 296/224 e 0 to 1 dl 1550784475 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Feb 21 13:28:16 fir-md1-s1 kernel: Lustre: 21751:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1550784489/real 1550784489] req@ffff9cc82f229800 x1625957396013728/t0(0) o104->fir-MDT0002@10.9.101.45@o2ib4:15/16 lens 296/224 e 0 to 1 dl 1550784496 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Not sure at all this is directly related to our hang issues but that's the only messages I could see at that moment. |
| Comment by Patrick Farrell (Inactive) [ 22/Feb/19 ] |
|
Are you unmounting the fs gracefully on the nodes you're rebooting? If they're doing certain types of activity and go away suddenly, the MDS can end up needing to call back locks from them, which means it has to time out requests and evict them, and until that happens, the system will be hung. This sounds rather like that. |
| Comment by Stephane Thiell [ 23/Feb/19 ] |
|
Patrick, we use xCAT and rinstall should unmount the fs gracefully for normal redeployment, but we also have nodes that end up in a bad state sometimes that we have to reset, this cannot be avoided. At this time, it seems that the frequency of the hangs have diminished during the last 24h. Perhaps a good effect of the progressive deployment of the patch for |
| Comment by Stephane Thiell [ 25/Feb/19 ] |
|
Hi Patrick, This morning at 8:09 AM: [root@sh-hn01 ~]# time ls /fir groups PI users real 0m51.166s user 0m0.000s sys 0m0.005s At that the exact same time 8:09, we have this on one MDS (fir-md1-s1): Feb 25 08:09:04 fir-md1-s1 kernel: Lustre: 51443:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1551110933/real 1551110933] req@ffff9c9518ffa700 x162595931911 Feb 25 08:09:04 fir-md1-s1 kernel: Lustre: 51443:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message Feb 25 08:09:15 fir-md1-s1 kernel: Lustre: 51443:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1551110944/real 1551110944] req@ffff9c9518ffa700 x162595931911 Feb 25 08:09:38 fir-md1-s1 kernel: Lustre: 51443:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1551110966/real 1551110966] req@ffff9c9518ffa700 x162595931911 Feb 25 08:09:38 fir-md1-s1 kernel: Lustre: 51443:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message Feb 25 08:09:53 fir-md1-s1 kernel: LustreError: 51443:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.8.20.15@o2ib6) returned error from blocking AST (req@ffff9c9518ffa700 x1625959319111840 statu Feb 25 08:09:53 fir-md1-s1 kernel: Lustre: MGS: Connection restored to ce807089-fd76-9353-e0a7-32c622e380eb (at 10.8.20.15@o2ib6) Feb 25 08:09:53 fir-md1-s1 kernel: Lustre: Skipped 2 previous similar messages Feb 25 08:09:53 fir-md1-s1 kernel: LustreError: 138-a: fir-MDT0000: A client on nid 10.8.20.15@o2ib6 was evicted due to a lock blocking callback time out: rc -107 Feb 25 08:09:53 fir-md1-s1 kernel: LustreError: 129189:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 60s: evicting client at 10.8.20.15@o2ib6 ns: mdt-fir-MDT0000_UUID lock: ffff The corresponding node is sh-20-15 (10.8.20.15@o2ib6), and we confirmed that this node crashed probably due to some hardware issue. We don't see this issue with 2.10 or 2.8 filesystems. I was able to gather some Lustre logs from the MDS, not sure they will reveal something though, but it's worth a shot. I'm attaching the logs as fir-md1-s1_dk20190225.log.gz grep ptlrpc_expire_one_request /tmp/log_20190225.log 00000100:00000400:0.0:1551110944.930881:0:51443:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1551110933/real 1551110933] req@ffff9c9518ffa700 x1625959319111840/t0(0) o104->fir-MDT0000@10.8.20.15@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1551110944 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Thanks! |
| Comment by Patrick Farrell (Inactive) [ 25/Feb/19 ] |
|
Can you do fid2path on [0x200000400:0x5:0x0]? It may not work. After fid2path, please also do lfs getstripe on that file, if you can find the file. |
| Comment by Stephane Thiell [ 25/Feb/19 ] |
|
It's one of our base directory: $ lfs fid2path /fir [0x200000400:0x5:0x0] /fir/users As a FYI, it uses DNEv2 stripping to autoselect one of the MDT when a new homedir is created: $ lfs getdirstripe /fir/users
lmv_stripe_count: 4 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
mdtidx FID[seq:oid:ver]
0 [0x200000400:0x5:0x0]
1 [0x240000402:0x5:0x0]
2 [0x2c0000400:0x5:0x0]
3 [0x280000401:0x5:0x0]
|
| Comment by Stephane Thiell [ 25/Feb/19 ] |
|
That way, we get scratch home dirs balanced on our 4 MDTs: $ lfs getdirstripe /fir/users/* | sort -n | uniq -c
920 lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
1027 lmv_stripe_count: 0 lmv_stripe_offset: 1 lmv_hash_type: none
927 lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none
947 lmv_stripe_count: 0 lmv_stripe_offset: 3 lmv_hash_type: none
Just as a FYI, don't know if that is related. |
| Comment by Patrick Farrell (Inactive) [ 25/Feb/19 ] |
|
Ah, OK. Interesting. OK, here's the general outline of the problem. The client is holding a pretty standard lock on that directory, the kind that suggests it did a lookup under it. (May be other operations as well, but lookup at least) Some change in 2.12 - possibly DNE2 related - means that some other operation needs to cancel that lock so it can proceed. So when a client goes away ungracefully, these other operations have to wait for this eviction. (BTW, this strongly suggests your intentional reboots to install the patch were indeed not unmounting the fs.) Hmm. I can probably reproduce this with a little effort. Did you use DNE2 in your pre-2.12 install? (And what version of Lustre was that install?) |
| Comment by Stephane Thiell [ 25/Feb/19 ] |
|
Patrick, OK! Just before putting /fir into production, we formatted it with 2.12 GA, so it's a new 2.12 filesystem. We only use DNEv2 in that specific case for /fir/users/ (all users) and /fir/groups/ (all labs/groups). |
| Comment by Patrick Farrell (Inactive) [ 25/Feb/19 ] |
|
So you have not used DNE2 in a similar way on a 2.10 system? (Trying to bound when the issue might've started |
| Comment by Stephane Thiell [ 25/Feb/19 ] |
|
That's correct, we started to use DNE2 with /fir in 2.12 only. |
| Comment by Patrick Farrell (Inactive) [ 25/Feb/19 ] |
|
If this happens again, can you dump the lock table on the MDS during the hang? lctl set_param debug=+dlmtrace; lctl clear lctl set_param ldlm.dump_namespaces=1 lctl set_param debug=-dlmtrace lctl dk > [file] |
| Comment by Stephane Thiell [ 25/Feb/19 ] |
|
Sure, will do! Thanks! |
| Comment by Patrick Farrell (Inactive) [ 25/Feb/19 ] |
|
Stephane, After some careful examination, I'm not sure this is a new problem. If you've got a client that has done a lookup on the and it is lost, then creating a new file in another directory can hang. I can reproduce this on 2.10 without any DNE being involved. I don't have a 2.7 system handy right now... I can reproduce this with two clients and no DNE. One client does ls -l in a directory, and then panic that client. The other client tries to create a file in a subdirectory of that first directory (remember that the first client did ls in the parent). This hangs until the eviction is complete. (It also works if the ls and the create are in the same directory.) I believe I've seen this behavior in older systems as well, so I don't think it's new. laisiyao, can you take a look at this and confirm what's supposed to happen here? |
| Comment by Stephane Thiell [ 01/Mar/19 ] |
|
More issues tonight that started with directories not accessible from a specific client (10.9.0.1@o2ib4) and ended up with a full filesystem deadlock. Restarting all MDTs resolved (temporarily) the issue... This is the console of fir-md1-s1 (serves MDT0000 and MDT0002), when the directories started to become unaccessible: [1378713.475739] LNet: Service thread pid 48757 was inactive for 212.00s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [1378713.488774] LNet: Skipped 3 previous similar messages [1378713.494012] LustreError: dumping log to /tmp/lustre-log.1551393702.48757 [1378716.547830] LustreError: dumping log to /tmp/lustre-log.1551393705.22268 [1378721.106914] Lustre: fir-MDT0000-osp-MDT0002: Connection to fir-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete [1378721.108926] LustreError: 51417:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393409, 300s ago); not entering recovery in server code, just going back to sleep ns: [1378721.161814] LustreError: 51459:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393409, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD [1378721.202150] Lustre: fir-MDT0000: Received new LWP connection from 0@lo, removing former export from same NID [1378721.943952] LustreError: 22249:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393410, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD [1378721.984222] LustreError: 22249:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 2 previous similar messages [1378724.392997] LustreError: 51445:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393413, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD [1378728.847103] LustreError: 22142:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393417, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD [1378729.910121] LustreError: 21796:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393418, 300s ago); not entering recovery in server code, just going back to sleep ns: [1378729.949613] LustreError: 21796:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 4 previous similar messages [1378736.197272] LustreError: 21546:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393424, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD [1378743.454431] LustreError: 21974:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393432, 300s ago); not entering recovery in server code, just going back to sleep ns: [1378753.250665] LustreError: 22233:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393441, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD [1378753.290920] LustreError: 22233:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 2 previous similar messages [1378771.845093] LNet: Service thread pid 22241 was inactive for 236.44s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [1378771.858133] LNet: Skipped 1 previous similar message [1378771.863282] LustreError: dumping log to /tmp/lustre-log.1551393760.22241 [1378774.917158] LustreError: dumping log to /tmp/lustre-log.1551393763.21826 [1378789.253489] LustreError: dumping log to /tmp/lustre-log.1551393777.51431 [1378796.422652] LustreError: dumping log to /tmp/lustre-log.1551393785.22159 [1378800.517748] LustreError: dumping log to /tmp/lustre-log.1551393789.47900 [1378801.468777] LustreError: 48757:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393490, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD [1378801.509027] LustreError: 48757:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 3 previous similar messages [1378801.541773] LustreError: dumping log to /tmp/lustre-log.1551393790.21878 [1378833.603516] LustreError: 22159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1551393522, 300s ago), entering recovery for fir-MDT0000_UUID@10.0.10.51@o2ib7 ns: fir-MD [1378833.643766] LustreError: 22159:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 2 previous similar messages [1378878.343542] LNet: Service thread pid 22157 was inactive for 313.10s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [1378878.356579] LNet: Skipped 5 previous similar messages [1378878.361831] LustreError: dumping log to /tmp/lustre-log.1551393867.22157 [1378938.760938] LNet: Service thread pid 21779 was inactive for 362.88s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [1378938.778069] LNet: Skipped 3 previous similar messages [1378938.783305] Pid: 21779, comm: mdt01_015 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 [1378938.793247] Call Trace: [1378938.795883] [<ffffffffc12340bd>] ldlm_completion_ast+0x63d/0x920 [ptlrpc] [1378938.802998] [<ffffffffc123612f>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc] [1378938.810278] [<ffffffffc12389ee>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc] [1378938.817141] [<ffffffffc16507a2>] osp_md_object_lock+0x162/0x2d0 [osp] [1378938.823899] [<ffffffffc1a7be03>] lod_object_lock+0xf3/0x7b0 [lod] [1378938.830299] [<ffffffffc1afcd3e>] mdd_object_lock+0x3e/0xe0 [mdd] [1378938.836613] [<ffffffffc199b2e1>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt] [1378938.844053] [<ffffffffc199b87a>] mdt_remote_object_lock+0x2a/0x30 [mdt] [1378938.850995] [<ffffffffc19bb022>] mdt_reint_rename_or_migrate.isra.51+0x362/0x860 [mdt] [1378938.859211] [<ffffffffc19bb553>] mdt_reint_rename+0x13/0x20 [mdt] [1378938.865606] [<ffffffffc19bb5e3>] mdt_reint_rec+0x83/0x210 [mdt] [1378938.871828] [<ffffffffc1998133>] mdt_reint_internal+0x6e3/0xaf0 [mdt] [1378938.878586] [<ffffffffc19a3497>] mdt_reint+0x67/0x140 [mdt] [1378938.884456] [<ffffffffc12d135a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [1378938.891594] [<ffffffffc127592b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [1378938.899485] [<ffffffffc127925c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] [1378938.905998] [<ffffffff8e8c1c31>] kthread+0xd1/0xe0 [1378938.911090] [<ffffffff8ef74c24>] ret_from_fork_nospec_begin+0xe/0x21 [1378938.917767] [<ffffffffffffffff>] 0xffffffffffffffff [1378938.922966] LustreError: dumping log to /tmp/lustre-log.1551393927.21779 MDTs restarted occurred at ~15:47 I was able to gather some logs:
Thanks! |
| Comment by Patrick Farrell (Inactive) [ 01/Mar/19 ] |
|
Huh! Stephane, this is a different issue. Can you open a ticket for this with these logs? Looks like a DNE bug of some kind - the MDTs are having connection problems between one another. |
| Comment by Stephane Thiell [ 01/Mar/19 ] |
|
Doh. Sure, done: |
| Comment by Stephane Thiell [ 05/Mar/19 ] |
|
To come back at our original issue, it happened again today, a node crashed (or rebooted, I'm not sure), anyway, the whole filesystem was blocked for a ~2 minutes, including the Robinhood server (scan completely stopped). This happened today between Mar 04 15:58 and Mar 04 16:00 MDS fir-md1-s2 MDT0001 / MDT0003 logs: Mar 04 15:58:34 fir-md1-s2 kernel: Lustre: 55428:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1551743903/real 1551743903] req@ffff9e1744f16000 x1627086082608288/t0(0) o104->fir-MDT0003@10.8.15.4@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1551743914 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Mar 04 15:58:34 fir-md1-s2 kernel: Lustre: 55428:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 58 previous similar messages Mar 04 15:58:45 fir-md1-s2 kernel: Lustre: 55428:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1551743914/real 1551743914] req@ffff9e1744f16000 x1627086082608288/t0(0) o104->fir-MDT0003@10.8.15.4@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1551743925 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Mar 04 15:59:07 fir-md1-s2 kernel: Lustre: 55428:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1551743936/real 1551743936] req@ffff9e1744f16000 x1627086082608288/t0(0) o104->fir-MDT0003@10.8.15.4@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1551743947 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Mar 04 15:59:07 fir-md1-s2 kernel: Lustre: 55428:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message Mar 04 15:59:40 fir-md1-s2 kernel: Lustre: 55428:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1551743969/real 1551743969] req@ffff9e1744f16000 x1627086082608288/t0(0) o104->fir-MDT0003@10.8.15.4@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1551743980 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Mar 04 15:59:40 fir-md1-s2 kernel: Lustre: 55428:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Mar 04 16:00:40 fir-md1-s2 kernel: Lustre: fir-MDT0001: haven't heard from client 76418224-15e7-b0fc-3a21-b63821db4ba2 (at 10.8.15.4@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9e0ccbc29400, cur 1551744040 expire 1551743890 last 1551743813 Mar 04 16:00:40 fir-md1-s2 kernel: Lustre: Skipped 1 previous similar message Mar 04 16:00:44 fir-md1-s2 kernel: Lustre: DEBUG MARKER: Mon Mar 4 16:00:44 2019 I'm not sure why the other MDT, fir-md1-s1 (MDT0000/MDT0002) didn't complain at all. I understand that locking could do such thing, but we've never seen that with previous version of Lustre, and this stops ALL metadata operation on the filesystem, globally. |
| Comment by Stephane Thiell [ 12/Mar/19 ] |
|
Again, filesystem blocked while a node crashed Mar 12 12:06:02 fir-md1-s1 kernel: Lustre: 31329:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1552417555/real 1552417555] req@ffff8b6c796a4b00 x162781588295 Mar 12 12:06:16 fir-md1-s1 kernel: Lustre: 31329:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1552417569/real 1552417569] req@ffff8b6c796a4b00 x162781588295 Mar 12 12:06:16 fir-md1-s1 kernel: Lustre: 31329:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message Mar 12 12:06:37 fir-md1-s1 kernel: Lustre: 31329:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1552417590/real 1552417590] req@ffff8b6c796a4b00 x162781588295 Mar 12 12:06:37 fir-md1-s1 kernel: Lustre: 31329:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 2 previous similar messages Mar 12 12:07:12 fir-md1-s1 kernel: Lustre: 31329:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1552417625/real 1552417625] req@ffff8b6c796a4b00 x162781588295 Mar 12 12:07:12 fir-md1-s1 kernel: Lustre: 31329:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Mar 12 12:08:01 fir-md1-s1 kernel: LustreError: 31329:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.8.10.29@o2ib6) failed to reply to blocking AST (req@ffff8b6c796a4b00 x1627815882950880 status Mar 12 12:08:01 fir-md1-s1 kernel: LustreError: 138-a: fir-MDT0002: A client on nid 10.8.10.29@o2ib6 was evicted due to a lock blocking callback time out: rc -110 Mar 12 12:08:01 fir-md1-s1 kernel: LustreError: 30507:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 154s: evicting client at 10.8.10.29@o2ib6 ns: mdt-fir-MDT0002_UUID lock: ffff During that time, we couldn't ls anywhere on /fir (for about 2 minutes, tested from several nodes and subdirectories). Is there a timeout we could lower to mitigate this? |
| Comment by Patrick Farrell (Inactive) [ 12/Mar/19 ] |
|
Stephane, Are you able to dump the lock tables? Remember that when you did that before, it was actually that DOM/rename related hang ( I will take a look at timeouts, this should be adjustable. |
| Comment by Stephane Thiell [ 12/Mar/19 ] |
|
Hi Patrick! We have Lai's patch for I'll try to dump the lock tables next time this happens. Thanks! |
| Comment by Stephane Thiell [ 12/Mar/19 ] |
|
Wasn't very long... the client was 10.8.20.15@o2ib6 in that case: fir-md1-s1 (MDT0000, MDT0002): Mar 12 12:38:48 fir-md1-s1 kernel: Lustre: 31416:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1552419520/real 1552419520] req@ffff8b8c3445ef00 x1627815889426544/t0(0) o106->fir-MDT0000@10.8.20.15@o2ib6:15/16 lens 296/280 e 0 to 1 dl 1552419527 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Mar 12 12:38:48 fir-md1-s1 kernel: Lustre: 31416:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 346 previous similar messages Mar 12 12:39:30 fir-md1-s1 kernel: LustreError: 34992:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.8.20.15@o2ib6) failed to reply to blocking AST (req@ffff8b798eedd100 x1627815889391920 status 0 rc -110), evict it ns: mdt-fir-MDT0000_UUID lock: ffff8b677f2218c0/0x4cd4038d909bca08 lrc: 4/0,0 mode: PR/PR res: [0x20000560a:0x5ec5:0x0].0x0 bits 0x13/0x0 rrc: 56 type: IBT flags: 0x60200400000020 nid: 10.8.20.15@o2ib6 remote: 0x282f29ed39ed2a33 expref: 78 pid: 31590 timeout: 264509 lvb_type: 0 Mar 12 12:39:30 fir-md1-s1 kernel: LustreError: 138-a: fir-MDT0000: A client on nid 10.8.20.15@o2ib6 was evicted due to a lock blocking callback time out: rc -110 Mar 12 12:39:30 fir-md1-s1 kernel: LustreError: 30507:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 154s: evicting client at 10.8.20.15@o2ib6 ns: mdt-fir-MDT0000_UUID lock: ffff8b677f2218c0/0x4cd4038d909bca08 lrc: 3/0,0 mode: PR/PR res: [0x20000560a:0x5ec5:0x0].0x0 bits 0x13/0x0 rrc: 56 type: IBT flags: 0x60200400000020 nid: 10.8.20.15@o2ib6 remote: 0x282f29ed39ed2a33 expref: 79 pid: 31590 timeout: 0 lvb_type: 0 Mar 12 12:40:23 fir-md1-s1 kernel: Lustre: fir-MDT0002: haven't heard from client 9e9c0882-3f15-dd96-7d4f-45564e48c021 (at 10.8.20.15@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8b6b55711000, cur 1552419623 expire 1552419473 last 1552419396 Mar 12 12:40:23 fir-md1-s1 kernel: Lustre: Skipped 3 previous similar messages fir-md1-s2 (MDT0001, MDT0003): Mar 12 12:42:45 fir-md1-s2 kernel: Lustre: 31774:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1552419753/real 1552419753] req@ffff8eb94fb2f500 x1627816490690992/t0(0) o104->fir-MDT0003@10.8.20.15@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1552419764 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Mar 12 12:42:45 fir-md1-s2 kernel: Lustre: 31774:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 7 previous similar messages Mar 12 12:43:29 fir-md1-s2 kernel: LustreError: 31774:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.8.20.15@o2ib6) returned error from blocking AST (req@ffff8eb94fb2f500 x1627816490690992 status -107 rc -107), evict it ns: mdt-fir-MDT0003_UUID lock: ffff8edbad0469c0/0x60006bf677e1ceee lrc: 4/0,0 mode: PR/PR res: [0x280000401:0x5:0x0].0x0 bits 0x13/0x0 rrc: 821 type: IBT flags: 0x60200400000020 nid: 10.8.20.15@o2ib6 remote: 0x282f29ed39ed259b expref: 7 pid: 31773 timeout: 264501 lvb_type: 0 Mar 12 12:43:29 fir-md1-s2 kernel: LustreError: 138-a: fir-MDT0003: A client on nid 10.8.20.15@o2ib6 was evicted due to a lock blocking callback time out: rc -107 Mar 12 12:43:29 fir-md1-s2 kernel: LustreError: 30984:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 56s: evicting client at 10.8.20.15@o2ib6 ns: mdt-fir-MDT0003_UUID lock: ffff8edbad0469c0/0x60006bf677e1ceee lrc: 3/0,0 mode: PR/PR res: [0x280000401:0x5:0x0].0x0 bits 0x13/0x0 rrc: 821 type: IBT flags: 0x60200400000020 nid: 10.8.20.15@o2ib6 remote: 0x282f29ed39ed259b expref: 8 pid: 31773 timeout: 0 lvb_type: 0 Mar 12 12:43:39 fir-md1-s2 kernel: Lustre: fir-MDT0001: haven't heard from client 9e9c0882-3f15-dd96-7d4f-45564e48c021 (at 10.8.20.15@o2ib6) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8ed84f2d8c00, cur 1552419819 expire 1552419669 last 1552419592 I think I got the MDT lock table dumps during that time! Filesystem just unblocked while I was finializing dumping it on fir-md1-s2. Attaching fir-md1-s1_dlmtrace_20190312.log.gz and fir-md1-s2_dlmtrace_20190312.log.gz. Thx. |
| Comment by Patrick Farrell (Inactive) [ 12/Mar/19 ] |
|
Stephane, Have you changed any settings related to adaptive timeouts (AT_MIN, AT_MAX) or obd_timeout? (Not suggesting you change these yet - Just asking.) |
| Comment by Patrick Farrell (Inactive) [ 12/Mar/19 ] |
|
And thank you for the logs, will investigate. |
| Comment by Patrick Farrell (Inactive) [ 12/Mar/19 ] |
|
Would you be able to tar up and attach dmesg from all your MDSes? |
| Comment by Patrick Farrell (Inactive) [ 12/Mar/19 ] |
|
Stephane, Looking in to the timeouts, you can tweak them, but it's a pretty big hammer. We don't want to set these too low or it may cause trouble. Let's aim for 30 seconds. On only the MDSes. This should reduce your hang time from 150 seconds to 30, and 30 should still be plenty of time. |
| Comment by Stephane Thiell [ 12/Mar/19 ] |
|
Thanks! Added that to our /etc/modprobe.d/lustre.conf: # LU-11989 options ptlrpc ldlm_enqueue_min=30 options ptlrpc at_max=20 And restarted MGS +MDTs. $ clush -w@mds -b cat /sys/module/ptlrpc/parameters/ldlm_enqueue_min --------------- fir-md1-s[1-2] (2) --------------- 30 $ clush -w@mds -b cat /sys/module/ptlrpc/parameters/at_max --------------- fir-md1-s[1-2] (2) --------------- 20 |
| Comment by Patrick Farrell (Inactive) [ 12/Mar/19 ] |
|
LU-12064 has been opened to look at possible adjustments to the adaptive timeout code. Note this is not a fix for this issue, just something to hopefully reduce delays. |
| Comment by Stephane Thiell [ 19/Apr/19 ] |
|
Just wanted to follow-up on this one, as today we noticed the same issue against regal, our 2.8-based old scratch filesystem, mounted under /regal, after rebooting a few compute nodes. We have never seen this behavior before upgrading our clients to 2.12, so we strongly suspect some regression there. It would be weird that we missed that kind of behavior for several years (and it's not like we're not rebooting nodes Server (2.8) logs: Apr 19 17:38:56 regal-mds1 kernel: Lustre: regal-MDT0000: haven't heard from client e07871c1-593b-18ef-bb42-04c1e04c4db9 (at 10.8.8.24@o2ib6) in 219 seconds. I think it's dead, and I am evicting it. exp ffff881bdd5ef000, cur 1555713536 expire 1555713386 last 1555713317 Apr 19 17:38:56 regal-mds1 kernel: Lustre: Skipped 10 previous similar messages Apr 19 17:42:40 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713749/real 1555713749] req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713760 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Apr 19 17:42:51 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713760/real 1555713760] req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713771 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Apr 19 17:43:02 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713771/real 1555713771] req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713782 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Apr 19 17:43:13 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713782/real 1555713782] req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713793 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Apr 19 17:43:24 regal-mds1 kernel: Lustre: 34779:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1555713793/real 1555713793] req@ffff88133e810080 x1627973588803528/t0(0) o104->regal-MDT0000@10.8.27.23@o2ib6:15/16 lens 296/224 e 0 to 1 dl 1555713804 ref 1 fl Rpc:X/2/ffffffff rc 0/-1 Apr 19 17:43:24 regal-mds1 kernel: LustreError: 34779:0:(ldlm_lockd.c:689:ldlm_handle_ast_error()) ### client (nid 10.8.27.23@o2ib6) returned error from blocking AST (req status -107 rc -107), evict it ns: mdt-regal-MDT0000_UUID lock: ffff880223094d00/0x6c9a0352c4202513 lrc: 4/0,0 mode: PR/PR res: [0x2000772e5:0x8a:0x0].0x0 bits 0x13 rrc: 4 type: IBT flags: 0x60200400000020 nid: 10.8.27.23@o2ib6 remote: 0xd278f5eabd47f9e6 expref: 14 pid: 35256 timeout: 7453544476 lvb_type: 0 Apr 19 17:43:24 regal-mds1 kernel: LustreError: 138-a: regal-MDT0000: A client on nid 10.8.27.23@o2ib6 was evicted due to a lock blocking callback time out: rc -107 My colleague had that following command blocked during that time: [root@sh-hn01 regal]# mkdir /regal/.deleted/hasantos/ But no logs on the client (2.12). Access to /regal was restored after a few minutes. |
| Comment by Stephane Thiell [ 08/May/19 ] |
|
We still had the issue this morning, this was when Kilian was creating a bunch of new accounts (mkdir) along with rebooting nodes. mkdir was done in /fir/users and /fir/groups which are DNE2 enabled. Not everything was blocked, but find in /fir/users and /fir/groups was hanging on all compute nodes. However, I was able to browse /fir/users/sthiell/ without issue (DNE1 only).
Attaching full kernel logs from both MDS as fir-md1-s1-20190508.log
After a few minutes, things went back to normal. |
| Comment by Stephane Thiell [ 15/Jul/19 ] |
|
This issue is still ongoing, and still seems to happen when we use mkdir with DNEv2 (each time we hit this issue there is a mkdir using DNEv2 running on sh-hn01 (10.9.0.1@o2ib4)... The full filesystem was blocked again today for more than 10 minutes while a mkdir was running there. There are some backtraces like that when things started to recover: Jul 15 15:34:26 fir-md1-s1 kernel: LNet: Service thread pid 23077 was inactive for 200.25s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jul 15 15:34:26 fir-md1-s1 kernel: LNet: Skipped 2 previous similar messages Jul 15 15:34:26 fir-md1-s1 kernel: Pid: 23077, comm: mdt02_042 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Fri Dec 7 14:50:35 PST 2018 Jul 15 15:34:26 fir-md1-s1 kernel: Call Trace: Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc0d92e55>] ldlm_completion_ast+0x4e5/0x890 [ptlrpc] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc0d93cbc>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc14744bb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc1474b40>] mdt_object_lock_internal+0x70/0x3e0 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc1474f0c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc148c44c>] mdt_reint_striped_lock+0x8c/0x510 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc148fd88>] mdt_reint_setattr+0x6c8/0x1340 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc1491ee3>] mdt_reint_rec+0x83/0x210 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc1470143>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc147b4a7>] mdt_reint+0x67/0x140 [mdt] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc0e3073a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc0dd4d0b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffc0dd863c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc] Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffff84ec1c31>] kthread+0xd1/0xe0 Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffff85574c24>] ret_from_fork_nospec_begin+0xe/0x21 Jul 15 15:34:26 fir-md1-s1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Jul 15 15:34:26 fir-md1-s1 kernel: LustreError: dumping log to /tmp/lustre-log.1563230066.23077 Jul 15 15:34:26 fir-md1-s1 kernel: LustreError: 20378:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 29s: evicting client at 10.8.9.9@o2ib6 ns: mdt-fir-MDT0000_UUID lock: ffff8f31c0f9e540/0x5d9ee640b61b6f03 lrc: 3/0,0 mode: PR/PR res: [0x2000297f6:0x882:0x0].0x0 bits 0x5b/0x0 rrc: 8 type: IBT flags: 0x60200400000020 nid: 10.8.9.9@o2ib6 remote: 0x243de7f26418e7fc expref: 16849 pid: 25680 timeout: 2345126 lvb_type: 0 Jul 15 15:34:26 fir-md1-s1 kernel: LustreError: 20378:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 10 previous similar messages Jul 15 15:34:36 fir-md1-s1 kernel: LNet: Service thread pid 23077 completed after 209.64s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). I'm attaching MDS logs `fir-md1-s1` as fir-md1-s1_20190715.log WARNING: this is the full kernel logs since last MDS restart, June 18, 2019!!! Not too bad for 2.12!! Our event started on July 15, 2019 at about 15:20 in the log. First user ticket arrived at 15:24. The mkdir that was blocked during this event was on the following directory, automatically stripped on MDT0, which eventually succeeded: [root@sh-hn01 sthiell.root]# lfs getdirstripe /fir/users/kmoulin lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none As a reminder: [root@sh-hn01 sthiell.root]# lfs getdirstripe /fir/users
lmv_stripe_count: 4 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
mdtidx FID[seq:oid:ver]
0 [0x200000400:0x5:0x0]
1 [0x240000402:0x5:0x0]
2 [0x2c0000400:0x5:0x0]
3 [0x280000401:0x5:0x0]
The good thing is that the MDS recovered by itself but it took some time.... at least 10 minutes. Thanks,
|
| Comment by Jesse Stroik [ 27/Nov/19 ] |
|
I've been observing something similar when we have clients that mistakenly associated with the MDS using an NID behind a NAT. In this example, nodes came up with lnet binding to a private cluster interface instead of an infiniband interface. With a lustre 2.10 client and 2.12 server, we can reproduce this behavior easily. The file system hangs for all clients. A relevant snippet from our logs is included - let me know if I can provide any additional information.
Nov 26 22:04:36 apollo-mds kernel: [4167831.563633] Lustre: 62902:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1574805865/real 1574805865] req@ffff8bc8d0153a80 x1646935336777456/t0(0) o104->apollo-MDT0000@10.23.255.254@tcp:15/16 lens 296/224 e 0 to 1 dl 1574805876 ref 1 fl Rpc:X/0/ffffffff rc 0/-1 Nov 26 22:04:36 apollo-mds kernel: [4167831.563643] Lustre: 62902:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 94 previous similar messages Nov 26 22:07:12 apollo-mds kernel: [4167986.760003] Lustre: apollo-MDT0000: Connection restored to 2e47e1dd-58f1-ea9e-4af1-7b2ae2de0c79 (at 10.23.255.254@tcp) Nov 26 22:07:12 apollo-mds kernel: [4167986.760011] Lustre: Skipped 22 previous similar messages Nov 26 22:09:33 apollo-mds kernel: [4168128.561768] LustreError: 77047:0:(ldlm_lockd.c:1357:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8bc6fcff4800 ns: mdt-apollo-MDT0000_UUID lock: ffff8bce53dcc800/0xe6373f3e00450a97 lrc: 3/0,0 mode: PR/PR res: [0x2000013a1:0x1:0x0].0x0 bits 0x13/0x0 rrc: 19 type: IBT flags: 0x50200400000020 nid: 10.23.255.218@tcp remote: 0x195afc50b177c9d2 expref: 6 pid: 77047 timeout: 0 lvb_type: 0 Nov 26 22:14:24 apollo-mds kernel: [4168419.322349] LustreError: 59085:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.23.255.218@tcp) failed to reply to blocking AST (req@ffff8bb79b703a80 x1646935336873728 status 0 rc -110), evict it ns: mdt-apollo-MDT0000_UUID lock: ffff8bc53a36e9c0/0xe6373f3e00514e3a lrc: 4/0,0 mode: PR/PR res: [0x20000c990:0x2302:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags: 0x60200400000020 nid: 10.23.255.218@tcp remote: 0x195afc50b177caff expref: 531 pid: 67366 timeout: 4168744 lvb_type: 0 |
| Comment by Olaf Faaland [ 25/Feb/20 ] |
|
Stephane, are you still seeing these long hangs? Asking because we aren't yet using DNE2 but would like to (originally intend to with Lustre 2.8 but decided against it, and now revisiting the idea). |
| Comment by Stephane Thiell [ 25/Feb/20 ] |
|
Hi Olaf – No, we haven't seen such long hangs for quite a while now, which is a good thing. They were very frequent when using mkdir with DNEv2, but the issue seems to have been resolved. We're still using 2.12.3 + a few patches (server-side) but they that have all landed in 2.12.4. |