[LU-13500] Client gets evicted - nfsd non-standard errorno -108 Created: 01/May/20 Updated: 03/Nov/21 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Campbell Mcleay (Inactive) | Assignee: | Oleg Drokin |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Client is getting evicted by MDT as soon as nfsd service started on the client. client (golf1) kernel version : 3.10.0-1062.1.1.el7_lustre.x86_64 mds (gmds1) kernel version : 3.10.0-1062.1.1.el7_lustre.x86_64 oss (goss1-goss6) kernel version : 3.10.0-1062.1.1.el7_lustre.x86_64 /etc/exports on golf1 : /user_data 10.25.0.0/16(fsid=123456789,rw,anonuid=0,insecure,no_subtree_check,insecure_locks,async) Apr 30 14:03:07 golf1 kernel: LustreError: 11-0: golf-MDT0000-mdc-ffff973bf6409800: operation ldlm_enqueue to node 10.25.22.90@tcp failed: rc = -107 Apr 30 14:03:07 golf1 kernel: Lustre: golf-MDT0000-mdc-ffff973bf6409800: Connection to golf-MDT0000 (at 10.25.22.90@tcp) was lost; in progress operations using this service will wait for recovery to complete Apr 30 14:03:07 golf1 kernel: LustreError: Skipped 8 previous similar messages Apr 30 14:03:07 golf1 kernel: LustreError: 167-0: golf-MDT0000-mdc-ffff973bf6409800: This client was evicted by golf-MDT0000; in progress operations using this service will fail. Apr 30 14:03:07 golf1 kernel: LustreError: 25491:0:(file.c:4339:ll_inode_revalidate_fini()) golf: revalidate FID [0x20004884e:0x16:0x0] error: rc = -5 Apr 30 14:03:07 golf1 kernel: ------------[ cut here ]------------ Apr 30 14:03:07 golf1 kernel: WARNING: CPU: 26 PID: 25600 at fs/nfsd/nfsproc.c:805 nfserrno+0x58/0x70 [nfsd] Apr 30 14:03:07 golf1 kernel: LustreError: 25579:0:(file.c:216:ll_close_inode_openhandle()) golf-clilmv-ffff973bf6409800: inode [0x20004884e:0x15:0x0] mdc close failed: rc = -108 Apr 30 14:03:07 golf1 kernel: ------------[ cut here ]------------ Apr 30 14:03:07 golf1 kernel: ------------[ cut here ]------------ Apr 30 14:03:07 golf1 kernel: nfsd: non-standard errno: -108 Apr 30 14:03:07 golf1 kernel: ------------[ cut here ]------------ Apr 30 14:03:07 golf1 kernel: WARNING: CPU: 54 PID: 25602 at fs/nfsd/nfsproc.c:805 nfserrno+0x58/0x70 [nfsd] Apr 30 14:03:07 golf1 kernel: LustreError: 25579:0:(file.c:216:ll_close_inode_openhandle()) Skipped 2 previous similar messages Apr 30 14:03:07 golf1 kernel: WARNING: CPU: 24 PID: 25601 at fs/nfsd/nfsproc.c:805 nfserrno+0x58/0x70 [nfsd] Apr 30 14:03:07 golf1 kernel: WARNING: CPU: 9 PID: 25505 at fs/nfsd/nfsproc.c:805 nfserrno+0x58/0x70 [nfsd] |
| Comments |
| Comment by Oleg Drokin [ 01/May/20 ] |
|
This only tells us that the client reexporting Lustre as NFS was evicted, but not why. Can you please provide MDT side log that lists the reason for eviction? |
| Comment by Campbell Mcleay (Inactive) [ 02/May/20 ] |
|
Hi, Apr 30 14:03:05 gmds1 kernel: LustreError: 14058:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1588235285, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-golf-MDT0000_UUID lock: ffffa0b090a560c0/0xa9eccc8009893f78 lrc: 3/1,0 mode: --/PR res: [0x20004845a:0x1c19e:0x0].0x0 bits 0x13/0x40 rrc: 33 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 14058 timeout: 0 lvb_type: 0 Apr 30 14:03:05 gmds1 kernel: LustreError: 14058:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 5 previous similar messages Apr 30 14:03:07 gmds1 kernel: LustreError: 8523:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 149s: evicting client at 10.25.22.97@tcp ns: mdt-golf-MDT0000_UUID lock: ffffa0a39e5d1680/0xa9eccc800d78ab44 lrc: 3/0,0 mode: CW/CW res: [0x20004884e:0x1c:0x0].0x0 bits 0x5/0x0 rrc: 20 type: IBT flags: 0x60200400000020 nid: 10.25.22.97@tcp remote: 0xda4d2a0ced4730c9 expref: 14511 pid: 8126 timeout: 261303 lvb_type: 0 Apr 30 14:03:07 gmds1 kernel: LustreError: 18657:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) ### lock on destroyed export ffffa0bb7a50fc00 ns: mdt-golf-MDT0000_UUID lock: ffffa07d7513c380/0xa9eccc800d78c204 lrc: 3/0,0 mode: PR/PR res: [0x20004884e:0x16:0x0].0x0 bits 0x13/0x0 rrc: 10 type: IBT flags: 0x50200000000000 nid: 10.25.22.97@tcp remote: 0xda4d2a0ced4734d5 expref: 14509 pid: 18657 timeout: 0 lvb_type: 0 Apr 30 14:03:07 gmds1 kernel: Lustre: golf-MDT0000: Connection restored to (at 10.25.22.97@tcp) Apr 30 14:03:07 gmds1 kernel: Lustre: Skipped 1 previous similar message Apr 30 14:03:07 gmds1 kernel: LustreError: 18657:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) Skipped 31 previous similar messages Apr 30 14:03:08 gmds1 kernel: LustreError: 17659:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) ### lock on destroyed export ffffa0a3eb454800 ns: mdt-golf-MDT0000_UUID lock: ffffa093c8c60240/0xa9eccc7eb0bd313c lrc: 3/0,0 mode: CW/CW res: [0x20004845a:0x1c19e:0x0].0x0 bits 0xd/0x0 rrc: 34 type: IBT flags: 0x50200400000020 nid: 10.25.22.97@tcp remote: 0x6cc2fba1e568f302 expref: 9 pid: 17659 timeout: 0 lvb_type: 0 Apr 30 14:03:08 gmds1 kernel: LustreError: 17659:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) Skipped 6 previous similar messages Apr 30 14:03:08 gmds1 kernel: Lustre: 17659:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (326:10967s); client may timeout. req@ffffa0a10ad09b00 x1665373519819424/t1069384853821(0) o101->62a78c6c-22f2-ded7-adb4-f014db8f51f3@10.25.22.97@tcp:581/0 lens 648/560 e 5 to 0 dl 1588224621 ref 1 fl Complete:/0/0 rc -107/-107 Apr 30 14:03:08 gmds1 kernel: Lustre: 17659:0:(service.c:2165:ptlrpc_server_handle_request()) Skipped 21 previous similar messages Apr 30 14:03:08 gmds1 kernel: LNet: Service thread pid 17659 completed after 11292.30s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Apr 30 14:03:08 gmds1 kernel: LNet: Skipped 22 previous similar messages Apr 30 14:05:38 gmds1 kernel: LustreError: 18656:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1588235438, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-golf-MDT0000_UUID lock: ffffa08c5d0186c0/0xa9eccc800d787abb lrc: 3/0,1 mode: --/CW res: [0x20004845a:0x1c19e:0x0].0x0 bits 0x5/0x8 rrc: 33 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 18656 timeout: 0 lvb_type: 0 Apr 30 14:05:38 gmds1 kernel: LustreError: 18656:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 1 previous similar message Apr 30 14:05:38 gmds1 kernel: LustreError: 8523:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 150s: evicting client at 10.25.22.97@tcp ns: mdt-golf-MDT0000_UUID lock: ffffa07bf5846e40/0xa9eccc801136d991 lrc: 3/0,0 mode: CW/CW res: [0x200048755:0xb4ea:0x0].0x0 bits 0xd/0x0 rrc: 5 type: IBT flags: 0x60200400000020 nid: 10.25.22.97@tcp remote: 0xda4d2a0ced475b16 expref: 65 pid: 17361 timeout: 261454 lvb_type: 0 Apr 30 14:05:38 gmds1 kernel: LustreError: 17658:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) ### lock on destroyed export ffffa0ad91ae8400 ns: mdt-golf-MDT0000_UUID lock: ffffa08429e50900/0xa9eccc801136dae8 lrc: 3/0,0 mode: PR/PR res: [0x20004884e:0x1b:0x0].0x0 bits 0x13/0x0 rrc: 11 type: IBT flags: 0x50200400000020 nid: 10.25.22.97@tcp remote: 0xda4d2a0ced475bbe expref: 59 pid: 17658 timeout: 0 lvb_type: 0 Apr 30 14:05:38 gmds1 kernel: Lustre: 15351:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (600:5657s); client may timeout. req@ffffa0a0d9c6ba80 x1665320672855920/t0(0) o101->2550d87b-664b-5ce0-ccf3-fa246e43162f@10.25.22.98@tcp:1/0 lens 584/592 e 24 to 0 dl 1588230081 ref 1 fl Complete:/0/0 rc 0/0 Apr 30 14:05:38 gmds1 kernel: Lustre: 15351:0:(service.c:2165:ptlrpc_server_handle_request()) Skipped 2 previous similar messages Apr 30 14:05:38 gmds1 kernel: LNet: Service thread pid 15351 completed after 6257.04s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Apr 30 14:05:38 gmds1 kernel: LNet: Skipped 2 previous similar messages Apr 30 14:05:38 gmds1 kernel: LustreError: 17658:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) Skipped 18 previous similar messages Apr 30 14:07:18 gmds1 kernel: LustreError: 8523:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 10.25.22.98@tcp ns: mdt-golf-MDT0000_UUID lock: ffffa08f3ed36e40/0xa9eccc7f66124799 lrc: 3/0,0 mode: PR/PR res: [0x20004845a:0x1c19e:0x0].0x0 bits 0x13/0x0 rrc: 33 type: IBT flags: 0x60200400000020 nid: 10.25.22.98@tcp remote: 0xc9202f42b18ed159 expref: 726586 pid: 15351 timeout: 261554 lvb_type: 0 Apr 30 14:07:28 gmds1 kernel: LustreError: 23173:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) ### lock on destroyed export ffffa0b019f7d800 ns: mdt-golf-MDT0000_UUID lock: ffffa08eefbb8900/0xa9eccc7ff25f4466 lrc: 3/0,0 mode: CW/CW res: [0x20004845a:0x1c19e:0x0].0x0 bits 0xd/0x0 rrc: 29 type: IBT flags: 0x50200400000020 nid: 10.25.22.97@tcp remote: 0x1b7af78dd9b57628 expref: 21 pid: 23173 timeout: 0 lvb_type: 0 Apr 30 14:07:28 gmds1 kernel: LustreError: 23173:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) Skipped 7 previous similar messages Apr 30 14:07:28 gmds1 kernel: Lustre: 10894:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (716:710s); client may timeout. req@ffffa0ac5d72f080 x1665374426896992/t1069387277830(0) o101->3f4f5fc2-0361-6277-7d3a-55b6e0508d3a@10.25.22.97@tcp:528/0 lens 648/560 e 0 to 0 dl 1588235138 ref 1 fl Complete:/0/0 rc -107/-107 Apr 30 14:07:28 gmds1 kernel: Lustre: 10894:0:(service.c:2165:ptlrpc_server_handle_request()) Skipped 4 previous similar messages Apr 30 14:07:28 gmds1 kernel: LNet: Service thread pid 9608 completed after 1426.69s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Apr 30 14:07:28 gmds1 kernel: LNet: Skipped 3 previous similar messages Apr 30 14:08:07 gmds1 kernel: LustreError: 13771:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1588235587, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-golf-MDT0000_UUID lock: ffffa09aebe38b40/0xa9eccc801131c48d lrc: 3/0,1 mode: --/CW res: [0x20004845a:0x1c19e:0x0].0x0 bits 0x5/0x8 rrc: 18 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 13771 timeout: 0 lvb_type: 0 Apr 30 14:08:07 gmds1 kernel: LustreError: 13771:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 1 previous similar message Apr 30 14:08:08 gmds1 kernel: LustreError: 8523:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 149s: evicting client at 10.25.22.97@tcp ns: mdt-golf-MDT0000_UUID lock: ffffa07bf5845a00/0xa9eccc801551d004 lrc: 3/0,0 mode: CW/CW res: [0x20004884e:0x18:0x0].0x0 bits 0x5/0x0 rrc: 6 type: IBT flags: 0x60200400000020 nid: 10.25.22.97@tcp remote: 0xda4d2a0ced477f58 expref: 59 pid: 8117 timeout: 261604 lvb_type: 0 Apr 30 14:08:08 gmds1 kernel: LustreError: 17661:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) ### lock on destroyed export ffffa0b0ee0c0400 ns: mdt-golf-MDT0000_UUID lock: ffffa099568e2640/0xa9eccc801551cf24 lrc: 1/0,0 mode: --/PR res: [0x200048755:0xb4ea:0x0].0x0 bits 0x13/0x0 rrc: 6 type: IBT flags: 0x54a01400000020 nid: 10.25.22.97@tcp remote: 0xda4d2a0ced477f3c expref: 53 pid: 17661 timeout: 0 lvb_type: 0 Apr 30 14:08:08 gmds1 kernel: Lustre: golf-MDT0000: Connection restored to (at 10.25.22.97@tcp) Apr 30 14:08:08 gmds1 kernel: Lustre: Skipped 2 previous similar messages Apr 30 14:08:08 gmds1 kernel: LustreError: 17661:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) Skipped 9 previous similar messages Apr 30 14:09:06 gmds1 dhclient[7002]: DHCPREQUEST on bond0 to 10.25.20.10 port 67 (xid=0x14eb9def) Apr 30 14:09:06 gmds1 dhclient[7002]: DHCPACK from 10.25.20.10 (xid=0x14eb9def) Apr 30 14:09:08 gmds1 dhclient[7002]: bound to 10.25.22.90 -- renewal in 18565 seconds. Apr 30 14:10:01 gmds1 systemd: Created slice User Slice of root. Apr 30 14:10:01 gmds1 systemd: Started Session 974 of user root. Apr 30 14:10:01 gmds1 systemd: Removed slice User Slice of root. Apr 30 14:13:08 gmds1 kernel: LustreError: 18651:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1588235888, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-golf-MDT0000_UUID lock: ffffa0a3d52d8900/0xa9eccc80189ae7ef lrc: 3/1,0 mode: --/PR res: [0x20004884e:0x15:0x0].0x0 bits 0x13/0x0 rrc: 32 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 18651 timeout: 0 lvb_type: 0 Apr 30 14:13:08 gmds1 kernel: LustreError: 18651:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) Skipped 30 previous similar messages Apr 30 14:13:21 gmds1 kernel: LNet: Service thread pid 17661 was inactive for 312.41s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Apr 30 14:13:21 gmds1 kernel: LNet: Skipped 4 previous similar messages Apr 30 14:13:21 gmds1 kernel: Pid: 17661, comm: mdt01_070 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Mon Oct 21 20:19:09 UTC 2019 Apr 30 14:13:21 gmds1 kernel: Call Trace: Apr 30 14:13:21 gmds1 kernel: [<ffffffffc154ab96>] ldlm_completion_ast+0x4e6/0x860 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc154b601>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc137c50b>] mdt_object_local_lock+0x50b/0xb20 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc137cb90>] mdt_object_lock_internal+0x70/0x360 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc137cf37>] mdt_object_lock_try+0x27/0xb0 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc137e677>] mdt_getattr_name_lock+0x1287/0x1c30 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc1385d25>] mdt_intent_getattr+0x2b5/0x480 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc1382bb5>] mdt_intent_policy+0x435/0xd80 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc1531d56>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc155a366>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc15e2b02>] tgt_enqueue+0x62/0x210 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc15e92ea>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc158e29b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc1591bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffb98c50d1>] kthread+0xd1/0xe0 Apr 30 14:13:21 gmds1 kernel: [<ffffffffb9f8cd37>] ret_from_fork_nospec_end+0x0/0x39 Apr 30 14:13:21 gmds1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Apr 30 14:13:21 gmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1588236201.17661 Apr 30 14:13:21 gmds1 kernel: Pid: 9332, comm: mdt01_005 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Mon Oct 21 20:19:09 UTC 2019 Apr 30 14:13:21 gmds1 kernel: Call Trace: Apr 30 14:13:21 gmds1 kernel: [<ffffffffc154ab96>] ldlm_completion_ast+0x4e6/0x860 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc154b601>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc137c50b>] mdt_object_local_lock+0x50b/0xb20 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc137cb90>] mdt_object_lock_internal+0x70/0x360 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc137cf37>] mdt_object_lock_try+0x27/0xb0 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc137e677>] mdt_getattr_name_lock+0x1287/0x1c30 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc1385d25>] mdt_intent_getattr+0x2b5/0x480 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc1382bb5>] mdt_intent_policy+0x435/0xd80 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc1531d56>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc155a366>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc15e2b02>] tgt_enqueue+0x62/0x210 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc15e92ea>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc158e29b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc1591bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffb98c50d1>] kthread+0xd1/0xe0 Apr 30 14:13:21 gmds1 kernel: [<ffffffffb9f8cd37>] ret_from_fork_nospec_end+0x0/0x39 Apr 30 14:13:21 gmds1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Apr 30 14:13:21 gmds1 kernel: Pid: 56713, comm: mdt01_034 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Mon Oct 21 20:19:09 UTC 2019 Apr 30 14:13:21 gmds1 kernel: Call Trace: Apr 30 14:13:21 gmds1 kernel: [<ffffffffc154ab96>] ldlm_completion_ast+0x4e6/0x860 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc154b601>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc137c50b>] mdt_object_local_lock+0x50b/0xb20 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc137cb90>] mdt_object_lock_internal+0x70/0x360 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc137cf37>] mdt_object_lock_try+0x27/0xb0 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc137e677>] mdt_getattr_name_lock+0x1287/0x1c30 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc1385d25>] mdt_intent_getattr+0x2b5/0x480 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc1382bb5>] mdt_intent_policy+0x435/0xd80 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc1531d56>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc155a366>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc15e2b02>] tgt_enqueue+0x62/0x210 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc15e92ea>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc158e29b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc1591bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffb98c50d1>] kthread+0xd1/0xe0 Apr 30 14:13:21 gmds1 kernel: [<ffffffffb9f8cd37>] ret_from_fork_nospec_end+0x0/0x39 Apr 30 14:13:21 gmds1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Apr 30 14:13:21 gmds1 kernel: Pid: 17361, comm: mdt01_055 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Mon Oct 21 20:19:09 UTC 2019 Apr 30 14:13:21 gmds1 kernel: Call Trace: Apr 30 14:13:21 gmds1 kernel: [<ffffffffc154ab96>] ldlm_completion_ast+0x4e6/0x860 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc154b601>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc137c50b>] mdt_object_local_lock+0x50b/0xb20 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc137cb90>] mdt_object_lock_internal+0x70/0x360 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc137cf37>] mdt_object_lock_try+0x27/0xb0 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc137e677>] mdt_getattr_name_lock+0x1287/0x1c30 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc1385d25>] mdt_intent_getattr+0x2b5/0x480 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc1382bb5>] mdt_intent_policy+0x435/0xd80 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc1531d56>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc155a366>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc15e2b02>] tgt_enqueue+0x62/0x210 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc15e92ea>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc158e29b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc1591bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffb98c50d1>] kthread+0xd1/0xe0 Apr 30 14:13:21 gmds1 kernel: [<ffffffffb9f8cd37>] ret_from_fork_nospec_end+0x0/0x39 Apr 30 14:13:21 gmds1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Apr 30 14:13:21 gmds1 kernel: Pid: 14058, comm: mdt01_048 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Mon Oct 21 20:19:09 UTC 2019 Apr 30 14:13:21 gmds1 kernel: Call Trace: Apr 30 14:13:21 gmds1 kernel: [<ffffffffc154ab96>] ldlm_completion_ast+0x4e6/0x860 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc154b601>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc137c50b>] mdt_object_local_lock+0x50b/0xb20 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc137cb90>] mdt_object_lock_internal+0x70/0x360 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc137cf37>] mdt_object_lock_try+0x27/0xb0 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc137e677>] mdt_getattr_name_lock+0x1287/0x1c30 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc1385d25>] mdt_intent_getattr+0x2b5/0x480 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc1382bb5>] mdt_intent_policy+0x435/0xd80 [mdt] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc1531d56>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc155a366>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc15e2b02>] tgt_enqueue+0x62/0x210 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc15e92ea>] tgt_request_handle+0xaea/0x1580 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc158e29b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffc1591bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc] Apr 30 14:13:21 gmds1 kernel: [<ffffffffb98c50d1>] kthread+0xd1/0xe0 Apr 30 14:13:21 gmds1 kernel: [<ffffffffb9f8cd37>] ret_from_fork_nospec_end+0x0/0x39 Apr 30 14:13:21 gmds1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Apr 30 14:13:21 gmds1 kernel: LNet: Service thread pid 18647 was inactive for 312.41s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. Apr 30 14:13:21 gmds1 kernel: LNet: Skipped 1 previous similar message Apr 30 14:18:03 gmds1 kernel: Lustre: 17829:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply#012 req@ffffa0b122e29f80 x1665385136351328/t0(0) o101->872face3-fe73-b814-5f50-65108dca215d@10.25.22.97@tcp:368/0 lens 576/3264 e 16 to 0 dl 1588236488 ref 2 fl Interpret:/0/0 rc 0/0 Apr 30 14:18:03 gmds1 kernel: Lustre: 17829:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 6 previous similar messages Apr 30 14:18:09 gmds1 kernel: Lustre: golf-MDT0000: Client 872face3-fe73-b814-5f50-65108dca215d (at 10.25.22.97@tcp) reconnecting Apr 30 14:18:09 gmds1 kernel: Lustre: golf-MDT0000: Connection restored to (at 10.25.22.97@tcp) Apr 30 14:18:18 gmds1 kernel: LNet: Service thread pid 13997 completed after 609.69s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Apr 30 14:18:18 gmds1 kernel: LNet: Skipped 7 previous similar messages |
| Comment by Oleg Drokin [ 04/May/20 ] |
|
Thank you, hm, this looksl like a lock timeout. Can you please collect lustre debug logs on the MDS and nfs-reexporting client when this is happening? Stop all other activity, then run this on both nodes: lctl set_param debug=-1 lctl set_param debug_mb=10240 # this requires 10G of RAM, if you don't have that much - reduce this number some lctl clear then start the nfsd on the client and wait until eviction, then run this right away on both nodes: lctl dk >/tmp/lustre.log collect the lustre.log from both nodes and make them available for download somehow. |
| Comment by Campbell Mcleay (Inactive) [ 04/May/20 ] |
|
Hi, I have collected logs and uploaded to our ftp server. Host= ftp.dneg.com Once connected go into From_Dneg folder. Please let us know if you are able to download logs from ftp.dneg.com ftp server . |
| Comment by Oleg Drokin [ 05/May/20 ] |
|
I am (somewaht slowly) getting these files so access is ok, thanks. |
| Comment by Campbell Mcleay (Inactive) [ 06/May/20 ] |
|
Thanks Oleg for looking into it. I checked the mds and client logs for 4th May, it didn't evict the client at that time even if mount was not accessible. Have reproduced this issue again today when client got evicted. Today's logs are uploaded again on ftp under /from_dneg/06May2020/ mds logs evicting client May 6 11:52:48 gmds1 kernel: LNet: Service thread pid 17655 was inactive for 200.31s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debug ging purposes: May 6 11:52:48 gmds1 kernel: LNet: Skipped 3 previous similar messages May 6 11:52:48 gmds1 kernel: Pid: 17655, comm: mdt01_064 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Mon Oct 21 20:19:09 UTC 2019 May 6 11:52:48 gmds1 kernel: Call Trace: May 6 11:52:48 gmds1 kernel: [<ffffffffc154aae0>] ldlm_completion_ast+0x430/0x860 [ptlrpc] May 6 11:52:48 gmds1 kernel: [<ffffffffc154b601>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] May 6 11:52:48 gmds1 kernel: [<ffffffffc137c50b>] mdt_object_local_lock+0x50b/0xb20 [mdt] May 6 11:52:48 gmds1 kernel: [<ffffffffc137cb90>] mdt_object_lock_internal+0x70/0x360 [mdt] May 6 11:52:48 gmds1 kernel: [<ffffffffc137cf37>] mdt_object_lock_try+0x27/0xb0 [mdt] May 6 11:52:48 gmds1 kernel: [<ffffffffc13a3128>] mdt_object_open_lock+0x688/0xac0 [mdt] May 6 11:52:48 gmds1 kernel: [<ffffffffc13a7967>] mdt_reint_open+0x15b7/0x3240 [mdt] May 6 11:52:48 gmds1 kernel: [<ffffffffc139b693>] mdt_reint_rec+0x83/0x210 [mdt] May 6 11:52:48 gmds1 kernel: [<ffffffffc13781b3>] mdt_reint_internal+0x6e3/0xaf0 [mdt] May 6 11:52:48 gmds1 kernel: [<ffffffffc1384a92>] mdt_intent_open+0x82/0x3a0 [mdt] May 6 11:52:48 gmds1 kernel: [<ffffffffc1382bb5>] mdt_intent_policy+0x435/0xd80 [mdt] May 6 11:52:48 gmds1 kernel: [<ffffffffc1531d56>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc] May 6 11:52:48 gmds1 kernel: [<ffffffffc155a366>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] May 6 11:52:48 gmds1 kernel: [<ffffffffc15e2b02>] tgt_enqueue+0x62/0x210 [ptlrpc] May 6 11:52:48 gmds1 kernel: [<ffffffffc15e92ea>] tgt_request_handle+0xaea/0x1580 [ptlrpc] May 6 11:52:48 gmds1 kernel: [<ffffffffc158e29b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] May 6 11:52:48 gmds1 kernel: [<ffffffffc1591bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc] May 6 11:52:48 gmds1 kernel: [<ffffffffb98c50d1>] kthread+0xd1/0xe0 May 6 11:52:48 gmds1 kernel: [<ffffffffb9f8cd37>] ret_from_fork_nospec_end+0x0/0x39 May 6 11:52:48 gmds1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff May 6 11:52:48 gmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1588746168.17655 May 6 11:52:53 gmds1 su: (to root) vch on pts/3 May 6 11:53:07 gmds1 kernel: LustreError: 8523:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 99s: evicting client at 10.25.22.97@tcp ns: mdt-golf-MDT0000_UUID lock: ffffa0afb7a4b600/0xa9ecccc11a19b5be lrc: 3/0,0 mode: CW/CW res: [0x20004884e:0x1a:0x0].0x0 bits 0xd/0x0 rrc: 10 type: IBT flags: 0x60200400000020 nid: 10.25.22.97@tcp remote: 0x608b1ed02e6a5ec5 expref: 6373 pid: 45464 timeout: 771903 lvb_type: 0 May 6 11:53:11 gmds1 kernel: LustreError: 9940:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) ### lock on destroyed export ffffa09ee05db000 ns: mdt-golf-MDT0000_UUID lock: ffffa0aa636f1b00/0xa9ecccc11c5e9407 lrc: 3/0,0 mode: PR/PR res: [0x20004884e:0x15:0x0].0x0 bits 0x13/0x0 rrc: 4 type: IBT flags: 0x50200000000000 nid: 10.25.22.97@tcp remote: 0x608b1ed02e6a6149 expref: 6372 pid: 9940 timeout: 0 lvb_type: 0 May 6 11:53:11 gmds1 kernel: Lustre: golf-MDT0000: Connection restored to (at 10.25.22.97@tcp) May 6 11:53:15 gmds1 kernel: Lustre: Skipped 1 previous similar message May 6 11:53:15 gmds1 kernel: LustreError: 9940:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) Skipped 5 previous similar messages May 6 11:53:15 gmds1 kernel: Lustre: 8191:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (2402:79721s); client may timeout. req@ffffa0baae312880 x1665743648877760/t0(0) o101->0da7ba62-284f-8d7c-7923-989bf54c33d1@10.25.22.97@tcp:459/0 lens 576/592 e 15 to 0 dl 1588666466 ref 1 fl Complete:/0/0 rc -107/-107 May 6 11:53:15 gmds1 kernel: Lustre: 8191:0:(service.c:2165:ptlrpc_server_handle_request()) Skipped 4 previous similar messages May 6 11:53:15 gmds1 kernel: LNet: Service thread pid 8191 completed after 82123.35s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). May 6 11:53:15 gmds1 kernel: LNet: Skipped 3 previous similar messages May 6 11:53:15 gmds1 kernel: LustreError: 9940:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) Skipped 5 previous similar messages May 6 11:53:15 gmds1 kernel: Lustre: 8191:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (2402:79721s); client may timeout. req@ffffa0baae312880 x1665743648877760/t0(0) o101->0da7ba62-284f-8d7c-7923-989bf54c33d1@10.25.22.97@tcp:459/0 lens 576/592 e 15 to 0 dl 1588666466 ref 1 fl Complete:/0/0 rc -107/-107 May 6 11:53:15 gmds1 kernel: Lustre: 8191:0:(service.c:2165:ptlrpc_server_handle_request()) Skipped 4 previous similar messages May 6 11:53:15 gmds1 kernel: LNet: Service thread pid 8191 completed after 82123.35s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). May 6 11:53:15 gmds1 kernel: LNet: Skipped 3 previous similar messages May 6 11:54:49 gmds1 kernel: LustreError: 8523:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 99s: evicting client at 10.25.22.97@tcp ns: mdt-golf-MDT0000_UUID lock: ffffa0aac1395c40/0xa9ecccc11e3833ee lrc: 3/0,0 mode: CW/CW res: [0x20004884e:0x17:0x0].0x0 bits 0x5/0x0 rrc: 68 type: IBT flags: 0x60200400000020 nid: 10.25.22.97@tcp remote: 0x608b1ed02e6b8ad7 expref: 101 pid: 45266 timeout: 772005 lvb_type: 0 May 6 11:54:49 gmds1 kernel: LustreError: 46556:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) ### lock on destroyed export ffffa0abc5bbdc00 ns: mdt-golf-MDT0000_UUID lock: ffffa09e4ffa4a40/0xa9ecccc11e3834a4 lrc: 3/0,0 mode: PR/PR res: [0x20004884e:0x17:0x0].0x0 bits 0x1b/0x0 rrc: 64 type: IBT flags: 0x50200400000020 nid: 10.25.22.97@tcp remote: 0x608b1ed02e6b8af3 expref: 67 pid: 46556 timeout: 0 lvb_type: 0 May 6 11:54:49 gmds1 kernel: LustreError: 46556:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) Skipped 40 previous similar messages |
| Comment by Campbell Mcleay (Inactive) [ 11/May/20 ] |
|
Any update with this please. |
| Comment by Oleg Drokin [ 11/May/20 ] |
|
I got the logs and they have some holes in them due to a lot of ongoing activity so the understanding is still limited. Basically what I pieced together is: 00000080:00000001:42.0:1588746088.342151:0:2834:0:(file.c:245:ll_md_real_close()) Process entered 00000080:00000001:42.0:1588746187.766589:0:2834:0:(file.c:264:ll_md_real_close()) Process leaving (rc=0 : 0 : 0) no interim state from this process got into the logs unfortunately. But on the server messages log there are messages of long processing of messages, so I wonder if the message is delayed somewhere? Also the messages about processing taking 80000+ seconds is highly unusual. Can you please describe what's the workload like on the fs? any non-nfs access in shared namespace with nfs access? what do the nfs clients roughly do? I'd really like to get the full logs but with the sheer amount of activity going on that appears to be not very practical, does the nfs reexport node has local (and fast) storage by any chance so we can try running debug daemon there? Absent that can you rerun the reproducer but instead of setting debug to -1 set it to: "inode super info warning other ha dlmtrace error emerg rpctrace vfstrace console". Additionally I tried to reproduce this on my test system applying what I was able to gather from existing logs and it seems to be working fine (but I am on a current b2_12), would not be a bad idea if you are able to update to 2.12.4 for this as well, I see there are some potentially relevant changes (e.g. the |
| Comment by Campbell Mcleay (Inactive) [ 14/May/20 ] |
|
lustre-log.1588746168.17655.xz is uploaded under same folder, the original lustre-log.1588746168.17655 is uploaded too. The namespace is accessed for data backups using rsyncs locally. Time machine backups over afp protocol. The namespace gets mounted on other server over nfs to copy data using rsync. In parallel we are exploring if we can get a downtime to upgrade it to 2.12.4. |
| Comment by Campbell Mcleay (Inactive) [ 06/Aug/20 ] |
|
We have upgraded golf1,2,gmds1,goss1 to goss6 to 2.12.4 |
| Comment by Peter Jones [ 06/Aug/20 ] |
|
ok. Please let us know whether this issue is still impacting you. |
| Comment by Campbell Mcleay (Inactive) [ 31/Aug/20 ] |
|
We have upgraded hotel servers as well to the Lustre version 2.12.4 . But, We are facing the issues with evicting clients in the 2.12.4 version as well.The NFS reexport is enabled in the hotel1 only in which we are having issues with client evictions. We Have uploaded the logs to our ftp server from MDS and NFS client servers to check this client evictions and the associated NFS re-export issues. Log location : /from_dneg/31Aug2020 Adding the details of the hotel servers client (hotel1) kernel version : 3.10.0-1062.9.1.el7_lustre.x86_64 mds (hmds1) kernel version : 3.10.0-1062.9.1.el7_lustre.x86_64 oss (hoss1-hoss6) kernel version : 3.10.0-1062.9.1.el7_lustre.x86_64 /etc/exports on hotel1 : |
| Comment by Campbell Mcleay (Inactive) [ 02/Sep/20 ] |
|
Hello Team, Can we have an update on this ? Also please let us know if we need any additional info to check this further. |
| Comment by Campbell Mcleay (Inactive) [ 03/Sep/20 ] |
|
Really appreciate if we get an update on this at the earliest . |
| Comment by Oleg Drokin [ 03/Sep/20 ] |
|
It looks like your logs cover different regions. The dmesgs covering the eviction are at 17:00 but the messages at 17:07 to 17:10 so we only know there's an eviction but nothing to corellate back to the logs |
| Comment by Campbell Mcleay (Inactive) [ 05/Sep/20 ] |
|
Yes, the eviction happened at aug 29 17:00 IST . The dmesg and messages logs having the same time but not really sure why the Lustre logs don't cover the logs at 17:00 . The evictions are keep happening. Eviction time : Aug 31 at 19:23 and 19:35 |
| Comment by Oleg Drokin [ 06/Sep/20 ] |
|
Well, this still seems to be very disconnected between the two sets of logs. the debug logs cover from 1598883056.697070 to 1598883239.198867 (these are seconds since unix epoch), I am not exactly sure what your timezone is but it's XX:10:56 to XX:13:59 which seems to be far away from the eviction event. Did it take you a long time between the eviction and stopping the log? The log data overflows super fast so it's important to stop the lustre logging very quickly after the eviction was recorded. |
| Comment by Campbell Mcleay (Inactive) [ 07/Sep/20 ] |
|
Both hmds1 and hotel1 running with the same timezone (Asia/Calcutta (IST, +0530)). hmds1 /tmp # timedatectl
Local time: Mon 2020-09-07 18:23:47 IST
Universal time: Mon 2020-09-07 12:53:47 UTC
RTC time: Mon 2020-09-07 12:53:48
Time zone: Asia/Calcutta (IST, +0530)
NTP enabled: yes
NTP synchronized: no
RTC in local TZ: no
DST active: n/a
hmds1 /tmp #
hotel1 /tmp # timedatectl
Local time: Mon 2020-09-07 18:24:16 IST
Universal time: Mon 2020-09-07 12:54:16 UTC
RTC time: Mon 2020-09-07 12:54:01
Time zone: Asia/Calcutta (IST, +0530)
NTP enabled: yes
NTP synchronized: no
RTC in local TZ: no
DST active: n/a
hotel1 /tmp #
hmds1 /tmp # ls -lrth lustre-hmds1-31082020.log -rw-rw-rw- 1 root root 9.6G Aug 31 19:50 lustre-hmds1-31082020.log hmds1 /tmp # hotel1 /tmp # ls -lrth lustre-hotel1-31082020.log -rw-rw-rw- 1 root root 9.6G Aug 31 19:51 lustre-hotel1-31082020.log hotel1 /tmp # As checked the lustre debug logs, there are some difference in the hmds1 lustre debug log write timings. hotel1 seems to be having the logs of eviction at 19:35 IST. I am not sure why hmds1 is not captured the logs of evictions. Why this difference is coming in hmds1 ? Is there any other settings ? hmds1-lustre-debug logs - Started at Monday, 31 August 2020 19:40:56.697 GMT+05:30 (1598883056.697070) and finished at Monday, 31 August 2020 19:43:59.198 GMT+05:30 (1598883239.198867) hotel1-lustre-debug logs - Started at Monday, 31 August 2020 19:26:45.925 GMT+05:30 (1598882205.925774) and finished at Monday, 31 August 2020 19:44:30.984 GMT+05:30 (1598883270.984189) The connection restorations details also available in the logs which are happening quickly but the evictions are making more trouble with the backup jobs. |
| Comment by Campbell Mcleay (Inactive) [ 10/Sep/20 ] |
|
Any update on this please ? |
| Comment by Campbell Mcleay (Inactive) [ 17/Sep/20 ] |
|
Hello Team, Any news here ? |
| Comment by Oleg Drokin [ 23/Sep/20 ] |
|
Ok, so did nto realize you had a fraction of hour timezone. I think everything checks out then. The debug logs at your debug_mb setting only allow ~3 minutes of detailed logs and you generated them 5 and 15 minutes away from the time eviction happened. It sounds like you might want to set dump_on_eviction to 1 on your clients to cause the client to dump current copy of the log the moment it realizes it's evicted and also dump_on_timeout on the server to do the same when it encounters lock timeout and evicts a client. |
| Comment by Campbell Mcleay (Inactive) [ 24/Sep/20 ] |
|
OK, We set these parameters on both MDS and client currently to the get the proper logs after evictions . On both client and MDS lctl set_param debug=-1 lctl set_param debug_mb=10240 lctl clear On Client lctl set_param dump_on_eviction=1 On MDS lctl set_param dump_on_timeout=1 |
| Comment by Campbell Mcleay (Inactive) [ 29/Sep/20 ] |
|
Have uploaded the log dump files to FTP server which is generated on MDS and client right after the eviction. Hope that would help to investigate more on it. Evictions happened at "Sun Sep 27 22:57:23 2020 IST" |
| Comment by Campbell Mcleay (Inactive) [ 29/Sep/20 ] |
|
We had eviction today as well which was happened on Tue Sep 29 09:20:16 2020 IST and Tue Sep 29 09:29:16 2020 IST , Uploaded the logs for the same as well to investigate further. Evictions at : Tue Sep 29 09:20:16 2020 IST and Tue Sep 29 09:29:16 2020 IST |
| Comment by Campbell Mcleay (Inactive) [ 06/Oct/20 ] |
|
Any updates here please ? |
| Comment by Oleg Drokin [ 07/Oct/20 ] |
|
it seems you have way too much stuff going on your nodes even for 10G of ram logs, so I guess we really need to reduce debug level and hope it all fits in and the debug level catches all the interesting bits.
lctl set_param debug="+rpctrace +dlmtrace +neterror +net"
please replace the debug=-1 with the above and otherwise repeat the exercise and upload the logs. Sorry it's taking this long with the iterative process. |
| Comment by Campbell Mcleay (Inactive) [ 07/Oct/20 ] |
|
OK, Do I need to stop all the IO/NFS connections on it to apply this change or apply it directly ? So this will be the final config on both MDS and client or any other changes ? Please confirm lctl set_param debug="+rpctrace +dlmtrace +neterror +net" lctl set_param debug_mb=10240 |
| Comment by Oleg Drokin [ 07/Oct/20 ] |
|
you can apply it directly. basically you run it like before, jsut changing the lctl set_param debug=-1 line to this one, but the rest of it stays the same |
| Comment by Campbell Mcleay (Inactive) [ 08/Oct/20 ] |
|
Have uploaded eviction logs after above changes. Please have a look on this Eviction at : Oct 7 19:22 IST |
| Comment by Oleg Drokin [ 08/Oct/20 ] |
|
I got the logs and the are still full debug and then it dawned on me that since you are not restarting the nodes - the instruction I gave you would not disable excess logic. Sorry about that. So to manually set the desired debug level and override all the bits you currently have you need to do this instead:
lctl set_param debug="vfstrace info inode warning error dentry emerg console rpctrace dlmtrace neterror net"
|
| Comment by Campbell Mcleay (Inactive) [ 12/Oct/20 ] |
|
OK , Have set these parameters on both hmds1 and client. However the "get_param" still showing blank for the debug level, Hope that won't be a problem. lctl set_param debug="vfstrace info inode warning error dentry emerg console rpctrace dlmtrace neterror net" lctl set_param debug_mb=10240 |
| Comment by Oleg Drokin [ 12/Oct/20 ] |
|
"still" showing blank? it was not supposed to show blank at any time past or present. are there any errors in dmesg? |
| Comment by Campbell Mcleay (Inactive) [ 12/Oct/20 ] |
|
Yes, Still it is blank and no errors in dmesg as well |
| Comment by Oleg Drokin [ 12/Oct/20 ] |
|
I mean was it is supposed to be something like this: lctl get_param debug debug= super ioctl neterror warning dlmtrace error emerg ha rpctrace vfstrace config console lfsck does "cat /sys/kernel/debug/lnet/debug" output anything? |
| Comment by Campbell Mcleay (Inactive) [ 13/Oct/20 ] |
|
OK, it's giving the output for debug on both and hope all good with that. hmds1 /tmp # lctl get_param debug debug= inode info neterror net warning dentry dlmtrace error emerg rpctrace vfstrace console hmds1 /tmp # cat /sys/kernel/debug/lnet/debug inode info neterror net warning dentry dlmtrace error emerg rpctrace vfstrace console hmds1 /tmp # hotel1 /tmp # lctl get_param debug hotel1 /tmp # lctl get_param debug debug= inode info neterror net warning dentry dlmtrace error emerg rpctrace vfstrace console hotel1 /tmp # cat /sys/kernel/debug/lnet/debug inode info neterror net warning dentry dlmtrace error emerg rpctrace vfstrace console hotel1 /tmp # |
| Comment by Campbell Mcleay (Inactive) [ 14/Oct/20 ] |
|
There were couple of evictions happened after the change in the debug level and have uploaded the latest logs . Please have a look on this Evictions at : Oct 14 04:11 and 04:28 IST |
| Comment by Campbell Mcleay (Inactive) [ 15/Oct/20 ] |
|
Whether the latest logs contains sufficient data ? |
| Comment by Campbell Mcleay (Inactive) [ 19/Oct/20 ] |
|
Any updates here ? |
| Comment by Oleg Drokin [ 20/Oct/20 ] |
|
the 1602629930.5992/1602629931.16759 set of logs seems to be promising. I see there's an attempt to bunch some other locks that all happen to be open locks that are sort of expensive in the sense that they cause an extra RPC to close the file descriptor first this is not ideal but not fatal. The real trouble is eventually we run out of "slots" for the close RPC and nothing is waking us up until eviction 100 seconds later: 00000020:00100000:11.0:1602629830.723116:0:37307:0:(genops.c:2356:obd_get_mod_rpc_slot()) hotel-MDT0000-mdc-ffff89b0ec6ee800: sleeping for a modify RPC slot opc 35, max 7 00000100:00000040:11.0:1602629930.420531:0:37307:0:(lustre_net.h:2498:ptlrpc_rqphase_move()) @@@ move req "New" -> "Rpc" req@ffff89b2d64a0d80 x1685199064046272/t0(0) o35->hotel-MDT0000-mdc-ffff89b0ec6ee800@10.27.22.90@tcp:23/10 lens 392/624 e 0 to 0 dl 0 ref 2 fl New:/0/ffffffff rc 0/-1 The reason seems to be some RPCs from this client being stuck on the server and so we run out of slots. checking server side for one of such requests it looks like |
| Comment by Campbell Mcleay (Inactive) [ 20/Oct/20 ] |
|
Thanks for the update and just gone thorough that case . Could you please explain which patch need to apply here in our system ? What exactly the patch is doing ? How we can implement the patch ? It would be good if you can share some docs on for that too. |
| Comment by Peter Jones [ 20/Oct/20 ] |
|
Campbell The fix Oleg mentioned is one that we are hoping to include in our upcoming 2.12.6 release. Would you prefer to wait until that so that you are running something well-tested or are things dire enough that you need to attempt a hotfix ASAP? Peter |
| Comment by Campbell Mcleay (Inactive) [ 21/Oct/20 ] |
|
Hi Peter, When we are planning to release 2.12.6 version with the fix ? Whether there will be more delay in releasing this minor version ? |
| Comment by Campbell Mcleay (Inactive) [ 21/Oct/20 ] |
|
Meantime, Could you please provide the patch as well with the documentation to apply it ? We will also try it out in our servers to see if that fixing the eviction issue. |
| Comment by Campbell Mcleay (Inactive) [ 23/Oct/20 ] |
|
Hi Peter/Oleg, As requested earlier, We would like to apply the patch first in our servers .Could you please share the patches for the same ASAP ? |
| Comment by Peter Jones [ 23/Oct/20 ] |
|
The patch is at https://review.whamcloud.com/#/c/39598/ |
| Comment by Campbell Mcleay (Inactive) [ 26/Oct/20 ] |
|
Hi Peter, Can you please tell me what needs to be done here to apply the patch? I downloaded ldlm_lock_new-83331795f2550e9e0bf15c3d3832f3d3.c, created a patch file from a diff between it and the version of ldm_lock.c in the 2.12.4 src rpm, and it patched successfully but failed to build. I was able to build rpms from the 142fe73 tree, but they were version 2.13.56, and the rpms were kmod-lustre-client and lustre-client - so is this patch for the client packages? We use the server packages on the affected client, so have the lustre kernel and lustre rpm on our client. Should I build the 2.13.56 rpms against a generic kernel and use that? Thanks, Campbell |
| Comment by Oleg Drokin [ 27/Oct/20 ] |
|
you can get already built RPMs from this patch: https://review.whamcloud.com/40412 there's a build link in the comments. The patch only affects MDS so technically that's the only node you need to update with it (This is tip of the b2_!2 tree so it has all the other fixes too though) |
| Comment by Campbell Mcleay (Inactive) [ 27/Oct/20 ] |
|
Hi Oleg, Apologies, I have not been able to find the link to the RPMs. I've found a kernel image (kernel-centos7-x86_64) and a source tree (ac40c31.tar.gz), but could not find a link to any rpms - can you please provide more detail? Sorry, I'm clearly missing something obvious. Kind regards, Campbell |
| Comment by Peter Jones [ 27/Oct/20 ] |
|
Campbell The build link from the comments Oleg was referring to is https://build.whamcloud.com/job/lustre-reviews/77274/ Hopefully it is intuitive enough how to drill in to get the distro/arch version relevant for you but let us know if not. Peter |
| Comment by Campbell Mcleay (Inactive) [ 27/Oct/20 ] |
|
Hi Peter, Thanks, I found them at https://build.whamcloud.com/job/lustre-reviews/arch=x86_64,build_type=server,distro=el7.7,ib_stack=inkernel/lastSuccessfulBuild/artifact/artifacts/RPMS/x86_64/ Regards, Campbell |
| Comment by Campbell Mcleay (Inactive) [ 30/Oct/20 ] |
|
Since upgrading to 2.13.56 on the MDS, we've found a strange issue whereby we cannot create files over NFS to the lustre client on the lustre filesystem. We can remove files, and chown and chmod. Permissions make no difference. I can't see anything in the logs. Any ideas why this might happen? Do we need to update packages on anything else besides the MDS perhaps? |
| Comment by Peter Jones [ 30/Oct/20 ] |
|
Wait - 2.13.56?! I hope that this is a misunderstanding as that is a 2.14 development tag and not something suitable for running in production. |
| Comment by Peter Jones [ 30/Oct/20 ] |
|
Ah I see - you navigated to the most recent development build rather than the specific one highlighted above. The link I expected you to navigate to was https://build.whamcloud.com/job/lustre-reviews/77274/arch=x86_64,build_type=server,distro=el7.7,ib_stack=inkernel/ |
| Comment by Campbell Mcleay (Inactive) [ 30/Oct/20 ] |
|
ok, thanks Peter. Just to make sure: I should be installing version 2.12.5_62_ga40c31-1? |
| Comment by Peter Jones [ 30/Oct/20 ] |
|
Correct |
| Comment by Campbell Mcleay (Inactive) [ 01/Nov/20 ] |
|
HI Peter / Oleg, We installed the version "2.12.5_62_ga40c31-1" , But unfortunately that didn't help us to fix the eviction issue. We do still having client evictions and have uploaded the latest logs to investigate further on this. Eviction time : Sun Nov 1 02:12:09 2020 IST |
| Comment by Campbell Mcleay (Inactive) [ 02/Nov/20 ] |
|
Any thoughts here ? |
| Comment by Oleg Drokin [ 02/Nov/20 ] |
|
the traces are ramarkably similar, but this time as the client is forming early-cancel-requests we can trace them all on the server and the last one sent is handled on the server real fast, the reply is formed and sent, but somehow it is never received on the client (or if it is, there are no traces of it) client side: 00000100:00100000:31.0:1604176830.175778:0:31048:0:(client.c:1630:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ldlm_bl_16:fba2d39b-78a9-1d04-926c-0efdbbdf3a0e:31048:1682064348899136:10.27.22.90@tcp:35 00000100:00000200:31.0:1604176830.175799:0:31048:0:(niobuf.c:884:ptl_send_rpc()) Setup reply buffer: 1024 bytes, xid 1682064348899136, portal 10 00000100:00000040:31.0:1604176830.175802:0:31048:0:(niobuf.c:905:ptl_send_rpc()) @@@ send flg=0 req@ffff8cfc9e310480 x1682064348899136/t0(0) o35->hotel-MDT0000-mdc-ffff8ced28a0a800@10.27.22.90@tcp:23/10 lens 392/624 e 0 to 0 dl 1604176837 ref 3 fl Rpc:/0/ffffffff rc 0/-1 00000100:00000040:31.0:1604176830.175809:0:31048:0:(niobuf.c:57:ptl_send_buf()) peer_id 12345-10.27.22.90@tcp 00000100:00000200:31.0:1604176830.175819:0:31048:0:(niobuf.c:85:ptl_send_buf()) Sending 392 bytes to portal 23, xid 1682064348899136, offset 0 00000400:00000200:31.0:1604176830.175824:0:31048:0:(lib-move.c:4684:LNetPut()) LNetPut -> 12345-10.27.22.90@tcp 00000400:00000200:31.0:1604176830.175827:0:31048:0:(lib-move.c:2478:lnet_handle_send_case_locked()) Source ANY to MR: 10.27.22.90@tcp local destination 00000400:00000200:31.0:1604176830.175831:0:31048:0:(lib-move.c:1515:lnet_get_best_ni()) compare ni 10.27.22.97@tcp [c:246, d:21, s:1134232350] with best_ni not seleced [c:-2147483648, d:-1, s:0] 00000400:00000200:31.0:1604176830.175834:0:31048:0:(lib-move.c:1558:lnet_get_best_ni()) selected best_ni 10.27.22.97@tcp 00000400:00000200:31.0:1604176830.175836:0:31048:0:(lib-move.c:1790:lnet_select_peer_ni()) 10.27.22.97@tcp ni_is_pref = 0 00000400:00000200:31.0:1604176830.175839:0:31048:0:(lib-move.c:1846:lnet_select_peer_ni()) sd_best_lpni = 10.27.22.90@tcp 00000400:00000200:31.0:1604176830.175872:0:31048:0:(lib-move.c:1752:lnet_handle_send()) TRACE: 10.27.22.97@tcp(10.27.22.97@tcp:<?>) -> 10.27.22.90@tcp(10.27.22.90@tcp:10.27.22.90@tcp) : PUT try# 0 00000800:00000200:31.0:1604176830.175876:0:31048:0:(socklnd_cb.c:1003:ksocknal_send()) sending 392 bytes in 1 frags to 12345-10.27.22.90@tcp 00000800:00000200:31.0:1604176830.175880:0:31048:0:(socklnd.c:195:ksocknal_find_peer_locked()) got peer_ni [ffff8cfbf622dd00] -> 12345-10.27.22.90@tcp (5) 00000800:00000200:31.0:1604176830.175884:0:31048:0:(socklnd_cb.c:764:ksocknal_queue_tx_locked()) Sending to 12345-10.27.22.90@tcp ip 10.27.22.90:988 00000800:00000200:31.0:1604176830.175886:0:31048:0:(socklnd_cb.c:783:ksocknal_queue_tx_locked()) Packet ffff8cf4a6168200 type 1, nob 488 niov 2 nkiov 0 00000100:00100000:31.0:1604176830.175889:0:31048:0:(client.c:2354:ptlrpc_set_wait()) set ffff8cffcd521800 going to sleep for 6 seconds no activity from this thread from this moment on highlighting it's still waiting for reply server side: 00000100:00100000:19.0:1604176830.175965:0:37439:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg01_019:fba2d39b-78a9-1d04-926c-0efdbbdf3a0e+4826808:31048:x1682064348899136:12345-10.27.22.97@tcp:35 00000100:00000200:19.0:1604176830.175966:0:37439:0:(service.c:2145:ptlrpc_server_handle_request()) got req 1682064348899136 00010000:00000040:19.0:1604176830.176000:0:37439:0:(ldlm_lib.c:2944:target_committed_to_req()) last_committed 734953002189, transno 734953004718, xid 1682064348899136 00010000:00000200:19.0:1604176830.176002:0:37439:0:(ldlm_lib.c:2996:target_send_reply_msg()) @@@ sending reply req@ffff9f0783b79f80 x1682064348899136/t734953004718(0) o35->fba2d39b-78a9-1d04-926c-0efdbbdf3a0e@10.27.22.97@tcp:401/0 lens 392/456 e 0 to 0 dl 1604176836 ref 1 fl Interpret:/0/0 rc 0/0 00000100:00000200:19.0:1604176830.176010:0:37439:0:(niobuf.c:85:ptl_send_buf()) Sending 424 bytes to portal 10, xid 1682064348899136, offset 192 00000800:00000200:19.0:1604176830.176028:0:37439:0:(socklnd_cb.c:783:ksocknal_queue_tx_locked()) Packet ffff9ee5e4e3a000 type 1, nob 520 niov 2 nkiov 0 00000100:00000040:19.0:1604176830.176034:0:37439:0:(lustre_net.h:2498:ptlrpc_rqphase_move()) @@@ move req "Interpret" -> "Complete" req@ffff9f0783b79f80 x1682064348899136/t734953004718(0) o35->fba2d39b-78a9-1d04-926c-0efdbbdf3a0e@10.27.22.97@tcp:401/0 lens 392/424 e 0 to 0 dl 1604176836 ref 1 fl Interpret:/0/0 rc 0/0 00000100:00100000:19.0:1604176830.176039:0:37439:0:(service.c:2190:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg01_019:fba2d39b-78a9-1d04-926c-0efdbbdf3a0e+4826812:31048:x1682064348899136:12345-10.27.22.97@tcp:35 Request processed in 74us (98us total) trans 734953004718 rc 0/0 00000100:00100000:19.0:1604176830.176042:0:37439:0:(nrs_fifo.c:241:nrs_fifo_req_stop()) NRS stop fifo request from 12345-10.27.22.97@tcp, seq: 56446120 00000100:00000040:19.0:1604176830.176043:0:37439:0:(service.c:1057:ptlrpc_server_finish_active_request()) RPC PUTting export ffff9ed637f2c000 : new rpc_count 5 00000020:00000040:19.0:1604176830.176044:0:37439:0:(genops.c:1018:class_export_put()) PUTting export ffff9ed637f2c000 : new refcount 4826811 Now if we switch back to the client we actually can see the reply come from the server - the next message received from MDS is for xid 1682064348976448 00000400:00000200:32.0:1604176830.231054:0:4117:0:(lib-move.c:4190:lnet_parse()) TRACE: 10.27.22.97@tcp(10.27.22.97@tcp) <- 10.27.22.90@tcp : PUT - for me 00000400:00000200:32.0:1604176830.231058:0:4117:0:(lib-ptl.c:571:lnet_ptl_match_md()) Request from 12345-10.27.22.90@tcp of length 592 into portal 10 MB=0x5f9d420029540 00000400:00000200:32.0:1604176830.231061:0:4117:0:(lib-ptl.c:200:lnet_try_match_md()) Incoming put index a from 12345-10.27.22.90@tcp of length 592/592 into md 0x32bcc8121 [1] + 192 00000400:00000200:32.0:1604176830.231064:0:4117:0:(lib-msg.c:828:lnet_is_health_check()) health check = 0, status = 0, hstatus = 0 00000100:00000200:32.0:1604176830.231066:0:4117:0:(events.c:93:reply_in_callback()) @@@ type 2, status 0 req@ffff8cf2d0afc800 x1682064348976448/t0(0) o101->hotel-MDT0000-mdc-ffff8ced28a0a800@10.27.22.90@tcp:12/10 lens 656/960 e 0 to 0 dl 1604176838 ref 2 fl Rpc:/0/ffffffff rc 0/-1 00000100:00000200:32.0:1604176830.231066:0:4117:0:(events.c:93:reply_in_callback()) @@@ type 2, status 0 req@ffff8cf2d0afc800 x1682064348976448/t0(0) o101->hotel-MDT0000-mdc-ffff8ced28a0a800@10.27.22.90@tcp:12/10 lens 656/960 e 0 to 0 dl 1604176838 ref 2 fl Rpc:/0/ffffffff rc 0/-1 But while it was sent a small fraction of a second before the other one - we don't see the message being sent in the client logs (we do see it in the server logs) - which I guess might explain why we never see the other one in logs either - it was just dropped as log bucket was full (we have log buckets divider per cpu and priority so some might be dropped before the others leaving such gaps seemingly at random when you have high level of activity). Do you have any storage on your nfs reexport node (like real fast local storage)? Basically you set the debug level as before, but you run this on the client: lctl debug_daemon start /some/path/on/real/local/storage/lustre.log 40000 this will create up to 40G file at the place. lctl debug_daemon stop you don't do the dump on eviction setting in this mode operation so it's important Do you think you would be able to do something like this? |
| Comment by Campbell Mcleay (Inactive) [ 03/Nov/20 ] |
|
OK, I have disabled the dump on eviction on client and trying to start the debug daemon but it seems there is some limit. But not sure whether 20GB is enough for this to capture it ? Please suggest hotel1 /tmp # lctl set_param dump_on_eviction=0 dump_on_eviction=0 hotel1 /tmp # lctl debug_daemon start /tmp/lustre-dump40g.log 40000 size 40000 invalid, must be in the range 10-20480 MB hotel1 /tmp # |
| Comment by Oleg Drokin [ 03/Nov/20 ] |
|
well then please use 20480 instead of the 40000 We won't know if it's enough or not until we look inside, here's hoping it is enough but if you can reproduce with less activity that certainly should help. |
| Comment by Campbell Mcleay (Inactive) [ 04/Nov/20 ] |
|
I have uploaded the latest logs after starting the debug_daemon to investigate further on this. Log Location : /from_dneg/04Nov2020/ |
| Comment by Campbell Mcleay (Inactive) [ 05/Nov/20 ] |
|
Those logs are helping out us to look more on this ? |
| Comment by Oleg Drokin [ 06/Nov/20 ] |
|
please include server side dmesg -it helps me see evicted logs without poring through huge debug logs. anyway I see you included a number of server side logs, but all of them were dumped an hour before the client log begins? |
| Comment by Oleg Drokin [ 06/Nov/20 ] |
|
I also noticed that the client log only covers 15 minutes of time - this is enough to go back for a 100 second eviction, but you really need to be able to stop client side logging pretty quickly after server side eviction hits. |
| Comment by Campbell Mcleay (Inactive) [ 06/Nov/20 ] |
|
Unfortunately it took a while for me to stop the debug daemon after the eviction happened . I think I had stopped the service after 1-2 hr (approx) of the eviction since it is happening randomly. I could see the evictions happened at "Wed Nov 4 02:12:55 2020 IST", But it dumped other files as well , Just added the created time details below. -rw------- 1 root root 7.8G Nov 4 02:13 lustre-log.1604436175.5291 -rw------- 1 root root 602M Nov 4 02:14 lustre-log.1604436274.18605 -rw------- 1 root root 4.5G Nov 4 02:20 lustre-log.1604436600.19412 -rw------- 1 root root 5.3G Nov 4 02:31 lustre-log.1604437281.19418 -rw------- 1 root root 5.0G Nov 4 02:41 lustre-log.1604437888.22490 Client debug log seems to be stopped writing once its reached the file size threshold?, Added the time details as well for that too. -rw------- 1 root root 21G Nov 4 03:38 lustre-dump-hotel1.log I have uploaded the dmesg of the server and client as requested. If it is making more trouble to dig more into the log file then I can upload another set of new evictions logs but this time will try to stop the service right after it reported, Let me know if that is really required here. Log location : /from_dneg/04Nov2020/hmds1-logs |
| Comment by Oleg Drokin [ 06/Nov/20 ] |
|
basically for the client log to contain enough information you must stop it within ~10 minutes of the eviction happening. the max size of 20GB is not used to stop writing but basically makes the file to act as a circular buffer so once the size is reached writing continues from offset zero overwriting old records. So please try to catch the eviction and stop the client logging right after reporting (since there's a very characteristic eviction record in the logs on the server side you might be able to script something up to catch the event and stop the writing client-side) |
| Comment by Campbell Mcleay (Inactive) [ 09/Nov/20 ] |
|
OK, Understood. So have uploaded latest set of logs to check further . This time have stopped logging at client side immediately after seeing the evictions. Log Location : /from_dneg/10Nov2020/ There are couple of files got generated at server end , Adding the file creation details as well hmds1 -rw------- 1 root root 7.8G Nov 10 02:27 lustre-log.1604955458.5291 -rw------- 1 root root 3.5G Nov 10 02:29 lustre-log.1604955558.5335 Hotel1- Eviction happened at Nov 10 02:27:42 2020 and had stopped client logging writing at Nov 10 02:29 -rw------- 1 root root 21G Nov 10 02:29 lustre-dump-20G-10112020.log |
| Comment by Gerrit Updater [ 11/Nov/20 ] |
|
Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40602 |
| Comment by Gerrit Updater [ 11/Nov/20 ] |
|
Oleg Drokin (green@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40603 |
| Comment by Oleg Drokin [ 11/Nov/20 ] |
|
Thank you. this is a good log but it's very big to sift through all of it fast. Meanwhile some suspicions I had from the previous logs seem to be confirming - the "proactive lru lock cancelling" we do when preparing to send a cancel anyway seems to be a bit overezalous and does some pretty heavy processing. I am going to post a patch that should help here while I am digging out some other suspicious stuff I also see. https://review.whamcloud.com/#/c/40603/ |
| Comment by Campbell Mcleay (Inactive) [ 11/Nov/20 ] |
|
Looks like we are seeing some more stuffs here to rectify. Should we try this patch now or we will wait to dig more on the undergoing case ? |
| Comment by Oleg Drokin [ 11/Nov/20 ] |
|
I think it makes sense to try the patch now - if it helps - great. Whenever any additional patches would be needed is not yet clear and if yes and the condition would necessitate reversal of a previous patch to reproduce - that could be decided on later. |
| Comment by Campbell Mcleay (Inactive) [ 13/Nov/20 ] |
|
Oleg, We have applied the provided patch but unfortunately that seems to be not helping much to reduce the evictions |
| Comment by Campbell Mcleay (Inactive) [ 14/Nov/20 ] |
|
nfs exports stopped working. Couldn't see anything in the client log. |
| Comment by Campbell Mcleay (Inactive) [ 16/Nov/20 ] |
|
After downgrading, back to 2.12.4, the lustre filesystem was hanging on the client (the other client we have was fine). It would work intermittently but mostly not (the other client was fine during this time). Is this due to lock recovery? I tried mounting with 'abort_recov', but same issue. It eventually resolved itself however. There's still a large number of stack traces in the MDS log, e.g.: Nov 16 02:58:40 hmds1 kernel: LNet: Service thread pid 16267 was inactive for 278.13s. The thread might be hung, or it might only be slow and will res ume later. Dumping the stack trace for debugging purposes: Nov 16 02:58:40 hmds1 kernel: LNet: Skipped 4 previous similar messages Nov 16 02:58:40 hmds1 kernel: Pid: 16267, comm: mdt01_051 3.10.0-1062.18.1.el7_lustre.x86_64 #1 SMP Mon Jun 8 13:47:48 UTC 2020 Nov 16 02:58:40 hmds1 kernel: Call Trace: Nov 16 02:58:40 hmds1 kernel: [<ffffffffc12b8070>] ldlm_completion_ast+0x430/0x860 [ptlrpc] Nov 16 02:58:40 hmds1 kernel: [<ffffffffc12ba0a1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Nov 16 02:58:40 hmds1 kernel: [<ffffffffc15a817b>] mdt_rename_lock+0x24b/0x4b0 [mdt] Nov 16 02:58:40 hmds1 kernel: [<ffffffffc15aa350>] mdt_reint_rename+0x2c0/0x2900 [mdt] Nov 16 02:58:40 hmds1 kernel: [<ffffffffc15b31b3>] mdt_reint_rec+0x83/0x210 [mdt] Nov 16 02:58:40 hmds1 kernel: [<ffffffffc158f383>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Nov 16 02:58:40 hmds1 kernel: [<ffffffffc159b0f7>] mdt_reint+0x67/0x140 [mdt] Nov 16 02:58:40 hmds1 kernel: [<ffffffffc1356e8a>] tgt_request_handle+0xada/0x1570 [ptlrpc] Nov 16 02:58:40 hmds1 kernel: [<ffffffffc12fb83b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Nov 16 02:58:40 hmds1 kernel: [<ffffffffc12ff1a4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Nov 16 02:58:40 hmds1 kernel: [<ffffffff8dac6321>] kthread+0xd1/0xe0 Nov 16 02:58:40 hmds1 kernel: [<ffffffff8e18ed37>] ret_from_fork_nospec_end+0x0/0x39 Nov 16 02:58:40 hmds1 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Nov 16 02:58:40 hmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1605475720.16267 We've uploaded a log (5.1GB) from the MDS |
| Comment by Campbell Mcleay (Inactive) [ 18/Nov/20 ] |
|
Any findings here please ? |
| Comment by Campbell Mcleay (Inactive) [ 23/Nov/20 ] |
|
Hello Team, It would be really good if you can provide a solution for the same at the earliest as this is really affecting the production backup which is really putting us in a panic situations. Requesting you to check this on priority please. |
| Comment by Oleg Drokin [ 24/Nov/20 ] |
|
Sorry that I have no immediate answers to this, I am still thinking about the whole thing. |
| Comment by Oleg Drokin [ 24/Nov/20 ] |
|
So I wonder if reducing amount of mdt locks would at least help you temporarily since the problem is clearly related to parallel cancels. lctl set_param ldlm.namespaces.*MDT*mdc*.lru_size=100 can you please run this on your nfs export node(s)? The setting is not permanent and would reset on node reboot/lustre client remount. |
| Comment by Campbell Mcleay (Inactive) [ 25/Nov/20 ] |
|
Hi Oleg, Just some feedback on this setting: since the change has been made, there have been no stack traces, and the backups that run on that node have not been finishing early (so far). We'll see what happens when they finish, but looking much better so far! Thanks, Campbell |
| Comment by Campbell Mcleay (Inactive) [ 08/Jan/21 ] |
|
HI Oleg, We have modified the config for limit the number of nfsd threads a single client can use on NFS gateways which helped to avoid the further client evictions. We are not seeing any stack traces and the backups are running smoothly so far. /sys/module/sunrpc/parameters/svc_rpc_per_connection_limit = 1 Also we are having the below lru size configs are in place too on the clients lctl set_param -P ldlm.namespaces.*.lru_size=10000 lctl set_param -P ldlm.namespaces.*.lru_max_age=600000 Please check and suggest . |
| Comment by Campbell Mcleay (Inactive) [ 01/Feb/21 ] |
|
Any updates here ? |
| Comment by Oleg Drokin [ 01/Feb/21 ] |
|
these ldlm settings look fine to me if they work for you. You can shorten the max age if you want depending on your workload. I can't comment on the "svc_rpc_per_connection_limit" parameter as I am not familiar with that part of the NFS stack. |
| Comment by Campbell Mcleay (Inactive) [ 17/Jun/21 ] |
|
Hi Oleg Drokin, As we monitored the system with below suggested change , it's noticed that the Metadata server (hmds1) is getting rebooted randomly and the both the clients are evicting from the cluster which is again making issues. Could you please look into that more and suggest ? lctl set_param ldlm.namespaces.*MDT*mdc*.lru_size=100 lctl set_param -P ldlm.namespaces.*.lru_size=10000 lctl set_param -P ldlm.namespaces.*.lru_max_age=600000 |
| Comment by Oleg Drokin [ 22/Jun/21 ] |
|
is there any more information about the MDS crashes/reboots? Is there an oops/lbug/assert (what's the stacktrace)? a HA-induced reboot due to high load/whatever? something else? |
| Comment by Campbell Mcleay (Inactive) [ 25/Jun/21 ] |
|
I have uploaded the crash logs for couple of reboot instances to check this further. Uploaded file name: vmcore-dmesg-2021-06-17.txt Last restart instance reboot system boot 3.10.0-1062.18.1 Thu Jun 17 16:34 - 14:58 (7+22:23) Load usage seems to be normal during that time 03:50:01 PM 2 1990 8.66 8.37 9.16 2 04:00:01 PM 11 1981 6.24 8.02 8.75 1 04:10:01 PM 15 1981 9.57 10.08 9.54 3 04:20:01 PM 2 1982 11.88 13.18 11.28 1 Average: 10 1983 10.40 10.50 10.42 2 04:35:02 PM LINUX RESTART 04:40:01 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 blocked 04:50:01 PM 2 1917 14.27 12.43 7.79 2 05:00:01 PM 5 1939 13.30 13.36 10.71 3 05:10:01 PM 11 1931 11.81 12.92 11.84 3 05:20:01 PM 20 1924 17.28 14.74 13.02 2 Most of the restart, could see the message "blk_update_request: critical medium error" always. [463108.636890] blk_update_request: critical medium error, dev nvme3n1, sector 5325696536 [622470.873927] md: md0: data-check done. [955488.123023] md: data-check of RAID array md0 [1028448.215074] ------------[ cut here ]------------ [1028448.215083] WARNING: CPU: 7 PID: 0 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x136/0x150 [1028448.215088] percpu ref (no_op) <= 0 (0) after switching to atomic [1028448.215089] Modules linked in: [1028448.215091] vfat fat osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) mbcache jbd2 obdclass(OE) lnet(OE) libcfs(OE) dell_rbu nfsv3 nfs fscache binfmt_misc bonding iTCO_wdt iTCO_vendor_support mxm_wmi dcdbas sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass raid10 sg lpc_ich mei_me mei ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace ip_tables xfs sd_mod crc_t10dif crct10dif_generic 8021q garp mrp stp llc mgag200 i2c_algo_bit drm_kms_helper bnx2x syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel [1028448.215136] ahci drm aesni_intel lrw gf128mul libahci glue_helper ablk_helper cryptd scsi_transport_iscsi libata megaraid_sas nvme ptp nvme_core pps_core drm_panel_orientation_quirks mdio libcrc32c dm_multipath sunrpc dm_mirror dm_region_hash dm_log dm_mod [last unloaded: usb_storage] [1028448.215163] CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G OE ------------ 3.10.0-1062.18.1.el7_lustre.x86_64 #1 [1028448.215164] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.11.0 11/02/2019 [1028448.215165] Call Trace: [1028448.215167] <IRQ> [<ffffffff8cf7b416>] dump_stack+0x19/0x1b [1028448.215178] [<ffffffff8c89bab8>] __warn+0xd8/0x100 [1028448.215180] [<ffffffff8c89bb3f>] warn_slowpath_fmt+0x5f/0x80 [1028448.215185] [<ffffffff8c958b6d>] ? rcu_advance_cbs+0xcd/0xe0 [1028448.215187] [<ffffffff8cd9b740>] ? null_show+0x20/0x20 [1028448.215189] [<ffffffff8cb86cd6>] percpu_ref_switch_to_atomic_rcu+0x136/0x150 [1028448.215192] [<ffffffff8c95a4f8>] rcu_process_callbacks+0x1d8/0x570 [1028448.215195] [<ffffffff8c8a5435>] __do_softirq+0xf5/0x280 [1028448.215200] [<ffffffff8cf9242c>] call_softirq+0x1c/0x30 [1028448.215204] [<ffffffff8c82f715>] do_softirq+0x65/0xa0 [1028448.215206] [<ffffffff8c8a57b5>] irq_exit+0x105/0x110 [1028448.215209] [<ffffffff8cf939d8>] smp_apic_timer_interrupt+0x48/0x60 [1028448.215211] [<ffffffff8cf8fefa>] apic_timer_interrupt+0x16a/0x170 [1028448.215213] <EOI> [<ffffffff8cdc2027>] ? cpuidle_enter_state+0x57/0xd0 [1028448.215218] [<ffffffff8cdc217e>] cpuidle_idle_call+0xde/0x230 [1028448.215221] [<ffffffff8c837c6e>] arch_cpu_idle+0xe/0xc0 [1028448.215226] [<ffffffff8c9017da>] cpu_startup_entry+0x14a/0x1e0 [1028448.215231] [<ffffffff8c85a0c7>] start_secondary+0x1f7/0x270 [1028448.215234] [<ffffffff8c8000d5>] start_cpu+0x5/0x14 [1028448.215236] ---[ end trace 4441139b69972efc ]--- [1136636.662352] blk_update_request: critical medium error, dev nvme3n1, sector 10406963584 [1136636.710304] blk_update_request: critical medium error, dev nvme3n1, sector 10406967424 [1165058.648561] md: md0: data-check done. [1560227.473353] md: data-check of RAID array md0 [1745005.678257] blk_update_request: critical medium error, dev nvme3n1, sector 9927251728 [1779948.753709] md: md0: data-check done. [1852184.120501] blk_update_request: critical medium error, dev nvme3n1, sector 9334756896 [1852184.120532] md/raid10:md0: nvme3n1: rescheduling sector 18668989512 [1852184.120644] blk_update_request: critical medium error, dev nvme3n1, sector 9334757280 [1852184.120673] md/raid10:md0: nvme3n1: rescheduling sector 18668990280 [1852184.120952] blk_update_request: critical medium error, dev nvme3n1, sector 9334756896 [1852184.121138] md/raid10:md0: read error corrected (8 sectors at 262144 on nvme3n1) [1852184.121148] md/raid10:md0: nvme2n1: redirecting sector 18668989512 to another mirror [1852184.121526] blk_update_request: critical medium error, dev nvme3n1, sector 9334757280 [1852184.121709] md/raid10:md0: read error corrected (8 sectors at 262144 on nvme3n1) [1852184.121720] md/raid10:md0: nvme2n1: redirecting sector 18668990280 to another mirror [2164966.710539] md: data-check of RAID array md0 [2183165.845838] blk_update_request: critical medium error, dev nvme3n1, sector 917466240 [2183165.846403] blk_update_request: critical medium error, dev nvme3n1, sector 917466280 [2431150.697903] md: md0: data-check done. [2769705.550545] md: data-check of RAID array md0 [2969839.376579] sector 10812151448 [2993948.791186] md: md0: data-check done. [3170956.989559] ------------[ cut here ]------------ [3170956.989584] kernel BUG at /tmp/rpmbuild-lustre-jenkins-KWaNslf3/BUILD/lustre-2.12.5_62_gac40c31/ldiskfs/htree_lock.c:429! [3170956.989617] invalid opcode: 0000 [#1] SMP [3170956.989634] Modules linked in: vfat fat osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) mbcache jbd2 obdclass(OE) lnet(OE) libcfs(OE) dell_rbu nfsv3 nfs fscache binfmt_misc bonding iTCO_wdt iTCO_vendor_support mxm_wmi dcdbas sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass raid10 sg lpc_ich mei_me mei ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace ip_tables xfs sd_mod crc_t10dif crct10dif_generic 8021q garp mrp stp llc mgag200 i2c_algo_bit drm_kms_helper bnx2x syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel [3170956.989935] ghash_clmulni_intel ahci drm aesni_intel lrw gf128mul libahci glue_helper ablk_helper cryptd scsi_transport_iscsi libata megaraid_sas nvme ptp nvme_core pps_core drm_panel_orientation_quirks mdio libcrc32c dm_multipath sunrpc dm_mirror dm_region_hash dm_log dm_mod [last unloaded: usb_storage] [3170956.990047] CPU: 28 PID: 17088 Comm: mdt00_017 Kdump: loaded Tainted: G W OE ------------ 3.10.0-1062.18.1.el7_lustre.x86_64 #1 [3170956.990085] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.11.0 11/02/2019 [3170956.990109] task: ffff9de935f041c0 ti: ffff9de74522c000 task.ti: ffff9de74522c000 [3170956.990132] RIP: 0010:[<ffffffffc0a86284>] [<ffffffffc0a86284>] htree_node_unlock+0x4b4/0x4c0 [ldiskfs] [3170956.990179] RSP: 0018:ffff9de74522f8b0 EFLAGS: 00010246 [3170956.990197] RAX: ffff9de933d98600 RBX: 0000000000000001 RCX: ffff9e0936a5b290 [3170956.990219] RDX: 00000000000000c8 RSI: 0000000000000001 RDI: 0000000000000000 [3170956.990241] RBP: ffff9de74522f928 R08: ffff9dcca9a01d00 R09: ffff9dda59559500 [3170956.990263] R10: 0000000000000000 R11: ffff9dcd260b9258 R12: ffff9e0936a5b2d8 [3170956.990285] R13: 0000000000000000 R14: ffff9de7ffea8a00 R15: ffff9dcd260b91e0 [3170956.990308] FS: 0000000000000000(0000) GS:ffff9de93f980000(0000) knlGS:0000000000000000 [3170956.990332] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [3170956.991222] CR2: 00007fdb46c25000 CR3: 0000003ff39b8000 CR4: 00000000003607e0 [3170956.992087] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [3170956.992943] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [3170956.993790] Call Trace: [3170956.994639] [<ffffffffc0a8646a>] htree_node_release_all+0x5a/0x80 [ldiskfs] [3170956.995505] [<ffffffffc0a864b2>] htree_unlock+0x22/0x70 [ldiskfs] [3170956.996363] [<ffffffffc0ad7b1e>] osd_index_ea_delete+0x30e/0xb10 [osd_ldiskfs] [3170956.997244] [<ffffffffc138efa8>] lod_sub_delete+0x1c8/0x460 [lod] [3170956.998285] [<ffffffffc1369c24>] lod_delete+0x24/0x30 [lod] [3170956.999350] [<ffffffffc13de7c4>] __mdd_index_delete_only+0x194/0x250 [mdd] [3170957.000391] [<ffffffffc13e1136>] __mdd_index_delete+0x46/0x290 [mdd] [3170957.001480] [<ffffffffc13eee78>] mdd_unlink+0x5f8/0xaa0 [mdd] [3170957.002467] [<ffffffffc12b0b69>] mdo_unlink+0x46/0x48 [mdt] [3170957.003319] [<ffffffffc1274b1d>] mdt_reint_unlink+0xbed/0x14b0 [mdt] [3170957.004156] [<ffffffffc12791b3>] mdt_reint_rec+0x83/0x210 [mdt] [3170957.004990] [<ffffffffc1255383>] mdt_reint_internal+0x6e3/0xaf0 [mdt] [3170957.005829] [<ffffffffc12610f7>] mdt_reint+0x67/0x140 [mdt] [3170957.006704] [<ffffffffc0f86e8a>] tgt_request_handle+0xada/0x1570 [ptlrpc] [3170957.007724] [<ffffffffc0f605d1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [3170957.008755] [<ffffffffc082cbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [3170957.009747] [<ffffffffc0f2b83b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [3170957.010752] [<ffffffffc0f28655>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [3170957.011741] [<ffffffff8c8d3a33>] ? __wake_up+0x13/0x20 [3170957.012740] [<ffffffffc0f2f1a4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [3170957.013630] [<ffffffffc0f2e670>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [3170957.014404] [<ffffffff8c8c6321>] kthread+0xd1/0xe0 [3170957.015169] [<ffffffff8c8c6250>] ? insert_kthread_work+0x40/0x40 [3170957.015910] [<ffffffff8cf8ed37>] ret_from_fork_nospec_begin+0x21/0x21 [3170957.016633] [<ffffffff8c8c6250>] ? insert_kthread_work+0x40/0x40 [3170957.017336] Code: 0f 0b 48 8b 45 90 8b 55 8c f3 90 0f a3 10 19 c9 85 c9 75 f5 f0 0f ab 10 19 c9 85 c9 0f 84 a4 fb ff ff eb e5 0f 1f 00 0f 0b 0f 0b <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 89 f0 48 [3170957.018817] RIP [<ffffffffc0a86284>] htree_node_unlock+0x4b4/0x4c0 [ldiskfs] [3170957.019524] RSP <ffff9de74522f8b0> |
| Comment by Oleg Drokin [ 29/Jun/21 ] |
|
The last one looks like
|
| Comment by Dneg (Inactive) [ 29/Oct/21 ] |
|
Hi Oleg, Whether this patch(for the reboot issue) included in 2.12.6 version or do we need to apply this manually in 2.12.6 ? |
| Comment by Peter Jones [ 29/Oct/21 ] |
|
The fix is queued up for the upcoming 2.12.8 release but, in the meantime, you could apply the patch - see https://review.whamcloud.com/#/c/44121/ |
| Comment by Dneg (Inactive) [ 03/Nov/21 ] |
|
Hi Peter, Thanks for the update, We have applied above patch on hotel and monitoring it further. When are we planning to release the version 2.12.8 ? |
| Comment by Peter Jones [ 03/Nov/21 ] |
|
The timeline will be affected by third party software releases so is a little unclear but seems very likely to be before the year is out. |