[LU-14802] MGS configuration problems - cannot add new OST, change parameters, hanging Created: 01/Jul/21  Updated: 27/Aug/22  Resolved: 31/Jan/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Critical
Reporter: Matt Rásó-Barnett (Inactive) Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None
Environment:
Server Environment
# Lustre 2.12.5
[root@rds-mds7 ~]# lfs --version
lfs 2.12.5

# Kernel
[root@rds-mds7 ~]# uname -r
3.10.0-1127.8.2.el7_lustre.x86_64

# MOFED 4.9
[root@rds-mds7 ~]# ofed_info -n
4.9-0.1.7.0.202008231437

# Filesystem
4 MDTs, rds-d6-MDT000[0-3]
64 OSTs initially - attempting to expand this with additional 16 OSTs
Client Environment
# Lustre 2.12.6 almost everywhere 
[root@gpu-e-80 ~]# lfs --version
lfs 2.12.6

# Mixture of RHEL 7.9 and RHEL 8.3 clients
[root@gpu-e-80 ~]# uname -r
3.10.0-1160.31.1.el7.csd3.x86_64

# Mixture of MOFED 4.9 and MOFED 5.1
[root@gpu-e-80 ~]# ofed_info -n
4.9-2.2.4.0

Attachments: File lustre-log.1624829616.12985.gz     File lustre-log.1625135231.204490.gz     File rds-md7.kernel.2021-06-26.log.gz     File rds-oss59.kernel.2021-06-27.log.gz    
Issue Links:
Duplicate
is duplicated by LU-14855 nodemap_add_range failed: MGS: OBD_IO... Resolved
Related
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

Hello,
Last sunday evening I attempted to add 16 additional OSTs to this filesystem and immediately ran into difficulties, with the majority of these OSTs failing to be added to the filesystem. Further, since this time, the MGS is showing a number of errors in the logs and is failing to take any new configuration (lctl set/conf_param) and all clients or server OST/MDT targets take a very long time to connect to the MGS, or sometimes fail to connect at all, leaving them unable to mount.

We've had a number of length periods of filesystem unavailability due to this, since the weekend, mostly from when I've tried to unmount/remount the MGS which is colocated on an MDS server with some of the MDTs in the filesystem, and the act of unmounting has caused the MDS load to spike and access to those MDTs to hang from all clients.

The filesystem is currently up for existing clients, but rebooted clients are failing to mount the filesystem again, hence the severity here.

Here are the MDS and OSS server kernel logs at the time the first OSTs were added on Sunday:

MDS Server logs Sun 27th June
Jun 27 22:28:18 rds-mds7 kernel: Lustre: ctl-rds-d6-MDT0000: super-sequence allocation rc = 0 [0x0000001300000400-0x0000001340000400]:40:ost
Jun 27 22:28:24 rds-mds7 kernel: LNet: 2968:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.44.163.219@o2ib2: 0 seconds
Jun 27 22:28:24 rds-mds7 kernel: LNet: 2968:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 177 previous similar messages
Jun 27 22:30:05 rds-mds7 kernel: LNet: 2968:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.44.163.219@o2ib2: 0 seconds
Jun 27 22:30:05 rds-mds7 kernel: LNet: 2968:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 2 previous similar messages
Jun 27 22:30:37 rds-mds7 kernel: Lustre: rds-d6-MDT0002: Connection restored to 0bb09763-547b-56c8-0a7a-c2613e26070a (at 10.47.2.157@o2ib1)
Jun 27 22:30:37 rds-mds7 kernel: Lustre: Skipped 216 previous similar messages
Jun 27 22:31:55 rds-mds7 kernel: Lustre: rds-d6-MDT0000: Client 472702b7-5951-4007-482a-a3ef0f9d65fb (at 10.44.161.6@o2ib2) reconnecting
Jun 27 22:31:55 rds-mds7 kernel: Lustre: Skipped 186 previous similar messages
Jun 27 22:32:35 rds-mds7 kernel: LNet: 2968:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.44.163.219@o2ib2: 0 seconds
Jun 27 22:32:35 rds-mds7 kernel: LNet: 2968:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 4 previous similar messages
Jun 27 22:32:52 rds-mds7 kernel: LustreError: 166-1: MGC10.44.241.1@o2ib2: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
Jun 27 22:32:52 rds-mds7 kernel: LustreError: 8092:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1624829272, 300s ago), entering recovery for MGS@10.44.241.1@o2ib2 ns: MGC10.44.241.1@o2ib2 lock: ffff975ed48c3840/0x184fda0e60b72a0e lrc: 4/1,0 m
Jun 27 22:33:09 rds-mds7 kernel: LustreError: 8162:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1624829289, 300s ago); not entering recovery in server code, just going back to sleep ns: MGS lock: ffff975c83dbd100/0x184fda0e60b67c50 lrc: 3/0,1
Jun 27 22:33:09 rds-mds7 kernel: LustreError: 8162:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) Skipped 43 previous similar messages
Jun 27 22:33:37 rds-mds7 kernel: LNet: Service thread pid 12985 was inactive for 200.33s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jun 27 22:33:37 rds-mds7 kernel: LNet: Skipped 3 previous similar messages
Jun 27 22:33:37 rds-mds7 kernel: Pid: 12985, comm: ll_mgs_0005 3.10.0-1127.8.2.el7_lustre.x86_64 #1 SMP Sun Aug 23 13:52:28 UTC 2020
Jun 27 22:33:37 rds-mds7 kernel: Call Trace:
Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc1313c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc16a076c>] mgs_completion_ast_generic+0x5c/0x200 [mgs]
Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc16a0983>] mgs_completion_ast_config+0x13/0x20 [mgs]
Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc13147b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc16a53b4>] mgs_revoke_lock+0x104/0x380 [mgs]
Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc16a5ad2>] mgs_target_reg+0x4a2/0x1320 [mgs]
Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc13b29da>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc135748b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc135adf4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Jun 27 22:33:37 rds-mds7 kernel: [<ffffffff964c6691>] kthread+0xd1/0xe0
Jun 27 22:33:37 rds-mds7 kernel: [<ffffffff96b92d1d>] ret_from_fork_nospec_begin+0x7/0x21
Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
Jun 27 22:33:37 rds-mds7 kernel: LustreError: dumping log to /tmp/lustre-log.1624829616.12985
Jun 27 22:34:53 rds-mds7 kernel: Lustre: MGS: Received new LWP connection from 10.47.1.153@o2ib1, removing former export from same NID
Jun 27 22:34:53 rds-mds7 kernel: Lustre: Skipped 1998 previous similar messages
Jun 27 22:35:24 rds-mds7 kernel: LustreError: 12985:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1624829424, 300s ago); not entering recovery in server code, just going back to sleep ns: MGS lock: ffff97351fb81200/0x184fda0e60c14203 lrc: 3/0,
Jun 27 22:35:25 rds-mds7 kernel: LustreError: 140-5: Server rds-d6-OST0041 requested index 65, but that index is already in use. Use --writeconf to force
Jun 27 22:35:25 rds-mds7 kernel: LustreError: 13001:0:(mgs_handler.c:526:mgs_target_reg()) Failed to write rds-d6-OST0041 log (-98)
Jun 27 22:35:26 rds-mds7 kernel: LustreError: 140-5: Server rds-d6-OST0041 requested index 65, but that index is already in use. Use --writeconf to force
Jun 27 22:35:26 rds-mds7 kernel: LustreError: 13001:0:(mgs_handler.c:526:mgs_target_reg()) Failed to write rds-d6-OST0041 log (-98)
Jun 27 22:37:52 rds-mds7 kernel: LustreError: 166-1: MGC10.44.241.1@o2ib2: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
Jun 27 22:37:52 rds-mds7 kernel: LustreError: 8092:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1624829572, 300s ago), entering recovery for MGS@10.44.241.1@o2ib2 ns: MGC10.44.241.1@o2ib2 lock: ffff9745624b8240/0x184fda0e60cb24e5 lrc: 4/1,0 m
Jun 27 22:37:52 rds-mds7 kernel: LustreError: 205913:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.44.241.1@o2ib2: namespace resource [0x36642d736472:0x0:0x0].0x0 (ffff973ee9f4cc00) refcount nonzero (1) after lock cleanup; forcing cleanup.
Jun 27 22:37:52 rds-mds7 kernel: LustreError: 8092:0:(mgc_request.c:599:do_requeue()) failed processing log: -5
Jun 27 22:39:41 rds-mds7 kernel: LNet: 2968:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.44.163.219@o2ib2: 0 seconds
Jun 27 22:39:41 rds-mds7 kernel: LNet: 2968:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 3 previous similar messages
Jun 27 22:40:11 rds-mds7 kernel: Lustre: 13006:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
 req@ffff974d89171850 x1703756897980928/t0(0) o253->dec8c782-f0e7-81a2-d397-7bf60e27e1ff@10.44.241.25@o2ib2:556/0 lens 4768/4768 e 20 to 0 dl 1624830016 ref 2 fl Interpret:/0/0 rc 0/0
Jun 27 22:40:11 rds-mds7 kernel: Lustre: 13006:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 131 previous similar messages
Jun 27 22:40:49 rds-mds7 kernel: Lustre: rds-d6-MDT0003: Connection restored to 2f43bc44-4d4e-e6e8-8ca0-dd0bfa20fa04 (at 10.47.2.114@o2ib1)
Jun 27 22:40:49 rds-mds7 kernel: Lustre: Skipped 4151 previous similar messages
Jun 27 22:41:59 rds-mds7 kernel: Lustre: rds-d6-MDT0002: Client 03c49b2b-134c-390d-9059-dc732d34be0b (at 10.47.1.253@o2ib1) reconnecting
Jun 27 22:41:59 rds-mds7 kernel: Lustre: Skipped 179 previous similar messages
Jun 27 22:42:52 rds-mds7 kernel: LustreError: 166-1: MGC10.44.241.1@o2ib2: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
Jun 27 22:42:52 rds-mds7 kernel: LustreError: 8092:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1624829872, 300s ago), entering recovery for MGS@10.44.241.1@o2ib2 ns: MGC10.44.241.1@o2ib2 lock: ffff9738f01418c0/0x184fda0e60faac19 lrc: 4/1,0 m
Jun 27 22:42:52 rds-mds7 kernel: LustreError: 212692:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.44.241.1@o2ib2: namespace resource [0x36642d736472:0x0:0x0].0x0 (ffff9724c6f7a6c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
Jun 27 22:42:52 rds-mds7 kernel: LustreError: 8092:0:(mgc_request.c:599:do_requeue()) failed processing log: -5
OSS Server logs
Jun 27 22:26:34 rds-oss59 kernel: LDISKFS-fs (dm-2): file extents enabled, maximum tree depth=5
Jun 27 22:26:34 rds-oss59 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Jun 27 22:26:35 rds-oss59 kernel: Lustre: Lustre: Build Version: 2.12.5
Jun 27 22:26:35 rds-oss59 kernel: LDISKFS-fs (dm-2): file extents enabled, maximum tree depth=5
Jun 27 22:26:35 rds-oss59 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
Jun 27 22:27:24 rds-oss59 kernel: LustreError: 137-5: rds-d6-OST0040_UUID: not available for connect from 10.47.0.2@o2ib1 (no target). If you are running an HA pair check that the target is mounted on the other server.
Jun 27 22:27:24 rds-oss59 kernel: LustreError: Skipped 135 previous similar messages
Jun 27 22:27:24 rds-oss59 kernel: LustreError: 137-5: rds-d6-OST0040_UUID: not available for connect from 10.44.163.107@o2ib2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Jun 27 22:27:24 rds-oss59 kernel: LustreError: Skipped 1088 previous similar messages
Jun 27 22:27:25 rds-oss59 kernel: LustreError: 137-5: rds-d6-OST0040_UUID: not available for connect from 10.44.161.43@o2ib2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Jun 27 22:27:25 rds-oss59 kernel: LustreError: Skipped 588 previous similar messages
Jun 27 22:27:27 rds-oss59 kernel: LustreError: 137-5: rds-d6-OST0040_UUID: not available for connect from 10.44.18.2@o2ib2 (no target). If you are running an HA pair check that the target is mounted on the other server.
Jun 27 22:27:27 rds-oss59 kernel: LustreError: Skipped 55 previous similar messages
Jun 27 22:27:42 rds-oss59 kernel: Lustre: rds-d6-OST0040: new disk, initializing
Jun 27 22:27:42 rds-oss59 kernel: Lustre: rds-d6-OST0040: Not available for connect from 10.44.161.6@o2ib2 (not set up)
Jun 27 22:27:43 rds-oss59 kernel: Lustre: srv-rds-d6-OST0040: No data found on store. Initialize space
Jun 27 22:27:43 rds-oss59 kernel: Lustre: rds-d6-OST0040: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
Jun 27 22:28:13 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.47.2.15@o2ib1)
Jun 27 22:28:13 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.47.1.172@o2ib1)
Jun 27 22:28:14 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.47.21.88@o2ib1)
Jun 27 22:28:14 rds-oss59 kernel: Lustre: Skipped 8 previous similar messages
Jun 27 22:28:15 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.44.74.52@o2ib2)
Jun 27 22:28:15 rds-oss59 kernel: Lustre: Skipped 1456 previous similar messages
Jun 27 22:28:17 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.43.12.31@tcp2)
Jun 27 22:28:17 rds-oss59 kernel: Lustre: Skipped 447 previous similar messages
Jun 27 22:28:18 rds-oss59 kernel: Lustre: cli-rds-d6-OST0040-super: Allocated super-sequence [0x0000001300000400-0x0000001340000400]:40:ost]
Jun 27 22:28:32 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.44.161.6@o2ib2)
Jun 27 22:28:32 rds-oss59 kernel: Lustre: Skipped 45 previous similar messages
Jun 27 22:29:34 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.43.240.201@tcp2)
Jun 27 22:29:34 rds-oss59 kernel: Lustre: Skipped 1 previous similar message
Jun 27 22:29:42 rds-oss59 kernel: Lustre: rds-d6-OST0040: Client 472702b7-5951-4007-482a-a3ef0f9d65fb (at 10.44.161.6@o2ib2) reconnecting
Jun 27 22:30:14 rds-oss59 kernel: LDISKFS-fs (dm-7): file extents enabled, maximum tree depth=5
Jun 27 22:30:14 rds-oss59 kernel: LDISKFS-fs (dm-7): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Jun 27 22:30:14 rds-oss59 kernel: LDISKFS-fs (dm-7): file extents enabled, maximum tree depth=5
Jun 27 22:30:15 rds-oss59 kernel: LDISKFS-fs (dm-7): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
Jun 27 22:30:35 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.44.6.58@o2ib2)
Jun 27 22:30:35 rds-oss59 kernel: Lustre: Skipped 2 previous similar messages
Jun 27 22:32:15 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.47.7.13@o2ib1)
Jun 27 22:33:35 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.47.7.15@o2ib1)
Jun 27 22:33:35 rds-oss59 kernel: Lustre: Skipped 1 previous similar message
Jun 27 22:35:23 rds-oss59 kernel: LustreError: 166-1: MGC10.44.241.1@o2ib2: Connection to MGS (at 10.44.241.1@o2ib2) was lost; in progress operations using this service will fail
Jun 27 22:35:23 rds-oss59 kernel: LustreError: 258659:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1624829423, 300s ago), entering recovery for MGS@10.44.241.1@o2ib2 ns: MGC10.44.241.1@o2ib2 lock: ffff9bc4d83c8b40/0xc34bf6a4691f74a9 lrc: 4/1,
Jun 27 22:35:23 rds-oss59 kernel: LustreError: 263517:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.44.241.1@o2ib2: namespace resource [0x36642d736472:0x0:0x0].0x0 (ffff9bc4d59bf5c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
Jun 27 22:35:23 rds-oss59 kernel: LustreError: 258659:0:(mgc_request.c:599:do_requeue()) failed processing log: -5
Jun 27 22:35:25 rds-oss59 kernel: LustreError: 15f-b: rds-d6-OST0041: cannot register this server with the MGS: rc = -98. Is the MGS running?
Jun 27 22:35:25 rds-oss59 kernel: LustreError: 262284:0:(obd_mount_server.c:1992:server_fill_super()) Unable to start targets: -98
Jun 27 22:35:25 rds-oss59 kernel: LustreError: 262284:0:(obd_mount_server.c:1600:server_put_super()) no obd rds-d6-OST0041
Jun 27 22:35:25 rds-oss59 kernel: LustreError: 262284:0:(obd_mount_server.c:134:server_deregister_mount()) rds-d6-OST0041 not registered
Jun 27 22:35:25 rds-oss59 kernel: Lustre: server umount rds-d6-OST0041 complete
Jun 27 22:35:25 rds-oss59 kernel: LustreError: 262284:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount  (-98)

So from above it shows OST0040 being added successfully, but then OST0041 failing to be added, and all subsequent OSTs (they were started together), failing in the same way.

The lustre internal log dump file mentioned above lustre-log.1624829616.12985 is attached.

Since sunday, I have managed to mount some of the new OSTs successfully, through a mixture of restarting the MGS, and then needing to reformat the new OSTs with the '--replace' flag, to get around the error mentioned in the log:

Jun 27 22:35:24 rds-mds7 kernel: LustreError: 12985:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1624829424, 300s ago); not entering recovery in server code, just going back to sleep ns: MGS lock: ffff97351fb81200/0x184fda0e60c14203 lrc: 3/0,
Jun 27 22:35:25 rds-mds7 kernel: LustreError: 140-5: Server rds-d6-OST0041 requested index 65, but that index is already in use. Use --writeconf to force
Jun 27 22:35:25 rds-mds7 kernel: LustreError: 13001:0:(mgs_handler.c:526:mgs_target_reg()) Failed to write rds-d6-OST0041 log (-98)

This worked for a couple of OSTs at a time, before the MGS would hang again refusing any further config. So far I have managed to mount about 8 of the 16 OSTs that I was intending to add on Sundary.

I've been looking around on Jira since sunday for similar tickets and I came across:
https://jira.whamcloud.com/browse/LU-12735
https://jira.whamcloud.com/browse/LU-14695
which both look very much like what I'm experiencing here.

Following the discussion in LU-12735, I tried the patch mentioned in https://jira.whamcloud.com/browse/LU-13356 (https://review.whamcloud.com/41309), applied on top of 2.12.5 (which is the current server release we are running), and installed this patched version only on the server running the MGS.

Unfortunately this doesn't appear to have fixed the issue for us, this is me subsequently trying to add an OST following a restart of the MGS onto this new version:

Jun 30 22:29:10 rds-mds8 kernel: LustreError: 142-7: The target rds-d6-OST004b has not registered yet. It must be started before failnids can be added.
Jun 30 22:29:10 rds-mds8 kernel: LustreError: 166987:0:(mgs_llog.c:4301:mgs_write_log_param()) err -2 on param 'failover.node=10.44.241.26@o2ib2:10.44.241.25@o2ib2'
Jun 30 22:29:10 rds-mds8 kernel: LustreError: 166987:0:(mgs_handler.c:526:mgs_target_reg()) Failed to write rds-d6-OST004b log (-2)

I am also getting errors from rebooted clients that are failing to mount the filesystem:

Jul 01 10:28:14 lustre-transfer-02 kernel: LustreError: 22964:0:(mgc_request.c:249:do_config_log_add()) MGC10.44.241.1@o2ib2: failed processing log, type 1: rc = -5
Jul 01 10:28:14 lustre-transfer-02 kernel: LustreError: 22964:0:(mgc_request.c:249:do_config_log_add()) Skipped 2 previous similar messages
Jul 01 10:28:33 lustre-transfer-02 kernel: Lustre: rds-d6: root_squash is set to 99:99
Jul 01 10:28:33 lustre-transfer-02 kernel: LustreError: 22964:0:(llite_lib.c:516:client_common_fill_super()) cannot mds_connect: rc = -2
Jul 01 10:28:33 lustre-transfer-02 kernel: Lustre: rds-d6: nosquash_nids set to 10.44.241.[1-2]@o2ib2 10.44.161.[81-84]@o2ib2 10.44.71.[2-3]@o2ib2 10.43.240.[201-202]@tcp2 10.43.240.[198-199]@tcp2 10.144.9.[50-51]@o2ib 10.47.4.[210-213]@o2ib1
Jul 01 10:28:33 lustre-transfer-02 kernel: Lustre: Unmounted rds-d6-client

I'm not sure what to try next. Can I get some support in tracking down what is wrong with the MGS?

Thanks,
Matt 

 

 



 Comments   
Comment by Matt Rásó-Barnett (Inactive) [ 01/Jul/21 ]

Another thread onto this problem - we have a second identical filesystem that also shows this problem. Again triggered by attempting to add OSTs to it on Sunday evening.

I just now tried unmounting the MGS as clients were unable to mount the filesystem and it generated the following stack trace and lustre debug log as it hung for some time before unmounting. Just thought I would add incase this particular trace helps:

Unable to find source-code formatter for language: rds-mds10 mgs log on unmount of mgt. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml
Jul 01 11:27:11 rds-mds10 kernel: LNet: Service thread pid 204490 was inactive for 200.34s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jul 01 11:27:11 rds-mds10 kernel: Pid: 204490, comm: ll_mgs_0002 3.10.0-1127.8.2.el7_lustre.x86_64 #1 SMP Sun Aug 23 13:52:28 UTC 2020
Jul 01 11:27:11 rds-mds10 kernel: Call Trace:
Jul 01 11:27:11 rds-mds10 kernel:  [<ffffffffc108ac90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
Jul 01 11:27:11 rds-mds10 kernel:  [<ffffffffc141776c>] mgs_completion_ast_generic+0x5c/0x200 [mgs]
Jul 01 11:27:11 rds-mds10 kernel:  [<ffffffffc1417983>] mgs_completion_ast_config+0x13/0x20 [mgs]
Jul 01 11:27:11 rds-mds10 kernel:  [<ffffffffc108b7b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
Jul 01 11:27:11 rds-mds10 kernel:  [<ffffffffc141c3b4>] mgs_revoke_lock+0x104/0x380 [mgs]
Jul 01 11:27:11 rds-mds10 kernel:  [<ffffffffc141cad2>] mgs_target_reg+0x4a2/0x1320 [mgs]
Jul 01 11:27:11 rds-mds10 kernel:  [<ffffffffc11299da>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Jul 01 11:27:11 rds-mds10 kernel:  [<ffffffffc10ce48b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Jul 01 11:27:11 rds-mds10 kernel:  [<ffffffffc10d1df4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Jul 01 11:27:11 rds-mds10 kernel:  [<ffffffff82ac6691>] kthread+0xd1/0xe0
Jul 01 11:27:11 rds-mds10 kernel:  [<ffffffff83192d1d>] ret_from_fork_nospec_begin+0x7/0x21
Jul 01 11:27:11 rds-mds10 kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
Jul 01 11:27:11 rds-mds10 kernel: LustreError: dumping log to /tmp/lustre-log.1625135231.204490

lustre-log.1625135231.204490.gz

Comment by Peter Jones [ 01/Jul/21 ]

Mike

Could you please advise?

Thanks

Peter

Comment by Matt Rásó-Barnett (Inactive) [ 05/Jul/21 ]

Hi Mike,
My suspicion before opening this ticket was that I was experiencing the same as this ticket: https://jira.whamcloud.com/browse/LU-12735 which references a patch under https://jira.whamcloud.com/browse/LU-13356 that should fix the problem.

I tried to test this by only updating the MGS server to a 2.12.5 version with this patch backported, and nothing else. However looking more closely at the commit, is this actually a client patch? If so, should I update all the clients in our environment with this patch to see the effect? How about the servers, is there any reason for the servers to have this patch as well?

Thanks for your time,
Matt

Comment by Mikhail Pershin [ 05/Jul/21 ]

Matt, yes, the patch you are referring is client/server patch, both parts are needed actually. Meanwhile I tend to think that originally we have LNET problem here and that caused related problem with MGS. Could you also collect LNET configuration:
 

lnetctl global show
lnetctl net show -v 4

also configuration defined in ko2iblnd.conf and the values of the configurations in /sys/module/lnet/parameters/ and /sys/module/ko2iblnd/parameters/ if any.

Comment by Matt Rásó-Barnett (Inactive) [ 06/Jul/21 ]

Here is the LNET configuration for the servers:

Server Config
[root@rds-mds8 ~]# lnetctl global show
global:
    numa_range: 0
    max_intf: 200
    discovery: 1
    drop_asym_route: 0
    retry_count: 2
    transaction_timeout: 50
    health_sensitivity: 100
    recovery_interval: 1

[root@rds-mds8 ~]# lnetctl net show -v 4
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
          statistics:
              send_count: 38220
              recv_count: 38220
              drop_count: 0
          sent_stats:
              put: 38220
              get: 0
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 38220
              get: 0
              reply: 0
              ack: 0
              hello: 0
          dropped_stats:
              put: 0
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 0
              interrupts: 0
              dropped: 0
              aborted: 0
              no route: 0
              timeouts: 0
              error: 0
          tunables:
              peer_timeout: 0
              peer_credits: 0
              peer_buffer_credits: 0
              credits: 0
          dev cpt: 0
          tcp bonding: 0
          CPT: "[0,1]"
      local NI(s):
        - nid: 10.44.241.2@o2ib2
          status: up
          interfaces:
              0: ib0
          statistics:
              send_count: 41253701
              recv_count: 40228442
              drop_count: 1038348
          sent_stats:
              put: 41184595
              get: 69106
              reply: 0
              ack: 0
              hello: 0
          received_stats:
              put: 40150520
              get: 3158
              reply: 65947
              ack: 8817
              hello: 0
          dropped_stats:
              put: 1038348
              get: 0
              reply: 0
              ack: 0
              hello: 0
          health stats:
              health value: 1000
              interrupts: 0
              dropped: 311
              aborted: 0
              no route: 0
              timeouts: 102
              error: 0
          tunables:
              peer_timeout: 0
              peer_credits: 16
              peer_buffer_credits: 0
              credits: 2048
              peercredits_hiw: 8
              map_on_demand: 256
              concurrent_sends: 16
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 2048
              conns_per_peer: 1
          lnd tunables:
          dev cpt: 0
          tcp bonding: 0
          CPT: "[0,1]"

[root@rds-mds8 ~]# cat /etc/modprobe.d/ko2iblnd.conf
# Ansible managed

# Currently it isn't possible to auto-tune the o2iblnd parameters optimally
# inside the kernel since the OFED API hides the details from us.
# Unfortunately, there isn't a single set of parameters that provide optimal
# performance on different HCA/HFI types. This file provides optimized
# tunables for the OPA cards.
#
# ** Please note that the below settings are the recommended settings only for
#    OPA cards. If other IB cards are also present along with OPA cards then
#    these settings will be applied across all the configured IB interfaces.
#
# Card detection and tunable selection is handled via /usr/sbin/ko2iblnd-probe
# at runtime when the ko2iblnd module is installed, either at boot or when
# Lustre is first mounted.

alias ko2iblnd-opa ko2iblnd
options ko2iblnd-opa peer_credits=32 peer_credits_hiw=16 credits=1024 concurrent_sends=64 ntx=2048 map_on_demand=256 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4
install ko2iblnd /usr/sbin/ko2iblnd-probe

[root@rds-mds8 ~]# cat /etc/modprobe.d/lnet.conf 
# Ansible managed
options lnet networks="o2ib2(ib0)" auto_down=1 avoid_asym_router_failure=1 check_routers_before_use=1 live_router_check_interval=60 dead_router_check_interval=60 router_ping_timeout=60

[root@rds-mds8 ~]# cat /etc/lnet.conf
# Ansible managed
---
# lnet.conf - configuration file for lnet routes to be imported by lnetctl
#
# This configuration file is formatted as YAML and can be imported
# by lnetctl.
net:
    - net type: o2ib2
      local NI(s):
        - nid: 10.44.241.2@o2ib2
          interfaces:
            0: ib0
          tunables:
              peer_credits: 16
              peer_credits_hiw: 8
              concurrent_sends: 16
              credits: 2048
          lnd tunables:
              map_on_demand: 256
              ntx: 2048
route:
    - net: o2ib0
      gateway: 10.44.240.165@o2ib2
    - net: o2ib0
      gateway: 10.44.240.166@o2ib2
    - net: o2ib0
      gateway: 10.44.240.167@o2ib2
    - net: o2ib0
      gateway: 10.44.240.168@o2ib2
    - net: o2ib1
      gateway: 10.44.240.161@o2ib2
    - net: o2ib1
      gateway: 10.44.240.162@o2ib2
    - net: o2ib1
      gateway: 10.44.240.163@o2ib2
    - net: o2ib1
      gateway: 10.44.240.164@o2ib2
    - net: o2ib1
      gateway: 10.44.240.165@o2ib2
    - net: o2ib1
      gateway: 10.44.240.166@o2ib2
    - net: o2ib1
      gateway: 10.44.240.167@o2ib2
    - net: o2ib1
      gateway: 10.44.240.168@o2ib2
    - net: tcp2
      gateway: 10.44.240.161@o2ib2
    - net: tcp2
      gateway: 10.44.240.162@o2ib2
    - net: tcp2
      gateway: 10.44.240.163@o2ib2
    - net: tcp2
      gateway: 10.44.240.164@o2ib2
    - net: tcp2
      gateway: 10.44.240.165@o2ib2
    - net: tcp2
      gateway: 10.44.240.166@o2ib2
    - net: tcp2
      gateway: 10.44.240.167@o2ib2
    - net: tcp2
      gateway: 10.44.240.168@o2ib2
    - net: tcp4
      gateway: 10.44.240.161@o2ib2
    - net: tcp4
      gateway: 10.44.240.162@o2ib2
    - net: tcp4
      gateway: 10.44.240.163@o2ib2
    - net: tcp4
      gateway: 10.44.240.164@o2ib2
    - net: tcp4
      gateway: 10.44.240.165@o2ib2
    - net: tcp4
      gateway: 10.44.240.166@o2ib2
    - net: tcp4
      gateway: 10.44.240.167@o2ib2
    - net: tcp4
      gateway: 10.44.240.168@o2ib2

[root@rds-mds8 ~]# grep . /sys/module/lnet/parameters/*                                                                                                                                                                                                                                    
/sys/module/lnet/parameters/accept:secure
/sys/module/lnet/parameters/accept_backlog:127
/sys/module/lnet/parameters/accept_port:988
/sys/module/lnet/parameters/accept_timeout:5
/sys/module/lnet/parameters/auto_down:1
/sys/module/lnet/parameters/avoid_asym_router_failure:1
/sys/module/lnet/parameters/check_routers_before_use:1
/sys/module/lnet/parameters/config_on_load:0
/sys/module/lnet/parameters/dead_router_check_interval:60
/sys/module/lnet/parameters/large_router_buffers:0
/sys/module/lnet/parameters/live_router_check_interval:60
/sys/module/lnet/parameters/lnet_drop_asym_route:0
/sys/module/lnet/parameters/lnet_health_sensitivity:100
/sys/module/lnet/parameters/lnet_interfaces_max:200
/sys/module/lnet/parameters/lnet_numa_range:0
/sys/module/lnet/parameters/lnet_peer_discovery_disabled:0
/sys/module/lnet/parameters/lnet_recovery_interval:1
/sys/module/lnet/parameters/lnet_retry_count:2
/sys/module/lnet/parameters/lnet_transaction_timeout:50
/sys/module/lnet/parameters/local_nid_dist_zero:1
/sys/module/lnet/parameters/networks:o2ib2(ib0)
/sys/module/lnet/parameters/peer_buffer_credits:0
/sys/module/lnet/parameters/portal_rotor:3
/sys/module/lnet/parameters/rnet_htable_size:128
/sys/module/lnet/parameters/router_ping_timeout:60
/sys/module/lnet/parameters/small_router_buffers:0
/sys/module/lnet/parameters/tiny_router_buffers:0
/sys/module/lnet/parameters/use_tcp_bonding:0

[root@rds-mds8 ~]# grep . /sys/module/ko2iblnd/parameters/*
/sys/module/ko2iblnd/parameters/cksum:0
/sys/module/ko2iblnd/parameters/concurrent_sends:0
/sys/module/ko2iblnd/parameters/conns_per_peer:1
/sys/module/ko2iblnd/parameters/credits:256
/sys/module/ko2iblnd/parameters/dev_failover:0
/sys/module/ko2iblnd/parameters/fmr_cache:1
/sys/module/ko2iblnd/parameters/fmr_flush_trigger:384
/sys/module/ko2iblnd/parameters/fmr_pool_size:512
/sys/module/ko2iblnd/parameters/ib_mtu:0
/sys/module/ko2iblnd/parameters/ipif_name:ib0
/sys/module/ko2iblnd/parameters/keepalive:100
/sys/module/ko2iblnd/parameters/map_on_demand:0
/sys/module/ko2iblnd/parameters/nscheds:0
/sys/module/ko2iblnd/parameters/ntx:512
/sys/module/ko2iblnd/parameters/peer_buffer_credits:0
/sys/module/ko2iblnd/parameters/peer_credits:8
/sys/module/ko2iblnd/parameters/peer_credits_hiw:0
/sys/module/ko2iblnd/parameters/peer_timeout:180
/sys/module/ko2iblnd/parameters/require_privileged_port:0
/sys/module/ko2iblnd/parameters/retry_count:5
/sys/module/ko2iblnd/parameters/rnr_retry_count:6
/sys/module/ko2iblnd/parameters/service:987
/sys/module/ko2iblnd/parameters/timeout:50
/sys/module/ko2iblnd/parameters/use_fastreg_gaps:0
/sys/module/ko2iblnd/parameters/use_privileged_port:1
/sys/module/ko2iblnd/parameters/wrq_sge:2

The servers here are Mellanox Connect-X6 cards, HDR, and we are currently still using MOFED 4.9 on all the servers:

[root@rds-mds8 ~]# lspci -vvv | grep Mellanox
5e:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
        Subsystem: Mellanox Technologies Device 0006
[root@rds-mds8 ~]# ibstatus
Infiniband device 'mlx5_0' port 1 status:
        default gid:     fe80:0000:0000:0000:1c34:da03:0055:3a44
        base lid:        0x24e
        sm lid:          0x53
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            100 Gb/sec (2X HDR)
        link_layer:      InfiniBand

[root@rds-mds8 ~]# ofed_info -n
4.9-0.1.7.0.202008231437

 

The client environment is much more diverse, so if the logs indicate any particular clients/NIDs that are involved here I can provide more details on those. We also have LNET routers in the environment, and these are routing between IB and Intel OPA and Ethernet.

The IB fabric is also relatively diverse, with generations of clients spanning from FDR to EDR to HDR. Generally though these are all running up to date, 2.12.6 Lustre, and a mixture of RHEL 7.9 or RHEL 8.3, with MOFED 4.9 or 5.1.

 

Comment by Mikhail Pershin [ 08/Jul/21 ]

From LNET level there is health stats for NID 10.44.241.2@o2ib2:

 health stats:
              health value: 1000
              interrupts: 0
              dropped: 311
              aborted: 0
              no route: 0
              timeouts: 102
              error: 0

it shows 'dropped' values and 'timeouts', though it is not major problem you've seen it still worth to tune lnet values to the following:

peer_credits = 32
peer_credits_hiw = 16
concurrent_sends = 64

lnet_transaction_timeout = 100
lnet_retry_count = 2 

 

As for main MGS issue, the main patch change is in lustre_start_mgc() which is used by both servers and clients to access MGS, so I think ideally patch should be added to all nodes. Can you apply it on all servers at least?

Comment by Matt Rásó-Barnett (Inactive) [ 12/Jul/21 ]

Hi Mikhail, I can certainly adjust the LNET values to those recommended across our estate. Similarly with deploying the patch. We have a general maintenance coming up next week on the 21st, so we will aim to deploy these changes at this time.

Thanks,

Matt

Comment by Matt Rásó-Barnett (Inactive) [ 14/Jul/21 ]

Hi Mikhail, slight side-question here, but in setting the parameter lnet_transaction_timeout I don't see a way to set this in the /etc/lnet.conf file.

It seems to be able to be configured dynamically via lnetctl set transaction_timeout, but I was hoping to find a configuration file way to do it, something like eg:

# cat /etc/lnet.conf
...
global:
    transaction_timeout = 100
    retry_count = 2 

We can obviously configure the setting dynamically via script, but we've moved all our tunables into managing the /etc/lnet.conf file now, so having a facility to set it via this file would be desirable just for simple management.

Otherwise we will roll out the patch and updated settings next week so will feed-back after this how it's looking.

Thanks,

Matt

Comment by Mikhail Pershin [ 16/Jul/21 ]

ssmirnov, could you please help with LNET settings here?

Comment by Serguei Smirnov [ 16/Jul/21 ]

Hi,

You should be able to set the transaction timeout and retry count in lnet.conf by adding the following line:

 options lnet lnet_retry_count=3 lnet_transaction_timeout=150

the o2ib lnd parameters can be set like this:

options ko2iblnd peer_credits=32 peer_credits_hiw=16 concurrent_sends=64

Please make sure that the above is set uniformly across the cluster.

Thanks,

Serguei.

Comment by Matt Rásó-Barnett (Inactive) [ 16/Jul/21 ]

Thanks Serguei, I forgot about adding it to the module options - I've been recently using the dynamic yaml /etc/lnet.conf config so much that that was my instinctive first choice for adding the configuration.

Cheers,
Matt

Comment by Etienne Aujames [ 21/Jul/21 ]

Hello mrb,

Are you sure to have the same logs before and after applying the patch LU-13356?

The typical dmesg on MGC nodes:

LustreError: 22397:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567976846, 300s ago), entering recovery for MGS@MGC10.0.10.51@o2ib7_0 ns: MGC10.0.10.51@o2ib7 lock: ffff90b36ed157c0/0x98816ce9d089ad9b lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 2 type: PLN flags: 0x1000000000000 nid: local remote: 0x98816ce9d089ada9 expref: -99 pid: 22397 timeout: 0 lvb_type: 0
LustreError: 22397:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) Skipped 1 previous similar message
LustreError: 5858:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff90b2677b6d80) refcount nonzero (1) after lock cleanup; forcing cleanup.
LustreError: 5858:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message

MGCs (all the clients nodes) are trying to reconnect indefinitely to MGS. Because MGS wait indefinitely a MGC to release a lock: it can not evict the client (flag OBD_CONNECT_MDS_MDS is set by the client).

So the following calltrace could be observed on the MGS:

Jun 27 22:33:37 rds-mds7 kernel: Call Trace:
Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc1313c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc16a076c>] mgs_completion_ast_generic+0x5c/0x200 [mgs]
Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc16a0983>] mgs_completion_ast_config+0x13/0x20 [mgs]
Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc13147b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc16a53b4>] mgs_revoke_lock+0x104/0x380 [mgs]
Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc16a5ad2>] mgs_target_reg+0x4a2/0x1320 [mgs]
Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc13b29da>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc135748b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc135adf4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Jun 27 22:33:37 rds-mds7 kernel: [<ffffffff964c6691>] kthread+0xd1/0xe0
Jun 27 22:33:37 rds-mds7 kernel: [<ffffffff96b92d1d>] ret_from_fork_nospec_begin+0x7/0x21
Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffffffffff>] 0xffffffffffffffff

 

After some hours/days (depending the number of clients), we observe that the load increase on the MGS node, because the res->lr_waiting become to large. The node became unresponsive, the HA kill it.

The LU-13356 is a client and server patch but after applying it, the MGS should clear the connection flag OBD_CONNECT_MDS_MDS/OBD_CONNECT_MNE_SWAB by checking MGS_CONNECT_SUPPORTED.

The known workaround for this issue is to umount/mount the MGT, but it might crash the MGS .

The main difference between our issue and this one is the resource.
In our case it was an "Imperative recovery"/CONFIG_T_RECOVER resource MGS: e.g [0x6b616f:*0x2*:0x0]
In this case it is "Config"/CONFIG_T_CONFIG MGS resource because you are adding new OSTs: [0x36642d736472:*0x0*:0x0]

xxd -p -r <<< "36642d736472" && echo
6d-sdr (rds-6d)

This issue could be definitly be caused by network errors (MGC  become unresponsive with a read lock on the resource).

Comment by Matt Rásó-Barnett (Inactive) [ 30/Jul/21 ]

Hi all, I have an update now post-upgrade of our servers and clients last week.

I have now updated all servers and the majority of our clients to 2.12.7 + https://jira.whamcloud.com/browse/LU-13356 patch.

Since that update we haven't had the same issues in most cases, which is good, so it appears the patch has worked for us. I have been able to update the MGS configuration and add new OSTs for one of the new filesystems, which was the original trigger for this ticket.

However unfortunately, I still have 6 OSTs on one of the other upgraded filesystems that are still refusing to mount and register. This is even the case after I re-format the OSTs:

MGS logs - post format of OST0049
Jul 30 15:22:34 rds-mds8 kernel: LustreError: 140-5: Server rds-d6-OST0049 requested index 73, but that index is already in use. Use --writeconf to force
Jul 30 15:22:34 rds-mds8 kernel: LustreError: 239688:0:(mgs_handler.c:526:mgs_target_reg()) Failed to write rds-d6-OST0049 log (-98)
Jul 30 15:22:34 rds-mds8 kernel: LustreError: 239688:0:(mgs_handler.c:526:mgs_target_reg()) Skipped 2 previous similar messages
Jul 30 15:22:36 rds-mds8 kernel: LustreError: 140-5: Server rds-d6-OST0049 requested index 73, but that index is already in use. Use --writeconf to force
Jul 30 15:22:36 rds-mds8 kernel: LustreError: 239688:0:(mgs_handler.c:526:mgs_target_reg()) Failed to write rds-d6-OST0049 log (-98)
OSS logs - post format of OST0049
Jul 30 15:22:28 rds-oss60 kernel: LDISKFS-fs (dm-11): file extents enabled, maximum tree depth=5
Jul 30 15:22:29 rds-oss60 kernel: LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Jul 30 15:22:29 rds-oss60 kernel: LDISKFS-fs (dm-11): file extents enabled, maximum tree depth=5
Jul 30 15:22:29 rds-oss60 kernel: LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
Jul 30 15:22:35 rds-oss60 kernel: LustreError: 15f-b: rds-d6-OST0049: cannot register this server with the MGS: rc = -98. Is the MGS running?
Jul 30 15:22:35 rds-oss60 kernel: LustreError: Skipped 1 previous similar message
Jul 30 15:22:35 rds-oss60 kernel: LustreError: 120315:0:(obd_mount_server.c:1992:server_fill_super()) Unable to start targets: -98
Jul 30 15:22:35 rds-oss60 kernel: LustreError: 120315:0:(obd_mount_server.c:1600:server_put_super()) no obd rds-d6-OST0049
Jul 30 15:22:35 rds-oss60 kernel: LustreError: 120315:0:(obd_mount_server.c:134:server_deregister_mount()) rds-d6-OST0049 not registered
Jul 30 15:22:35 rds-oss60 kernel: LustreError: 120315:0:(obd_mount_server.c:134:server_deregister_mount()) Skipped 1 previous similar message
Jul 30 15:22:35 rds-oss60 kernel: Lustre: server umount rds-d6-OST0049 complete
Jul 30 15:22:35 rds-oss60 kernel: LustreError: 120315:0:(obd_mount.c:1604:lustre_fill_super()) Unable to mount  (-98)

Ok so it's complaining that the requested index is in use, so I reformat again now with the '--replace' flag:

MGS logs - post format with --replace
Jul 30 15:27:26 rds-mds8 kernel: Lustre: Found index 73 for rds-d6-OST0049, updating log
Jul 30 15:27:26 rds-mds8 kernel: LustreError: 142-7: The target rds-d6-OST0049 has not registered yet. It must be started before failnids can be added.
Jul 30 15:27:26 rds-mds8 kernel: LustreError: Skipped 2 previous similar messages
Jul 30 15:27:26 rds-mds8 kernel: LustreError: 286637:0:(mgs_llog.c:4304:mgs_write_log_param()) err -2 on param 'failover.node=10.44.241.26@o2ib2:10.44.241.25@o2ib2'
Jul 30 15:27:26 rds-mds8 kernel: LustreError: 286637:0:(mgs_llog.c:4304:mgs_write_log_param()) Skipped 2 previous similar messages
Jul 30 15:27:26 rds-mds8 kernel: LustreError: 286637:0:(mgs_handler.c:526:mgs_target_reg()) Failed to write rds-d6-OST0049 log (-2)
OSS logs - post format with --replace
Jul 30 15:27:20 rds-oss60 kernel: LDISKFS-fs (dm-11): file extents enabled, maximum tree depth=5
Jul 30 15:27:21 rds-oss60 kernel: LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. Opts: errors=remount-ro
Jul 30 15:27:21 rds-oss60 kernel: LDISKFS-fs (dm-11): file extents enabled, maximum tree depth=5
Jul 30 15:27:21 rds-oss60 kernel: LDISKFS-fs (dm-11): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
Jul 30 15:27:32 rds-oss60 kernel: LustreError: 15f-b: rds-d6-OST0049: cannot register this server with the MGS: rc = -2. Is the MGS running?
Jul 30 15:27:33 rds-oss60 kernel: LustreError: 123310:0:(obd_mount_server.c:1992:server_fill_super()) Unable to start targets: -2
Jul 30 15:27:33 rds-oss60 kernel: LustreError: 123310:0:(obd_mount_server.c:1600:server_put_super()) no obd rds-d6-OST0049
Jul 30 15:27:33 rds-oss60 kernel: LustreError: 123310:0:(obd_mount_server.c:134:server_deregister_mount()) rds-d6-OST0049 not registered
Jul 30 15:27:33 rds-oss60 kernel: Lustre: server umount rds-d6-OST0049 complete
Jul 30 15:27:33 rds-oss60 kernel: LustreError: 123310:0:(obd_mount.c:1604:lustre_fill_super()) Unable to mount  (-2)

I don't see any log files on the MGT for the OSTs I'm trying to mount (for reference these are rds-d6-OST004[9abdef]):

[root@rds-mds8 ~]# lctl --device MGS llog_catlist | grep -E "rds-d6-OST004[9abdef]"
[root@rds-mds8 ~]#

I'm not really sure what else to try at this point? Is there any information from the above logs that would indicate the problem here?
Thanks,
Matt

Comment by Mikhail Pershin [ 30/Jul/21 ]

Matt, could you show exact commands you were using to format OST? The mgs_write_log_param() failed on param failover.node setting, probably it is just wrong syntax. Also to get more details you could collect lustre logs, just do the following steps:

  1. On MGS node, set debug level by 'lctl set_param debug=0xffffffff' (maximum level)
  2. Do your command to add OST
  3. Again on MGS collect log by: 'lctl dk > your_logfilename'
  4. Analyse logfile or just attach it here.

Prior that remember old debug level on MGS, get it via 'lctl get_param debug' and save it, after all set it back on MGS, maximum debug level would cause server slowdown, so better to use default setting and set maximum level only to debug things

Comment by Matt Rásó-Barnett (Inactive) [ 03/Aug/21 ]

Here are the format commands used:

/sbin/mkfs.lustre --ost --replace --fsname rds-d6 --index 73 --mgsnode='10.44.241.1@o2ib2' --mgsnode='10.44.241.2@o2ib2' --servicenode='10.44.241.26@o2ib2' --servicenode='10.44.241.25@o2ib2' --mkfsoptions '-i 2097152 -E stride=32,stripe-width=256' '/dev/mapper/v2-rds-ost-jb60'

The only difference between the tests above is whether I include the '--replace' flag or not.

I have taken logs from both the MGS and OSS during an attempted mount after running the above format command. I've set a debug marker just before this: 'DEBUG MARKER: START rds-OST0049 mount'. The OSS NID is:
10.44.241.26@o2ib2 and the OST is rds-OST0049, index 73.

Unfortunately I think the MGS log is too big to upload to this ticket (even compressed it's 100M), do you have an sftp upload facility I can put the log files?

Here is (I think), the relevant section from the MGS log:

20000000:00000001:25.0:1627998228.698165:0:182803:0:(mgs_handler.c:364:mgs_target_reg()) Process entered
20000000:00000001:25.0:1627998228.698167:0:182803:0:(mgs_llog.c:573:mgs_find_or_make_fsdb()) Process entered
20000000:00000001:25.0:1627998228.698167:0:182803:0:(mgs_llog.c:552:mgs_find_or_make_fsdb_nolock()) Process entered
20000000:00000001:25.0:1627998228.698168:0:182803:0:(mgs_llog.c:566:mgs_find_or_make_fsdb_nolock()) Process leaving (rc=0 : 0 : 0)
20000000:00000001:25.0:1627998228.698169:0:182803:0:(mgs_llog.c:579:mgs_find_or_make_fsdb()) Process leaving (rc=0 : 0 : 0)
20000000:01000000:25.0:1627998228.698169:0:182803:0:(mgs_handler.c:414:mgs_target_reg()) fs: rds-d6 index: 73 is registered to MGS.
20000000:00000001:25.0:1627998228.698170:0:182803:0:(mgs_llog.c:573:mgs_find_or_make_fsdb()) Process entered
20000000:00000001:25.0:1627998228.698170:0:182803:0:(mgs_llog.c:552:mgs_find_or_make_fsdb_nolock()) Process entered
20000000:00000001:25.0:1627998228.698171:0:182803:0:(mgs_llog.c:566:mgs_find_or_make_fsdb_nolock()) Process leaving (rc=0 : 0 : 0)
20000000:00000001:25.0:1627998228.698171:0:182803:0:(mgs_llog.c:579:mgs_find_or_make_fsdb()) Process leaving (rc=0 : 0 : 0)
20000000:01000000:25.0:1627998228.698172:0:182803:0:(mgs_handler.c:519:mgs_target_reg()) updating rds-d6-OST0049, index=73
20000000:00000001:25.0:1627998228.698172:0:182803:0:(mgs_llog.c:4315:mgs_write_log_target()) Process entered
20000000:00000001:25.0:1627998228.698173:0:182803:0:(mgs_llog.c:666:mgs_set_index()) Process entered
20000000:00000001:25.0:1627998228.698173:0:182803:0:(mgs_llog.c:573:mgs_find_or_make_fsdb()) Process entered
20000000:00000001:25.0:1627998228.698173:0:182803:0:(mgs_llog.c:552:mgs_find_or_make_fsdb_nolock()) Process entered
20000000:00000001:25.0:1627998228.698174:0:182803:0:(mgs_llog.c:566:mgs_find_or_make_fsdb_nolock()) Process leaving (rc=0 : 0 : 0)
20000000:00000001:25.0:1627998228.698174:0:182803:0:(mgs_llog.c:579:mgs_find_or_make_fsdb()) Process leaving (rc=0 : 0 : 0)
20000000:01000000:25.0:1627998228.698175:0:182803:0:(mgs_llog.c:710:mgs_set_index()) Server rds-d6-OST0049 updating index 73
20000000:00000001:25.0:1627998228.698175:0:182803:0:(mgs_llog.c:711:mgs_set_index()) Process leaving via out_up (rc=114 : 114 : 0x72)
20000000:02000400:25.0:1627998228.698176:0:182803:0:(mgs_llog.c:4324:mgs_write_log_target()) Found index 73 for rds-d6-OST0049, updating log
20000000:01000000:25.0:1627998228.698178:0:182803:0:(mgs_llog.c:4356:mgs_write_log_target()) Update params for rds-d6-OST0049
20000000:00000010:25.0:1627998228.698179:0:182803:0:(mgs_llog.c:4362:mgs_write_log_target()) kmalloced '(buf)': 97 at ffff96c084bf1700.
20000000:01000000:25.0:1627998228.698180:0:182803:0:(mgs_llog.c:4376:mgs_write_log_target()) remaining string: ' failover.node=10.44.241.26@o2ib2:10.44.241.25@o2ib2 ', param: 'mgsnode=10.44.241.1@o2ib2:10.44.241.2@o2ib2'
20000000:00000001:25.0:1627998228.698181:0:182803:0:(mgs_llog.c:3886:mgs_write_log_param()) Process entered
20000000:01000000:25.0:1627998228.698181:0:182803:0:(mgs_llog.c:3890:mgs_write_log_param()) next param 'mgsnode=10.44.241.1@o2ib2:10.44.241.2@o2ib2'
20000000:00000001:25.0:1627998228.698182:0:182803:0:(mgs_llog.c:3897:mgs_write_log_param()) Process leaving via end (rc=0 : 0 : 0x0)
20000000:00000001:25.0:1627998228.698183:0:182803:0:(mgs_llog.c:4306:mgs_write_log_param()) Process leaving (rc=0 : 0 : 0)
20000000:01000000:25.0:1627998228.698183:0:182803:0:(mgs_llog.c:4376:mgs_write_log_target()) remaining string: ' ', param: 'failover.node=10.44.241.26@o2ib2:10.44.241.25@o2ib2'
20000000:00000001:25.0:1627998228.698184:0:182803:0:(mgs_llog.c:3886:mgs_write_log_param()) Process entered
20000000:01000000:25.0:1627998228.698184:0:182803:0:(mgs_llog.c:3890:mgs_write_log_param()) next param 'failover.node=10.44.241.26@o2ib2:10.44.241.25@o2ib2'
20000000:01000000:25.0:1627998228.698185:0:182803:0:(mgs_llog.c:3925:mgs_write_log_param()) Adding failnode
20000000:00000001:25.0:1627998228.698185:0:182803:0:(mgs_llog.c:3203:mgs_write_log_add_failnid()) Process entered
20000000:00000040:25.0:1627998228.698186:0:182803:0:(lustre_log.h:367:llog_ctxt_get()) GETting ctxt ffff96bf0f3a6600 : new refcount 2
00000040:00000001:25.0:1627998228.698187:0:182803:0:(llog.c:1280:llog_open()) Process entered
00000040:00000010:25.0:1627998228.698188:0:182803:0:(llog.c:61:llog_alloc_handle()) kmalloced '(loghandle)': 280 at ffff96c104bcce00.
00000040:00000001:25.0:1627998228.698190:0:182803:0:(llog_osd.c:1236:llog_osd_open()) Process entered
00000020:00000001:25.0:1627998228.698190:0:182803:0:(local_storage.c:146:ls_device_get()) Process entered
00000020:00000001:25.0:1627998228.698191:0:182803:0:(local_storage.c:151:ls_device_get()) Process leaving via out_ls (rc=18446628553290942976 : -115520418608640 : 0xffff96ef4e7c0a00)
00000020:00000001:25.0:1627998228.698192:0:182803:0:(local_storage.c:173:ls_device_get()) Process leaving (rc=18446628553290942976 : -115520418608640 : ffff96ef4e7c0a00)
00080000:00000001:25.0:1627998228.698194:0:182803:0:(osd_handler.c:7423:osd_index_ea_lookup()) Process entered
00080000:00000001:25.0:1627998228.698195:0:182803:0:(osd_handler.c:5972:osd_ea_lookup_rec()) Process entered
00080000:00000001:25.0:1627998228.698200:0:182803:0:(osd_handler.c:6047:osd_ea_lookup_rec()) Process leaving via out (rc=18446744073709551614 : -2 : 0xfffffffffffffffe)
00080000:00000001:25.0:1627998228.698202:0:182803:0:(osd_handler.c:7431:osd_index_ea_lookup()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)
00000040:00000001:25.0:1627998228.698203:0:182803:0:(llog_osd.c:1298:llog_osd_open()) Process leaving via out (rc=18446744073709551614 : -2 : 0xfffffffffffffffe)
00000040:00000001:25.0:1627998228.698203:0:182803:0:(llog_osd.c:1354:llog_osd_open()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)
00000040:00000010:25.0:1627998228.698204:0:182803:0:(llog.c:91:llog_free_handle()) kfreed 'loghandle': 280 at ffff96c104bcce00.
00000040:00000001:25.0:1627998228.698205:0:182803:0:(llog.c:1306:llog_open()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe)
00000040:00000001:25.0:1627998228.698205:0:182803:0:(llog.c:1337:llog_is_empty()) Process leaving via out (rc=0 : 0 : 0x0)
20000000:00000040:25.0:1627998228.698206:0:182803:0:(lustre_log.h:377:llog_ctxt_put()) PUTting ctxt ffff96bf0f3a6600 : new refcount 1
20000000:02020000:25.0:1627998228.698206:0:182803:0:(mgs_llog.c:3214:mgs_write_log_add_failnid()) 142-7: The target rds-d6-OST0049 has not registered yet. It must be started before failnids can be added.
00000800:00000200:36.2F:1627998228.706154:0:0:0:(o2iblnd_cb.c:3769:kiblnd_cq_completion()) conn[ffff96c19bbf3a00] (36)++
00000800:00000200:43.0:1627998228.706160:0:2905:0:(o2iblnd_cb.c:3891:kiblnd_scheduler()) conn[ffff96c19bbf3a00] (37)++
00000800:00000200:43.0:1627998228.706161:0:2905:0:(o2iblnd_cb.c:343:kiblnd_handle_rx()) Received d1[1] from 10.44.240.162@o2ib2
00000400:00000200:43.0:1627998228.706163:0:2905:0:(lib-move.c:4196:lnet_parse()) TRACE: 10.44.241.2@o2ib2(10.44.241.2@o2ib2) <- 10.43.10.24@tcp2 : PUT - for me
00000800:00000200:45.0:1627998228.706164:0:2903:0:(o2iblnd_cb.c:3907:kiblnd_scheduler()) conn[ffff96c19bbf3a00] (38)--
00000400:00000010:43.0:1627998228.706164:0:2905:0:(lib-lnet.h:490:lnet_msg_alloc()) alloc '(msg)': 440 at ffff96c10fb88e00 (tot 357391524).
00000400:00000200:43.0:1627998228.706165:0:2905:0:(lib-ptl.c:571:lnet_ptl_match_md()) Request from 12345-10.43.10.24@tcp2 of length 224 into portal 26 MB=0x60f8b5d916f80
00000400:00000200:43.0:1627998228.706167:0:2905:0:(lib-ptl.c:200:lnet_try_match_md()) Incoming put index 1a from 12345-10.43.10.24@tcp2 of length 224/224 into md 0x4a433cc9 [1] + 0
00000400:00000010:43.0:1627998228.706167:0:2905:0:(lib-lnet.h:295:lnet_me_free()) slab-freed 'me' at ffff96c0e179aa80.
00000400:00000200:43.0:1627998228.706168:0:2905:0:(lib-md.c:65:lnet_md_unlink()) Queueing unlink of md ffff96c0a439a990
00000400:00000200:43.0:1627998228.706169:0:2905:0:(lib-msg.c:918:lnet_is_health_check()) health check = 0, status = 0, hstatus = 0
00000100:00000001:43.0:1627998228.706169:0:2905:0:(events.c:300:request_in_callback()) Process entered
00000100:00000200:43.0:1627998228.706169:0:2905:0:(events.c:310:request_in_callback()) event type 2, status 0, service mgs
00000100:00000040:43.0:1627998228.706170:0:2905:0:(events.c:353:request_in_callback()) incoming req@ffff96bf0fa13850 x1705941104947072 msgsize 224
00000100:00100000:43.0:1627998228.706171:0:2905:0:(events.c:356:request_in_callback()) peer: 12345-10.43.10.24@tcp2 (source: 12345-10.43.10.24@tcp2)
00000100:00000040:43.0:1627998228.706172:0:2905:0:(events.c:365:request_in_callback()) Buffer complete: 62 buffers still posted
00000100:00000001:43.0:1627998228.706173:0:2905:0:(events.c:389:request_in_callback()) Process leaving
00000400:00000200:43.0:1627998228.706174:0:2905:0:(lib-md.c:69:lnet_md_unlink()) Unlinking md ffff96c0a439a990
00000400:00000010:43.0:1627998228.706174:0:2905:0:(lib-lnet.h:270:lnet_md_free()) slab-freed 'md' at ffff96c0a439a990.
00000400:00000010:43.0:1627998228.706175:0:2905:0:(lib-lnet.h:500:lnet_msg_free()) kfreed 'msg': 440 at ffff96c10fb88e00 (tot 357391084).

But I'll provide full logs if I can upload them somewhere?

Comment by Matt Rásó-Barnett (Inactive) [ 19/Aug/21 ]

Hi Mikhail, just wondering if you have any time to take another look at this for me? I'm still unable to add these OSTs to the filesystem currently and not sure what else to try right now.

I can provide the full logs to the previous comment if there is an FTP server I can upload them to?

Comment by DELBARY Gael [ 13/Sep/21 ]

Hi Matt,

I don't know if you have finally figure out your issue. On a custom Lustre version (2.12.6 + some patchs) we hit a strange issue, sad lustre file configurations and we were enable to start the fs.

The root_cause was a sad patch (outside LTS) but the behavior was not clean (in ldiskfs params files have 0 in link count !, lustre complains in config corruption). Even after clean write conf, no way to have a working fs. When I have debug a bit, I have seen that some few clients (~13000) did many noise. Normally in write conf procedure, you have to unmount and remount all clients (+targets) but I was surprised MDT/OST configs were in "bad state".

Anyway to fix it, I had to stop all clients and all became good. Write conf procedure completes successfully and after that clients mounted without any issue (no either corruption in config files). If you can, you could test this (just in case) but it is not a seemless solution...

Comment by Matt Rásó-Barnett (Inactive) [ 22/Sep/21 ]

Hi Gael, that's interesting thanks for sharing.

Unfortunately I am still stuck on this issue somewhat, although the impact has lessened a little bit. The current status is that I have a single OST that fails to mount and join the filesystem, the one mentioned above that should be index 73.

The only new development since my last messages is that there were 5 other OSTs that weren't mounting previously, but have now mounted successfully and joined the filesystem - index 74 to 79.

In the last month we had a lot of unrelated LNET routing related issues over in https://jira.whamcloud.com/browse/LU-14987 and as part of that, we did in fact restart a large number of both the clients and also all the servers in the filesystems.

It was after these restarts that I noticed the above OSTs 74 to 79 that were finally able to mount and join the filesystem.

Unfortunately that still leaves OST 73, which I see the exact same errors with as before.

I will generate new debug logs for this OST on both OSS and MDS and try to share as before.

Comment by Matt Rásó-Barnett (Inactive) [ 22/Sep/21 ]

Actually I didn't notice this in my previous logs, but I note now the failing call to 'mgs_revoke_lock()' after attempting to start the missing OST 73 now:

Sep 22 15:46:53 rds-mds7 kernel: LustreError: 140-5: Server rds-d6-OST0049 requested index 73, but that index is already in use. Use --writeconf to force
Sep 22 15:46:53 rds-mds7 kernel: LustreError: 5636:0:(mgs_handler.c:526:mgs_target_reg()) Failed to write rds-d6-OST0049 log (-98)
Sep 22 15:47:04 rds-mds7 kernel: Lustre: 5636:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1632322013/real 1632322013]  req@ffff962705810000 x1710915848718400/t0(0) o104->MGS@10.47.2.177@o2ib1:15/16 lens 296/224 e 0 to 1 dl 1632322024 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Sep 22 15:47:04 rds-mds7 kernel: Lustre: 5636:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Sep 22 15:47:07 rds-mds7 kernel: Lustre: rds-d6-MDT0002: haven't heard from client df0f1789-b047-52d8-233e-98e43aab8799 (at 10.47.2.177@o2ib1) in 227 seconds. I think it's dead, and I am evicting it. exp ffff95f07ed38c00, cur 1632322027 expire 1632321877 last 1632321800
Sep 22 15:47:07 rds-mds7 kernel: Lustre: Skipped 4 previous similar messages
Sep 22 15:47:07 rds-mds7 kernel: LustreError: 5636:0:(mgs_handler.c:282:mgs_revoke_lock()) MGS: can't take cfg lock for 0x36642d736472/0x0 : rc = -11

Not sure if this last message is of any significance here?

Comment by Etienne Aujames [ 02/Oct/21 ]
Server rds-d6-OST0049 requested index 73, but that index is already in use. Use --writeconf to force

This line is displayed when a newly target (formatted) is firstly mounted but the requested target index is already allocated for an another target in MGS configuration.
In your case maybe, the MGS register the OST0049 in its configuration but did not update back OST0049 local configuration (because network issue or locking issue etc...).

You can check the llog configuration on the MGS to verify if you see your OST with (for example):

lctl --device MGS llog_print rds-d6-client

Then you could try clean your MGS configuration with "lctl clear_conf"
doc: https://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#lustremaint.clear_conf

Then you could try to replace the existing OST target in configuration with your new one : "mkfs.lustre --replace"
doc: https://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#lustremaint.restore_ost

Comment by Gerrit Updater [ 19/Nov/21 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45626
Subject: LU-14802 nodemap: return proper error code
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 05a33d6c10921b630237c570a5b989463e90ca65

Comment by Gerrit Updater [ 31/Jan/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45626/
Subject: LU-14802 nodemap: return proper error code
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7d6cddff24de2f79e539eae462554dc21e674511

Comment by Peter Jones [ 31/Jan/22 ]

Landed for 2.15

Generated at Sat Feb 10 03:12:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.