Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14802

MGS configuration problems - cannot add new OST, change parameters, hanging

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.15.0
    • None
    • None
    • 2
    • 9223372036854775807

    Description

      Hello,
      Last sunday evening I attempted to add 16 additional OSTs to this filesystem and immediately ran into difficulties, with the majority of these OSTs failing to be added to the filesystem. Further, since this time, the MGS is showing a number of errors in the logs and is failing to take any new configuration (lctl set/conf_param) and all clients or server OST/MDT targets take a very long time to connect to the MGS, or sometimes fail to connect at all, leaving them unable to mount.

      We've had a number of length periods of filesystem unavailability due to this, since the weekend, mostly from when I've tried to unmount/remount the MGS which is colocated on an MDS server with some of the MDTs in the filesystem, and the act of unmounting has caused the MDS load to spike and access to those MDTs to hang from all clients.

      The filesystem is currently up for existing clients, but rebooted clients are failing to mount the filesystem again, hence the severity here.

      Here are the MDS and OSS server kernel logs at the time the first OSTs were added on Sunday:

      MDS Server logs Sun 27th June
      Jun 27 22:28:18 rds-mds7 kernel: Lustre: ctl-rds-d6-MDT0000: super-sequence allocation rc = 0 [0x0000001300000400-0x0000001340000400]:40:ost
      Jun 27 22:28:24 rds-mds7 kernel: LNet: 2968:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.44.163.219@o2ib2: 0 seconds
      Jun 27 22:28:24 rds-mds7 kernel: LNet: 2968:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 177 previous similar messages
      Jun 27 22:30:05 rds-mds7 kernel: LNet: 2968:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.44.163.219@o2ib2: 0 seconds
      Jun 27 22:30:05 rds-mds7 kernel: LNet: 2968:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 2 previous similar messages
      Jun 27 22:30:37 rds-mds7 kernel: Lustre: rds-d6-MDT0002: Connection restored to 0bb09763-547b-56c8-0a7a-c2613e26070a (at 10.47.2.157@o2ib1)
      Jun 27 22:30:37 rds-mds7 kernel: Lustre: Skipped 216 previous similar messages
      Jun 27 22:31:55 rds-mds7 kernel: Lustre: rds-d6-MDT0000: Client 472702b7-5951-4007-482a-a3ef0f9d65fb (at 10.44.161.6@o2ib2) reconnecting
      Jun 27 22:31:55 rds-mds7 kernel: Lustre: Skipped 186 previous similar messages
      Jun 27 22:32:35 rds-mds7 kernel: LNet: 2968:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.44.163.219@o2ib2: 0 seconds
      Jun 27 22:32:35 rds-mds7 kernel: LNet: 2968:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 4 previous similar messages
      Jun 27 22:32:52 rds-mds7 kernel: LustreError: 166-1: MGC10.44.241.1@o2ib2: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
      Jun 27 22:32:52 rds-mds7 kernel: LustreError: 8092:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1624829272, 300s ago), entering recovery for MGS@10.44.241.1@o2ib2 ns: MGC10.44.241.1@o2ib2 lock: ffff975ed48c3840/0x184fda0e60b72a0e lrc: 4/1,0 m
      Jun 27 22:33:09 rds-mds7 kernel: LustreError: 8162:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1624829289, 300s ago); not entering recovery in server code, just going back to sleep ns: MGS lock: ffff975c83dbd100/0x184fda0e60b67c50 lrc: 3/0,1
      Jun 27 22:33:09 rds-mds7 kernel: LustreError: 8162:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) Skipped 43 previous similar messages
      Jun 27 22:33:37 rds-mds7 kernel: LNet: Service thread pid 12985 was inactive for 200.33s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Jun 27 22:33:37 rds-mds7 kernel: LNet: Skipped 3 previous similar messages
      Jun 27 22:33:37 rds-mds7 kernel: Pid: 12985, comm: ll_mgs_0005 3.10.0-1127.8.2.el7_lustre.x86_64 #1 SMP Sun Aug 23 13:52:28 UTC 2020
      Jun 27 22:33:37 rds-mds7 kernel: Call Trace:
      Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc1313c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
      Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc16a076c>] mgs_completion_ast_generic+0x5c/0x200 [mgs]
      Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc16a0983>] mgs_completion_ast_config+0x13/0x20 [mgs]
      Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc13147b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
      Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc16a53b4>] mgs_revoke_lock+0x104/0x380 [mgs]
      Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc16a5ad2>] mgs_target_reg+0x4a2/0x1320 [mgs]
      Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc13b29da>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc135748b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffc135adf4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      Jun 27 22:33:37 rds-mds7 kernel: [<ffffffff964c6691>] kthread+0xd1/0xe0
      Jun 27 22:33:37 rds-mds7 kernel: [<ffffffff96b92d1d>] ret_from_fork_nospec_begin+0x7/0x21
      Jun 27 22:33:37 rds-mds7 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
      Jun 27 22:33:37 rds-mds7 kernel: LustreError: dumping log to /tmp/lustre-log.1624829616.12985
      Jun 27 22:34:53 rds-mds7 kernel: Lustre: MGS: Received new LWP connection from 10.47.1.153@o2ib1, removing former export from same NID
      Jun 27 22:34:53 rds-mds7 kernel: Lustre: Skipped 1998 previous similar messages
      Jun 27 22:35:24 rds-mds7 kernel: LustreError: 12985:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1624829424, 300s ago); not entering recovery in server code, just going back to sleep ns: MGS lock: ffff97351fb81200/0x184fda0e60c14203 lrc: 3/0,
      Jun 27 22:35:25 rds-mds7 kernel: LustreError: 140-5: Server rds-d6-OST0041 requested index 65, but that index is already in use. Use --writeconf to force
      Jun 27 22:35:25 rds-mds7 kernel: LustreError: 13001:0:(mgs_handler.c:526:mgs_target_reg()) Failed to write rds-d6-OST0041 log (-98)
      Jun 27 22:35:26 rds-mds7 kernel: LustreError: 140-5: Server rds-d6-OST0041 requested index 65, but that index is already in use. Use --writeconf to force
      Jun 27 22:35:26 rds-mds7 kernel: LustreError: 13001:0:(mgs_handler.c:526:mgs_target_reg()) Failed to write rds-d6-OST0041 log (-98)
      Jun 27 22:37:52 rds-mds7 kernel: LustreError: 166-1: MGC10.44.241.1@o2ib2: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
      Jun 27 22:37:52 rds-mds7 kernel: LustreError: 8092:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1624829572, 300s ago), entering recovery for MGS@10.44.241.1@o2ib2 ns: MGC10.44.241.1@o2ib2 lock: ffff9745624b8240/0x184fda0e60cb24e5 lrc: 4/1,0 m
      Jun 27 22:37:52 rds-mds7 kernel: LustreError: 205913:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.44.241.1@o2ib2: namespace resource [0x36642d736472:0x0:0x0].0x0 (ffff973ee9f4cc00) refcount nonzero (1) after lock cleanup; forcing cleanup.
      Jun 27 22:37:52 rds-mds7 kernel: LustreError: 8092:0:(mgc_request.c:599:do_requeue()) failed processing log: -5
      Jun 27 22:39:41 rds-mds7 kernel: LNet: 2968:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 10.44.163.219@o2ib2: 0 seconds
      Jun 27 22:39:41 rds-mds7 kernel: LNet: 2968:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 3 previous similar messages
      Jun 27 22:40:11 rds-mds7 kernel: Lustre: 13006:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
       req@ffff974d89171850 x1703756897980928/t0(0) o253->dec8c782-f0e7-81a2-d397-7bf60e27e1ff@10.44.241.25@o2ib2:556/0 lens 4768/4768 e 20 to 0 dl 1624830016 ref 2 fl Interpret:/0/0 rc 0/0
      Jun 27 22:40:11 rds-mds7 kernel: Lustre: 13006:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 131 previous similar messages
      Jun 27 22:40:49 rds-mds7 kernel: Lustre: rds-d6-MDT0003: Connection restored to 2f43bc44-4d4e-e6e8-8ca0-dd0bfa20fa04 (at 10.47.2.114@o2ib1)
      Jun 27 22:40:49 rds-mds7 kernel: Lustre: Skipped 4151 previous similar messages
      Jun 27 22:41:59 rds-mds7 kernel: Lustre: rds-d6-MDT0002: Client 03c49b2b-134c-390d-9059-dc732d34be0b (at 10.47.1.253@o2ib1) reconnecting
      Jun 27 22:41:59 rds-mds7 kernel: Lustre: Skipped 179 previous similar messages
      Jun 27 22:42:52 rds-mds7 kernel: LustreError: 166-1: MGC10.44.241.1@o2ib2: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
      Jun 27 22:42:52 rds-mds7 kernel: LustreError: 8092:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1624829872, 300s ago), entering recovery for MGS@10.44.241.1@o2ib2 ns: MGC10.44.241.1@o2ib2 lock: ffff9738f01418c0/0x184fda0e60faac19 lrc: 4/1,0 m
      Jun 27 22:42:52 rds-mds7 kernel: LustreError: 212692:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.44.241.1@o2ib2: namespace resource [0x36642d736472:0x0:0x0].0x0 (ffff9724c6f7a6c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
      Jun 27 22:42:52 rds-mds7 kernel: LustreError: 8092:0:(mgc_request.c:599:do_requeue()) failed processing log: -5
      
      OSS Server logs
      Jun 27 22:26:34 rds-oss59 kernel: LDISKFS-fs (dm-2): file extents enabled, maximum tree depth=5
      Jun 27 22:26:34 rds-oss59 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: errors=remount-ro
      Jun 27 22:26:35 rds-oss59 kernel: Lustre: Lustre: Build Version: 2.12.5
      Jun 27 22:26:35 rds-oss59 kernel: LDISKFS-fs (dm-2): file extents enabled, maximum tree depth=5
      Jun 27 22:26:35 rds-oss59 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
      Jun 27 22:27:24 rds-oss59 kernel: LustreError: 137-5: rds-d6-OST0040_UUID: not available for connect from 10.47.0.2@o2ib1 (no target). If you are running an HA pair check that the target is mounted on the other server.
      Jun 27 22:27:24 rds-oss59 kernel: LustreError: Skipped 135 previous similar messages
      Jun 27 22:27:24 rds-oss59 kernel: LustreError: 137-5: rds-d6-OST0040_UUID: not available for connect from 10.44.163.107@o2ib2 (no target). If you are running an HA pair check that the target is mounted on the other server.
      Jun 27 22:27:24 rds-oss59 kernel: LustreError: Skipped 1088 previous similar messages
      Jun 27 22:27:25 rds-oss59 kernel: LustreError: 137-5: rds-d6-OST0040_UUID: not available for connect from 10.44.161.43@o2ib2 (no target). If you are running an HA pair check that the target is mounted on the other server.
      Jun 27 22:27:25 rds-oss59 kernel: LustreError: Skipped 588 previous similar messages
      Jun 27 22:27:27 rds-oss59 kernel: LustreError: 137-5: rds-d6-OST0040_UUID: not available for connect from 10.44.18.2@o2ib2 (no target). If you are running an HA pair check that the target is mounted on the other server.
      Jun 27 22:27:27 rds-oss59 kernel: LustreError: Skipped 55 previous similar messages
      Jun 27 22:27:42 rds-oss59 kernel: Lustre: rds-d6-OST0040: new disk, initializing
      Jun 27 22:27:42 rds-oss59 kernel: Lustre: rds-d6-OST0040: Not available for connect from 10.44.161.6@o2ib2 (not set up)
      Jun 27 22:27:43 rds-oss59 kernel: Lustre: srv-rds-d6-OST0040: No data found on store. Initialize space
      Jun 27 22:27:43 rds-oss59 kernel: Lustre: rds-d6-OST0040: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      Jun 27 22:28:13 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.47.2.15@o2ib1)
      Jun 27 22:28:13 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.47.1.172@o2ib1)
      Jun 27 22:28:14 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.47.21.88@o2ib1)
      Jun 27 22:28:14 rds-oss59 kernel: Lustre: Skipped 8 previous similar messages
      Jun 27 22:28:15 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.44.74.52@o2ib2)
      Jun 27 22:28:15 rds-oss59 kernel: Lustre: Skipped 1456 previous similar messages
      Jun 27 22:28:17 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.43.12.31@tcp2)
      Jun 27 22:28:17 rds-oss59 kernel: Lustre: Skipped 447 previous similar messages
      Jun 27 22:28:18 rds-oss59 kernel: Lustre: cli-rds-d6-OST0040-super: Allocated super-sequence [0x0000001300000400-0x0000001340000400]:40:ost]
      Jun 27 22:28:32 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.44.161.6@o2ib2)
      Jun 27 22:28:32 rds-oss59 kernel: Lustre: Skipped 45 previous similar messages
      Jun 27 22:29:34 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.43.240.201@tcp2)
      Jun 27 22:29:34 rds-oss59 kernel: Lustre: Skipped 1 previous similar message
      Jun 27 22:29:42 rds-oss59 kernel: Lustre: rds-d6-OST0040: Client 472702b7-5951-4007-482a-a3ef0f9d65fb (at 10.44.161.6@o2ib2) reconnecting
      Jun 27 22:30:14 rds-oss59 kernel: LDISKFS-fs (dm-7): file extents enabled, maximum tree depth=5
      Jun 27 22:30:14 rds-oss59 kernel: LDISKFS-fs (dm-7): mounted filesystem with ordered data mode. Opts: errors=remount-ro
      Jun 27 22:30:14 rds-oss59 kernel: LDISKFS-fs (dm-7): file extents enabled, maximum tree depth=5
      Jun 27 22:30:15 rds-oss59 kernel: LDISKFS-fs (dm-7): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
      Jun 27 22:30:35 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.44.6.58@o2ib2)
      Jun 27 22:30:35 rds-oss59 kernel: Lustre: Skipped 2 previous similar messages
      Jun 27 22:32:15 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.47.7.13@o2ib1)
      Jun 27 22:33:35 rds-oss59 kernel: Lustre: rds-d6-OST0040: Connection restored to  (at 10.47.7.15@o2ib1)
      Jun 27 22:33:35 rds-oss59 kernel: Lustre: Skipped 1 previous similar message
      Jun 27 22:35:23 rds-oss59 kernel: LustreError: 166-1: MGC10.44.241.1@o2ib2: Connection to MGS (at 10.44.241.1@o2ib2) was lost; in progress operations using this service will fail
      Jun 27 22:35:23 rds-oss59 kernel: LustreError: 258659:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1624829423, 300s ago), entering recovery for MGS@10.44.241.1@o2ib2 ns: MGC10.44.241.1@o2ib2 lock: ffff9bc4d83c8b40/0xc34bf6a4691f74a9 lrc: 4/1,
      Jun 27 22:35:23 rds-oss59 kernel: LustreError: 263517:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.44.241.1@o2ib2: namespace resource [0x36642d736472:0x0:0x0].0x0 (ffff9bc4d59bf5c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
      Jun 27 22:35:23 rds-oss59 kernel: LustreError: 258659:0:(mgc_request.c:599:do_requeue()) failed processing log: -5
      Jun 27 22:35:25 rds-oss59 kernel: LustreError: 15f-b: rds-d6-OST0041: cannot register this server with the MGS: rc = -98. Is the MGS running?
      Jun 27 22:35:25 rds-oss59 kernel: LustreError: 262284:0:(obd_mount_server.c:1992:server_fill_super()) Unable to start targets: -98
      Jun 27 22:35:25 rds-oss59 kernel: LustreError: 262284:0:(obd_mount_server.c:1600:server_put_super()) no obd rds-d6-OST0041
      Jun 27 22:35:25 rds-oss59 kernel: LustreError: 262284:0:(obd_mount_server.c:134:server_deregister_mount()) rds-d6-OST0041 not registered
      Jun 27 22:35:25 rds-oss59 kernel: Lustre: server umount rds-d6-OST0041 complete
      Jun 27 22:35:25 rds-oss59 kernel: LustreError: 262284:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount  (-98)
      

      So from above it shows OST0040 being added successfully, but then OST0041 failing to be added, and all subsequent OSTs (they were started together), failing in the same way.

      The lustre internal log dump file mentioned above lustre-log.1624829616.12985 is attached.

      Since sunday, I have managed to mount some of the new OSTs successfully, through a mixture of restarting the MGS, and then needing to reformat the new OSTs with the '--replace' flag, to get around the error mentioned in the log:

      Jun 27 22:35:24 rds-mds7 kernel: LustreError: 12985:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1624829424, 300s ago); not entering recovery in server code, just going back to sleep ns: MGS lock: ffff97351fb81200/0x184fda0e60c14203 lrc: 3/0,
      Jun 27 22:35:25 rds-mds7 kernel: LustreError: 140-5: Server rds-d6-OST0041 requested index 65, but that index is already in use. Use --writeconf to force
      Jun 27 22:35:25 rds-mds7 kernel: LustreError: 13001:0:(mgs_handler.c:526:mgs_target_reg()) Failed to write rds-d6-OST0041 log (-98)
      

      This worked for a couple of OSTs at a time, before the MGS would hang again refusing any further config. So far I have managed to mount about 8 of the 16 OSTs that I was intending to add on Sundary.

      I've been looking around on Jira since sunday for similar tickets and I came across:
      https://jira.whamcloud.com/browse/LU-12735
      https://jira.whamcloud.com/browse/LU-14695
      which both look very much like what I'm experiencing here.

      Following the discussion in LU-12735, I tried the patch mentioned in https://jira.whamcloud.com/browse/LU-13356 (https://review.whamcloud.com/41309), applied on top of 2.12.5 (which is the current server release we are running), and installed this patched version only on the server running the MGS.

      Unfortunately this doesn't appear to have fixed the issue for us, this is me subsequently trying to add an OST following a restart of the MGS onto this new version:

      Jun 30 22:29:10 rds-mds8 kernel: LustreError: 142-7: The target rds-d6-OST004b has not registered yet. It must be started before failnids can be added.
      Jun 30 22:29:10 rds-mds8 kernel: LustreError: 166987:0:(mgs_llog.c:4301:mgs_write_log_param()) err -2 on param 'failover.node=10.44.241.26@o2ib2:10.44.241.25@o2ib2'
      Jun 30 22:29:10 rds-mds8 kernel: LustreError: 166987:0:(mgs_handler.c:526:mgs_target_reg()) Failed to write rds-d6-OST004b log (-2)
      

      I am also getting errors from rebooted clients that are failing to mount the filesystem:

      Jul 01 10:28:14 lustre-transfer-02 kernel: LustreError: 22964:0:(mgc_request.c:249:do_config_log_add()) MGC10.44.241.1@o2ib2: failed processing log, type 1: rc = -5
      Jul 01 10:28:14 lustre-transfer-02 kernel: LustreError: 22964:0:(mgc_request.c:249:do_config_log_add()) Skipped 2 previous similar messages
      Jul 01 10:28:33 lustre-transfer-02 kernel: Lustre: rds-d6: root_squash is set to 99:99
      Jul 01 10:28:33 lustre-transfer-02 kernel: LustreError: 22964:0:(llite_lib.c:516:client_common_fill_super()) cannot mds_connect: rc = -2
      Jul 01 10:28:33 lustre-transfer-02 kernel: Lustre: rds-d6: nosquash_nids set to 10.44.241.[1-2]@o2ib2 10.44.161.[81-84]@o2ib2 10.44.71.[2-3]@o2ib2 10.43.240.[201-202]@tcp2 10.43.240.[198-199]@tcp2 10.144.9.[50-51]@o2ib 10.47.4.[210-213]@o2ib1
      Jul 01 10:28:33 lustre-transfer-02 kernel: Lustre: Unmounted rds-d6-client
      

      I'm not sure what to try next. Can I get some support in tracking down what is wrong with the MGS?

      Thanks,
      Matt 

       

       

      Attachments

        Issue Links

          Activity

            People

              tappro Mikhail Pershin
              mrb Matt Rásó-Barnett (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: