Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11761

blocked MDT mount and high cpu usage from lodXXXX_recYYYY threads

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.13.0, Lustre 2.12.3
    • Lustre 2.12.0
    • None
    • CentOS 7.6 3.10.0-957.1.3.el7_lustre.x86_64 Lustre 2.12.0 RC2
    • 3
    • 9223372036854775807

    Description

      Another issue when using 2.12.0 RC2 during testing... MDTs mount seems to never complete and the following threads take 100% cpu:

      PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
       20953 root 20 0 0 0 0 R 100.0 0.0 27:00.33 lod0002_rec0001
       20954 root 20 0 0 0 0 R 100.0 0.0 27:00.34 lod0002_rec0003
      

      This is on fir-md1-s1 that handles MDT0 and MDT2 on this test system.

      sysrq t shows:

      Dec 11 09:50:13 fir-md1-s1 kernel: lod0002_rec0001 R  running task        0 20953      2 0x00000080
      Dec 11 09:50:13 fir-md1-s1 kernel: Call Trace:
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d3cc5d>] ? keys_fini+0x2d/0x1d0 [obdclass]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d3ce2b>] lu_context_fini+0x2b/0xa0 [obdclass]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d3d0da>] lu_env_init+0x1a/0x30 [obdclass]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0f19b68>] ptlrpc_set_wait+0x7d8/0x8d0 [ptlrpc]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d515e5>] ? lustre_get_jobid+0x185/0x2e0 [obdclass]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d09f3c>] ? obd_get_request_slot+0x3c/0x280 [obdclass]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0f19ce3>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc088e334>] fld_client_rpc+0x104/0x540 [fld]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0892f5f>] fld_server_lookup+0x15f/0x320 [fld]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc1684587>] lod_fld_lookup+0x327/0x510 [lod]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc16997dd>] lod_object_init+0x7d/0x3c0 [lod]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d3dfd5>] lu_object_alloc+0xe5/0x320 [obdclass]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d3e2e6>] lu_object_find_at+0x76/0x280 [obdclass]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d3f78d>] dt_locate_at+0x1d/0xb0 [obdclass]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d02b4c>] llog_osd_open+0xfc/0xf30 [obdclass]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0d3e789>] ? lu_object_put+0x279/0x3d0 [obdclass]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc0ceff20>] llog_open+0x140/0x3d0 [obdclass]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc16bdeed>] lod_sub_prep_llog+0x14d/0x783 [lod]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc16837ab>] lod_sub_recovery_thread+0x1cb/0xc80 [lod]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffc16835e0>] ? lod_obd_get_info+0x9d0/0x9d0 [lod]
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffa7ac1c31>] kthread+0xd1/0xe0
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffa7ac1b60>] ? insert_kthread_work+0x40/0x40
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffa8174c24>] ret_from_fork_nospec_begin+0xe/0x21
      Dec 11 09:50:13 fir-md1-s1 kernel:  [<ffffffffa7ac1b60>] ? insert_kthread_work+0x40/0x40
      

      and

      Dec 11 09:44:24 fir-md1-s1 kernel: lod0002_rec0003 R  running task        0 20954      2 0x00000080
      Dec 11 09:44:24 fir-md1-s1 kernel: Call Trace:
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d3cfa3>] ? lu_context_init+0xd3/0x1f0 [obdclass]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d3ceba>] ? lu_env_fini+0x1a/0x30 [obdclass]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0f19b68>] ? ptlrpc_set_wait+0x7d8/0x8d0 [ptlrpc]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d515e5>] ? lustre_get_jobid+0x185/0x2e0 [obdclass]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d09f3c>] ? obd_get_request_slot+0x3c/0x280 [obdclass]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0f19ce3>] ? ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc088e421>] ? fld_client_rpc+0x1f1/0x540 [fld]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0892f5f>] ? fld_server_lookup+0x15f/0x320 [fld]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc1684587>] ? lod_fld_lookup+0x327/0x510 [lod]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc16997dd>] ? lod_object_init+0x7d/0x3c0 [lod]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d3dfd5>] ? lu_object_alloc+0xe5/0x320 [obdclass]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d3e2e6>] ? lu_object_find_at+0x76/0x280 [obdclass]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d3f78d>] ? dt_locate_at+0x1d/0xb0 [obdclass]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d02b4c>] ? llog_osd_open+0xfc/0xf30 [obdclass]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0d3e789>] ? lu_object_put+0x279/0x3d0 [obdclass]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc0ceff20>] ? llog_open+0x140/0x3d0 [obdclass]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc16bdeed>] ? lod_sub_prep_llog+0x14d/0x783 [lod]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc16837ab>] ? lod_sub_recovery_thread+0x1cb/0xc80 [lod]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffc16835e0>] ? lod_obd_get_info+0x9d0/0x9d0 [lod]
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffa7ac1c31>] ? kthread+0xd1/0xe0
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffa7ac1b60>] ? insert_kthread_work+0x40/0x40
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffa8174c24>] ? ret_from_fork_nospec_begin+0xe/0x21
      Dec 11 09:44:24 fir-md1-s1 kernel:  [<ffffffffa7ac1b60>] ? insert_kthread_work+0x40/0x40
      

      Mount commands are stuck, even when using -o abort_recov

      I took a crash dump just in case you're interested.

      I believe it's a regression from earlier 2.11.x versions...

      HTH,
      Stephane

      Attachments

        Issue Links

          Activity

            [LU-11761] blocked MDT mount and high cpu usage from lodXXXX_recYYYY threads

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35661/
            Subject: LU-11761 fld: let's caller to retry FLD_QUERY
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 80eb31f5367f5b879b544a61a938a8acd7de9cf5

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35661/ Subject: LU-11761 fld: let's caller to retry FLD_QUERY Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 80eb31f5367f5b879b544a61a938a8acd7de9cf5

            Great! With this patch on top of current b2_12 (so quite a lot of changes for us, not only this patch), the recovery worked on all 4 MDTs:

            fir-md1-s2: Sep 05 05:59:04 fir-md1-s2 kernel: Lustre: fir-MDT0001: Recovery over after 5:26, of 1378 clients 1378 recovered and 0 were evicted.
            fir-md1-s2: Sep 05 05:59:11 fir-md1-s2 kernel: Lustre: fir-MDT0003: Recovery over after 0:32, of 1378 clients 1378 recovered and 0 were evicted.
            fir-md1-s1: Sep 05 05:59:07 fir-md1-s1 kernel: Lustre: fir-MDT0002: Recovery over after 5:26, of 1379 clients 1379 recovered and 0 were evicted.
            fir-md1-s1: Sep 05 05:59:25 fir-md1-s1 kernel: Lustre: fir-MDT0000: Recovery over after 0:43, of 1379 clients 1379 recovered and 0 were evicted.
            

            Minor thing, two MDTs took some more time to mount, but after some 300s timeout it worked:

            [ 4563.995907] LustreError: 39383:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567688019, 300s ago), entering recovery for MGS@10.0.10.51@o2ib7 ns: MGC10.0.10.51@o2ib7 lock: ffff8fb0c2240480/0x5731634ee5f506c6 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 3 type: PLN flags: 0x1000000000000 nid: local remote: 0x98816ce1399443f3 expref: -99 pid: 39383 timeout: 0 lvb_type: 0
            [ 4564.033287] LustreError: 40296:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff8fb0c1334600) refcount nonzero (2) after lock cleanup; forcing cleanup.
            [ 4564.052847] Lustre: MGC10.0.10.51@o2ib7: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7)
            [ 4564.062155] Lustre: Skipped 1 previous similar message
            [ 4564.070150] Lustre: fir-MDT0003: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
            [ 4564.323008] Lustre: fir-MDD0003: changelog on
            [ 4564.332139] Lustre: fir-MDT0003: in recovery but waiting for the first client to connect
            [ 4564.429981] Lustre: fir-MDT0003: Will be in recovery for at least 2:30, or until 1378 clients reconnect
            

            In case you want to investigate this timeout issue, the kernel logs of the successful start of 2.12.2_119 on our two MDSes are attached as:
            fir-md1-s1_2.12.2_119_SRCC.log and fir-md1-s2_2.12.2_119_SRCC.log

            But no 100% loaded lod* threads anymore. Not even a single eviction. This is awesome. Thanks all!

            sthiell Stephane Thiell added a comment - Great! With this patch on top of current b2_12 (so quite a lot of changes for us, not only this patch), the recovery worked on all 4 MDTs: fir-md1-s2: Sep 05 05:59:04 fir-md1-s2 kernel: Lustre: fir-MDT0001: Recovery over after 5:26, of 1378 clients 1378 recovered and 0 were evicted. fir-md1-s2: Sep 05 05:59:11 fir-md1-s2 kernel: Lustre: fir-MDT0003: Recovery over after 0:32, of 1378 clients 1378 recovered and 0 were evicted. fir-md1-s1: Sep 05 05:59:07 fir-md1-s1 kernel: Lustre: fir-MDT0002: Recovery over after 5:26, of 1379 clients 1379 recovered and 0 were evicted. fir-md1-s1: Sep 05 05:59:25 fir-md1-s1 kernel: Lustre: fir-MDT0000: Recovery over after 0:43, of 1379 clients 1379 recovered and 0 were evicted. Minor thing, two MDTs took some more time to mount, but after some 300s timeout it worked: [ 4563.995907] LustreError: 39383:0:(ldlm_request.c:147:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1567688019, 300s ago), entering recovery for MGS@10.0.10.51@o2ib7 ns: MGC10.0.10.51@o2ib7 lock: ffff8fb0c2240480/0x5731634ee5f506c6 lrc: 4/1,0 mode: --/CR res: [0x726966:0x2:0x0].0x0 rrc: 3 type: PLN flags: 0x1000000000000 nid: local remote: 0x98816ce1399443f3 expref: -99 pid: 39383 timeout: 0 lvb_type: 0 [ 4564.033287] LustreError: 40296:0:(ldlm_resource.c:1147:ldlm_resource_complain()) MGC10.0.10.51@o2ib7: namespace resource [0x726966:0x2:0x0].0x0 (ffff8fb0c1334600) refcount nonzero (2) after lock cleanup; forcing cleanup. [ 4564.052847] Lustre: MGC10.0.10.51@o2ib7: Connection restored to 10.0.10.51@o2ib7 (at 10.0.10.51@o2ib7) [ 4564.062155] Lustre: Skipped 1 previous similar message [ 4564.070150] Lustre: fir-MDT0003: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 [ 4564.323008] Lustre: fir-MDD0003: changelog on [ 4564.332139] Lustre: fir-MDT0003: in recovery but waiting for the first client to connect [ 4564.429981] Lustre: fir-MDT0003: Will be in recovery for at least 2:30, or until 1378 clients reconnect In case you want to investigate this timeout issue, the kernel logs of the successful start of 2.12.2_119 on our two MDSes are attached as: fir-md1-s1_2.12.2_119_SRCC.log and fir-md1-s2_2.12.2_119_SRCC.log But no 100% loaded lod* threads anymore. Not even a single eviction. This is awesome. Thanks all!
            pjones Peter Jones added a comment -

            Yes it is quite likely this will end up in 2.12.3

            pjones Peter Jones added a comment - Yes it is quite likely this will end up in 2.12.3
            sthiell Stephane Thiell added a comment - - edited

            Is there any chance to land this one in 2.12.3?

            sthiell Stephane Thiell added a comment - - edited Is there any chance to land this one in 2.12.3?

            Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35661
            Subject: LU-11761 fld: let's caller to retry FLD_QUERY
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: fe3d5156ac95e58c5fc62ab503dcff3619ccfbe0

            gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35661 Subject: LU-11761 fld: let's caller to retry FLD_QUERY Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: fe3d5156ac95e58c5fc62ab503dcff3619ccfbe0
            pjones Peter Jones added a comment -

            Landed for 2.13

            pjones Peter Jones added a comment - Landed for 2.13

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34962/
            Subject: LU-11761 fld: let's caller to retry FLD_QUERY
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: e3f6111dfd1c6f2266d0beef67e5a7514a6965d0

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34962/ Subject: LU-11761 fld: let's caller to retry FLD_QUERY Project: fs/lustre-release Branch: master Current Patch Set: Commit: e3f6111dfd1c6f2266d0beef67e5a7514a6965d0

            Thanks Hongchao! FYI I'll use your current patch in case of emergency (eg. MDT restart) on our system, and rebuild when it's fully ready. We need to be able to abort the MDT recovery or better, the MDT recovery needs to be able to timeout, which is not always the case today as described in LU-12360.

            sthiell Stephane Thiell added a comment - Thanks Hongchao! FYI I'll use your current patch in case of emergency (eg. MDT restart) on our system, and rebuild when it's fully ready. We need to be able to abort the MDT recovery or better, the MDT recovery needs to be able to timeout, which is not always the case today as described in LU-12360 .

            Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34962
            Subject: LU-11761 fld: limit FLD_QUERY retry between MDTs
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: f2c831def8ac010b7a913633141deb2b5384e06c

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34962 Subject: LU-11761 fld: limit FLD_QUERY retry between MDTs Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f2c831def8ac010b7a913633141deb2b5384e06c

            In fld_client_rpc, the FLD_QUERY request will be retried in a loop if the connection to MDT0000 is not completed.

            int fld_client_rpc(struct obd_export *exp,
                               struct lu_seq_range *range, u32 fld_op,
                               struct ptlrpc_request **reqp)
            {
                        ...
                        if (rc != 0) {
                            if (imp->imp_state != LUSTRE_IMP_CLOSED &&
                                !imp->imp_deactive &&
                                imp->imp_connect_flags_orig & OBD_CONNECT_MDS_MDS &&
                                OCD_HAS_FLAG(&imp->imp_connect_data, LIGHTWEIGHT) &&
                                rc != -ENOTSUPP) {
                                    /*
                                     * Since LWP is not replayable, so it will keep
                                     * trying unless umount happens or the remote
                                     * target does not support the operation, otherwise
                                     * it would cause unecessary failure of the
                                     * application.
                                     */
                                    ptlrpc_req_finished(req);
                                    rc = 0;
                                    goto again;
                            }
                            GOTO(out_req, rc);
                            ...
            

            In this case, the connection to the MDT0000 is actually lost, then the kernel thread "lodxxxx_recxxxx" is stuck in fld_client_rpc.

            Apr 28 10:58:55 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0000_UUID: not available for connect from 10.9.105.54@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other 
            Apr 28 10:58:55 fir-md1-s1 kernel: LustreError: Skipped 16004 previous similar messages
            
            hongchao.zhang Hongchao Zhang added a comment - In fld_client_rpc, the FLD_QUERY request will be retried in a loop if the connection to MDT0000 is not completed. int fld_client_rpc(struct obd_export *exp, struct lu_seq_range *range, u32 fld_op, struct ptlrpc_request **reqp) { ... if (rc != 0) { if (imp->imp_state != LUSTRE_IMP_CLOSED && !imp->imp_deactive && imp->imp_connect_flags_orig & OBD_CONNECT_MDS_MDS && OCD_HAS_FLAG(&imp->imp_connect_data, LIGHTWEIGHT) && rc != -ENOTSUPP) { /* * Since LWP is not replayable, so it will keep * trying unless umount happens or the remote * target does not support the operation, otherwise * it would cause unecessary failure of the * application. */ ptlrpc_req_finished(req); rc = 0; goto again; } GOTO(out_req, rc); ... In this case, the connection to the MDT0000 is actually lost, then the kernel thread "lodxxxx_recxxxx" is stuck in fld_client_rpc. Apr 28 10:58:55 fir-md1-s1 kernel: LustreError: 137-5: fir-MDT0000_UUID: not available for connect from 10.9.105.54@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other Apr 28 10:58:55 fir-md1-s1 kernel: LustreError: Skipped 16004 previous similar messages

            People

              hongchao.zhang Hongchao Zhang
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: