Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17809

MDT umount exceeding 5 minute HA timeout

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • Lustre 2.16.0
    • 3
    • 9223372036854775807

    Description

      crash> bt
      PID: 95475    TASK: ffff9d51fe3fc740  CPU: 9    COMMAND: "umount"
       #0 [ffffb6218eb478b0] __schedule at ffffffff8b74e1d4
       #1 [ffffb6218eb47910] schedule at ffffffff8b74e648
       #2 [ffffb6218eb47920] schedule_timeout at ffffffff8b751cd3
       #3 [ffffb6218eb479b8] ptlrpc_set_wait at ffffffffc1679185 [ptlrpc]
       #4 [ffffb6218eb47a30] ptlrpc_queue_wait at ffffffffc1679371 [ptlrpc]
       #5 [ffffb6218eb47a48] ptlrpc_disconnect_import at ffffffffc16a5165 [ptlrpc]
       #6 [ffffb6218eb47ac8] osp_disconnect at ffffffffc1d348d2 [osp]
       #7 [ffffb6218eb47ae8] osp_process_config at ffffffffc1d35a7f [osp]
       #8 [ffffb6218eb47b18] lod_sub_process_config at ffffffffc1abf901 [lod]
       #9 [ffffb6218eb47b58] lod_process_config at ffffffffc1ac7b2e [lod]
      #10 [ffffb6218eb47ba8] mdd_process_config at ffffffffc1b5ec8f [mdd]
      #11 [ffffb6218eb47bd8] mdt_stack_pre_fini at ffffffffc1bdfc19 [mdt]
      #12 [ffffb6218eb47c10] mdt_device_fini at ffffffffc1be5e97 [mdt]
      #13 [ffffb6218eb47c60] class_cleanup at ffffffffc12f3ed1 [obdclass]
      #14 [ffffb6218eb47ce0] class_process_config at ffffffffc12f4e35 [obdclass]
      #15 [ffffb6218eb47d50] class_manual_cleanup at ffffffffc12f6f15 [obdclass]
      #16 [ffffb6218eb47df0] server_put_super at ffffffffc1331143 [obdclass]
      #17 [ffffb6218eb47e98] generic_shutdown_super at ffffffff8b11bdcc
      

      Log includes timeouts for disconnects requests ±71 seconds each.

      00000100:00000400:0.0:1712841383.691084:0:95475:0:(client.c:2310:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1712841312/real 1712841312]  req@000000000cc2b990 x1796043989671424/t0(0) o39->work2-MDT0001-osp-MDT0003@23421@kfi:24/4 lens 224/224 e 0 to 1 dl 1712841383 ref 2 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''
      

      To reach 5 mins limit, it is enough to get 4-5 disconnects timeouts. When cluster has 200 osp devices it is easy.

      With a disabling HA umount takes about 7min.

      [root@work2n005 ~]# dmesg -T | grep work2-MDT0003 | egrep 'Failing|complete'
      [Thu Apr 11 09:18:18 2024] Lustre: Failing over work2-MDT0003
      [Thu Apr 11 09:25:25 2024] Lustre: server umount work2-MDT0003 complete
      

      Attachments

        Issue Links

          Activity

            People

              aboyko Alexander Boyko
              aboyko Alexander Boyko
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: