Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.15.0
    • Lustre 2.15.0
    • 2.15.0-rc3
    • 3
    • 9223372036854775807

    Description

      Hit the same issue twice during soak testing on the same MDS node

      trying to connect to soak-10 ...
      2022-04-15 00:35:34,865:fsmgmt.fsmgmt:INFO     soak-10 is up!!!
      2022-04-15 00:35:45,877:fsmgmt.fsmgmt:INFO     Failing over soaked-MDT0002 ...
      2022-04-15 00:35:45,877:fsmgmt.fsmgmt:INFO     Mounting soaked-MDT0002 on soak-11 ...
      2022-04-15 00:36:42,914:fsmgmt.fsmgmt:INFO     ... soaked-MDT0002 mounted successfully on soak-11
      2022-04-15 00:36:42,914:fsmgmt.fsmgmt:INFO     ... soaked-MDT0002 failed over
      2022-04-15 00:36:42,914:fsmgmt.fsmgmt:INFO     Wait for recovery to complete
      2022-04-15 00:36:43,502:fsmgmt.fsmgmt:DEBUG    Recovery Result Record: {'soak-11': {'soaked-MDT0003': 'COMPLETE', 'soaked-MDT0002': 'WAITING'}}
      2022-04-15 00:36:43,503:fsmgmt.fsmgmt:INFO     soaked-MDT0002 in status 'WAITING'.
      

      on soak-11

      [136090.466173] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages
      [136538.516253] Lustre: soaked-MDT0003: Client ec0f61fb-4e34-4a0b-b633-b84f3e77a68b (at 192.168.1.138@o2ib) reconnecting
      [136538.528122] Lustre: Skipped 8 previous similar messages
      [136570.792064] LustreError: 43760:0:(llog.c:781:llog_process_thread()) soaked-MDT0001-osp-MDT0002 retry remote llog process
      [136570.804339] LustreError: 43760:0:(llog.c:781:llog_process_thread()) Skipped 36771 previous similar messages
      [136690.448382] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reached hard limit: 900, exten
      d: 1
      [136690.462972] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages
      [137139.504191] Lustre: soaked-MDT0003: Client 6734fd00-3e57-422d-a9ca-e8a57bcfd8f1 (at 192.168.1.135@o2ib) reconnecting
      [137139.516055] Lustre: Skipped 10 previous similar messages
      [137170.777941] LustreError: 43760:0:(llog.c:781:llog_process_thread()) soaked-MDT0001-osp-MDT0002 retry remote llog process
      [137170.790212] LustreError: 43760:0:(llog.c:781:llog_process_thread()) Skipped 36789 previous similar messages
      [137180.761466] Lustre: mdt00_016: service thread pid 12073 was inactive for 200.281 seconds. The thread might be hung, or it might only be slow
       and will resume later. Dumping the stack trace for debugging purposes:
      [137180.782540] Pid: 12073, comm: mdt00_016 3.10.0-1160.49.1.el7_lustre.x86_64 #1 SMP Mon Dec 13 22:00:59 UTC 2021
      [137180.793815] Call Trace:
      [137180.796783] [<0>] top_trans_stop+0x882/0xfa0 [ptlrpc]
      [137180.802560] [<0>] lod_trans_stop+0x25c/0x340 [lod]
      [137180.808065] [<0>] mdd_trans_stop+0x2e/0x174 [mdd]
      [137180.813433] [<0>] mdd_create+0x154a/0x1cd0 [mdd]
      [137180.818740] [<0>] mdo_create+0x46/0x48 [mdt]
      [137180.823632] [<0>] mdt_create+0xab1/0xe40 [mdt]
      [137180.828720] [<0>] mdt_reint_create+0x3a0/0x460 [mdt]
      [137180.834388] [<0>] mdt_reint_rec+0x8a/0x240 [mdt]
      [137180.839662] [<0>] mdt_reint_internal+0x76c/0xb50 [mdt]
      [137180.845516] [<0>] mdt_reint+0x67/0x150 [mdt]
      [137180.850456] [<0>] tgt_request_handle+0x92f/0x19c0 [ptlrpc]
      [137180.856729] [<0>] ptlrpc_server_handle_request+0x253/0xc30 [ptlrpc]
      [137180.863875] [<0>] ptlrpc_main+0xbf4/0x15e0 [ptlrpc]
      [137180.869420] [<0>] kthread+0xd1/0xe0
      [137180.873423] [<0>] ret_from_fork_nospec_begin+0x21/0x21
      [137180.879286] [<0>] 0xfffffffffffffffe
      [137290.446156] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reached hard limit: 900, extend: 1
      [137290.460730] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages
      [137575.430810] Lustre: 7111:0:(service.c:1436:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/5), not sending early reply  req@ffff8f5f6cb94380 x1729527267995904/t176094400030(0) o36->f5a20605-7e7a-45f1-bf01-daf0664944ec@192.168.1.118@o2ib:514/0 lens 552/448 e 24 to 0 dl 1649907299 ref 2 fl Interpret:/0/0 rc 0/0 job:''
      [137740.477658] Lustre: soaked-MDT0003: Client ec0f61fb-4e34-4a0b-b633-b84f3e77a68b (at 192.168.1.138@o2ib) reconnecting
      [137740.489550] Lustre: Skipped 9 previous similar messages
      [137770.755234] LustreError: 43760:0:(llog.c:781:llog_process_thread()) soaked-MDT0001-osp-MDT0002 retry remote llog process
      [137770.767489] LustreError: 43760:0:(llog.c:781:llog_process_thread()) Skipped 36738 previous similar messages
      [137890.442958] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reached hard limit: 900, extend: 1
      [137890.457533] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages
      [138341.465763] Lustre: soaked-MDT0003: Client 6734fd00-3e57-422d-a9ca-e8a57bcfd8f1 (at 192.168.1.135@o2ib) reconnecting
      [138341.477619] Lustre: Skipped 11 previous similar messages
      

      have to run abort_recovery

      [188290.202475] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages
      [188327.367445] LustreError: 37182:0:(mdt_handler.c:7419:mdt_iocontrol()) soaked-MDT0002: Aborting recovery for device
      [188327.379129] LustreError: 37182:0:(ldlm_lib.c:2876:target_stop_recovery_thread()) soaked-MDT0002: Aborting recovery
      [188327.390814] Lustre: 43762:0:(ldlm_lib.c:2283:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      [188327.392289] LustreError: 43760:0:(lod_dev.c:424:lod_sub_recovery_thread()) soaked-MDT0001-osp-MDT0002 get update log failed: rc = -11
      [188327.418009] Lustre: 43762:0:(ldlm_lib.c:2273:target_recovery_overseer()) soaked-MDT0002 recovery is aborted by hard timeout
      [188327.486462] Lustre: soaked-MDT0002-osp-MDT0003: Connection restored to 192.168.1.111@o2ib (at 0@lo)
      [188327.487906] Lustre: soaked-MDT0002: Recovery over after 2739:44, of 22 clients 22 recovered and 0 were evicted.
      [188345.146528] Lustre: Failing over soaked-MDT0002
      [188345.224943] LustreError: 3633:0:(mdt_reint.c:2741:mdt_reint_rename()) soaked-MDT0002: cannot lock for rename: rc = -5
      [188345.236924] LustreError: 3633:0:(mdt_reint.c:2741:mdt_reint_rename()) Skipped 1 previous similar message
      [188345.238945] Lustre: soaked-MDT0002: Not available for connect from 192.168.1.126@o2ib (stopping)
      [188345.238946] Lustre: Skipped 2 previous similar messages
      

      Attachments

        1. lu-14495-soak-8debug
          15.14 MB
        2. lu-14495-soak-8dmesg
          381 kB
        3. lu-14495-soak-8trace
          994 kB
        4. lustre-log.1651263727.22864
          102.86 MB
        5. lustre-log.1651732904.9736
          146.86 MB
        6. soak-10-debuglog
          133.07 MB
        7. soak-10-umount-hang-trace
          1016 kB
        8. soak-8dmesg
          337 kB
        9. soak-8trace
          995 kB

        Issue Links

          Activity

            [LU-15761] cannot finish MDS recovery

            can you grab all the traces with echo t >/proc/sysrq-trigger please?

            bzzz Alex Zhuravlev added a comment - can you grab all the traces with echo t >/proc/sysrq-trigger please?
            sarah Sarah Liu added a comment - - edited

            bzzz, the same umount hang issue happened again on a different node, if you want to check the status, please let me know. I will also upload the trace from soak-10 which the hang happened

            sarah Sarah Liu added a comment - - edited bzzz , the same umount hang issue happened again on a different node, if you want to check the status, please let me know. I will also upload the trace from soak-10 which the hang happened

            still looking at the logs, but haven't got a good idea yet. is it reproducable?

            bzzz Alex Zhuravlev added a comment - still looking at the logs, but haven't got a good idea yet. is it reproducable?
            sarah Sarah Liu added a comment -

            bzzz, please check the attached.

            sarah Sarah Liu added a comment - bzzz , please check the attached.

            sarah would it be possible to get all traces for the case?

            bzzz Alex Zhuravlev added a comment - sarah would it be possible to get all traces for the case?
            sarah Sarah Liu added a comment -

            Hi Colin,

            soak doesn't hit the same issue, instead it has another issue on the other pair of MDSs(soak-8 and soak-9) during failback.

            on soak-8, umount hung when it tried to failback MDT0001

            [13201.893886] Lustre: Skipped 3 previous similar messages
            [13226.408821] LDISKFS-fs warning (device dm-1): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait.
            
            [13241.225041] LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 192.168.1.128@o2ib (no target). If you are running an HA 
            pair check that the target is mounted on the other server.
            [13241.244882] LustreError: Skipped 929 previous similar messages
            [13272.213200] LDISKFS-fs (dm-1): recovery complete
            [13272.218869] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc
            [13274.329038] Lustre: soaked-MDT0001: Not available for connect from 192.168.1.111@o2ib (not set up)
            [13274.339066] Lustre: Skipped 1 previous similar message
            [13274.402654] Lustre: soaked-MDT0000: Received MDS connection from 0@lo, removing former export from 192.168.1.109@o2ib
            [13274.615826] Lustre: soaked-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
            [13274.633101] Lustre: soaked-MDT0001: in recovery but waiting for the first client to connect
            [13274.686553] Lustre: soaked-MDT0001: Will be in recovery for at least 2:30, or until 22 clients reconnect
            [13283.173712] Lustre: soaked-MDT0001-osp-MDT0000: Connection restored to 192.168.1.108@o2ib (at 0@lo)
            [13283.174024] Lustre: soaked-MDT0001: Recovery over after 0:08, of 22 clients 22 recovered and 0 were evicted.
            [13283.187022] Lustre: 3818:0:(mdt_recovery.c:200:mdt_req_from_lrd()) @@@ restoring transno  req@ffff9849b4361680 x1730744898968000/t38655174014
            (0) o36->88b0180e-0530-4e8d-8af5-9a48dbccd57d@192.168.1.119@o2ib:271/0 lens 496/2888 e 0 to 0 dl 1650719436 ref 1 fl Interpret:/2/0 rc 0/0 job:'
            '
            [13283.187026] Lustre: 3818:0:(mdt_recovery.c:200:mdt_req_from_lrd()) Skipped 3 previous similar messages
            [13291.941059] Lustre: Failing over soaked-MDT0001
            [13291.949360] LustreError: 5845:0:(mdt_reint.c:2741:mdt_reint_rename()) soaked-MDT0001: cannot lock for rename: rc = -5
            [13291.957834] LustreError: 3666:0:(ldlm_lockd.c:2500:ldlm_cancel_handler()) ldlm_cancel from 0@lo arrived at 1650719399 with bad export cookie 17448559810846504576
            [13291.957883] Lustre: soaked-MDT0001-osp-MDT0000: Connection to soaked-MDT0001 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
            [13291.957978] LustreError: 11-0: soaked-MDT0001-osp-MDT0000: operation out_update to node 0@lo failed: rc = -107
            [13291.957984] Lustre: soaked-MDT0001: Not available for connect from 0@lo (stopping)
            [13291.996099] LustreError: 3684:0:(osp_object.c:629:osp_attr_get()) soaked-MDT0002-osp-MDT0001: osp_attr_get update error [0x280004a85:0x3:0x0]: rc = -108
            [13292.029469] LustreError: 5845:0:(mdt_reint.c:2741:mdt_reint_rename()) Skipped 2 previous similar messages
            [13293.222589] Lustre: soaked-MDT0001: Not available for connect from 192.168.1.111@o2ib (stopping)
            [13293.232426] Lustre: Skipped 7 previous similar messages
            [13294.448007] LustreError: 3645:0:(client.c:1256:ptlrpc_import_delay_req()) @@@ IMP_CLOSED  req@ffff9849b28df500 x1730891251304896/t0(0) o41->soaked-MDT0002-osp-MDT0001@192.168.1.110@o2ib:24/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:QU/0/ffffffff rc 0/-1 job:''
            [13294.473664] LustreError: 3645:0:(client.c:1256:ptlrpc_import_delay_req()) Skipped 2 previous similar messages
            [13296.155550] Lustre: soaked-MDT0001: Not available for connect from 192.168.1.129@o2ib (stopping)
            [13296.165402] Lustre: Skipped 1 previous similar message
            [13300.253260] Lustre: soaked-MDT0001: Not available for connect from 192.168.1.121@o2ib (stopping)
            [13300.263095] Lustre: Skipped 13 previous similar messages
            [13308.262410] Lustre: soaked-MDT0001: Not available for connect from 192.168.1.111@o2ib (stopping)
            [13308.272238] Lustre: Skipped 36 previous similar messages
            [13324.545237] Lustre: soaked-MDT0001: Not available for connect from 192.168.1.127@o2ib (stopping)
            [13324.555060] Lustre: Skipped 122 previous similar messages
            [13357.071204] Lustre: soaked-MDT0001: Not available for connect from 0@lo (stopping)
            [13357.079677] Lustre: Skipped 226 previous similar messages
            [13421.084164] Lustre: soaked-MDT0001: Not available for connect from 192.168.1.123@o2ib (stopping)
            [13421.094016] Lustre: Skipped 458 previous similar messages
            [13441.068783] INFO: task umount:25201 blocked for more than 120 seconds.
            [13441.076096] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [13441.084855] umount          D ffff9845eaf8d280     0 25201  25197 0x00000080
            [13441.092782] Call Trace:
            [13441.095539]  [<ffffffff91b89179>] schedule+0x29/0x70
            [13441.101100]  [<ffffffff91b8a8b5>] rwsem_down_write_failed+0x215/0x3c0
            [13441.108344]  [<ffffffffc1ac50e0>] ? osp_trans_update_request_create+0x30/0x50 [osp]
            [13441.116905]  [<ffffffff91798327>] call_rwsem_down_write_failed+0x17/0x30
            [13441.124415]  [<ffffffff91b8868d>] down_write+0x2d/0x3d
            [13441.130225]  [<ffffffffc0df0bbe>] llog_cancel_arr_rec+0x14e/0xa40 [obdclass]
            [13441.138142]  [<ffffffffc0df14ce>] llog_cancel_rec+0x1e/0x20 [obdclass]
            [13441.145474]  [<ffffffffc0df813e>] llog_cat_cleanup+0xee/0x3f0 [obdclass]
            [13441.153015]  [<ffffffffc0df8556>] llog_cat_close+0x116/0x240 [obdclass]
            [13441.160452]  [<ffffffffc19979e6>] lod_sub_fini_llog+0xa6/0x2e0 [lod]
            [13441.167568]  [<ffffffffc1998ccc>] lod_process_config+0x76c/0x1340 [lod]
            [13441.174981]  [<ffffffffc1a292fe>] mdd_process_config+0x8e/0x640 [mdd]
            [13441.182214]  [<ffffffffc1870921>] mdt_stack_pre_fini+0x271/0x6f0 [mdt]
            [13441.189538]  [<ffffffffc1871b37>] mdt_device_fini+0x87/0x990 [mdt]
            [13441.196489]  [<ffffffffc0e2de91>] class_cleanup+0xa61/0xd20 [obdclass]
            [13441.203849]  [<ffffffffc0e2ee97>] class_process_config+0x6a7/0x2b20 [obdclass]
            [13441.211958]  [<ffffffffc0e314d6>] class_manual_cleanup+0x1c6/0x760 [obdclass]
            [13441.219967]  [<ffffffffc0e6b6f5>] server_put_super+0xa25/0xf80 [obdclass]
            [13441.227555]  [<ffffffff916977f6>] ? fsnotify_unmount_inodes+0x46/0x1d0
            [13441.234856]  [<ffffffff91650f5d>] generic_shutdown_super+0x6d/0x100
            [13441.241866]  [<ffffffff91651362>] kill_anon_super+0x12/0x20
            [13441.248134]  [<ffffffffc14d2d7b>] lustre_kill_super+0x2b/0x30 [lustre]
            [13441.255455]  [<ffffffff9165173e>] deactivate_locked_super+0x4e/0x70
            [13441.262458]  [<ffffffff91651ec6>] deactivate_super+0x46/0x60
            [13441.268792]  [<ffffffff9167130f>] cleanup_mnt+0x3f/0x80
            [13441.274634]  [<ffffffff916713a2>] __cleanup_mnt+0x12/0x20
            [13441.280692]  [<ffffffff914c299b>] task_work_run+0xbb/0xe0
            [13441.286737]  [<ffffffff9142cc65>] do_notify_resume+0xa5/0xc0
            [13441.293085]  [<ffffffff91b962ef>] int_signal+0x12/0x17
            [13549.118275] Lustre: soaked-MDT0001: Not available for connect from 192.168.1.102@o2ib (stopping)
            [13549.128106] Lustre: Skipped 917 previous similar messages
            [13561.296954] INFO: task umount:25201 blocked for more than 120 seconds.
            [13561.304262] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [13561.313021] umount          D ffff9845eaf8d280     0 25201  25197 0x00000080
            [13561.320948] Call Trace:
            [13561.323722]  [<ffffffff91b89179>] schedule+0x29/0x70
            [13561.329295]  [<ffffffff91b8a8b5>] rwsem_down_write_failed+0x215/0x3c0
            [13561.336541]  [<ffffffffc1ac50e0>] ? osp_trans_update_request_create+0x30/0x50 [osp]
            [13561.345118]  [<ffffffff91798327>] call_rwsem_down_write_failed+0x17/0x30
            [13561.352614]  [<ffffffff91b8868d>] down_write+0x2d/0x3d
            [13561.358425]  [<ffffffffc0df0bbe>] llog_cancel_arr_rec+0x14e/0xa40 [obdclass]
            [13561.366349]  [<ffffffffc0df14ce>] llog_cancel_rec+0x1e/0x20 [obdclass]
            [13561.373688]  [<ffffffffc0df813e>] llog_cat_cleanup+0xee/0x3f0 [obdclass]
            [13561.381198]  [<ffffffffc0df8556>] llog_cat_close+0x116/0x240 [obdclass]
            [13561.388628]  [<ffffffffc19979e6>] lod_sub_fini_llog+0xa6/0x2e0 [lod]
            [13561.395764]  [<ffffffffc1998ccc>] lod_process_config+0x76c/0x1340 [lod]
            [13561.403174]  [<ffffffffc1a292fe>] mdd_process_config+0x8e/0x640 [mdd]
            [13561.410429]  [<ffffffffc1870921>] mdt_stack_pre_fini+0x271/0x6f0 [mdt]
            [13561.417740]  [<ffffffffc1871b37>] mdt_device_fini+0x87/0x990 [mdt]
            [13561.424679]  [<ffffffffc0e2de91>] class_cleanup+0xa61/0xd20 [obdclass]
            [13561.432019]  [<ffffffffc0e2ee97>] class_process_config+0x6a7/0x2b20 [obdclass]
            [13561.440115]  [<ffffffffc0e314d6>] class_manual_cleanup+0x1c6/0x760 [obdclass]
            [13561.448130]  [<ffffffffc0e6b6f5>] server_put_super+0xa25/0xf80 [obdclass]
            [13561.455738]  [<ffffffff916977f6>] ? fsnotify_unmount_inodes+0x46/0x1d0
            [13561.463042]  [<ffffffff91650f5d>] generic_shutdown_super+0x6d/0x100
            [13561.470059]  [<ffffffff91651362>] kill_anon_super+0x12/0x20
            [13561.476333]  [<ffffffffc14d2d7b>] lustre_kill_super+0x2b/0x30 [lustre]
            [13561.483635]  [<ffffffff9165173e>] deactivate_locked_super+0x4e/0x70
            [13561.490656]  [<ffffffff91651ec6>] deactivate_super+0x46/0x60
            [13561.497002]  [<ffffffff9167130f>] cleanup_mnt+0x3f/0x80
            [13561.502856]  [<ffffffff916713a2>] __cleanup_mnt+0x12/0x20
            [13561.508890]  [<ffffffff914c299b>] task_work_run+0xbb/0xe0
            [13561.514946]  [<ffffffff9142cc65>] do_notify_resume+0xa5/0xc0
            [13561.521284]  [<ffffffff91b962ef>] int_signal+0x12/0x17
            [13563.147828] Lustre: mdt01_009: service thread pid 3882 was inactive for 236.694 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
            [13563.168704] Pid: 3882, comm: mdt01_009 3.10.0-1160.49.1.el7_lustre.x86_64 #1 SMP Mon Dec 13 22:00:59 UTC 2021
            [13563.179794] Call Trace:
            [13563.182673] [<0>] top_trans_stop+0x882/0xfa0 [ptlrpc]
            [13563.188353] [<0>] lod_trans_stop+0x25c/0x340 [lod]
            [13563.193761] [<0>] mdd_trans_stop+0x2e/0x174 [mdd]
            [13563.199052] [<0>] mdd_create+0x154a/0x1cd0 [mdd]
            [13563.204259] [<0>] mdo_create+0x46/0x48 [mdt]
            [13563.209061] [<0>] mdt_create+0xab1/0xe40 [mdt]
            [13563.214062] [<0>] mdt_reint_create+0x3a0/0x460 [mdt]
            [13563.219660] [<0>] mdt_reint_rec+0x8a/0x240 [mdt]
            [13563.224847] [<0>] mdt_reint_internal+0x76c/0xb50 [mdt]
            [13563.230632] [<0>] mdt_reint+0x67/0x150 [mdt]
            [13563.235465] [<0>] tgt_request_handle+0x92f/0x19c0 [ptlrpc]
            [13563.241663] [<0>] ptlrpc_server_handle_request+0x253/0xc30 [ptlrpc]
            [13563.248722] [<0>] ptlrpc_main+0xbf4/0x15e0 [ptlrpc]
            [13563.254185] [<0>] kthread+0xd1/0xe0
            [13563.258089] [<0>] ret_from_fork_nospec_begin+0x21/0x21
            [13563.263866] [<0>] 0xfffffffffffffffe
            [13591.927826] Lustre: soaked-MDT0000: Received new MDS connection from 192.168.1.110@o2ib, keep former export from same NID
            [13591.940130] Lustre: Skipped 1 previous similar message
            [13681.525125] INFO: task umount:25201 blocked for more than 120 seconds.
            [13681.532427] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            [13681.541183] umount          D ffff9845eaf8d280     0 25201  25197 0x00000080
            
            sarah Sarah Liu added a comment - Hi Colin, soak doesn't hit the same issue, instead it has another issue on the other pair of MDSs(soak-8 and soak-9) during failback. on soak-8, umount hung when it tried to failback MDT0001 [13201.893886] Lustre: Skipped 3 previous similar messages [13226.408821] LDISKFS-fs warning (device dm-1): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait. [13241.225041] LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 192.168.1.128@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. [13241.244882] LustreError: Skipped 929 previous similar messages [13272.213200] LDISKFS-fs (dm-1): recovery complete [13272.218869] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc [13274.329038] Lustre: soaked-MDT0001: Not available for connect from 192.168.1.111@o2ib (not set up) [13274.339066] Lustre: Skipped 1 previous similar message [13274.402654] Lustre: soaked-MDT0000: Received MDS connection from 0@lo, removing former export from 192.168.1.109@o2ib [13274.615826] Lustre: soaked-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 [13274.633101] Lustre: soaked-MDT0001: in recovery but waiting for the first client to connect [13274.686553] Lustre: soaked-MDT0001: Will be in recovery for at least 2:30, or until 22 clients reconnect [13283.173712] Lustre: soaked-MDT0001-osp-MDT0000: Connection restored to 192.168.1.108@o2ib (at 0@lo) [13283.174024] Lustre: soaked-MDT0001: Recovery over after 0:08, of 22 clients 22 recovered and 0 were evicted. [13283.187022] Lustre: 3818:0:(mdt_recovery.c:200:mdt_req_from_lrd()) @@@ restoring transno req@ffff9849b4361680 x1730744898968000/t38655174014 (0) o36->88b0180e-0530-4e8d-8af5-9a48dbccd57d@192.168.1.119@o2ib:271/0 lens 496/2888 e 0 to 0 dl 1650719436 ref 1 fl Interpret:/2/0 rc 0/0 job:' ' [13283.187026] Lustre: 3818:0:(mdt_recovery.c:200:mdt_req_from_lrd()) Skipped 3 previous similar messages [13291.941059] Lustre: Failing over soaked-MDT0001 [13291.949360] LustreError: 5845:0:(mdt_reint.c:2741:mdt_reint_rename()) soaked-MDT0001: cannot lock for rename: rc = -5 [13291.957834] LustreError: 3666:0:(ldlm_lockd.c:2500:ldlm_cancel_handler()) ldlm_cancel from 0@lo arrived at 1650719399 with bad export cookie 17448559810846504576 [13291.957883] Lustre: soaked-MDT0001-osp-MDT0000: Connection to soaked-MDT0001 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete [13291.957978] LustreError: 11-0: soaked-MDT0001-osp-MDT0000: operation out_update to node 0@lo failed: rc = -107 [13291.957984] Lustre: soaked-MDT0001: Not available for connect from 0@lo (stopping) [13291.996099] LustreError: 3684:0:(osp_object.c:629:osp_attr_get()) soaked-MDT0002-osp-MDT0001: osp_attr_get update error [0x280004a85:0x3:0x0]: rc = -108 [13292.029469] LustreError: 5845:0:(mdt_reint.c:2741:mdt_reint_rename()) Skipped 2 previous similar messages [13293.222589] Lustre: soaked-MDT0001: Not available for connect from 192.168.1.111@o2ib (stopping) [13293.232426] Lustre: Skipped 7 previous similar messages [13294.448007] LustreError: 3645:0:(client.c:1256:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff9849b28df500 x1730891251304896/t0(0) o41->soaked-MDT0002-osp-MDT0001@192.168.1.110@o2ib:24/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:QU/0/ffffffff rc 0/-1 job:'' [13294.473664] LustreError: 3645:0:(client.c:1256:ptlrpc_import_delay_req()) Skipped 2 previous similar messages [13296.155550] Lustre: soaked-MDT0001: Not available for connect from 192.168.1.129@o2ib (stopping) [13296.165402] Lustre: Skipped 1 previous similar message [13300.253260] Lustre: soaked-MDT0001: Not available for connect from 192.168.1.121@o2ib (stopping) [13300.263095] Lustre: Skipped 13 previous similar messages [13308.262410] Lustre: soaked-MDT0001: Not available for connect from 192.168.1.111@o2ib (stopping) [13308.272238] Lustre: Skipped 36 previous similar messages [13324.545237] Lustre: soaked-MDT0001: Not available for connect from 192.168.1.127@o2ib (stopping) [13324.555060] Lustre: Skipped 122 previous similar messages [13357.071204] Lustre: soaked-MDT0001: Not available for connect from 0@lo (stopping) [13357.079677] Lustre: Skipped 226 previous similar messages [13421.084164] Lustre: soaked-MDT0001: Not available for connect from 192.168.1.123@o2ib (stopping) [13421.094016] Lustre: Skipped 458 previous similar messages [13441.068783] INFO: task umount:25201 blocked for more than 120 seconds. [13441.076096] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [13441.084855] umount D ffff9845eaf8d280 0 25201 25197 0x00000080 [13441.092782] Call Trace: [13441.095539] [<ffffffff91b89179>] schedule+0x29/0x70 [13441.101100] [<ffffffff91b8a8b5>] rwsem_down_write_failed+0x215/0x3c0 [13441.108344] [<ffffffffc1ac50e0>] ? osp_trans_update_request_create+0x30/0x50 [osp] [13441.116905] [<ffffffff91798327>] call_rwsem_down_write_failed+0x17/0x30 [13441.124415] [<ffffffff91b8868d>] down_write+0x2d/0x3d [13441.130225] [<ffffffffc0df0bbe>] llog_cancel_arr_rec+0x14e/0xa40 [obdclass] [13441.138142] [<ffffffffc0df14ce>] llog_cancel_rec+0x1e/0x20 [obdclass] [13441.145474] [<ffffffffc0df813e>] llog_cat_cleanup+0xee/0x3f0 [obdclass] [13441.153015] [<ffffffffc0df8556>] llog_cat_close+0x116/0x240 [obdclass] [13441.160452] [<ffffffffc19979e6>] lod_sub_fini_llog+0xa6/0x2e0 [lod] [13441.167568] [<ffffffffc1998ccc>] lod_process_config+0x76c/0x1340 [lod] [13441.174981] [<ffffffffc1a292fe>] mdd_process_config+0x8e/0x640 [mdd] [13441.182214] [<ffffffffc1870921>] mdt_stack_pre_fini+0x271/0x6f0 [mdt] [13441.189538] [<ffffffffc1871b37>] mdt_device_fini+0x87/0x990 [mdt] [13441.196489] [<ffffffffc0e2de91>] class_cleanup+0xa61/0xd20 [obdclass] [13441.203849] [<ffffffffc0e2ee97>] class_process_config+0x6a7/0x2b20 [obdclass] [13441.211958] [<ffffffffc0e314d6>] class_manual_cleanup+0x1c6/0x760 [obdclass] [13441.219967] [<ffffffffc0e6b6f5>] server_put_super+0xa25/0xf80 [obdclass] [13441.227555] [<ffffffff916977f6>] ? fsnotify_unmount_inodes+0x46/0x1d0 [13441.234856] [<ffffffff91650f5d>] generic_shutdown_super+0x6d/0x100 [13441.241866] [<ffffffff91651362>] kill_anon_super+0x12/0x20 [13441.248134] [<ffffffffc14d2d7b>] lustre_kill_super+0x2b/0x30 [lustre] [13441.255455] [<ffffffff9165173e>] deactivate_locked_super+0x4e/0x70 [13441.262458] [<ffffffff91651ec6>] deactivate_super+0x46/0x60 [13441.268792] [<ffffffff9167130f>] cleanup_mnt+0x3f/0x80 [13441.274634] [<ffffffff916713a2>] __cleanup_mnt+0x12/0x20 [13441.280692] [<ffffffff914c299b>] task_work_run+0xbb/0xe0 [13441.286737] [<ffffffff9142cc65>] do_notify_resume+0xa5/0xc0 [13441.293085] [<ffffffff91b962ef>] int_signal+0x12/0x17 [13549.118275] Lustre: soaked-MDT0001: Not available for connect from 192.168.1.102@o2ib (stopping) [13549.128106] Lustre: Skipped 917 previous similar messages [13561.296954] INFO: task umount:25201 blocked for more than 120 seconds. [13561.304262] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [13561.313021] umount D ffff9845eaf8d280 0 25201 25197 0x00000080 [13561.320948] Call Trace: [13561.323722] [<ffffffff91b89179>] schedule+0x29/0x70 [13561.329295] [<ffffffff91b8a8b5>] rwsem_down_write_failed+0x215/0x3c0 [13561.336541] [<ffffffffc1ac50e0>] ? osp_trans_update_request_create+0x30/0x50 [osp] [13561.345118] [<ffffffff91798327>] call_rwsem_down_write_failed+0x17/0x30 [13561.352614] [<ffffffff91b8868d>] down_write+0x2d/0x3d [13561.358425] [<ffffffffc0df0bbe>] llog_cancel_arr_rec+0x14e/0xa40 [obdclass] [13561.366349] [<ffffffffc0df14ce>] llog_cancel_rec+0x1e/0x20 [obdclass] [13561.373688] [<ffffffffc0df813e>] llog_cat_cleanup+0xee/0x3f0 [obdclass] [13561.381198] [<ffffffffc0df8556>] llog_cat_close+0x116/0x240 [obdclass] [13561.388628] [<ffffffffc19979e6>] lod_sub_fini_llog+0xa6/0x2e0 [lod] [13561.395764] [<ffffffffc1998ccc>] lod_process_config+0x76c/0x1340 [lod] [13561.403174] [<ffffffffc1a292fe>] mdd_process_config+0x8e/0x640 [mdd] [13561.410429] [<ffffffffc1870921>] mdt_stack_pre_fini+0x271/0x6f0 [mdt] [13561.417740] [<ffffffffc1871b37>] mdt_device_fini+0x87/0x990 [mdt] [13561.424679] [<ffffffffc0e2de91>] class_cleanup+0xa61/0xd20 [obdclass] [13561.432019] [<ffffffffc0e2ee97>] class_process_config+0x6a7/0x2b20 [obdclass] [13561.440115] [<ffffffffc0e314d6>] class_manual_cleanup+0x1c6/0x760 [obdclass] [13561.448130] [<ffffffffc0e6b6f5>] server_put_super+0xa25/0xf80 [obdclass] [13561.455738] [<ffffffff916977f6>] ? fsnotify_unmount_inodes+0x46/0x1d0 [13561.463042] [<ffffffff91650f5d>] generic_shutdown_super+0x6d/0x100 [13561.470059] [<ffffffff91651362>] kill_anon_super+0x12/0x20 [13561.476333] [<ffffffffc14d2d7b>] lustre_kill_super+0x2b/0x30 [lustre] [13561.483635] [<ffffffff9165173e>] deactivate_locked_super+0x4e/0x70 [13561.490656] [<ffffffff91651ec6>] deactivate_super+0x46/0x60 [13561.497002] [<ffffffff9167130f>] cleanup_mnt+0x3f/0x80 [13561.502856] [<ffffffff916713a2>] __cleanup_mnt+0x12/0x20 [13561.508890] [<ffffffff914c299b>] task_work_run+0xbb/0xe0 [13561.514946] [<ffffffff9142cc65>] do_notify_resume+0xa5/0xc0 [13561.521284] [<ffffffff91b962ef>] int_signal+0x12/0x17 [13563.147828] Lustre: mdt01_009: service thread pid 3882 was inactive for 236.694 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [13563.168704] Pid: 3882, comm: mdt01_009 3.10.0-1160.49.1.el7_lustre.x86_64 #1 SMP Mon Dec 13 22:00:59 UTC 2021 [13563.179794] Call Trace: [13563.182673] [<0>] top_trans_stop+0x882/0xfa0 [ptlrpc] [13563.188353] [<0>] lod_trans_stop+0x25c/0x340 [lod] [13563.193761] [<0>] mdd_trans_stop+0x2e/0x174 [mdd] [13563.199052] [<0>] mdd_create+0x154a/0x1cd0 [mdd] [13563.204259] [<0>] mdo_create+0x46/0x48 [mdt] [13563.209061] [<0>] mdt_create+0xab1/0xe40 [mdt] [13563.214062] [<0>] mdt_reint_create+0x3a0/0x460 [mdt] [13563.219660] [<0>] mdt_reint_rec+0x8a/0x240 [mdt] [13563.224847] [<0>] mdt_reint_internal+0x76c/0xb50 [mdt] [13563.230632] [<0>] mdt_reint+0x67/0x150 [mdt] [13563.235465] [<0>] tgt_request_handle+0x92f/0x19c0 [ptlrpc] [13563.241663] [<0>] ptlrpc_server_handle_request+0x253/0xc30 [ptlrpc] [13563.248722] [<0>] ptlrpc_main+0xbf4/0x15e0 [ptlrpc] [13563.254185] [<0>] kthread+0xd1/0xe0 [13563.258089] [<0>] ret_from_fork_nospec_begin+0x21/0x21 [13563.263866] [<0>] 0xfffffffffffffffe [13591.927826] Lustre: soaked-MDT0000: Received new MDS connection from 192.168.1.110@o2ib, keep former export from same NID [13591.940130] Lustre: Skipped 1 previous similar message [13681.525125] INFO: task umount:25201 blocked for more than 120 seconds. [13681.532427] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [13681.541183] umount D ffff9845eaf8d280 0 25201 25197 0x00000080
            cfaber Colin Faber added a comment -

            Hi sarah 

            Any luck reproducing?

            -cf

             

            cfaber Colin Faber added a comment - Hi sarah   Any luck reproducing? -cf  
            sarah Sarah Liu added a comment -

            Hi Andreas, I didn't notice this problem on the previous master tags, I will reload soak tomorrow to try the patch, thank you.

            sarah Sarah Liu added a comment - Hi Andreas, I didn't notice this problem on the previous master tags, I will reload soak tomorrow to try the patch, thank you.

            Sarah, I didn't notice that you have already hit this problem twice. Could you please test with patch https://review.whamcloud.com/46837 "LU-15645 obdclass: llog to handle gaps" (already based on 2.15.0-RC3) to see if this solves the recovery problem?

            adilger Andreas Dilger added a comment - Sarah, I didn't notice that you have already hit this problem twice. Could you please test with patch https://review.whamcloud.com/46837 " LU-15645 obdclass: llog to handle gaps " (already based on 2.15.0-RC3) to see if this solves the recovery problem?

            Sarah, it would be useful to know which was the last 2.14.5x version where this test was still passing? Have you restarted soak with the current build to see if this is repeatable, or if it was a one-time failure?

            According to https://wiki.whamcloud.com/display/Releases/Soak+Testing+on+Soak it looks like 2.14.56 was run on 20211223, and 2.15.0-RC2-2-g94f4e1f was run on 20220302. Did those builds work without error?

            The number of patches that landed since 2.15.0-RC2-2-g94f4e1f is quite small (excluding test patches):

            03e4b451b0dc New RC 2.15.0-RC3
            649d638467c0 LU-15670 clio: Disable lockless for DIO with O_APPEND
            22de0bd145b6 LU-15616 lnet: ln_api_mutex deadlocks
            3e3f70eb1ec9 LU-13714 lnet: only update gateway NI status on discovery
            0090b6f6f6cf LU-15692 lmv: change default hash back to fnv_1a_64
            42a6d1fdb681 LU-15702 lov: remove lo_trunc_stripeno
            ef826db1f43b LU-15548 osd-ldiskfs: hide virtual projid xattr
            9884f37985c1 LU-15637 llite: Fix use of uninitialized fields
            d8012811cc6f LU-15435 ptlrpc: unregister reply buffer on rq_err
            4d93fd79e8b0 LU-15661 nodemap: fix map mode value for 'both'
            2f496148c31d LU-15551 ofd: Return EOPNOTSUPP instead of EPROTO
            9dcbf8b3d44f LU-15634 ptlrpc: Use after free of 'conn' in rhashtable retry
            f14090e56c9d LU-15546 mdt: mdt_reint_open lookup before locking
            966ca46e4aa2 LU-15608 sec: fix DIO for encrypted files
            1dbcbd70f87b LU-15021 quota: protect lqe_glbl_data in lqe
            5da859e262dd LU-15576 osp: Interop skip sanity test 823 for MDS < 2.14.56
            06134ff9adde LU-15601 osd-ldiskfs: handle read_inode_bitmap() error
            591a990c617f LU-15513 lod: skip uninit component in lod_fill_mirrors
            
            adilger Andreas Dilger added a comment - Sarah, it would be useful to know which was the last 2.14.5x version where this test was still passing? Have you restarted soak with the current build to see if this is repeatable, or if it was a one-time failure? According to https://wiki.whamcloud.com/display/Releases/Soak+Testing+on+Soak it looks like 2.14.56 was run on 20211223, and 2.15.0-RC2-2-g94f4e1f was run on 20220302. Did those builds work without error? The number of patches that landed since 2.15.0-RC2-2-g94f4e1f is quite small (excluding test patches): 03e4b451b0dc New RC 2.15.0-RC3 649d638467c0 LU-15670 clio: Disable lockless for DIO with O_APPEND 22de0bd145b6 LU-15616 lnet: ln_api_mutex deadlocks 3e3f70eb1ec9 LU-13714 lnet: only update gateway NI status on discovery 0090b6f6f6cf LU-15692 lmv: change default hash back to fnv_1a_64 42a6d1fdb681 LU-15702 lov: remove lo_trunc_stripeno ef826db1f43b LU-15548 osd-ldiskfs: hide virtual projid xattr 9884f37985c1 LU-15637 llite: Fix use of uninitialized fields d8012811cc6f LU-15435 ptlrpc: unregister reply buffer on rq_err 4d93fd79e8b0 LU-15661 nodemap: fix map mode value for 'both' 2f496148c31d LU-15551 ofd: Return EOPNOTSUPP instead of EPROTO 9dcbf8b3d44f LU-15634 ptlrpc: Use after free of 'conn' in rhashtable retry f14090e56c9d LU-15546 mdt: mdt_reint_open lookup before locking 966ca46e4aa2 LU-15608 sec: fix DIO for encrypted files 1dbcbd70f87b LU-15021 quota: protect lqe_glbl_data in lqe 5da859e262dd LU-15576 osp: Interop skip sanity test 823 for MDS < 2.14.56 06134ff9adde LU-15601 osd-ldiskfs: handle read_inode_bitmap() error 591a990c617f LU-15513 lod: skip uninit component in lod_fill_mirrors
            pjones Peter Jones added a comment -

            Mike

            Could you please investigate?

            Peter

            pjones Peter Jones added a comment - Mike Could you please investigate? Peter

            People

              tappro Mikhail Pershin
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: