Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17040

LFSCK hang with resending request on EINPROGRESS

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • Upstream
    • 2MDTs cluster FOFB/IO tests
    • 3
    • 9223372036854775807

    Description

      LFSCK hanged, no progress after 10+ hours. There was an attempt to stop it

      UID          PID    PPID  C STIME TTY          TIME CMD
      root     1241254 1241253  0 00:29 ?        00:00:00 lctl lfsck_stop --device snx11922-MDT0001 --all
      

      The stop did not help and node was crashed after stop+10h. Here is ps -m and stacks.

      crash> ps -m | grep "lfsck\|scrub\|lctl"
      [0 00:00:01.186] [RU]  PID: 164187   TASK: ffff90055862af80  CPU: 6    COMMAND: "OI_scrub"
      [0 10:36:24.817] [ID]  PID: 1241254  TASK: ffff9002ab60c740  CPU: 19   COMMAND: "lctl"
      [0 23:45:12.588] [ID]  PID: 164188   TASK: ffff90055862c740  CPU: 8    COMMAND: "lfsck_layout"
      [0 23:45:12.623] [UN]  PID: 164186   TASK: ffff90055862df00  CPU: 9    COMMAND: "lfsck"
      [0 23:45:16.187] [ID]  PID: 164189   TASK: ffff900558628000  CPU: 17   COMMAND: "lfsck_namespace"
      
      PID: 164186   TASK: ffff90055862df00  CPU: 9    COMMAND: "lfsck"
       #0 [ffffa414250f7d08] __schedule at ffffffffa374e1d4
       #1 [ffffa414250f7d68] schedule at ffffffffa374e648
       #2 [ffffa414250f7d78] osd_otable_it_next at ffffffffc1db3983 [osd_ldiskfs]
       #3 [ffffa414250f7e08] lfsck_master_oit_engine at ffffffffc187f17e [lfsck]
       #4 [ffffa414250f7e78] lfsck_master_engine at ffffffffc188073e [lfsck]
       #5 [ffffa414250f7f10] kthread at ffffffffa2f043a6
       #6 [ffffa414250f7f50] ret_from_fork at ffffffffa380023f
      
      PID: 164187   TASK: ffff90055862af80  CPU: 6    COMMAND: "OI_scrub"
       #0 [fffffe000013be48] crash_nmi_callback at ffffffffa2e54863
       #1 [fffffe000013be50] nmi_handle at ffffffffa2e24c83
       #2 [fffffe000013bea8] default_do_nmi at ffffffffa3741f89
       #3 [fffffe000013bec8] do_nmi at ffffffffa2e2518e
       #4 [fffffe000013bef0] end_repeat_nmi at ffffffffa38015c4
          [exception RIP: kthread_should_stop+33]
          RIP: ffffffffa2f04fc1  RSP: ffffa414250ffce0  RFLAGS: 00000246
          RAX: 0000000000000000  RBX: ffff900576b28000  RCX: ffffa414250ffdc8
          RDX: ffff900576b2d880  RSI: ffff900576b28000  RDI: ffff900513717000
          RBP: ffffa414250ffdc8   R8: 0000000000000001   R9: 0000000000000000
          R10: ffffa414250ffc68  R11: 0000000000000000  R12: 0000000000000001
          R13: ffff900576b2d880  R14: ffff900513717000  R15: ffff900576b2d820
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
      --- <NMI exception stack> ---
       #5 [ffffa414250ffce0] kthread_should_stop at ffffffffa2f04fc1
       #6 [ffffa414250ffce0] osd_scrub_next at ffffffffc1dadeb4 [osd_ldiskfs]
       #7 [ffffa414250ffd78] osd_inode_iteration at ffffffffc1db27fb [osd_ldiskfs]
       #8 [ffffa414250ffe40] osd_scrub_main at ffffffffc1db3319 [osd_ldiskfs]
       #9 [ffffa414250fff10] kthread at ffffffffa2f043a6
      #10 [ffffa414250fff50] ret_from_fork at ffffffffa380023f
      
      PID: 164188   TASK: ffff90055862c740  CPU: 8    COMMAND: "lfsck_layout"
       #0 [ffffa41425127da8] __schedule at ffffffffa374e1d4
       #1 [ffffa41425127e08] schedule at ffffffffa374e648
       #2 [ffffa41425127e18] lfsck_assistant_engine at ffffffffc18816e2 [lfsck]
       #3 [ffffa41425127f10] kthread at ffffffffa2f043a6
       #4 [ffffa41425127f50] ret_from_fork at ffffffffa380023f
      
      PID: 164189   TASK: ffff900558628000  CPU: 17   COMMAND: "lfsck_namespace"
       #0 [ffffa41425137da8] __schedule at ffffffffa374e1d4
       #1 [ffffa41425137e08] schedule at ffffffffa374e648
       #2 [ffffa41425137e18] lfsck_assistant_engine at ffffffffc18816e2 [lfsck]
       #3 [ffffa41425137f10] kthread at ffffffffa2f043a6
       #4 [ffffa41425137f50] ret_from_fork at ffffffffa380023f
      
      
      PID: 1241254  TASK: ffff9002ab60c740  CPU: 19   COMMAND: "lctl"
       #0 [ffffa4142502fc28] __schedule at ffffffffa374e1d4
       #1 [ffffa4142502fc88] schedule at ffffffffa374e648
       #2 [ffffa4142502fc98] lfsck_stop at ffffffffc1876fb5 [lfsck]
       #3 [ffffa4142502fd08] mdd_iocontrol at ffffffffc1a2e265 [mdd]
       #4 [ffffa4142502fd58] mdt_iocontrol at ffffffffc1bc5c2f [mdt]
       #5 [ffffa4142502fdf0] class_handle_ioctl at ffffffffc12a86ff [obdclass]
       #6 [ffffa4142502fe68] obd_class_ioctl at ffffffffc12a9047 [obdclass]
       #7 [ffffa4142502fe80] do_vfs_ioctl at ffffffffa312e0c4
       #8 [ffffa4142502fef8] ksys_ioctl at ffffffffa312e700
       #9 [ffffa4142502ff30] __x64_sys_ioctl at ffffffffa312e746
      #10 [ffffa4142502ff38] do_syscall_64 at ffffffffa2e0420b
       

      I have found that scrub inconsistent list has entry with zero inode. And looks like scrub with LFSCK looped it. When LFSCK get EINPROGRESS it resend a request. After scrub process oii, it deletes oii from a list, and then LFSCK tries osp_attr_get again and adds it to scrub. So inconsistent list is not empty, scrub cannot finish inode iteration loop.
      There is also lfsck stop bug, I guess LFSCK stop code doesn't have logic to stop scrub, but it requires additional ticket.

      crash> osd_inconsistent_item ffff9001fb355340 -x
      struct osd_inconsistent_item {
        oii_list = {
          next = 0xffff9001fb355900, 
          prev = 0xffff900576b2d488
        }, 
        oii_cache = {
          oic_fid = {
            f_seq = 0x240019e29, 
            f_oid = 0x1bac, 
            f_ver = 0x0
          }, 
          oic_lid = {
            oii_ino = 0x0, 
            oii_gen = 0x0
          }, 
          oic_dev = 0xffff900576b28000, 
          oic_remote = 0x0
        }, 
        oii_insert = 0x1
      }
      

      Attachments

        Issue Links

          Activity

            People

              aboyko Alexander Boyko
              aboyko Alexander Boyko
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: