Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11201

NMI watchdog: BUG: soft lockup in lfsck_namespace

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.12.0, Lustre 2.10.6
    • Lustre 2.10.4
    • lustre 2.10.4_1.chaos
      kernel 3.10.0-862.9.1.1chaos.ch6.x86_64
      RHEL 7.5 based
      MDT
    • 3
    • 9223372036854775807

    Description

      About 4 minutes after recovery ended post MDS startup, the console started reporting soft lockups as below. The node started refusing connections from pdsh and ltop running on the mgmt node started reporting stale data (not getting updates from cerebro).

      The watchdog is repeatedly reporting a stack about every 40 seconds. Usually it is the same stack, below, with the same PID and same CPU (CPU#6). lfs check servers on a compute node with the file system mounted still shows lquake-MDT0008 as active and the node still responds to lctl ping. An ls of a directory stored on MDT0008 hangs in ptlrpc_set_wait() in
      ldlm_cli_enqueue() in mdc_enqueue().

      NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [lfsck_namespace:26532]
      Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) libcfs(OE) nfsv3 nfs_acl ib_ucm sb_edac intel_powerclamp coretemp rpcrdma intel_rapl rdma_ucm iosf_mbi ib_umad ib_uverbs ib_ipoib ib_iser kvm rdma_cm iw_cm libiscsi irqbypass ib_cm iTCO_wdt iTCO_vendor_support mlx5_ib ib_core joydev mlx5_core pcspkr mlxfw devlink lpc_ich i2c_i801 ioatdma ses enclosure sch_fq_codel sg ipmi_si shpchp acpi_cpufreq acpi_power_meter zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) binfmt_misc msr_safe(OE) ip_tables rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache dm_round_robin sd_mod crc_t10dif crct10dif_generic 8021q garp mrp stp llc mgag200 crct10dif_pclmul crct10dif_common i2c_algo_bit crc32_pclmul drm_kms_helper scsi_transport_iscsi crc32c_intel syscopyarea sysfillrect ghash_clmulni_intel sysimgblt dm_multipath ixgbe fb_sys_fops aesni_intel ttm lrw mxm_wmi ahci dca gf128mul libahci glue_helper ablk_helper drm mpt3sas cryptd ptp raid_class libata i2c_core pps_core scsi_transport_sas mdio ipmi_devintf ipmi_msghandler wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
      CPU: 3 PID: 26532 Comm: lfsck_namespace Kdump: loaded Tainted: P           OE  ------------   3.10.0-862.9.1.1chaos.ch6.x86_64 #1
      Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
      task: ffffa0df6ab94f10 ti: ffffa0df32650000 task.ti: ffffa0df32650000
      RIP: 0010:[<ffffffffc1402f1e>]  [<ffffffffc1402f1e>] lfsck_namespace_filter_linkea_entry.isra.64+0x8e/0x180 [lfsck]
      RSP: 0018:ffffa0df32653af0  EFLAGS: 00000202
      RAX: 0000000000000000 RBX: ffffa0df32653ab8 RCX: ffffa0df2e1d4971
      RDX: 0000000000000000 RSI: ffffa0df30ec8010 RDI: ffffa0df32653bc8
      RBP: ffffa0df32653b38 R08: 0000000000000000 R09: 0000000000000025
      R10: ffffa0df30ec8010 R11: 0000000000000000 R12: ffffa0df32653a70
      R13: ffffa0df32653acc R14: ffffffffc144d09c R15: ffffa0df33d5e240
      FS:  0000000000000000(0000) GS:ffffa0df7e6c0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ffff7ff8000 CR3: 0000005cd280e000 CR4: 00000000001607e0
      Call Trace:
       [<ffffffffc140d63d>] ? __lfsck_links_read+0x13d/0x2d0 [lfsck]
       [<ffffffffc14159af>] lfsck_namespace_double_scan_one+0x49f/0x14b0 [lfsck]
       [<ffffffffc087d50e>] ? dmu_buf_rele+0xe/0x10 [zfs]
       [<ffffffffc090225f>] ? zap_unlockdir+0x3f/0x60 [zfs]
       [<ffffffffc1416d82>] lfsck_namespace_double_scan_one_trace_file+0x3c2/0x7e0 [lfsck]
       [<ffffffffc141a7bd>] lfsck_namespace_assistant_handler_p2+0x79d/0xa80 [lfsck]
       [<ffffffffb2e03426>] ? kfree+0x136/0x180
       [<ffffffffc11a9ad8>] ? ptlrpc_set_destroy+0x208/0x4f0 [ptlrpc]
       [<ffffffffc13fe6a4>] lfsck_assistant_engine+0x13e4/0x21a0 [lfsck]
       [<ffffffffb2cd5c20>] ? wake_up_state+0x20/0x20
       [<ffffffffc13fd2c0>] ? lfsck_master_engine+0x1450/0x1450 [lfsck]
       [<ffffffffb2cc0ad1>] kthread+0xd1/0xe0
       [<ffffffffb2cc0a00>] ? insert_kthread_work+0x40/0x40
       [<ffffffffb3344837>] ret_from_fork_nospec_begin+0x21/0x21
       [<ffffffffb2cc0a00>] ? insert_kthread_work+0x40/0x40
      

      Attachments

        Issue Links

          Activity

            People

              laisiyao Lai Siyao
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: