Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11391

soft lockup in ldlm_prepare_lru_list()

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • None
    • None
    • CentOS 7.5 patchfull and Lustre 2.11.55 on AMD EPYC servers
    • 3
    • 9223372036854775807

    Description

      Testing master branch, tag 2.11.55, and hit soft lockups in ldlm_prepare_lru_list() (workqueue: ldlm_pools_recalc_task) on the client when running mdtest from the IO-500 benchmark using a single client.

      [212288.213417] NMI watchdog: BUG: soft lockup - CPU#35 stuck for 22s! [kworker/35:1:600]
      [212288.221336] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) osc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase rpcsec_gss_krb5 dell_rbu auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ib_ucm rpcrdma rdma_ucm ib_uverbs ib_iser ib_umad rdma_cm iw_cm libiscsi ib_ipoib scsi_transport_iscsi ib_cm mlx5_ib ib_core sunrpc vfat fat amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass crc32_pclmul ghash_clmulni_intel dcdbas aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg dm_multipath ccp dm_mod pcspkr shpchp i2c_piix4 ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt
      [212288.294305]  fb_sys_fops ttm mlx5_core crct10dif_pclmul drm ahci mlxfw crct10dif_common tg3 libahci crc32c_intel devlink megaraid_sas ptp libata i2c_core pps_core
      [212288.307953] CPU: 35 PID: 600 Comm: kworker/35:1 Kdump: loaded Tainted: G           OEL ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1
      [212288.320378] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.3.6 04/20/2018
      [212288.328069] Workqueue: events ldlm_pools_recalc_task [ptlrpc]
      [212288.333925] task: ffff908bfc470000 ti: ffff908bfc464000 task.ti: ffff908bfc464000
      [212288.341491] RIP: 0010:[<ffffffff9fd08ff2>]  [<ffffffff9fd08ff2>] native_queued_spin_lock_slowpath+0x122/0x200
      [212288.351518] RSP: 0018:ffff908bfc467be8  EFLAGS: 00000246
      [212288.356918] RAX: 0000000000000000 RBX: 0000000000002000 RCX: 0000000001190000
      [212288.364139] RDX: ffff906bffb99740 RSI: 0000000001b10000 RDI: ffff90abfa32953c
      [212288.371358] RBP: ffff908bfc467be8 R08: ffff908bffb19740 R09: 0000000000000000
      [212288.378577] R10: 0000fbd0948bcb20 R11: 7fffffffffffffff R12: ffff907ca99fd018
      [212288.385796] R13: 0000000000000000 R14: 0000000000018b40 R15: 0000000000018b40
      [212288.393017] FS:  00007f96c2fbb740(0000) GS:ffff908bffb00000(0000) knlGS:0000000000000000
      [212288.401190] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [212288.407023] CR2: 00007f8de6b4da88 CR3: 0000000aaf40e000 CR4: 00000000003407e0
      [212288.414244] Call Trace:
      [212288.416793]  [<ffffffffa0309510>] queued_spin_lock_slowpath+0xb/0xf
      [212288.423151]  [<ffffffffa0316840>] _raw_spin_lock+0x20/0x30
      [212288.428755]  [<ffffffffc0fb9280>] ldlm_pool_set_clv+0x20/0x40 [ptlrpc]
      [212288.435391]  [<ffffffffc0f9c956>] ldlm_cancel_lrur_policy+0xd6/0x100 [ptlrpc]
      [212288.442639]  [<ffffffffc0f9e4ca>] ldlm_prepare_lru_list+0x1fa/0x4c0 [ptlrpc]
      [212288.449797]  [<ffffffffc0f9c880>] ? ldlm_iter_helper+0x20/0x20 [ptlrpc]
      [212288.456522]  [<ffffffffc0fa3e31>] ldlm_cancel_lru+0x61/0x170 [ptlrpc]
      [212288.463076]  [<ffffffffc0fb7741>] ldlm_cli_pool_recalc+0x231/0x240 [ptlrpc]
      [212288.470148]  [<ffffffffc0fb785c>] ldlm_pool_recalc+0x10c/0x1f0 [ptlrpc]
      [212288.476874]  [<ffffffffc0fb7abc>] ldlm_pools_recalc_delay+0x17c/0x1d0 [ptlrpc]
      [212288.484208]  [<ffffffffc0fb7cd3>] ldlm_pools_recalc_task+0x1c3/0x260 [ptlrpc]
      [212288.491431]  [<ffffffff9fcb35ef>] process_one_work+0x17f/0x440
      [212288.497356]  [<ffffffff9fcb4686>] worker_thread+0x126/0x3c0
      [212288.503016]  [<ffffffff9fcb4560>] ? manage_workers.isra.24+0x2a0/0x2a0
      [212288.509629]  [<ffffffff9fcbb621>] kthread+0xd1/0xe0
      [212288.514594]  [<ffffffff9fcbb550>] ? insert_kthread_work+0x40/0x40
      [212288.520776]  [<ffffffffa03205e4>] ret_from_fork_nospec_begin+0xe/0x21
      [212288.527300]  [<ffffffff9fcbb550>] ? insert_kthread_work+0x40/0x40
      [212288.533479] Code: 13 48 c1 ea 0d 48 98 83 e2 30 48 81 c2 40 97 01 00 48 03 14 c5 a0 53 93 a0 4c 89 02 41 8b 40 08 85 c0 75 0f 0f 1f 44 00 00 f3 90 <41> 8b 40 08 85 c0 74 f6 4d 8b 08 4d 85 c9 74 04 41 0f 18 09 8b 
      

      Triggered a crash dump that can be made available if anyone interested, just let me know. Attaching vmcore-dmest.txt and the output of foreach bt.

      Client was running the following part of the IO-500 benchmark:

      [Starting] mdtest_easy_stat
      [Exec] mpirun -np 24 /home/sthiell/io-500-dev/bin/mdtest -T -F -d /firbench/nodom/datafiles/io500.2018.09.17-19.30.06/mdt_easy -n 200000 -u -L -x /firbench/nodom/datafiles/io500.2018.09.17-19.30.06/mdt_easy-stonewall
      

       
      Best,
      Stephane

      Attachments

        1. foreach_bt.txt
          745 kB
          Stephane Thiell
        2. vmcore-dmesg.txt
          1020 kB
          Stephane Thiell

        Issue Links

          Activity

            People

              ys Yang Sheng
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: