Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12832

soft lockup in ldlm_bl_xx threads at read for a single shared strided file

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Major
    • None
    • Lustre 2.13.0
    • None
    • master (commit fcf219d) on both server and client
    • 3
    • 9223372036854775807

    Description

      many clients got soft lockup in ldlm_bl_xx at read for a single shared strided file that is same workload of io500 hard. at the end, ior failed becouse of clients were evicted.

      # mkdir /scratch0/io500.out/ior_hard -p
      # lfs setstripe -C 240 -S 16M /scratch0/io500.out/ior_hard
      
      # salloc --nodes=10 --ntasks-per-node=24 mpirun --allow-run-as-root /scratch0/ior -w -s 132000 -a POSIX -i 1 -C -Q 1 -g -G 27 -k -e -t 47008 -b 47008 -o /scratch0/io500.out/ior_hard/IOR_file -O stoneWallingStatusFile=/scratch0/io500.out/ior_hard/stonewall -O stoneWallingWearOut=1 -D 300 -W
      

      one of server log

      Oct  6 21:27:46 es400nv-vm1 kernel: LustreError: 26369:0:(ldlm_lockd.c:261:expired_lock_main()) ### lock callback timer expired after 99s: evicting client at 10.0.13.151@o2ib10  ns: filter-ai400-OST0000_UUID lock: ffff995029f90480/0xe6d707f7408ef34e lrc: 3/0,0 mode: PW/PW res: [0x19:0x0:0x0].0x0 rrc: 30741 type: EXT [5842382848->5842853887] (req 5842382848->5842431999) flags: 0x60000400000020 nid: 10.0.13.151@o2ib10 remote: 0xeff507f1ffb50fc0 expref: 158923 pid: 27145 timeout: 11179 lvb_type: 0
      Oct  6 21:27:46 es400nv-vm1 kernel: Lustre: ai400-OST0000: Connection restored to 85106299-51f6-4 (at 10.0.13.154@o2ib10)
      Oct  6 21:27:46 es400nv-vm1 kernel: Lustre: Skipped 127 previous similar messages
      Oct  6 21:27:49 es400nv-vm1 kernel: LustreError: 26369:0:(ldlm_lockd.c:261:expired_lock_main()) ### lock callback timer expired after 99s: evicting client at 10.0.13.152@o2ib10  ns: filter-ai400-OST0001_UUID lock: ffff9950aee0d8c0/0xe6d707f740dbdd52 lrc: 3/0,0 mode: PW/PW res: [0x8:0x0:0x0].0x0 rrc: 26996 type: EXT [5140467712->5140746239] (req 5140467712->5140516863) flags: 0x60000400020020 nid: 10.0.13.152@o2ib10 remote: 0xc5564b8fed3e7bd7 expref: 166664 pid: 26904 timeout: 11182 lvb_type: 0
      Oct  6 21:27:49 es400nv-vm1 kernel: LustreError: 26369:0:(ldlm_lockd.c:261:expired_lock_main()) Skipped 2 previous similar messages
      Oct  6 21:27:53 es400nv-vm1 kernel: LustreError: 26986:0:(ldlm_lockd.c:2437:ldlm_cancel_handler()) ldlm_cancel from 10.0.13.152@o2ib10 arrived at 1570364871 with bad export cookie 16633772507216549573
      Oct  6 21:27:59 es400nv-vm1 kernel: LustreError: 26369:0:(ldlm_lockd.c:261:expired_lock_main()) ### lock callback timer expired after 99s: evicting client at 10.0.13.159@o2ib10  ns: filter-ai400-OST0000_UUID lock: ffff9950587b6400/0xe6d707f7402a10dc lrc: 3/0,0 mode: PW/PW res: [0x15:0x0:0x0].0x0 rrc: 18417 type: EXT [6164672512->6164860927] (req 6164672512->6164721663) flags: 0x60000400020020 nid: 10.0.13.159@o2ib10 remote: 0xd7d9cb14933b97f6 expref: 153927 pid: 26985 timeout: 11192 lvb_type: 0

      one of client log

      Oct  6 21:27:54 c05 kernel: NMI watchdog: BUG: soft lockup - CPU#9 stuck for 22s! [ldlm_bl_01:151183]
      Oct  6 21:27:54 c05 kernel: Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) ip6table_nat ip6_tables xt_statistic xt_physdev xt_NFLOG nfnetlink_log veth dummy vport_vxlan vxlan ip6_udp_tunnel udp_tunnel openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_defrag_ipv6 ip_set_hash_ip xt_set ip_set xt_nat ipt_REJECT nf_reject_ipv4 xt_comment xt_mark ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c br_netfilter bridge stp llc nfsv3 nfs_acl nfs lockd grace fscache overlay(T) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) ib_uverbs(OE)
      Oct  6 21:27:54 c05 kernel: mlx5_core(OE) mlxfw(OE) mlx4_en(OE) sunrpc iTCO_wdt iTCO_vendor_support mxm_wmi sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev sg i2c_i801 mei_me lpc_ich ioatdma mei ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_pad acpi_power_meter binfmt_misc knem(OE) ip_tables ext4 mbcache jbd2 mlx4_ib(OE) ib_core(OE) sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci drm libahci igb crct10dif_pclmul crct10dif_common libata crc32c_intel mlx4_core(OE) ptp pps_core devlink dca i2c_algo_bit drm_panel_orientation_quirks mlx_compat(OE) [last unloaded: libcfs]
      Oct  6 21:27:54 c05 kernel: CPU: 9 PID: 151183 Comm: ldlm_bl_01 Kdump: loaded Tainted: G           OEL ------------ T 3.10.0-957.27.2.el7.x86_64 #1
      Oct  6 21:27:54 c05 kernel: Hardware name: Intel Corporation S2600KPR/S2600KPR, BIOS SE5C610.86B.01.01.0027.071020182329 07/10/2018
      Oct  6 21:27:54 c05 kernel: task: ffff8eb4c1eb4100 ti: ffff8eb64c178000 task.ti: ffff8eb64c178000
      Oct  6 21:27:54 c05 kernel: RIP: 0010:[<ffffffff89d13536>]  [<ffffffff89d13536>] native_queued_spin_lock_slowpath+0x126/0x200
      Oct  6 21:27:54 c05 kernel: RSP: 0018:ffff8eb64c17bc70  EFLAGS: 00000246
      Oct  6 21:27:54 c05 kernel: RAX: 0000000000000000 RBX: ffff8ec614df8280 RCX: 0000000000490000
      Oct  6 21:27:54 c05 kernel: RDX: ffff8eb67e79b780 RSI: 0000000000310001 RDI: ffff8ec614de0058
      Oct  6 21:27:54 c05 kernel: RBP: ffff8eb64c17bc70 R08: ffff8eb67e85b780 R09: 0000000000000000
      Oct  6 21:27:54 c05 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8eb64c17bc38
      Oct  6 21:27:54 c05 kernel: R13: ffff8eb1aa758000 R14: ffff8eb2af3e46c8 R15: ffff8eb1aa758098
      Oct  6 21:27:54 c05 kernel: FS:  0000000000000000(0000) GS:ffff8eb67e840000(0000) knlGS:0000000000000000
      Oct  6 21:27:54 c05 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      Oct  6 21:27:54 c05 kernel: CR2: 00000000029cb000 CR3: 0000000f51610000 CR4: 00000000003607e0
      Oct  6 21:27:54 c05 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      Oct  6 21:27:54 c05 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Oct  6 21:27:54 c05 kernel: Call Trace:
      Oct  6 21:27:54 c05 kernel: [<ffffffff8a35e2cb>] queued_spin_lock_slowpath+0xb/0xf
      Oct  6 21:27:54 c05 kernel: [<ffffffff8a36c7a0>] _raw_spin_lock+0x20/0x30
      Oct  6 21:27:54 c05 kernel: [<ffffffffc0c3b32a>] cl_object_attr_lock+0x1a/0x20 [obdclass]
      Oct  6 21:27:54 c05 kernel: [<ffffffffc0fea746>] osc_ldlm_blocking_ast+0x2f6/0x3a0 [osc]
      Oct  6 21:27:54 c05 kernel: [<ffffffffc0e3e54a>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc]
      Oct  6 21:27:54 c05 kernel: [<ffffffffc0c14979>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      Oct  6 21:27:54 c05 kernel: [<ffffffffc0e49460>] ldlm_cli_cancel_local+0xa0/0x3f0 [ptlrpc]
      Oct  6 21:27:54 c05 kernel: [<ffffffffc0e4d97a>] ldlm_cli_cancel_list_local+0xea/0x260 [ptlrpc]
      Oct  6 21:27:54 c05 kernel: [<ffffffffc0e54370>] ldlm_bl_thread_main+0x5a0/0xa40 [ptlrpc]
      Oct  6 21:27:54 c05 kernel: [<ffffffff89cd7c40>] ? wake_up_state+0x20/0x20
      Oct  6 21:27:54 c05 kernel: [<ffffffffc0e53dd0>] ? ldlm_handle_bl_callback+0x4e0/0x4e0 [ptlrpc]
      Oct  6 21:27:54 c05 kernel: [<ffffffff89cc2e81>] kthread+0xd1/0xe0
      Oct  6 21:27:54 c05 kernel: [<ffffffff89cc2db0>] ? insert_kthread_work+0x40/0x40
      Oct  6 21:27:54 c05 kernel: [<ffffffff8a376c1d>] ret_from_fork_nospec_begin+0x7/0x21
      Oct  6 21:27:54 c05 kernel: [<ffffffff89cc2db0>] ? insert_kthread_work+0x40/0x40
      Oct  6 21:27:54 c05 kernel: Code: 0d 48 98 83 e2 30 48 81 c2 80 b7 01 00 48 03 14 c5 a0 bf 94 8a 4c 89 02 41 8b 40 08 85 c0 75 0f 0f 1f 44 00 00 f3 90 41 8b 40 08 <85> c0 74 f6 4d 8b 08 4d 85 c9 74 04 41 0f 18 09 8b 17 0f b7 c2 
      Oct  6 21:27:56 c05 kernel: LustreError: 152233:0:(ldlm_resource.c:1147:ldlm_resource_complain()) ai400-OST0006-osc-ffff8eafdda3d800: namespace resource [0x10:0x0:0x0].0x0 (ffff8ebf4ddcc840) refcount nonzero (1404) after lock cleanup; forcing cleanup.
      Oct  6 21:27:56 c05 kernel: Lustre: ai400-OST0006-osc-ffff8eafdda3d800: Connection restored to 10.0.10.178@o2ib10 (at 10.0.10.178@o2ib10)
      Oct  6 21:27:58 c05 kernel: LustreError: 152229:0:(ldlm_resource.c:1147:ldlm_resource_complain()) ai400-OST0002-osc-ffff8eafdda3d800: namespace resource [0x10:0x0:0x0].0x0 (ffff8ec4263355c0) refcount nonzero (1729) after lock cleanup; forcing cleanup.
      Oct  6 21:27:58 c05 kernel: LustreError: 152229:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 59 previous similar messages
      Oct  6 21:27:58 c05 kernel: Lustre: ai400-OST0002-osc-ffff8eafdda3d800: Connection restored to 10.0.10.176@o2ib10 (at 10.0.10.176@o2ib10)

      Attachments

        Issue Links

          Activity

            People

              wshilong Wang Shilong (Inactive)
              sihara Shuichi Ihara
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: