Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10494

clients in locked state: spirit-29 kernel: LustreError: 11-0: zfstest-OST0000-osc-ffff881010be4000: operation ldlm_enqueue to node 192.168.2.1@o2ib failed: rc = -107

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.11.0, Lustre 2.10.3
    • None
    • Performance Spirit Cluster.
      Servers and Clients 2.10.3_RC1
    • 3
    • 9223372036854775807

    Description

      Clients in locked state:
      OSS: spirit-aeon-1.log

      spirit-aeon-1.log:Jan 11 18:25:19 spirit-aeon-1 kernel: LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 192.168.1.21@o2ib  ns: filter-zfstest-OST0002_UUID lock: ffff880085f45000/0x78dafdfb659c8831 lrc: 3/0,0 mode: PR/PR res: [0x2f948:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.21@o2ib remote: 0x444cce0aa8c1b51c expref: 93939 pid: 10432 timeout: 4295871382 lvb_type: 1
      spirit-aeon-1.log:Jan 11 18:25:19 spirit-aeon-1 kernel: LustreError: 10447:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff88081d26bc00 x1589320376498320/t0(0) o104->zfstest-OST0002@192.168.1.18@o2ib:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
      spirit-aeon-1.log:Jan 11 18:25:19 spirit-aeon-1 kernel: LustreError: 10447:0:(client.c:1166:ptlrpc_import_delay_req()) Skipped 1 previous similar message
      spirit-aeon-1.log:Jan 11 18:25:21 spirit-aeon-1 kernel: LustreError: 10481:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880820cdbc00 x1589320376498464/t0(0) o104->zfstest-OST0002@192.168.1.21@o2ib:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
      spirit-aeon-1.log:Jan 11 18:25:21 spirit-aeon-1 kernel: LustreError: 10481:0:(client.c:1166:ptlrpc_import_delay_req()) Skipped 1 previous similar message
      spirit-aeon-1.log:Jan 11 18:25:21 spirit-aeon-1 kernel: LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 192.168.1.20@o2ib  ns: filter-zfstest-OST0000_UUID lock: ffff8808130f5800/0x78dafdfb6598887f lrc: 3/0,0 mode: PR/PR res: [0x2b7d8:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.20@o2ib remote: 0x1f6e0be3f64f847c expref: 91031 pid: 10470 timeout: 4295873298 lvb_type: 1
      spirit-aeon-1.log:Jan 11 18:25:21 spirit-aeon-1 kernel: LustreError: 0:0:(ldlm_lockd.c:334:waiting_locks_callback()) Skipped 4 previous similar messages
      spirit-aeon-1.log:Jan 11 18:25:32 spirit-aeon-1 kernel: LustreError: 22938:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 192.168.1.16@o2ib  ns: filter-zfstest-OST0000_UUID lock: ffff8807bf296c00/0x78dafdfb6597d9a2 lrc: 3/0,0 mode: PR/PR res: [0x2b6b5:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.16@o2ib remote: 0x270a88b21111f96b expref: 85199 pid: 10554 timeout: 4295884213 lvb_type: 1
      spirit-aeon-1.log:Jan 11 18:25:32 spirit-aeon-1 kernel: LustreError: 22938:0:(ldlm_lockd.c:334:waiting_locks_callback()) Skipped 2 previous similar messages
      spirit-aeon-1.log:Jan 11 18:25:32 spirit-aeon-1 kernel: LustreError: 10550:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff88081d268600 x1589320376498832/t0(0) o104->zfstest-OST0000@192.168.1.16@o2ib:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
      spirit-aeon-1.log:Jan 11 18:25:32 spirit-aeon-1 kernel: LustreError: 10550:0:(client.c:1166:ptlrpc_import_delay_req()) Skipped 4 previous similar messages
      spirit-aeon-1.log:Jan 11 18:25:34 spirit-aeon-1 kernel: LustreError: 10431:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff88081d26b600 x1589320376499072/t0(0) o104->zfstest-OST0002@192.168.1.21@o2ib:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
      spirit-aeon-1.log:Jan 11 18:25:34 spirit-aeon-1 kernel: LustreError: 10431:0:(client.c:1166:ptlrpc_import_delay_req()) Skipped 6 previous similar messages
      spirit-aeon-1.log:Jan 11 18:25:38 spirit-aeon-1 kernel: LustreError: 10550:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff88003516f000 x1589320376499360/t0(0) o104->zfstest-OST0000@192.168.1.16@o2ib:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
      spirit-aeon-1.log:Jan 11 18:25:38 spirit-aeon-1 kernel: LustreError: 10550:0:(client.c:1166:ptlrpc_import_delay_req()) Skipped 6 previous similar messages
      spirit-aeon-1.log:Jan 11 18:25:47 spirit-aeon-1 kernel: LustreError: 10299:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880820cda100 x1589320376499840/t0(0) o104->zfstest-OST0000@192.168.1.16@o2ib:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
      spirit-aeon-1.log:Jan 11 18:25:47 spirit-aeon-1 kernel: LustreError: 10299:0:(client.c:1166:ptlrpc_import_delay_req()) Skipped 6 previous similar messages
      spirit-aeon-1.log:Jan 11 18:25:59 spirit-aeon-1 kernel: LustreError: 16964:0:(ldlm_lockd.c:334:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 192.168.1.27@o2ib  ns: filter-zfstest-OST0002_UUID lock: ffff880793e21e00/0x78dafdfb659be2d8 lrc: 3/0,0 mode: PR/PR res: [0x2f1f8:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.27@o2ib remote: 0x4d4a0c820ccb8af9 expref: 54505 pid: 10542 timeout: 4295911095 lvb_type: 1
      spirit-aeon-1.log:Jan 11 18:26:03 spirit-aeon-1 kernel: LustreError: 10247:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880846aaf300 x1589320376504784/t0(0) o104->zfstest-OST0002@192.168.1.27@o2ib:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
      spirit-aeon-1.log:Jan 11 18:26:03 spirit-aeon-1 kernel: LustreError: 10247:0:(client.c:1166:ptlrpc_import_delay_req()) Skipped 27 previous similar messages
      

      Clients:

      spirit-29.log:Jan 11 18:25:24 spirit-29 kernel: LustreError: 11-0: zfstest-OST0000-osc-ffff881010be4000: operation ldlm_enqueue to node 192.168.2.1@o2ib failed: rc = -107
      spirit-29.log:Jan 11 18:25:24 spirit-29 kernel: LustreError: 167-0: zfstest-OST0000-osc-ffff881010be4000: This client was evicted by zfstest-OST0000; in progress operations using this service will fail.
      spirit-29.log:Jan 11 18:25:37 spirit-29 kernel: LustreError: 3048:0:(ldlm_resource.c:1100:ldlm_resource_complain()) zfstest-OST0000-osc-ffff881010be4000: namespace resource [0x145010:0x0:0x0].0x0 (ffff881005cf2f00) refcount nonzero (1) after lock cleanup; forcing cleanup.
      spirit-29.log:Jan 11 18:25:37 spirit-29 kernel: LustreError: 3048:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x145010:0x0:0x0].0x0 (ffff881005cf2f00) refcount = 1
      spirit-29.log:Jan 11 18:25:37 spirit-29 kernel: LustreError: 3048:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x138010:0x0:0x0].0x0 (ffff881004979140) refcount = 1
      

      Attachments

        1. lustre_log_dump.txt
          59 kB
        2. spirit-11.log
          2.34 MB
        3. vmcore-dmesg.txt
          115 kB

        Issue Links

          Activity

            [LU-10494] clients in locked state: spirit-29 kernel: LustreError: 11-0: zfstest-OST0000-osc-ffff881010be4000: operation ldlm_enqueue to node 192.168.2.1@o2ib failed: rc = -107
            pjones Peter Jones added a comment -

            Closing as a duplicate and we can reopen if this issue is hit again with the mentioned fixes in place

            pjones Peter Jones added a comment - Closing as a duplicate and we can reopen if this issue is hit again with the mentioned fixes in place

            Spirit-14 also showed the soft lockup (as sprit-29 did).

            I still think this could be addressed by LU-9230 or possibly also LU-9313

            utopiabound Nathaniel Clark added a comment - Spirit-14 also showed the soft lockup (as sprit-29 did). I still think this could be addressed by LU-9230 or possibly also LU-9313
            standan Saurabh Tandan (Inactive) added a comment - - edited

            Hit this issue again while performing Performance testing on master for tag 2.10.57 build 3703. And like before, ldiskfs tests ran fine and this issue was hit for ZFS only.
            OSS:

            spirit-aeon-2.log:Feb  7 01:35:49 spirit-aeon-2 kernel: LustreError: 0:0:(ldlm_lockd.c:331:waiting_locks_callback()) ### lock callback timer expired after 99s: evicting client at 192.168.1.29@o2ib  ns: filter-zfstest-OST0002_UUID lock: ffff8807f6f34d80/0x5dc5605455b348f0 lrc: 3/0,0 mode: PR/PR res: [0x59b2b:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.29@o2ib remote: 0x44b5bbfe4e669eb5 expref: 55782 pid: 26481 timeout: 87043 lvb_type: 1
            spirit-aeon-2.log:Feb  7 01:35:50 spirit-aeon-2 kernel: LustreError: 0:0:(ldlm_lockd.c:331:waiting_locks_callback()) ### lock callback timer expired after 99s: evicting client at 192.168.1.14@o2ib  ns: filter-zfstest-OST0001_UUID lock: ffff88069c672880/0x5dc5605455bc3910 lrc: 3/0,0 mode: PR/PR res: [0x5f453:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.14@o2ib remote: 0xfc6047aec383864d expref: 60700 pid: 9296 timeout: 87044 lvb_type: 1
            spirit-aeon-2.log:Feb  7 01:35:50 spirit-aeon-2 kernel: LustreError: 24899:0:(client.c:1147:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff8807f1867600 x1591684776293664/t0(0) o104->zfstest-OST0002@192.168.1.21@o2ib:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
            spirit-aeon-2.log:Feb  7 01:35:52 spirit-aeon-2 kernel: LustreError: 28739:0:(ldlm_lockd.c:331:waiting_locks_callback()) ### lock callback timer expired after 99s: evicting client at 192.168.1.25@o2ib  ns: filter-zfstest-OST0001_UUID lock: ffff8807aea01680/0x5dc5605455bbc5c1 lrc: 3/0,0 mode: PR/PR res: [0x5e5f8:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.25@o2ib remote: 0x9cc6d935e8eb8efe expref: 52076 pid: 9489 timeout: 87046 lvb_type: 1
            

            One of the many clients:

            spirit-14.log:Feb  7 01:35:54 spirit-14 kernel: LustreError: 11-0: zfstest-OST0001-osc-ffff881011939000: operation ldlm_enqueue to node 192.168.2.2@o2ib failed: rc = -107
            spirit-14.log:Feb  7 01:35:54 spirit-14 kernel: LustreError: 167-0: zfstest-OST0001-osc-ffff881011939000: This client was evicted by zfstest-OST0001; in progress operations using this service will fail.
            spirit-14.log:Feb  7 01:36:10 spirit-14 kernel: LustreError: 19111:0:(ldlm_resource.c:1093:ldlm_resource_complain()) zfstest-OST0001-osc-ffff881011939000: namespace resource [0xc9010:0x0:0x0].0x0 (ffff880fce276e40) refcount nonzero (1) after lock cleanup; forcing cleanup.
            spirit-14.log:Feb  7 01:36:10 spirit-14 kernel: LustreError: 19111:0:(ldlm_resource.c:1669:ldlm_resource_dump()) --- Resource: [0xc9010:0x0:0x0].0x0 (ffff880fce276e40) refcount = 1
            spirit-14.log:Feb  7 01:36:10 spirit-14 kernel: LustreError: 19111:0:(ldlm_resource.c:1669:ldlm_resource_dump()) --- Resource: [0x9e010:0x0:0x0].0x0 (ffff880304b772c0) refcount = 1
            
            standan Saurabh Tandan (Inactive) added a comment - - edited Hit this issue again while performing Performance testing on master for tag 2.10.57 build 3703. And like before, ldiskfs tests ran fine and this issue was hit for ZFS only. OSS: spirit-aeon-2.log:Feb 7 01:35:49 spirit-aeon-2 kernel: LustreError: 0:0:(ldlm_lockd.c:331:waiting_locks_callback()) ### lock callback timer expired after 99s: evicting client at 192.168.1.29@o2ib ns: filter-zfstest-OST0002_UUID lock: ffff8807f6f34d80/0x5dc5605455b348f0 lrc: 3/0,0 mode: PR/PR res: [0x59b2b:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.29@o2ib remote: 0x44b5bbfe4e669eb5 expref: 55782 pid: 26481 timeout: 87043 lvb_type: 1 spirit-aeon-2.log:Feb 7 01:35:50 spirit-aeon-2 kernel: LustreError: 0:0:(ldlm_lockd.c:331:waiting_locks_callback()) ### lock callback timer expired after 99s: evicting client at 192.168.1.14@o2ib ns: filter-zfstest-OST0001_UUID lock: ffff88069c672880/0x5dc5605455bc3910 lrc: 3/0,0 mode: PR/PR res: [0x5f453:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.14@o2ib remote: 0xfc6047aec383864d expref: 60700 pid: 9296 timeout: 87044 lvb_type: 1 spirit-aeon-2.log:Feb 7 01:35:50 spirit-aeon-2 kernel: LustreError: 24899:0:(client.c:1147:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff8807f1867600 x1591684776293664/t0(0) o104->zfstest-OST0002@192.168.1.21@o2ib:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 spirit-aeon-2.log:Feb 7 01:35:52 spirit-aeon-2 kernel: LustreError: 28739:0:(ldlm_lockd.c:331:waiting_locks_callback()) ### lock callback timer expired after 99s: evicting client at 192.168.1.25@o2ib ns: filter-zfstest-OST0001_UUID lock: ffff8807aea01680/0x5dc5605455bbc5c1 lrc: 3/0,0 mode: PR/PR res: [0x5e5f8:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 192.168.1.25@o2ib remote: 0x9cc6d935e8eb8efe expref: 52076 pid: 9489 timeout: 87046 lvb_type: 1 One of the many clients: spirit-14.log:Feb 7 01:35:54 spirit-14 kernel: LustreError: 11-0: zfstest-OST0001-osc-ffff881011939000: operation ldlm_enqueue to node 192.168.2.2@o2ib failed: rc = -107 spirit-14.log:Feb 7 01:35:54 spirit-14 kernel: LustreError: 167-0: zfstest-OST0001-osc-ffff881011939000: This client was evicted by zfstest-OST0001; in progress operations using this service will fail. spirit-14.log:Feb 7 01:36:10 spirit-14 kernel: LustreError: 19111:0:(ldlm_resource.c:1093:ldlm_resource_complain()) zfstest-OST0001-osc-ffff881011939000: namespace resource [0xc9010:0x0:0x0].0x0 (ffff880fce276e40) refcount nonzero (1) after lock cleanup; forcing cleanup. spirit-14.log:Feb 7 01:36:10 spirit-14 kernel: LustreError: 19111:0:(ldlm_resource.c:1669:ldlm_resource_dump()) --- Resource: [0xc9010:0x0:0x0].0x0 (ffff880fce276e40) refcount = 1 spirit-14.log:Feb 7 01:36:10 spirit-14 kernel: LustreError: 19111:0:(ldlm_resource.c:1669:ldlm_resource_dump()) --- Resource: [0x9e010:0x0:0x0].0x0 (ffff880304b772c0) refcount = 1

            For spirit-29 (client) this preceeds the above snippet (from console log)

            spirit-29 login: [ 1111.158689] sched: RT throttling activated
            [ 1113.267810] NMI watchdog: BUG: soft lockup - CPU#16 stuck for 22s! [ldlm_poold:1893]
            [ 1113.360604] Modules linked in: osc(OE) mgc(OE) lustre(OE) lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ipmi_ssif iTCO_wdt iTCO_vendor_support ipmi_si joydev pcspkr ipmi_devintf sg ipmi_msghandler ioatdma wmi shpchp i2c_i801 mei_me mei lpc_ich nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif mlx4_en crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect isci sysimgblt fb_sys_fops igb ttm libsas crct10dif_pclmul ahci crct10dif_common drm mlx4_core ptp crc32c_intel scsi_transport_sas libahci pps_core dca i2c_algo_bit libata i2c_core devlink
            [ 1114.503227] CPU: 16 PID: 1893 Comm: ldlm_poold Tainted: G           OE  ------------   3.10.0-693.11.6.el7.x86_64 #1
            [ 1114.629344] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.01.0002.082220131453 08/22/2013
            [ 1114.752337] task: ffff88101d3d1fa0 ti: ffff881010a2c000 task.ti: ffff881010a2c000
            [ 1114.842003] RIP: 0010:[<ffffffff810fc032>]  [<ffffffff810fc032>] native_queued_spin_lock_slowpath+0x112/0x1e0
            [ 1114.960945] RSP: 0018:ffff881010a2fc90  EFLAGS: 00000246
            [ 1115.024575] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000810000
            [ 1115.110079] RDX: ffff88081e1195c0 RSI: 0000000000210000 RDI: ffff88080c2fb018
            [ 1115.195585] RBP: ffff881010a2fc90 R08: ffff88101d9195c0 R09: 0000000000000000
            [ 1115.281090] R10: 000000a51a3eadf2 R11: 0000000000000800 R12: 0000000000000101
            [ 1115.366595] R13: ffff880ffa920800 R14: 000000a51a3eadf2 R15: 0000000000000800
            [ 1115.452101] FS:  0000000000000000(0000) GS:ffff88101d900000(0000) knlGS:0000000000000000
            [ 1115.549061] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [ 1115.617903] CR2: 0000000000616dc0 CR3: 00000000019fa000 CR4: 00000000001607e0
            [ 1115.703406] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            [ 1115.788909] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            [ 1115.874409] Call Trace:
            [ 1115.903681]  [<ffffffff816a070f>] queued_spin_lock_slowpath+0xb/0xf
            [ 1115.984727]  [<ffffffff816ade20>] _raw_spin_lock+0x20/0x30
            [ 1116.056587]  [<ffffffffc0b9fdd5>] ldlm_prepare_lru_list+0x325/0x4e0 [ptlrpc]
            [ 1116.147140]  [<ffffffffc0b9dca0>] ? ldlm_iter_helper+0x20/0x20 [ptlrpc]
            [ 1116.232506]  [<ffffffffc0ba4f01>] ldlm_cancel_lru+0x61/0x170 [ptlrpc]
            [ 1116.315748]  [<ffffffffc0bb0191>] ldlm_cli_pool_recalc+0x231/0x240 [ptlrpc]
            [ 1116.405194]  [<ffffffffc0bb07e9>] ldlm_pool_recalc+0x109/0x1d0 [ptlrpc]
            [ 1116.489602]  [<ffffffffc0bb20a4>] ldlm_pools_recalc+0x224/0x3d0 [ptlrpc]
            [ 1116.574991]  [<ffffffffc0bb22e5>] ldlm_pools_thread_main+0x95/0x330 [ptlrpc]
            [ 1116.664388]  [<ffffffff810c6440>] ? wake_up_state+0x20/0x20
            [ 1116.736229]  [<ffffffffc0bb2250>] ? ldlm_pools_recalc+0x3d0/0x3d0 [ptlrpc]
            [ 1116.823523]  [<ffffffff810b252f>] kthread+0xcf/0xe0
            [ 1116.886873]  [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40
            [ 1116.964785]  [<ffffffff816b8798>] ret_from_fork+0x58/0x90
            [ 1117.034359]  [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40
            [ 1117.112174] Code: 13 48 c1 ea 0d 48 98 83 e2 30 48 81 c2 c0 95 01 00 48 03 14 c5 20 8e b1 81 4c 89 02 41 8b 40 08 85 c0 75 0f 0f 1f 44 00 00 f3 90 <41> 8b 40 08 85 c0 74 f6 4d 8b 08 4d 85 c9 74 04 41 0f 18 09 8b 
            

            Could this be a related to LU-9230?

            utopiabound Nathaniel Clark added a comment - For spirit-29 (client) this preceeds the above snippet (from console log) spirit-29 login: [ 1111.158689] sched: RT throttling activated [ 1113.267810] NMI watchdog: BUG: soft lockup - CPU#16 stuck for 22s! [ldlm_poold:1893] [ 1113.360604] Modules linked in: osc(OE) mgc(OE) lustre(OE) lmv(OE) fld(OE) mdc(OE) fid(OE) lov(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ipmi_ssif iTCO_wdt iTCO_vendor_support ipmi_si joydev pcspkr ipmi_devintf sg ipmi_msghandler ioatdma wmi shpchp i2c_i801 mei_me mei lpc_ich nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif mlx4_en crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect isci sysimgblt fb_sys_fops igb ttm libsas crct10dif_pclmul ahci crct10dif_common drm mlx4_core ptp crc32c_intel scsi_transport_sas libahci pps_core dca i2c_algo_bit libata i2c_core devlink [ 1114.503227] CPU: 16 PID: 1893 Comm: ldlm_poold Tainted: G OE ------------ 3.10.0-693.11.6.el7.x86_64 #1 [ 1114.629344] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.01.0002.082220131453 08/22/2013 [ 1114.752337] task: ffff88101d3d1fa0 ti: ffff881010a2c000 task.ti: ffff881010a2c000 [ 1114.842003] RIP: 0010:[<ffffffff810fc032>] [<ffffffff810fc032>] native_queued_spin_lock_slowpath+0x112/0x1e0 [ 1114.960945] RSP: 0018:ffff881010a2fc90 EFLAGS: 00000246 [ 1115.024575] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000810000 [ 1115.110079] RDX: ffff88081e1195c0 RSI: 0000000000210000 RDI: ffff88080c2fb018 [ 1115.195585] RBP: ffff881010a2fc90 R08: ffff88101d9195c0 R09: 0000000000000000 [ 1115.281090] R10: 000000a51a3eadf2 R11: 0000000000000800 R12: 0000000000000101 [ 1115.366595] R13: ffff880ffa920800 R14: 000000a51a3eadf2 R15: 0000000000000800 [ 1115.452101] FS: 0000000000000000(0000) GS:ffff88101d900000(0000) knlGS:0000000000000000 [ 1115.549061] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1115.617903] CR2: 0000000000616dc0 CR3: 00000000019fa000 CR4: 00000000001607e0 [ 1115.703406] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1115.788909] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 1115.874409] Call Trace: [ 1115.903681] [<ffffffff816a070f>] queued_spin_lock_slowpath+0xb/0xf [ 1115.984727] [<ffffffff816ade20>] _raw_spin_lock+0x20/0x30 [ 1116.056587] [<ffffffffc0b9fdd5>] ldlm_prepare_lru_list+0x325/0x4e0 [ptlrpc] [ 1116.147140] [<ffffffffc0b9dca0>] ? ldlm_iter_helper+0x20/0x20 [ptlrpc] [ 1116.232506] [<ffffffffc0ba4f01>] ldlm_cancel_lru+0x61/0x170 [ptlrpc] [ 1116.315748] [<ffffffffc0bb0191>] ldlm_cli_pool_recalc+0x231/0x240 [ptlrpc] [ 1116.405194] [<ffffffffc0bb07e9>] ldlm_pool_recalc+0x109/0x1d0 [ptlrpc] [ 1116.489602] [<ffffffffc0bb20a4>] ldlm_pools_recalc+0x224/0x3d0 [ptlrpc] [ 1116.574991] [<ffffffffc0bb22e5>] ldlm_pools_thread_main+0x95/0x330 [ptlrpc] [ 1116.664388] [<ffffffff810c6440>] ? wake_up_state+0x20/0x20 [ 1116.736229] [<ffffffffc0bb2250>] ? ldlm_pools_recalc+0x3d0/0x3d0 [ptlrpc] [ 1116.823523] [<ffffffff810b252f>] kthread+0xcf/0xe0 [ 1116.886873] [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40 [ 1116.964785] [<ffffffff816b8798>] ret_from_fork+0x58/0x90 [ 1117.034359] [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40 [ 1117.112174] Code: 13 48 c1 ea 0d 48 98 83 e2 30 48 81 c2 c0 95 01 00 48 03 14 c5 20 8e b1 81 4c 89 02 41 8b 40 08 85 c0 75 0f 0f 1f 44 00 00 f3 90 <41> 8b 40 08 85 c0 74 f6 4d 8b 08 4d 85 c9 74 04 41 0f 18 09 8b Could this be a related to LU-9230 ?
            pjones Peter Jones added a comment -

            Nate

            Can you please advise on this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Nate Can you please advise on this one? Thanks Peter

            Log files attached above.

            standan Saurabh Tandan (Inactive) added a comment - Log files attached above.

            People

              utopiabound Nathaniel Clark
              standan Saurabh Tandan (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: