Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11092

NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [ptlrpcd_00_18:4222]

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.10.4
    • None
    • SLES12SP3 -- 4.4.132-94.33-default client -- lustre 2.10.4
      OSS & MDS on centos 7.3 -- lustre 2.10.0 (with a couple of hand picked patches)
    • 3
    • 9223372036854775807

    Description

      I'm running a robinhood scan on a lustre filesystem with 80M files. As I'm getting memory pressure on the system (80M * ~1kB per lustre_inode_cache is big) and the system tries to shrink the slabs, I get those errors in /var/log/messages:

      2018-06-20T10:49:34.165875-04:00 btsotbal3000 kernel: [ 4889.674359] Lustre: lustre-MDT0000-mdc-ffff883fd53bb000: Connection to lustre-MDT0000 (at 10.11.201.11@o2ib) was lost; in progress operations using this service will wait for recov
      ery to complete
      2018-06-20T10:51:38.493202-04:00 btsotbal3000 kernel: [ 5014.015342] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1529505581/real 0] req@ffff881f7b3513c0 x16037989090541
      12/t0(0) o103->lustre-OST008c-osc-ffff883fd53bb000@10.11.201.18@o2ib:17/18 lens 328/224 e 0 to 1 dl 1529505643 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      2018-06-20T10:51:38.493217-04:00 btsotbal3000 kernel: [ 5014.017476] Lustre: lustre-OST008c-osc-ffff883fd53bb000: Connection to lustre-OST008c (at 10.11.201.18@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      2018-06-20T10:51:39.012492-04:00 btsotbal3000 kernel: [ 5014.527203] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1529505581/real 0] req@ffff881e37d8f080 x1603798909214432/t0(0) o103->lustre-OST0084-osc-ffff883fd53bb000@10.11.201.17@o2ib:17/18 lens 328/224 e 0 to 1 dl 1529505643 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      2018-06-20T10:51:39.012505-04:00 btsotbal3000 kernel: [ 5014.527207] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 262 previous similar messages
      2018-06-20T10:51:40.008987-04:00 btsotbal3000 kernel: [ 5015.533078] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1529505582/real 0] req@ffff881f2e08acc0 x1603798909677472/t0(0) o103->lustre-OST0089-osc-ffff883fd53bb000@10.11.201.18@o2ib:17/18 lens 328/224 e 0 to 1 dl 1529505644 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      2018-06-20T10:51:40.008998-04:00 btsotbal3000 kernel: [ 5015.533083] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 553 previous similar messages
      2018-06-20T10:51:42.022152-04:00 btsotbal3000 kernel: [ 5017.542234] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1529505582/real 0] req@ffff8818937e8340 x1603798910485936/t0(0) o103->lustre-OST00a9-osc-ffff883fd53bb000@10.11.200.15@o2ib:17/18 lens 328/224 e 0 to 1 dl 1529505644 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      2018-06-20T10:51:42.022165-04:00 btsotbal3000 kernel: [ 5017.542239] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1102 previous similar messages
      2018-06-20T10:51:46.030748-04:00 btsotbal3000 kernel: [ 5021.548246] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1529505584/real 0] req@ffff881e0945b640 x1603798911754992/t0(0) o103->lustre-OST00ab-osc-ffff883fd53bb000@10.11.200.15@o2ib:17/18 lens 328/224 e 0 to 1 dl 1529505648 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      2018-06-20T10:51:46.030762-04:00 btsotbal3000 kernel: [ 5021.548249] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 2104 previous similar messages
      2018-06-20T10:51:54.040347-04:00 btsotbal3000 kernel: [ 5029.557301] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1529505589/real 0] req@ffff881e054d39c0 x1603798914360720/t0(0) o103->lustre-OST009f-osc-ffff883fd53bb000@10.11.200.14@o2ib:17/18 lens 328/224 e 0 to 1 dl 1529505658 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      2018-06-20T10:51:54.040360-04:00 btsotbal3000 kernel: [ 5029.557307] Lustre: 4222:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 4325 previous similar messages
      2018-06-20T10:52:01.579715-04:00 btsotbal3000 kernel: [ 5037.097043] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [ptlrpcd_00_18:4222]
      2018-06-20T10:52:01.579728-04:00 btsotbal3000 kernel: [ 5037.106479] Modules linked in: nfsv3 nfs fscache mpt3sas raid_class scsi_transport_sas mptctl mptbase osc(OEN) mgc(OEN) lustre(OEN) lmv(OEN) fld(OEN) mdc(OEN) fid(OEN) lov(OEN) ko2iblnd(OEN) ptlrpc(OEN) obdclass(OEN) lnet(OEN) libcfs(OEN) af_packet iscsi_ibft iscsi_boot_sysfs rdma_ucm(OEX) ib_ucm(OEX) rdma_cm(OEX) iw_cm(OEX) configfs ib_ipoib(OEX) ib_cm(OEX) ib_uverbs(OEX) ib_umad(OEX) mlx4_en(OEX) mlx4_ib(OEX) mlx4_core(OEX) msr xfs libcrc32c intel_rapl sb_edac edac_core ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp ixgbe kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel mdio drbg ansi_cprng aesni_intel igb mgag200 ttm drm_kms_helper ipmi_si drm ipmi_devintf mei_me ipmi_msghandler pcspkr i2c_algo_bit fjes aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd syscopyarea sysfillrect sysimgblt fb_sys_fops iTCO_wdt iTCO_vendor_support dcdbas(X) dca shpchp lpc_ich mfd_core processor mei wmi button nfsd nfs_acl lockd auth_rpcgss grace sunrpc mlx5_ib(OEX) ext4 crc16 jbd2 mbcache ib_core(OEX) sd_mod mlx5_core(OEX) vxlan ip6_udp_tunnel udp_tunnel ptp pps_core devlink inet_lro ehci_pci mlx_compat(OEX) ehci_hcd ahci libahci usbcore libata megaraid_sas usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4 [last unloaded: dell_rbu]
      2018-06-20T10:52:01.579731-04:00 btsotbal3000 kernel: [ 5037.106543] Supported: No, Unsupported modules are loaded
      2018-06-20T10:52:01.579732-04:00 btsotbal3000 kernel: [ 5037.106546] CPU: 12 PID: 4222 Comm: ptlrpcd_00_18 Tainted: G OE NX 4.4.132-94.33-default #1
      2018-06-20T10:52:01.579733-04:00 btsotbal3000 kernel: [ 5037.106548] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.0.2 03/15/2016
      2018-06-20T10:52:01.579734-04:00 btsotbal3000 kernel: [ 5037.106549] task: ffff883f99d95640 ti: ffff883f99d98000 task.ti: ffff883f99d98000
      2018-06-20T10:52:01.579735-04:00 btsotbal3000 kernel: [ 5037.106550] RIP: 0010:[<ffffffff810cb01f>] [<ffffffff810cb01f>] native_queued_spin_lock_slowpath+0x10f/0x1a0
      2018-06-20T10:52:01.579736-04:00 btsotbal3000 kernel: [ 5037.106557] RSP: 0018:ffff883f99d9bc00 EFLAGS: 00000246
      2018-06-20T10:52:01.579736-04:00 btsotbal3000 kernel: [ 5037.106557] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000340000
      2018-06-20T10:52:01.579737-04:00 btsotbal3000 kernel: [ 5037.106558] RDX: ffff883fff05af80 RSI: 0000000000100001 RDI: ffff881ffae04500
      2018-06-20T10:52:01.579737-04:00 btsotbal3000 kernel: [ 5037.106559] RBP: ffff883f99d9bce0 R08: ffff881fff99af80 R09: 0000000000000000
      2018-06-20T10:52:01.579738-04:00 btsotbal3000 kernel: [ 5037.106560] R10: 000000000000002d R11: ffff881efa09381b R12: 00000000147f4371
      2018-06-20T10:52:01.579739-04:00 btsotbal3000 kernel: [ 5037.106561] R13: 00000000051fd0dc R14: 0000000000000000 R15: ffff883f99d95640
      2018-06-20T10:52:01.579740-04:00 btsotbal3000 kernel: [ 5037.106562] FS: 0000000000000000(0000) GS:ffff881fff980000(0000) knlGS:0000000000000000
      2018-06-20T10:52:01.579740-04:00 btsotbal3000 kernel: [ 5037.106563] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      2018-06-20T10:52:01.579740-04:00 btsotbal3000 kernel: [ 5037.106564] CR2: 00007f0e6c018898 CR3: 0000000001e0c000 CR4: 00000000003606f0
      2018-06-20T10:52:01.579741-04:00 btsotbal3000 kernel: [ 5037.106565] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      2018-06-20T10:52:01.579742-04:00 btsotbal3000 kernel: [ 5037.106566] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      2018-06-20T10:52:01.579743-04:00 btsotbal3000 kernel: [ 5037.106567] Stack:
      2018-06-20T10:52:01.579744-04:00 btsotbal3000 kernel: [ 5037.106568] ffffffff811915d2 ffffffffa0b98f15 ffffffffa0eee8db ffffffffa0eee8db
      2018-06-20T10:52:01.579744-04:00 btsotbal3000 kernel: [ 5037.106569] ffffffffa0eee8db ffffffffa0eee8db 0000000000000000 00080000ffffffff
      2018-06-20T10:52:01.579744-04:00 btsotbal3000 kernel: [ 5037.106571] 5b2a6a1100000000 00000000ffffffff ffff883f92821048 ffffffffa0eee8db
      2018-06-20T10:52:01.579745-04:00 btsotbal3000 kernel: [ 5037.106573] Call Trace:
      2018-06-20T10:52:01.579745-04:00 btsotbal3000 kernel: [ 5037.106581] [<ffffffff811915d2>] queued_spin_lock_slowpath+0x7/0xa
      2018-06-20T10:52:01.579746-04:00 btsotbal3000 kernel: [ 5037.106605] [<ffffffffa0b98f15>] LNetMDUnlink+0x65/0x150 [lnet]
      2018-06-20T10:52:01.579747-04:00 btsotbal3000 kernel: [ 5037.106677] [<ffffffffa0e854be>] ptlrpc_unregister_reply+0xee/0x6b0 [ptlrpc]
      2018-06-20T10:52:01.579748-04:00 btsotbal3000 kernel: [ 5037.106703] [<ffffffffa0e89de3>] ptlrpc_expire_one_request+0xb3/0x420 [ptlrpc]
      2018-06-20T10:52:01.579748-04:00 btsotbal3000 kernel: [ 5037.106726] [<ffffffffa0e8a1f9>] ptlrpc_expired_set+0xa9/0x180 [ptlrpc]
      2018-06-20T10:52:01.579749-04:00 btsotbal3000 kernel: [ 5037.106752] [<ffffffffa0eb7679>] ptlrpcd+0x229/0x490 [ptlrpc]
      2018-06-20T10:52:01.579749-04:00 btsotbal3000 kernel: [ 5037.106760] [<ffffffff8109ebc9>] kthread+0xc9/0xe0
      2018-06-20T10:52:01.579750-04:00 btsotbal3000 kernel: [ 5037.106765] [<ffffffff816177ef>] ret_from_fork+0x3f/0x80
      2018-06-20T10:52:01.579751-04:00 btsotbal3000 kernel: [ 5037.107943] DWARF2 unwinder stuck at ret_from_fork+0x3f/0x80
      2018-06-20T10:52:01.579752-04:00 btsotbal3000 kernel: [ 5037.107944]
      2018-06-20T10:52:01.579752-04:00 btsotbal3000 kernel: [ 5037.107945] Leftover inexact backtrace:
      2018-06-20T10:52:01.579753-04:00 btsotbal3000 kernel: [ 5037.107945]
      2018-06-20T10:52:01.579753-04:00 btsotbal3000 kernel: [ 5037.107947] [<ffffffff8109eb00>] ? kthread_park+0x50/0x50
      2018-06-20T10:52:01.579754-04:00 btsotbal3000 kernel: [ 5037.107948] Code: c1 e8 12 48 c1 ea 0c 83 e8 01 83 e2 30 48 98 48 81 c2 80 af 01 00 48 03 14 c5 20 0d f5 81 4c 89 02 41 8b 40 08 85 c0 75 0a f3 90 <41> 8b 40 08 85 c0 74 f6 4d 8b 08 4d 85 c9 74 08 41 0f 0d 09 eb

      At that point, all the CPUs go to 100% and it seems like we are not making progress anymore.

      I can repro by running robinhood for a few minutes and dropping the cache (echo 2 > /proc/sys/vm/drop_caches).

      I managed to fix it by limiting lru_max_age to 10s (ldlm.namespace.*.lru_max_age).

       

      Overall, it feels like when the slab shrinker run, if there are too many locks, reclaiming those locks takes a while, prevents the rpc stack to make progress, and then we hit a race condition in LNetMDUnlink. 

       

       

      Attachments

        1. client.txt
          16 kB
          Julien Wallior
        2. mds831.txt
          13 kB
          Julien Wallior
        3. mds832.txt
          13 kB
          Julien Wallior

        Issue Links

          Activity

            [LU-11092] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [ptlrpcd_00_18:4222]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-17428 [ LU-17428 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-13970 [ LU-13970 ]
            utopiabound Nathaniel Clark made changes -
            Link New: This issue is duplicated by DELL-540 [ DELL-540 ]
            ashehata Amir Shehata (Inactive) made changes -
            Remote Link New: This issue links to "Page (Whamcloud Community Wiki)" [ 23309 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-11075 [ LU-11075 ]

            Yes, we use stripe count > 1 in different directory.

            I tried lru_size = 100k and lru_max_age = 600k and that works fine too.

            jwallior Julien Wallior added a comment - Yes, we use stripe count > 1 in different directory. I tried lru_size = 100k and lru_max_age = 600k and that works fine too.

            It might be with so much RAM on the client that it isn't trying very hard to cancel locks on the client. Setting the lock max age to 10s is probably too short to be useful for other locks on the system (e.g. directories that are being accessed repeatedly). Another option would be to set a limit on the number of locks on the client and increase the max age to something better like 10 minutes, like lctl set_param ldlm.namespaces.*.lru_size=100000 ldlm.namespaces.*.lru_max_age=600000 or similar. If we assumed all of the RAM in use on the client was used by Lustre, the 1M inodes cached would be using about 13KB/inode (including 5M locks), which means that it would consume at most 13MB per OST for the inode/lock cache. It may be that you are seeing more locks than inodes if e.g. you have a default stripe count = 4 (one MDT lock, and 4 OST locks per file).

            It probably makes sense to put some reasonable upper limit on the LRU size, since I've seen similar problems of too many cached locks on other systems with a lot of RAM. Even if there is enough free RAM it just slows things down when there are many millions of files cached for no real reason. We might also need to take a look at the kernel slab cache shrinker to see if it is working effectively to shrink the DLM cache size under memory pressure.

            adilger Andreas Dilger added a comment - It might be with so much RAM on the client that it isn't trying very hard to cancel locks on the client. Setting the lock max age to 10s is probably too short to be useful for other locks on the system (e.g. directories that are being accessed repeatedly). Another option would be to set a limit on the number of locks on the client and increase the max age to something better like 10 minutes, like lctl set_param ldlm.namespaces.*.lru_size=100000 ldlm.namespaces.*.lru_max_age=600000 or similar. If we assumed all of the RAM in use on the client was used by Lustre, the 1M inodes cached would be using about 13KB/inode (including 5M locks), which means that it would consume at most 13MB per OST for the inode/lock cache. It may be that you are seeing more locks than inodes if e.g. you have a default stripe count = 4 (one MDT lock, and 4 OST locks per file). It probably makes sense to put some reasonable upper limit on the LRU size, since I've seen similar problems of too many cached locks on other systems with a lot of RAM. Even if there is enough free RAM it just slows things down when there are many millions of files cached for no real reason. We might also need to take a look at the kernel slab cache shrinker to see if it is working effectively to shrink the DLM cache size under memory pressure.
            adilger Andreas Dilger made changes -
            Link New: This issue is related to DDN-550 [ DDN-550 ]

            When I revert lru_max_age to default (3900000), I only do it on the client and I can repro. (this is live prod filesystem, so I can only break it so much).

            Regarding who's running out of memory, I think it's the client (I can either wait for it to run out of memory, or create the issue earlier by dropping the cache on the client). Now the client has 256GB, so even for a 80M filesystem, I would hope that should be enough to run rbh.

            Finally, regarding the timeouts, they say "@@@ Request sent has timed out for sent delay". I was thinking that meant the client was stuck on something (clearing the memory) and didn't get to send the request on time. But maybe it means the client can't put the request on the wire because the server told it to wait (I'm not sure if that's possible).

            The client was freshly rebooted before starting the run. After running rhb --scan for ~20min, I got the following results.
            mds831.txt: data on the mds hosting mdt0
            mds832.txt: data on the mds hosting mdt1
            client.txt: data on the client

            I can get the data on the OSS too if you need them. mds832.txt mds831.txt client.txt

            jwallior Julien Wallior added a comment - When I revert lru_max_age to default (3900000), I only do it on the client and I can repro. (this is live prod filesystem, so I can only break it so much). Regarding who's running out of memory, I think it's the client (I can either wait for it to run out of memory, or create the issue earlier by dropping the cache on the client). Now the client has 256GB, so even for a 80M filesystem, I would hope that should be enough to run rbh. Finally, regarding the timeouts, they say "@@@ Request sent has timed out for sent delay". I was thinking that meant the client was stuck on something (clearing the memory) and didn't get to send the request on time. But maybe it means the client can't put the request on the wire because the server told it to wait (I'm not sure if that's possible). The client was freshly rebooted before starting the run. After running rhb --scan for ~20min, I got the following results. mds831.txt: data on the mds hosting mdt0 mds832.txt: data on the mds hosting mdt1 client.txt: data on the client I can get the data on the OSS too if you need them. mds832.txt mds831.txt client.txt
            jwallior Julien Wallior made changes -
            Attachment New: mds832.txt [ 30765 ]
            Attachment New: mds831.txt [ 30766 ]
            Attachment New: client.txt [ 30767 ]

            People

              wc-triage WC Triage
              jwallior Julien Wallior
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated: