Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6470

SWL tests appear to wedge on mutex, clients are evicted

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.8.0
    • None
    • Hyperion, 2.7.52 tag - ldiskfs format 200 clients
    • 3
    • 9223372036854775807

    Description

      Running SWL test on Hyperion, multiple clients timeout, eventually are evicted due to lock timeouts.
      Typical client stack:

      INFO: task ior:76875 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      ior           D 0000000000000000     0 76875  76869 0x00000000
       ffff88083f5ddd18 0000000000000082 0000000000000000 ffff8808341ab1d8
       ffff88083f5ddc88 ffffffff81227e9f ffff88083f5ddd68 ffffffff81199045
       ffff880871a9b058 ffff88083f5ddfd8 000000000000fbc8 ffff880871a9b058
      Call Trace:
       [<ffffffff81227e9f>] ? security_inode_permission+0x1f/0x30
       [<ffffffff81199045>] ? __link_path_walk+0x145/0x1000
       [<ffffffff8152a5be>] __mutex_lock_slowpath+0x13e/0x180
       [<ffffffff8152a45b>] mutex_lock+0x2b/0x50
       [<ffffffff8119ba76>] do_filp_open+0x2d6/0xd20
       [<ffffffff811bd6b8>] ? do_statfs_native+0x98/0xb0
       [<ffffffff8128f83a>] ? strncpy_from_user+0x4a/0x90
       [<ffffffff811a8b82>] ? alloc_fd+0x92/0x160
       [<ffffffff81185be9>] do_sys_open+0x69/0x140
       [<ffffffff81185d00>] sys_open+0x20/0x30
       [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      

      Server side:

      Apr 16 12:45:55 iws5 kernel: LustreError: 0:0:(ldlm_lockd.c:341:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 192.168.124.165@o2ib  ns: filter-lustre-OST0024_UUID lock: ffff8801eac5e740/0x9c91b8d7046afd8 lrc: 3/0,0 mode: PR/PR res: [0x1d8c6:0x0:0x0].0 rrc: 13 type: EXT [0->18446744073709551615] (req 29796335616->29930553343) flags: 0x60000000010020 nid: 192.168.124.165@o2ib remote: 0x42e28ecfaef2b33a expref: 6 pid: 109819 timeout: 4391206162 lvb_type: 0
      Apr 16 13:45:23 iws3 kernel: LustreError: 0:0:(ldlm_lockd.c:341:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.124.165@o2ib  ns: filter-lustre-OST002d_UUID lock: ffff8805ccaf1180/0xc2e6f2e60c6a3a1f lrc: 3/0,0 mode: PR/PR res: [0x28183:0x0:0x0].0 rrc: 10 type: EXT [0->18446744073709551615] (req 29527900160->29662117887) flags: 0x60000000010020 nid: 192.168.124.165@o2ib remote: 0x42e28ecfaef2c102 expref: 5 pid: 23687 timeout: 4394807701 lvb_type: 0
      

      Maybe related to DDN-56? Easy to reproduce if more data is required
      I dumped the lustre log from a client immediately after an eviction, file attached

      Attachments

        Activity

          People

            wc-triage WC Triage
            cliffw Cliff White (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: