Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-552

1.8<->2.1 interop: parallel-scale connectathon test hung

    XMLWordPrintable

Details

    • 3
    • 10342

    Description

      parallel-scale connectathon test hung as follows:

      Test #7 - Test parent/child mutual exclusion.
      	Parent: 7.0  - F_TLOCK [             ffc,               9] PASSED.
      	Parent: Wrote 'aaaa eh' to testfile [ 4092, 7 ].
      	Parent: Now free child to run, should block on lock.
      	Parent: Check data in file to insure child blocked.
      	Parent: Read 'aaaa eh' from testfile [ 4092, 7 ].
      	Parent: 7.1  - COMPARE [             ffc,               7] PASSED.
      	Parent: Now unlock region so child will unblock.
      	Parent: 7.2  - F_ULOCK [             ffc,               9] PASSED.
      

      On client node fat-amd-3-ib:

      [root@fat-amd-3-ib tests]# ps auxww
      <~snip~>
      root     16272  0.0  0.0 107268  2160 pts/0    S+   08:19   0:00 bash /usr/lib64/lustre/tests/parallel-scale.sh
      root     16274  0.0  0.0 106020  1304 pts/0    S+   08:19   0:00 sh runtests -f
      root     16281  0.0  0.0   6600   556 pts/0    S+   08:19   0:00 tlocklfs /mnt/lustre/d0.connectathon
      root     16282  0.0  0.0   6436   332 pts/0    S+   08:19   0:00 tlocklfs /mnt/lustre/d0.connectathon
      
      [root@fat-amd-3-ib tests]# echo t > /proc/sysrq-trigger
      <~snip~>
      tlocklfs      S 0000000000000004     0 16281  16274 0x00000080
       ffff880234c9dca8 0000000000000082 0000000000000000 0000000000000082
       ffff880234c9dc28 ffff8803d5c97cb8 0000000000000000 0000000101fbf9c8
       ffff8803190f7ab8 ffff880234c9dfd8 000000000000f598 ffff8803190f7ab8
      Call Trace:
       [<ffffffff8117bf7b>] pipe_wait+0x5b/0x80
       [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40
       [<ffffffff814dbc1e>] ? mutex_lock+0x1e/0x50
       [<ffffffff8117c9d6>] pipe_read+0x3e6/0x4e0
       [<ffffffff811723ea>] do_sync_read+0xfa/0x140
       [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40
       [<ffffffff811bc395>] ? fcntl_setlk+0x75/0x320
       [<ffffffff81204ef6>] ? security_file_permission+0x16/0x20
       [<ffffffff81172e15>] vfs_read+0xb5/0x1a0
       [<ffffffff810d1ac2>] ? audit_syscall_entry+0x272/0x2a0
       [<ffffffff81172f51>] sys_read+0x51/0x90
       [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b
      
      tlocklfs      S 0000000000000004     0 16282  16281 0x00000080
       ffff880105f5ba98 0000000000000086 00003f9affffff9d 0000020300003f9a
       0000000000000000 0000000000000001 ffff880105f5ba88 ffffffffa0789fb0
       ffff880104877ab8 ffff880105f5bfd8 000000000000f598 ffff880104877ab8
      Call Trace:
       [<ffffffffa0789fb0>] ? ldlm_lock_dump+0x560/0x640 [ptlrpc]
       [<ffffffffa07b900d>] ldlm_flock_completion_ast+0x61d/0x9f0 [ptlrpc]
       [<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
       [<ffffffffa07a7565>] ldlm_cli_enqueue_fini+0x6c5/0xba0 [ptlrpc]
       [<ffffffff8105dc20>] ? default_wake_function+0x0/0x20
       [<ffffffffa07ab074>] ldlm_cli_enqueue+0x344/0x7a0 [ptlrpc]
       [<ffffffffa09a7edd>] ll_file_flock+0x47d/0x6b0 [lustre]
       [<ffffffff81190f40>] ? mntput_no_expire+0x30/0x110
       [<ffffffffa07b89f0>] ? ldlm_flock_completion_ast+0x0/0x9f0 [ptlrpc]
       [<ffffffff8117f451>] ? path_put+0x31/0x40
       [<ffffffff811bc243>] vfs_lock_file+0x23/0x40
       [<ffffffff811bc497>] fcntl_setlk+0x177/0x320
       [<ffffffff811845f7>] sys_fcntl+0x197/0x530
       [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b
      

      Dmesg on the MDS node fat-amd-1-ib showed that:

      Lustre: DEBUG MARKER: == test connectathon: connectathon == 08:17:31
      Lustre: DEBUG MARKER: ./runtests -N 10 -b -f /mnt/lustre/d0.connectathon
      Lustre: DEBUG MARKER: ./runtests -N 10 -g -f /mnt/lustre/d0.connectathon
      Lustre: DEBUG MARKER: ./runtests -N 10 -s -f /mnt/lustre/d0.connectathon
      Lustre: DEBUG MARKER: ./runtests -N 10 -l -f /mnt/lustre/d0.connectathon
      Lustre: Service thread pid 27106 was inactive for 0.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Lustre: Service thread pid 27106 completed after 0.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      Pid: 27106, comm: mdt_06
      
      Call Trace: 
       [<ffffffffa09c16fe>] cfs_waitq_wait+0xe/0x10 [libcfs]
       [<ffffffffa0c17b89>] ptlrpc_wait_event+0x2b9/0x2c0 [ptlrpc]
       [<ffffffff8105dc60>] ? default_wake_function+0x0/0x20
       [<ffffffffa0c1f6a5>] ptlrpc_main+0x4f5/0x1900 [ptlrpc]
       [<ffffffff8100c1ca>] child_rip+0xa/0x20
       [<ffffffffa0c1f1b0>] ? ptlrpc_main+0x0/0x1900 [ptlrpc]
       [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
      

      Maloo report: https://maloo.whamcloud.com/test_sets/52ca975a-b9a5-11e0-8bdf-52540025f9af

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: