Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0, Lustre 2.12.5
    • Lustre 2.13.0, Lustre 2.12.4
    • RHEL 8.1 client + RHEL 7.7 server
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for jianyu <yujian@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/f4707d44-3ba7-11ea-bb75-52540065bddc

      test_70 failed with the following error:

      == sanity-hsm test 70: Copytool logs JSON register/unregister events to FIFO ========================= 06:37:48 (1579415868)
      CMD: trevis-12vm7 mktemp --tmpdir=/tmp -d sanity-hsm.test_70.XXXX
      CMD: trevis-12vm7 mkfifo -m 0644 /tmp/sanity-hsm.test_70.r3C7/fifo
      CMD: trevis-12vm7 cat /tmp/sanity-hsm.test_70.r3C7/fifo > /tmp/sanity-hsm.test_70.r3C7/events & echo \$! > /tmp/sanity-hsm.test_70.r3C7/monitor_pid
      

      Timeout occurred after 238 mins, last suite running was sanity-hsm, restarting cluster to continue tests

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-hsm test_70 - Timeout occurred after 238 mins, last suite running was sanity-hsm, restarting cluster to continue tests

      Attachments

        Activity

          [LU-13160] sanity-hsm test 70 timeout
          dongyang Dongyang Li added a comment -

          Yes I was using 2 clients.

          Are you reproducing it with the lab vms? if so can you create a setup and make sure the test case fails, then I can get into the vm and have a look?

          Thanks a lot

          dongyang Dongyang Li added a comment - Yes I was using 2 clients. Are you reproducing it with the lab vms? if so can you create a setup and make sure the test case fails, then I can get into the vm and have a look? Thanks a lot
          yujian Jian Yu added a comment -

          Hi Dongyang,

          Also tried to reproduce this with the same setup, no luck yet.

          I can reproduce this with two clients. I found the test passed with one client. Did you try to use two clients to reproduce the issue?

          yujian Jian Yu added a comment - Hi Dongyang, Also tried to reproduce this with the same setup, no luck yet. I can reproduce this with two clients. I found the test passed with one client. Did you try to use two clients to reproduce the issue?
          dongyang Dongyang Li added a comment -

          from the both test sessions, seems the process was stuck after the last cat command started on the second client. the stack log from the session suggests the cat process was indeed started:

          [22875.131497] cat             S    0  1279      1 0x00000080
          [22875.132478] Call Trace:
          [22875.132975]  ? __schedule+0x253/0x830
          [22875.133660]  schedule+0x28/0x70
          [22875.134253]  pipe_wait+0x6c/0xb0
          [22875.134880]  ? finish_wait+0x80/0x80
          [22875.135566]  wait_for_partner+0x19/0x50
          [22875.136269]  fifo_open+0x27b/0x2b0
          [22875.136911]  ? pipe_release+0xa0/0xa0
          [22875.137597]  do_dentry_open+0x132/0x330
          [22875.138310]  path_openat+0x573/0x14d0
          [22875.139008]  do_filp_open+0x93/0x100
          [22875.139681]  ? __check_object_size+0xa3/0x181
          [22875.140466]  do_sys_open+0x184/0x220
          [22875.141149]  do_syscall_64+0x5b/0x1b0
          [22875.141843]  entry_SYSCALL_64_after_hwframe+0x65/0xca
          

          which is part of the copytool_monitor_setup,

          we should proceed to 

          copytool setup --event-fifo "$HSMTOOL_MONITOR_DIR/fifo"
          

          and should be able to see

          CMD: trevis-16vm6 mkdir -p /tmp/arc1/sanity-hsm.test_70/
          

          as the next command on the second client.

          but I couldn't see anything stopping us from doing that.

          Also tried to reproduce this with the same setup, no luck yet.

          Need some help here.

          dongyang Dongyang Li added a comment - from the both test sessions, seems the process was stuck after the last cat command started on the second client. the stack log from the session suggests the cat process was indeed started: [22875.131497] cat S 0 1279 1 0x00000080 [22875.132478] Call Trace: [22875.132975] ? __schedule+0x253/0x830 [22875.133660] schedule+0x28/0x70 [22875.134253] pipe_wait+0x6c/0xb0 [22875.134880] ? finish_wait+0x80/0x80 [22875.135566] wait_for_partner+0x19/0x50 [22875.136269] fifo_open+0x27b/0x2b0 [22875.136911] ? pipe_release+0xa0/0xa0 [22875.137597] do_dentry_open+0x132/0x330 [22875.138310] path_openat+0x573/0x14d0 [22875.139008] do_filp_open+0x93/0x100 [22875.139681] ? __check_object_size+0xa3/0x181 [22875.140466] do_sys_open+0x184/0x220 [22875.141149] do_syscall_64+0x5b/0x1b0 [22875.141843] entry_SYSCALL_64_after_hwframe+0x65/0xca which is part of the copytool_monitor_setup, we should proceed to  copytool setup --event-fifo "$HSMTOOL_MONITOR_DIR/fifo" and should be able to see CMD: trevis-16vm6 mkdir -p /tmp/arc1/sanity-hsm.test_70/ as the next command on the second client. but I couldn't see anything stopping us from doing that. Also tried to reproduce this with the same setup, no luck yet. Need some help here.
          pjones Peter Jones added a comment -

          Dongyang

          Could you please look into why this regression has appeared using RHEL 8.1 rather than RHEL 8.0 clients?

          Peter

          pjones Peter Jones added a comment - Dongyang Could you please look into why this regression has appeared using RHEL 8.1 rather than RHEL 8.0 clients? Peter
          yujian Jian Yu added a comment -

          The failure also occurred on Lustre b2_12 branch with RHEL 8.1 client:
          https://testing.whamcloud.com/test_sets/59bb2c92-3ea7-11ea-9543-52540065bddc

          yujian Jian Yu added a comment - The failure also occurred on Lustre b2_12 branch with RHEL 8.1 client: https://testing.whamcloud.com/test_sets/59bb2c92-3ea7-11ea-9543-52540065bddc
          pjones Peter Jones added a comment -

          Yingjin

          Could you please advise?

          Thanks

          Peter

          pjones Peter Jones added a comment - Yingjin Could you please advise? Thanks Peter

          People

            dongyang Dongyang Li
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: