Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0, Lustre 2.12.5
    • Lustre 2.13.0, Lustre 2.12.4
    • RHEL 8.1 client + RHEL 7.7 server
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for jianyu <yujian@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/f4707d44-3ba7-11ea-bb75-52540065bddc

      test_70 failed with the following error:

      == sanity-hsm test 70: Copytool logs JSON register/unregister events to FIFO ========================= 06:37:48 (1579415868)
      CMD: trevis-12vm7 mktemp --tmpdir=/tmp -d sanity-hsm.test_70.XXXX
      CMD: trevis-12vm7 mkfifo -m 0644 /tmp/sanity-hsm.test_70.r3C7/fifo
      CMD: trevis-12vm7 cat /tmp/sanity-hsm.test_70.r3C7/fifo > /tmp/sanity-hsm.test_70.r3C7/events & echo \$! > /tmp/sanity-hsm.test_70.r3C7/monitor_pid
      

      Timeout occurred after 238 mins, last suite running was sanity-hsm, restarting cluster to continue tests

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-hsm test_70 - Timeout occurred after 238 mins, last suite running was sanity-hsm, restarting cluster to continue tests

      Attachments

        Activity

          [LU-13160] sanity-hsm test 70 timeout

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37595/
          Subject: LU-13160 tests: fix sanity-hsm monitor setup
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 6724d8ca58e9b8474a180b013a4723cbdd8900d9

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37595/ Subject: LU-13160 tests: fix sanity-hsm monitor setup Project: fs/lustre-release Branch: master Current Patch Set: Commit: 6724d8ca58e9b8474a180b013a4723cbdd8900d9

          Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/37595
          Subject: LU-13160 tests: fix sanity-hsm monitor setup
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 093d9f2fe4d80060db46475bf32d4b985edebf97

          gerrit Gerrit Updater added a comment - Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/37595 Subject: LU-13160 tests: fix sanity-hsm monitor setup Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 093d9f2fe4d80060db46475bf32d4b985edebf97
          yujian Jian Yu added a comment -

          Sure, Dongyang.
          I usually directly specify variable values to run auster. Here is the command I ran on onyx-22vm3 under /usr/lib64/lustre/tests:

          [root@onyx-22vm3 tests]# PDSH="pdsh -t 120 -S -R ssh -w" NAME=ncli RCLIENTS=onyx-22vm5 mds_HOST=onyx-22vm1 MDSDEV=/dev/vda5 MDSSIZE=2097152 ost_HOST=onyx-22vm4 OSTCOUNT=2 OSTSIZE=2097152 OSTDEV1=/dev/vda5 OSTDEV2=/dev/vda6 SHARED_DIRECTORY="/home/jianyu/test_logs" VERBOSE=true bash auster -d /home/jianyu/test_logs -r -s -v -k sanity-hsm --only 70
          
          yujian Jian Yu added a comment - Sure, Dongyang. I usually directly specify variable values to run auster. Here is the command I ran on onyx-22vm3 under /usr/lib64/lustre/tests: [root@onyx-22vm3 tests]# PDSH="pdsh -t 120 -S -R ssh -w" NAME=ncli RCLIENTS=onyx-22vm5 mds_HOST=onyx-22vm1 MDSDEV=/dev/vda5 MDSSIZE=2097152 ost_HOST=onyx-22vm4 OSTCOUNT=2 OSTSIZE=2097152 OSTDEV1=/dev/vda5 OSTDEV2=/dev/vda6 SHARED_DIRECTORY="/home/jianyu/test_logs" VERBOSE=true bash auster -d /home/jianyu/test_logs -r -s -v -k sanity-hsm --only 70
          dongyang Dongyang Li added a comment -

          Many thanks Jian,

          Looks like the session on onyx-22vm3 has already finished, I was trying to start a new session with something like --only 70 but I noticed the cfg/local.sh was not setup.

          Can I just use auster under /usr/lib64/lustre/tests on 22vm3? if not can you provide a cmd for it?

          dongyang Dongyang Li added a comment - Many thanks Jian, Looks like the session on onyx-22vm3 has already finished, I was trying to start a new session with something like --only 70 but I noticed the cfg/local.sh was not setup. Can I just use auster under /usr/lib64/lustre/tests on 22vm3? if not can you provide a cmd for it?
          yujian Jian Yu added a comment -

          Sure, Dongyang.

          Here are the test nodes:

          2 Clients: onyx-22vm3 (local), onyx-22vm5 (remote)
          1 MGS/MDS: onyx-22vm1 (1 MDT)
          1 OSS: onyx-22vm4 (2 OSTs)
          

          Now sanity-hsm test 70 is hanging on onyx-22vm3:

          == sanity-hsm test 70: Copytool logs JSON register/unregister events to FIFO ========================= 06:18:09 (1581574689)
          CMD: onyx-22vm5 mktemp --tmpdir=/tmp -d sanity-hsm.test_70.XXXX
          CMD: onyx-22vm5 mkfifo -m 0644 /tmp/sanity-hsm.test_70.Whg8/fifo
          CMD: onyx-22vm5 cat /tmp/sanity-hsm.test_70.Whg8/fifo > /tmp/sanity-hsm.test_70.Whg8/events & echo \$! > /tmp/sanity-hsm.test_70.Whg8/monitor_pid
          
          yujian Jian Yu added a comment - Sure, Dongyang. Here are the test nodes: 2 Clients: onyx-22vm3 (local), onyx-22vm5 (remote) 1 MGS/MDS: onyx-22vm1 (1 MDT) 1 OSS: onyx-22vm4 (2 OSTs) Now sanity-hsm test 70 is hanging on onyx-22vm3: == sanity-hsm test 70: Copytool logs JSON register/unregister events to FIFO ========================= 06:18:09 (1581574689) CMD: onyx-22vm5 mktemp --tmpdir=/tmp -d sanity-hsm.test_70.XXXX CMD: onyx-22vm5 mkfifo -m 0644 /tmp/sanity-hsm.test_70.Whg8/fifo CMD: onyx-22vm5 cat /tmp/sanity-hsm.test_70.Whg8/fifo > /tmp/sanity-hsm.test_70.Whg8/events & echo \$! > /tmp/sanity-hsm.test_70.Whg8/monitor_pid
          dongyang Dongyang Li added a comment -

          Yes I was using 2 clients.

          Are you reproducing it with the lab vms? if so can you create a setup and make sure the test case fails, then I can get into the vm and have a look?

          Thanks a lot

          dongyang Dongyang Li added a comment - Yes I was using 2 clients. Are you reproducing it with the lab vms? if so can you create a setup and make sure the test case fails, then I can get into the vm and have a look? Thanks a lot
          yujian Jian Yu added a comment -

          Hi Dongyang,

          Also tried to reproduce this with the same setup, no luck yet.

          I can reproduce this with two clients. I found the test passed with one client. Did you try to use two clients to reproduce the issue?

          yujian Jian Yu added a comment - Hi Dongyang, Also tried to reproduce this with the same setup, no luck yet. I can reproduce this with two clients. I found the test passed with one client. Did you try to use two clients to reproduce the issue?
          dongyang Dongyang Li added a comment -

          from the both test sessions, seems the process was stuck after the last cat command started on the second client. the stack log from the session suggests the cat process was indeed started:

          [22875.131497] cat             S    0  1279      1 0x00000080
          [22875.132478] Call Trace:
          [22875.132975]  ? __schedule+0x253/0x830
          [22875.133660]  schedule+0x28/0x70
          [22875.134253]  pipe_wait+0x6c/0xb0
          [22875.134880]  ? finish_wait+0x80/0x80
          [22875.135566]  wait_for_partner+0x19/0x50
          [22875.136269]  fifo_open+0x27b/0x2b0
          [22875.136911]  ? pipe_release+0xa0/0xa0
          [22875.137597]  do_dentry_open+0x132/0x330
          [22875.138310]  path_openat+0x573/0x14d0
          [22875.139008]  do_filp_open+0x93/0x100
          [22875.139681]  ? __check_object_size+0xa3/0x181
          [22875.140466]  do_sys_open+0x184/0x220
          [22875.141149]  do_syscall_64+0x5b/0x1b0
          [22875.141843]  entry_SYSCALL_64_after_hwframe+0x65/0xca
          

          which is part of the copytool_monitor_setup,

          we should proceed to 

          copytool setup --event-fifo "$HSMTOOL_MONITOR_DIR/fifo"
          

          and should be able to see

          CMD: trevis-16vm6 mkdir -p /tmp/arc1/sanity-hsm.test_70/
          

          as the next command on the second client.

          but I couldn't see anything stopping us from doing that.

          Also tried to reproduce this with the same setup, no luck yet.

          Need some help here.

          dongyang Dongyang Li added a comment - from the both test sessions, seems the process was stuck after the last cat command started on the second client. the stack log from the session suggests the cat process was indeed started: [22875.131497] cat S 0 1279 1 0x00000080 [22875.132478] Call Trace: [22875.132975] ? __schedule+0x253/0x830 [22875.133660] schedule+0x28/0x70 [22875.134253] pipe_wait+0x6c/0xb0 [22875.134880] ? finish_wait+0x80/0x80 [22875.135566] wait_for_partner+0x19/0x50 [22875.136269] fifo_open+0x27b/0x2b0 [22875.136911] ? pipe_release+0xa0/0xa0 [22875.137597] do_dentry_open+0x132/0x330 [22875.138310] path_openat+0x573/0x14d0 [22875.139008] do_filp_open+0x93/0x100 [22875.139681] ? __check_object_size+0xa3/0x181 [22875.140466] do_sys_open+0x184/0x220 [22875.141149] do_syscall_64+0x5b/0x1b0 [22875.141843] entry_SYSCALL_64_after_hwframe+0x65/0xca which is part of the copytool_monitor_setup, we should proceed to  copytool setup --event-fifo "$HSMTOOL_MONITOR_DIR/fifo" and should be able to see CMD: trevis-16vm6 mkdir -p /tmp/arc1/sanity-hsm.test_70/ as the next command on the second client. but I couldn't see anything stopping us from doing that. Also tried to reproduce this with the same setup, no luck yet. Need some help here.
          pjones Peter Jones added a comment -

          Dongyang

          Could you please look into why this regression has appeared using RHEL 8.1 rather than RHEL 8.0 clients?

          Peter

          pjones Peter Jones added a comment - Dongyang Could you please look into why this regression has appeared using RHEL 8.1 rather than RHEL 8.0 clients? Peter
          yujian Jian Yu added a comment -

          The failure also occurred on Lustre b2_12 branch with RHEL 8.1 client:
          https://testing.whamcloud.com/test_sets/59bb2c92-3ea7-11ea-9543-52540065bddc

          yujian Jian Yu added a comment - The failure also occurred on Lustre b2_12 branch with RHEL 8.1 client: https://testing.whamcloud.com/test_sets/59bb2c92-3ea7-11ea-9543-52540065bddc
          pjones Peter Jones added a comment -

          Yingjin

          Could you please advise?

          Thanks

          Peter

          pjones Peter Jones added a comment - Yingjin Could you please advise? Thanks Peter

          People

            dongyang Dongyang Li
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: