Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6039

sanity-hsm test 228: INFO: task lhsmtool_posix:26565 blocked for more than 120 seconds.

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.5.0
    • single node setup on local vm
    • 3
    • HSM
    • 16831

    Description

      cmd :

       # PDSH="pdsh -R ssh -S -w" ONLY=228 MDS_MOUNT_OPTS="-o rw,user_xattr" OST_MOUNT_OPTS="-o user_xattr" AGTCOUNT=1 agt1_HOST=vinayak lustre/tests/sanity-hsm.sh
      
      Dec 17 11:37:58 vinayak_centos kernel: INFO: task lhsmtool_posix:26565 blocked for more than 120 seconds.
      Dec 17 11:37:58 vinayak_centos kernel:      Not tainted 2.6.32.431.20.3.el6_lustre #1
      Dec 17 11:37:58 vinayak_centos kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Dec 17 11:37:58 vinayak_centos kernel: lhsmtool_posi D 0000000000000001     0 26565      1 0x00000080
      Dec 17 11:37:58 vinayak_centos kernel: ffff880000f4ba98 0000000000000082 0000000000000000 ffff88000df84000
      Dec 17 11:37:58 vinayak_centos kernel: ffff880000f4ba98 ffffffff8128d1f6 0000000200000010 0000000affffffff
      Dec 17 11:37:58 vinayak_centos kernel: ffff88002d703098 ffff880000f4bfd8 000000000000fbc8 ffff88002d703098
      Dec 17 11:37:58 vinayak_centos kernel: Call Trace:
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8128d1f6>] ? vsnprintf+0x336/0x5e0
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8152a40e>] __mutex_lock_slowpath+0x13e/0x180
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8152a2ab>] mutex_lock+0x2b/0x50
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0f3219a>] ll_layout_refresh+0x1da/0xcd0 [lustre]
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa07c4ab3>] ? ldlm_lock_add_to_lru+0x43/0x120 [ptlrpc]
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0f5cd80>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre]
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa07e87a0>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0606011>] ? cl_io_slice_add+0xc1/0x190 [obdclass]
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0f82d70>] vvp_io_init+0x340/0x490 [lustre]
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0605088>] cl_io_init0+0x88/0x150 [obdclass]
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa05fe505>] ? cl_env_get+0x195/0x350 [obdclass]
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0608004>] cl_io_init+0x64/0xe0 [obdclass]
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0f78ab1>] cl_glimpse_size0+0x91/0x1d0 [lustre]
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0f27966>] ll_getattr+0x2e6/0x800 [lustre]
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8119b442>] ? user_path_at+0x62/0xa0
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8118e851>] vfs_getattr+0x51/0x80
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8118e8e4>] vfs_fstatat+0x64/0xa0
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8118e944>] sys_newfstatat+0x24/0x50
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff810e1e87>] ? audit_syscall_entry+0x1d7/0x200
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      

      Attachments

        Activity

          [LU-6039] sanity-hsm test 228: INFO: task lhsmtool_posix:26565 blocked for more than 120 seconds.

          duplication of LU-4727.

          jay Jinshan Xiong (Inactive) added a comment - duplication of LU-4727 .

          It looks like the same issue of LU-4727.

          jay Jinshan Xiong (Inactive) added a comment - It looks like the same issue of LU-4727 .
          vinayak_clogeny Vinayak Hariharmath (Inactive) added a comment - - edited is it related to https://jira.hpdd.intel.com/browse/LU-4727 ?

          Got answer for
          1. why file is released twice ?
          To perform step 5 ( cp --sparse=auto test/tfile test/tfile.2), file is restored first then copying is done.
          After this step hsm_state will be reverted to "exists" from "released". Thats why again the file is released for 2nd time.

          Below record proves that

          # lfs hsm_state test/tfile 
          test/tfile: (0x00000009) exists archived, archive_id:1
          
          # lfs hsm_release test/tfile 
          
          # lfs hsm_state test/tfile 
          test/tfile: (0x0000000d) released exists archived, archive_id:1
          
          # cp test/tfile test/tfile.2
          lhsmtool_posix[13264]: copytool fs=lustre archive#=1 item_count=1
          lhsmtool_posix[13264]: waiting for message from kernel
          lhsmtool_posix[13296]: '[0x200000400:0x2:0x0]' action RESTORE reclen 72, cookie=0x549282f9
          lhsmtool_posix[13296]: processing file 'test/tfile'
          lhsmtool_posix[13296]: reading stripe rules from '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0.lov' for '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0'
          lhsmtool_posix[13296]: restoring data from '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0' to '{VOLATILE}=[0x200000400:0x5:0x0]'
          lhsmtool_posix[13296]: going to copy data from '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0' to '{VOLATILE}=[0x200000400:0x5:0x0]'
          lhsmtool_posix[13296]: data restore from '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0' to '{VOLATILE}=[0x200000400:0x5:0x0]' done
          lhsmtool_posix[13296]: Action completed, notifying coordinator cookie=0x549282f9, FID=[0x200000400:0x2:0x0], hp_flags=0 err=0
          lhsmtool_posix[13296]: llapi_hsm_action_end() on '/mnt/lustre/.lustre/fid/0x200000400:0x2:0x0' ok (rc=0)
          
          # lfs hsm_state test/tfile 
          test/tfile: (0x00000009) exists archived, archive_id:1
          
          # lfs hsm_release test/tfile
          
          # lfs hsm_state test/tfile 
          test/tfile: (0x0000000d) released exists archived, archive_id:1
          
          vinayak_clogeny Vinayak Hariharmath (Inactive) added a comment - - edited Got answer for 1. why file is released twice ? To perform step 5 ( cp --sparse=auto test/tfile test/tfile.2), file is restored first then copying is done. After this step hsm_state will be reverted to "exists" from "released". Thats why again the file is released for 2nd time. Below record proves that # lfs hsm_state test/tfile test/tfile: (0x00000009) exists archived, archive_id:1 # lfs hsm_release test/tfile # lfs hsm_state test/tfile test/tfile: (0x0000000d) released exists archived, archive_id:1 # cp test/tfile test/tfile.2 lhsmtool_posix[13264]: copytool fs=lustre archive#=1 item_count=1 lhsmtool_posix[13264]: waiting for message from kernel lhsmtool_posix[13296]: '[0x200000400:0x2:0x0]' action RESTORE reclen 72, cookie=0x549282f9 lhsmtool_posix[13296]: processing file 'test/tfile' lhsmtool_posix[13296]: reading stripe rules from '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0.lov' for '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0' lhsmtool_posix[13296]: restoring data from '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0' to '{VOLATILE}=[0x200000400:0x5:0x0]' lhsmtool_posix[13296]: going to copy data from '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0' to '{VOLATILE}=[0x200000400:0x5:0x0]' lhsmtool_posix[13296]: data restore from '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0' to '{VOLATILE}=[0x200000400:0x5:0x0]' done lhsmtool_posix[13296]: Action completed, notifying coordinator cookie=0x549282f9, FID=[0x200000400:0x2:0x0], hp_flags=0 err=0 lhsmtool_posix[13296]: llapi_hsm_action_end() on '/mnt/lustre/.lustre/fid/0x200000400:0x2:0x0' ok (rc=0) # lfs hsm_state test/tfile test/tfile: (0x00000009) exists archived, archive_id:1 # lfs hsm_release test/tfile # lfs hsm_state test/tfile test/tfile: (0x0000000d) released exists archived, archive_id:1

          logs file attached

          vinayak_clogeny Vinayak Hariharmath (Inactive) added a comment - logs file attached

          I ran it on master. Wrongly updated affect/versions.

          The test_228 is doing hsm_release 2 times. I commented out 2nd release statement and test case got passed.

          I mimicked the steps manually
          1. cd /mnt/lustre
          2. created file dd if=/dev/zero of=test/tfile bs=1k count=1k
          3. lfs hsm_archive --archive=1 test/tfile
          4. lfs hsm_release test/tfile
          5. cp --sparse=auto test/tfile test/tfile.2
          6. lfs hsm_release test/tfile
          7. mkdir test/tdir
          8. tar cf - --sparse test/tfile | tar xvf - -C test/tdir ==> process hung on layout lock

          if we skip step 6, then everything goes fine.

          Even I checked with other operation (copy) instead of step 8
          cp --sparse=auto test/tfile test/tdir/tfile.2 ==> this also got hung

          After this I got few of the questions in my mind
          1. why file is released twice ?
          2. Even if the file is released twice, why the operation getting hung ?
          releasing the archived file for "n" number of times should not affect.

          Jinshan Xiong : Yes, you are right. layout lock has been held by some one and getting stuck there. I will update the logs.

          vinayak_clogeny Vinayak Hariharmath (Inactive) added a comment - I ran it on master. Wrongly updated affect/versions. The test_228 is doing hsm_release 2 times. I commented out 2nd release statement and test case got passed. I mimicked the steps manually 1. cd /mnt/lustre 2. created file dd if=/dev/zero of=test/tfile bs=1k count=1k 3. lfs hsm_archive --archive=1 test/tfile 4. lfs hsm_release test/tfile 5. cp --sparse=auto test/tfile test/tfile.2 6. lfs hsm_release test/tfile 7. mkdir test/tdir 8. tar cf - --sparse test/tfile | tar xvf - -C test/tdir ==> process hung on layout lock if we skip step 6, then everything goes fine. Even I checked with other operation (copy) instead of step 8 cp --sparse=auto test/tfile test/tdir/tfile.2 ==> this also got hung After this I got few of the questions in my mind 1. why file is released twice ? 2. Even if the file is released twice, why the operation getting hung ? releasing the archived file for "n" number of times should not affect. Jinshan Xiong : Yes, you are right. layout lock has been held by some one and getting stuck there. I will update the logs.

          This bug is filed against 2.5.0. Are you really running that version, or is it 2.5.3 (which you probably should use, or even the tip of b2_5)?

          adilger Andreas Dilger added a comment - This bug is filed against 2.5.0. Are you really running that version, or is it 2.5.3 (which you probably should use, or even the tip of b2_5)?

          If you can reproduce this problem, please use the following command:

          echo t > /proc/sysrq-trigger
          dmesg > log.txt
          

          and then attach the log.txt file here. I believe there is another process holding the mutex and waiting for response from the MDT.

          jay Jinshan Xiong (Inactive) added a comment - If you can reproduce this problem, please use the following command: echo t > /proc/sysrq-trigger dmesg > log.txt and then attach the log.txt file here. I believe there is another process holding the mutex and waiting for response from the MDT.

          People

            wc-triage WC Triage
            vinayak_clogeny Vinayak Hariharmath (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: