Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6039

sanity-hsm test 228: INFO: task lhsmtool_posix:26565 blocked for more than 120 seconds.

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.5.0
    • single node setup on local vm
    • 3
    • HSM
    • 16831

    Description

      cmd :

       # PDSH="pdsh -R ssh -S -w" ONLY=228 MDS_MOUNT_OPTS="-o rw,user_xattr" OST_MOUNT_OPTS="-o user_xattr" AGTCOUNT=1 agt1_HOST=vinayak lustre/tests/sanity-hsm.sh
      
      Dec 17 11:37:58 vinayak_centos kernel: INFO: task lhsmtool_posix:26565 blocked for more than 120 seconds.
      Dec 17 11:37:58 vinayak_centos kernel:      Not tainted 2.6.32.431.20.3.el6_lustre #1
      Dec 17 11:37:58 vinayak_centos kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Dec 17 11:37:58 vinayak_centos kernel: lhsmtool_posi D 0000000000000001     0 26565      1 0x00000080
      Dec 17 11:37:58 vinayak_centos kernel: ffff880000f4ba98 0000000000000082 0000000000000000 ffff88000df84000
      Dec 17 11:37:58 vinayak_centos kernel: ffff880000f4ba98 ffffffff8128d1f6 0000000200000010 0000000affffffff
      Dec 17 11:37:58 vinayak_centos kernel: ffff88002d703098 ffff880000f4bfd8 000000000000fbc8 ffff88002d703098
      Dec 17 11:37:58 vinayak_centos kernel: Call Trace:
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8128d1f6>] ? vsnprintf+0x336/0x5e0
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8152a40e>] __mutex_lock_slowpath+0x13e/0x180
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8152a2ab>] mutex_lock+0x2b/0x50
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0f3219a>] ll_layout_refresh+0x1da/0xcd0 [lustre]
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa07c4ab3>] ? ldlm_lock_add_to_lru+0x43/0x120 [ptlrpc]
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0f5cd80>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre]
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa07e87a0>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc]
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0606011>] ? cl_io_slice_add+0xc1/0x190 [obdclass]
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0f82d70>] vvp_io_init+0x340/0x490 [lustre]
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0605088>] cl_io_init0+0x88/0x150 [obdclass]
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa05fe505>] ? cl_env_get+0x195/0x350 [obdclass]
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0608004>] cl_io_init+0x64/0xe0 [obdclass]
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0f78ab1>] cl_glimpse_size0+0x91/0x1d0 [lustre]
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffffa0f27966>] ll_getattr+0x2e6/0x800 [lustre]
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8119b442>] ? user_path_at+0x62/0xa0
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8118e851>] vfs_getattr+0x51/0x80
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8118e8e4>] vfs_fstatat+0x64/0xa0
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8118e944>] sys_newfstatat+0x24/0x50
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff810e1e87>] ? audit_syscall_entry+0x1d7/0x200
      Dec 17 11:37:58 vinayak_centos kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
      

      Attachments

        Activity

          [LU-6039] sanity-hsm test 228: INFO: task lhsmtool_posix:26565 blocked for more than 120 seconds.

          duplication of LU-4727.

          jay Jinshan Xiong (Inactive) added a comment - duplication of LU-4727 .

          It looks like the same issue of LU-4727.

          jay Jinshan Xiong (Inactive) added a comment - It looks like the same issue of LU-4727 .
          vinayak_clogeny Vinayak Hariharmath (Inactive) added a comment - - edited is it related to https://jira.hpdd.intel.com/browse/LU-4727 ?

          Got answer for
          1. why file is released twice ?
          To perform step 5 ( cp --sparse=auto test/tfile test/tfile.2), file is restored first then copying is done.
          After this step hsm_state will be reverted to "exists" from "released". Thats why again the file is released for 2nd time.

          Below record proves that

          # lfs hsm_state test/tfile 
          test/tfile: (0x00000009) exists archived, archive_id:1
          
          # lfs hsm_release test/tfile 
          
          # lfs hsm_state test/tfile 
          test/tfile: (0x0000000d) released exists archived, archive_id:1
          
          # cp test/tfile test/tfile.2
          lhsmtool_posix[13264]: copytool fs=lustre archive#=1 item_count=1
          lhsmtool_posix[13264]: waiting for message from kernel
          lhsmtool_posix[13296]: '[0x200000400:0x2:0x0]' action RESTORE reclen 72, cookie=0x549282f9
          lhsmtool_posix[13296]: processing file 'test/tfile'
          lhsmtool_posix[13296]: reading stripe rules from '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0.lov' for '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0'
          lhsmtool_posix[13296]: restoring data from '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0' to '{VOLATILE}=[0x200000400:0x5:0x0]'
          lhsmtool_posix[13296]: going to copy data from '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0' to '{VOLATILE}=[0x200000400:0x5:0x0]'
          lhsmtool_posix[13296]: data restore from '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0' to '{VOLATILE}=[0x200000400:0x5:0x0]' done
          lhsmtool_posix[13296]: Action completed, notifying coordinator cookie=0x549282f9, FID=[0x200000400:0x2:0x0], hp_flags=0 err=0
          lhsmtool_posix[13296]: llapi_hsm_action_end() on '/mnt/lustre/.lustre/fid/0x200000400:0x2:0x0' ok (rc=0)
          
          # lfs hsm_state test/tfile 
          test/tfile: (0x00000009) exists archived, archive_id:1
          
          # lfs hsm_release test/tfile
          
          # lfs hsm_state test/tfile 
          test/tfile: (0x0000000d) released exists archived, archive_id:1
          
          vinayak_clogeny Vinayak Hariharmath (Inactive) added a comment - - edited Got answer for 1. why file is released twice ? To perform step 5 ( cp --sparse=auto test/tfile test/tfile.2), file is restored first then copying is done. After this step hsm_state will be reverted to "exists" from "released". Thats why again the file is released for 2nd time. Below record proves that # lfs hsm_state test/tfile test/tfile: (0x00000009) exists archived, archive_id:1 # lfs hsm_release test/tfile # lfs hsm_state test/tfile test/tfile: (0x0000000d) released exists archived, archive_id:1 # cp test/tfile test/tfile.2 lhsmtool_posix[13264]: copytool fs=lustre archive#=1 item_count=1 lhsmtool_posix[13264]: waiting for message from kernel lhsmtool_posix[13296]: '[0x200000400:0x2:0x0]' action RESTORE reclen 72, cookie=0x549282f9 lhsmtool_posix[13296]: processing file 'test/tfile' lhsmtool_posix[13296]: reading stripe rules from '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0.lov' for '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0' lhsmtool_posix[13296]: restoring data from '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0' to '{VOLATILE}=[0x200000400:0x5:0x0]' lhsmtool_posix[13296]: going to copy data from '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0' to '{VOLATILE}=[0x200000400:0x5:0x0]' lhsmtool_posix[13296]: data restore from '/tmp/hsm//0002/0000/0400/0000/0002/0000/0x200000400:0x2:0x0' to '{VOLATILE}=[0x200000400:0x5:0x0]' done lhsmtool_posix[13296]: Action completed, notifying coordinator cookie=0x549282f9, FID=[0x200000400:0x2:0x0], hp_flags=0 err=0 lhsmtool_posix[13296]: llapi_hsm_action_end() on '/mnt/lustre/.lustre/fid/0x200000400:0x2:0x0' ok (rc=0) # lfs hsm_state test/tfile test/tfile: (0x00000009) exists archived, archive_id:1 # lfs hsm_release test/tfile # lfs hsm_state test/tfile test/tfile: (0x0000000d) released exists archived, archive_id:1

          logs file attached

          vinayak_clogeny Vinayak Hariharmath (Inactive) added a comment - logs file attached

          I ran it on master. Wrongly updated affect/versions.

          The test_228 is doing hsm_release 2 times. I commented out 2nd release statement and test case got passed.

          I mimicked the steps manually
          1. cd /mnt/lustre
          2. created file dd if=/dev/zero of=test/tfile bs=1k count=1k
          3. lfs hsm_archive --archive=1 test/tfile
          4. lfs hsm_release test/tfile
          5. cp --sparse=auto test/tfile test/tfile.2
          6. lfs hsm_release test/tfile
          7. mkdir test/tdir
          8. tar cf - --sparse test/tfile | tar xvf - -C test/tdir ==> process hung on layout lock

          if we skip step 6, then everything goes fine.

          Even I checked with other operation (copy) instead of step 8
          cp --sparse=auto test/tfile test/tdir/tfile.2 ==> this also got hung

          After this I got few of the questions in my mind
          1. why file is released twice ?
          2. Even if the file is released twice, why the operation getting hung ?
          releasing the archived file for "n" number of times should not affect.

          Jinshan Xiong : Yes, you are right. layout lock has been held by some one and getting stuck there. I will update the logs.

          vinayak_clogeny Vinayak Hariharmath (Inactive) added a comment - I ran it on master. Wrongly updated affect/versions. The test_228 is doing hsm_release 2 times. I commented out 2nd release statement and test case got passed. I mimicked the steps manually 1. cd /mnt/lustre 2. created file dd if=/dev/zero of=test/tfile bs=1k count=1k 3. lfs hsm_archive --archive=1 test/tfile 4. lfs hsm_release test/tfile 5. cp --sparse=auto test/tfile test/tfile.2 6. lfs hsm_release test/tfile 7. mkdir test/tdir 8. tar cf - --sparse test/tfile | tar xvf - -C test/tdir ==> process hung on layout lock if we skip step 6, then everything goes fine. Even I checked with other operation (copy) instead of step 8 cp --sparse=auto test/tfile test/tdir/tfile.2 ==> this also got hung After this I got few of the questions in my mind 1. why file is released twice ? 2. Even if the file is released twice, why the operation getting hung ? releasing the archived file for "n" number of times should not affect. Jinshan Xiong : Yes, you are right. layout lock has been held by some one and getting stuck there. I will update the logs.

          People

            wc-triage WC Triage
            vinayak_clogeny Vinayak Hariharmath (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: