Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10302

hsm: obscure bug with multi-mountpoints and ldlm

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.11.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      I do not have much to share except the attached reproducer.

      The key elements of the reproducer seem to be:

      1. setup lustre with two mountpoints;
      2. create a file;
      3. launch a copytool on mountpoint A;
      4. suspend the copytool;
      5. archive the file created at step 1 from mountpoint A*;
      6. delete the file on mountpoint B;
      7. sync;
      8. un-suspend the copytool (the output of the copytool should indicate that llapi_hsm_action_begin() failed with EIO, not ENOENT)
      9. umount => the process hangs in an unkillable state.

      *You can use mountpoint B at step 5, but only if you created the file from mountpoint A.

      I added some debug in the reproducer that should be logged in /tmp.

      I suspect those two lines in the dmesg are related to this issue (they are logged at umount time):

      [  143.575078] LustreError: 3703:0:(ldlm_resource.c:1094:ldlm_resource_complain()) filter-lustre-OST0000_UUID: namespace resource [0x2:0x0:0x0].0x0 (ffff8806ab7b6900) refcount nonzero (1) after lock cleanup; forcing cleanup.
      [  143.578233] LustreError: 3703:0:(ldlm_resource.c:1676:ldlm_resource_dump()) --- Resource: [0x2:0x0:0x0].0x0 (ffff8806ab7b6900) refcount = 2
      

      Note: the title should probably be updated once we figure what the issue exactly is

      Attachments

        Issue Links

          Activity

            [LU-10302] hsm: obscure bug with multi-mountpoints and ldlm
            pjones Peter Jones made changes -
            Reporter Original: Quentin Bouget [ bougetq ] New: CEA [ cealustre ]
            jamesanunez James Nunez (Inactive) made changes -
            Link New: This issue is related to LU-10723 [ LU-10723 ]
            pjones Peter Jones made changes -
            Link Original: This issue is related to JFC-19 [ JFC-19 ]
            pjones Peter Jones made changes -
            Link Original: This issue is related to JFC-10 [ JFC-10 ]
            pjones Peter Jones made changes -
            Link New: This issue is related to JFC-20 [ JFC-20 ]
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.11.0 [ 13091 ]
            Assignee Original: Bruno Faccini [ bfaccini ] New: John Hammond [ jhammond ]
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            pjones Peter Jones made changes -
            Link New: This issue is related to JFC-19 [ JFC-19 ]
            mdiep Minh Diep made changes -
            Link New: This issue is related to JFC-10 [ JFC-10 ]
            jhammond John Hammond made changes -
            Link New: This issue is related to LU-10357 [ LU-10357 ]
            bougetq Quentin Bouget (Inactive) made changes -
            Description Original: I do not have much to share except the attached reproducer.

            The key elements of the reproducer *seem* to be:
             # setup lustre with two mountpoints;
             # create a file;
             # launch a copytool *on the first mountpoint*;
             # suspend the copytool;
             # archive the file created at step 1 *from the first mountpoint*;
             # delete the file *on the second mountpoint*;
             # {{sync}};
             # un-suspend the copytool (the output of the copytool should indicate that {{llapi_hsm_action_begin()}} failed with EIO, not ENOENT)
             # umount => the process hangs in an unkillable state.

            I added some debug in the reproducer that should be logged in /tmp.

            I suspect those two lines in the {{dmesg}} are related to this issue (they are logged at umount time):
            {noformat}
            [ 143.575078] LustreError: 3703:0:(ldlm_resource.c:1094:ldlm_resource_complain()) filter-lustre-OST0000_UUID: namespace resource [0x2:0x0:0x0].0x0 (ffff8806ab7b6900) refcount nonzero (1) after lock cleanup; forcing cleanup.
            [ 143.578233] LustreError: 3703:0:(ldlm_resource.c:1676:ldlm_resource_dump()) --- Resource: [0x2:0x0:0x0].0x0 (ffff8806ab7b6900) refcount = 2
            {noformat}
            _Note: the title should probably be updated once we figure what the issue exactly is_
            New: I do not have much to share except the attached reproducer.

            The key elements of the reproducer *seem* to be:
             # setup lustre with two mountpoints;
             # create a file;
             # launch a copytool *on mountpoint A*;
             # suspend the copytool;
             # archive the file created at step 1 *from mountpoint A**;
             # delete the file *on mountpoint B*;
             # {{sync}};
             # un-suspend the copytool (the output of the copytool should indicate that {{llapi_hsm_action_begin()}} failed with EIO, not ENOENT)
             # umount => the process hangs in an unkillable state.

            *_You can use mountpoint B at step 5, but only if you created the file from mountpoint A._

            I added some debug in the reproducer that should be logged in /tmp.

            I suspect those two lines in the {{dmesg}} are related to this issue (they are logged at umount time):
            {noformat}
            [ 143.575078] LustreError: 3703:0:(ldlm_resource.c:1094:ldlm_resource_complain()) filter-lustre-OST0000_UUID: namespace resource [0x2:0x0:0x0].0x0 (ffff8806ab7b6900) refcount nonzero (1) after lock cleanup; forcing cleanup.
            [ 143.578233] LustreError: 3703:0:(ldlm_resource.c:1676:ldlm_resource_dump()) --- Resource: [0x2:0x0:0x0].0x0 (ffff8806ab7b6900) refcount = 2
            {noformat}
            _Note: the title should probably be updated once we figure what the issue exactly is_

            People

              jhammond John Hammond
              cealustre CEA
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: