Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4705

LustreError: 89827:0:(mdc_locks.c:916:mdc_enqueue()) ldlm_cli_enqueue: -2

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.11.0, Lustre 2.10.2
    • Lustre 2.5.1
    • None
    • Running tip of Lustre b2_5, 1 MGS, 1 MDS, 2 OSS, 12 clients.
    • 3
    • 12942

    Description

      Unexpected MDC LustreError's on most clients.

      Client 10:
      Mar 4 03:27:11 lustre10 kernel: LustreError: 183913:0:(mdc_locks.c:916:mdc_enqueue()) ldlm_cli_enqueue: -2

      Client 11:
      Mar 4 00:37:25 lustre11 kernel: LustreError: 89827:0:(mdc_locks.c:916:mdc_enqueue()) ldlm_cli_enqueue: -2

      Client 12:
      Mar 4 00:39:36 lustre12 kernel: LustreError: 11-0: cal-MDT0000-mdc-ffff8807b75c4000: Communicating with 192.168.20.1@tcp1, operation ldlm_enqueue failed with -116.
      Mar 4 00:39:36 lustre12 kernel: LustreError: 70225:0:(mdc_locks.c:916:mdc_enqueue()) ldlm_cli_enqueue: -116
      Mar 4 00:39:36 lustre12 kernel: LustreError: 70225:0:(vvp_io.c:1227:vvp_io_init()) cal: refresh file layout [0x200001c0b:0x176e:0x0] error -116.
      Mar 4 03:09:33 lustre12 kernel: LustreError: 70225:0:(mdc_locks.c:916:mdc_enqueue()) ldlm_cli_enqueue: -2

      Client 13:
      Mar 4 00:29:54 lustre13 kernel: LustreError: 167294:0:(mdc_locks.c:916:mdc_enqueue()) ldlm_cli_enqueue: -2

      Client 14:
      Mar 4 01:18:04 lustre14 kernel: LustreError: 11-0: cal-MDT0000-mdc-ffff880787af8400: Communicating with 192.168.20.1@tcp1, operation ldlm_enqueue failed with -116.
      Mar 4 01:18:04 lustre14 kernel: LustreError: 11503:0:(mdc_locks.c:916:mdc_enqueue()) ldlm_cli_enqueue: -116
      Mar 4 01:18:04 lustre14 kernel: LustreError: 11503:0:(vvp_io.c:1227:vvp_io_init()) cal: refresh file layout [0x200001c12:0xbbe2:0x0] error -116.

      Client 16:
      Mar 4 01:00:46 lustre16 kernel: LustreError: 141605:0:(mdc_locks.c:916:mdc_enqueue()) ldlm_cli_enqueue: -2

      Client 17:
      Mar 4 00:13:39 lustre17 kernel: LustreError: 11-0: cal-MDT0000-mdc-ffff8808038aa000: Communicating with 192.168.20.1@tcp1, operation ldlm_enqueue failed with -116.
      Mar 4 00:13:39 lustre17 kernel: LustreError: 126770:0:(mdc_locks.c:916:mdc_enqueue()) ldlm_cli_enqueue: -116
      Mar 4 00:13:39 lustre17 kernel: LustreError: 126770:0:(vvp_io.c:1227:vvp_io_init()) cal: refresh file layout [0x200001beb:0x1aedf:0x0] error -116.
      Mar 4 02:02:43 lustre17 kernel: LustreError: 126770:0:(mdc_locks.c:916:mdc_enqueue()) ldlm_cli_enqueue: -2

      Client 18:
      Mar 1 05:34:03 lustre18 kernel: LustreError: 146331:0:(mdc_locks.c:916:mdc_enqueue()) ldlm_cli_enqueue: -2

      Attachments

        Issue Links

          Activity

            [LU-4705] LustreError: 89827:0:(mdc_locks.c:916:mdc_enqueue()) ldlm_cli_enqueue: -2

            Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/28978
            Subject: LU-4705 mdc: improve mdc_enqueue() error message
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 9d8f53da6ac5482262c188ba1e0ca3fb395aedfd

            gerrit Gerrit Updater added a comment - Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/28978 Subject: LU-4705 mdc: improve mdc_enqueue() error message Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9d8f53da6ac5482262c188ba1e0ca3fb395aedfd

            I just saw an instance of this error in the Lustre file system at TJNAF. It is the only instance I can recall of it being seen here, we are running lustre 2.5.3 pristine

            To expand a bit more... I have a test environment that I'm using to benchmark oss systems. Presently I have three osts on a single server running lustre 2.5.3. I've mounted it on a single client and am running IOR tests with the following parameters:

            mpirun -np 12 -bynode -machinefile ./nodelist ./ior -F -e -m -g -i 10 -t 1024k -b 42G -o /testL/benchmark/test

            where nodelist contains a single node.

            kjstrosahl Kurt J. Strosahl (Inactive) added a comment - - edited I just saw an instance of this error in the Lustre file system at TJNAF. It is the only instance I can recall of it being seen here, we are running lustre 2.5.3 pristine To expand a bit more... I have a test environment that I'm using to benchmark oss systems. Presently I have three osts on a single server running lustre 2.5.3. I've mounted it on a single client and am running IOR tests with the following parameters: mpirun -np 12 -bynode -machinefile ./nodelist ./ior -F -e -m -g -i 10 -t 1024k -b 42G -o /testL/benchmark/test where nodelist contains a single node.

            This is being seen at Gulfstream. In their environment, there doesn't appear to be any operational consequence to it. But, it scared them. It'd be nice if we could mute these errors, as discussed in https://jira.hpdd.intel.com/browse/LU-4705?focusedCommentId=79255&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-79255

            mjo Mike O'Connor added a comment - This is being seen at Gulfstream. In their environment, there doesn't appear to be any operational consequence to it. But, it scared them. It'd be nice if we could mute these errors, as discussed in https://jira.hpdd.intel.com/browse/LU-4705?focusedCommentId=79255&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-79255

            Andreas, the workload was a mix of real jobs with varying IO patterns - most prominent of which was many small reads from large files. There was no artificial creating/deleting of files. As for the application, am now noticing that a setting disabled printing of "some" error an warning messages during this run, however, each job completed successfully. No unexpected application-visible errors were seen.

            brett Brett Lee (Inactive) added a comment - Andreas, the workload was a mix of real jobs with varying IO patterns - most prominent of which was many small reads from large files. There was no artificial creating/deleting of files. As for the application, am now noticing that a setting disabled printing of "some" error an warning messages during this run, however, each job completed successfully. No unexpected application-visible errors were seen.

            Brett, what was the workload being run here? Something that is creating and deleting files concurrently (e.g. racer), or possibly multiple threads doing "rm -r" on the same tree? Either this is "normal" and maybe we should quiet the error messages, or it might imply some sort of bug on the MDS with inode lookup or files unexpectedly being deleted. Are there application-visible errors that are unexpected ("No such file or directory")?

            adilger Andreas Dilger added a comment - Brett, what was the workload being run here? Something that is creating and deleting files concurrently (e.g. racer), or possibly multiple threads doing "rm -r" on the same tree? Either this is "normal" and maybe we should quiet the error messages, or it might imply some sort of bug on the MDS with inode lookup or files unexpectedly being deleted. Are there application-visible errors that are unexpected ("No such file or directory")?

            No, there was no re-exporting, but each Lustre client did have four (4) mounts of the file system - each mount appearing active via the stats files in /proc.

            brett Brett Lee (Inactive) added a comment - No, there was no re-exporting, but each Lustre client did have four (4) mounts of the file system - each mount appearing active via the stats files in /proc.

            I have seen this error with IOR no NFS. I am not sure if the errors were generated during one single file or file per process.

            keith Keith Mannthey (Inactive) added a comment - I have seen this error with IOR no NFS. I am not sure if the errors were generated during one single file or file per process.

            Is the filesystem re-exported via NFS, or possibly have concurrent threads that are accessing and unlinking files?

            These messages mean that the client was looking up some file, but it was deleted by the time it tried to access it.

            -116 = -ESTALE, -2 = -ENOENT.

            The errors are not really fatal, and could probably be quieted from the console.

            adilger Andreas Dilger added a comment - Is the filesystem re-exported via NFS, or possibly have concurrent threads that are accessing and unlinking files? These messages mean that the client was looking up some file, but it was deleted by the time it tried to access it. -116 = -ESTALE, -2 = -ENOENT. The errors are not really fatal, and could probably be quieted from the console.

            I see these same errors with a Lustre 2.5.0 Client. The do not seem to impact the usability of the filesystem. But this is listed as a Error so there could be something happening.

            keith Keith Mannthey (Inactive) added a comment - I see these same errors with a Lustre 2.5.0 Client. The do not seem to impact the usability of the filesystem. But this is listed as a Error so there could be something happening.

            People

              wc-triage WC Triage
              brett Brett Lee (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: