[LU-4705] LustreError: 89827:0:(mdc_locks.c:916:mdc_enqueue()) ldlm_cli_enqueue: -2 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.11.0, Lustre 2.10.2
Affects Version/s: Lustre 2.5.1
Labels:
None
Environment:
Running tip of Lustre b2_5, 1 MGS, 1 MDS, 2 OSS, 12 clients.

Severity:
3
Rank (Obsolete):
12942

Description

Unexpected MDC LustreError's on most clients.

Client 10:
Mar 4 03:27:11 lustre10 kernel: LustreError: 183913:0:(mdc_locks.c:916:mdc_enqueue()) ldlm_cli_enqueue: -2

Client 11:
Mar 4 00:37:25 lustre11 kernel: LustreError: 89827:0:(mdc_locks.c:916:mdc_enqueue()) ldlm_cli_enqueue: -2

Client 12:
Mar 4 00:39:36 lustre12 kernel: LustreError: 11-0: cal-MDT0000-mdc-ffff8807b75c4000: Communicating with 192.168.20.1@tcp1, operation ldlm_enqueue failed with -116.
Mar 4 00:39:36 lustre12 kernel: LustreError: 70225:0:(mdc_locks.c:916:mdc_enqueue()) ldlm_cli_enqueue: -116
Mar 4 00:39:36 lustre12 kernel: LustreError: 70225:0:(vvp_io.c:1227:vvp_io_init()) cal: refresh file layout [0x200001c0b:0x176e:0x0] error -116.
Mar 4 03:09:33 lustre12 kernel: LustreError: 70225:0:(mdc_locks.c:916:mdc_enqueue()) ldlm_cli_enqueue: -2

Client 13:
Mar 4 00:29:54 lustre13 kernel: LustreError: 167294:0:(mdc_locks.c:916:mdc_enqueue()) ldlm_cli_enqueue: -2

Client 14:
Mar 4 01:18:04 lustre14 kernel: LustreError: 11-0: cal-MDT0000-mdc-ffff880787af8400: Communicating with 192.168.20.1@tcp1, operation ldlm_enqueue failed with -116.
Mar 4 01:18:04 lustre14 kernel: LustreError: 11503:0:(mdc_locks.c:916:mdc_enqueue()) ldlm_cli_enqueue: -116
Mar 4 01:18:04 lustre14 kernel: LustreError: 11503:0:(vvp_io.c:1227:vvp_io_init()) cal: refresh file layout [0x200001c12:0xbbe2:0x0] error -116.

Client 16:
Mar 4 01:00:46 lustre16 kernel: LustreError: 141605:0:(mdc_locks.c:916:mdc_enqueue()) ldlm_cli_enqueue: -2

Client 17:
Mar 4 00:13:39 lustre17 kernel: LustreError: 11-0: cal-MDT0000-mdc-ffff8808038aa000: Communicating with 192.168.20.1@tcp1, operation ldlm_enqueue failed with -116.
Mar 4 00:13:39 lustre17 kernel: LustreError: 126770:0:(mdc_locks.c:916:mdc_enqueue()) ldlm_cli_enqueue: -116
Mar 4 00:13:39 lustre17 kernel: LustreError: 126770:0:(vvp_io.c:1227:vvp_io_init()) cal: refresh file layout [0x200001beb:0x1aedf:0x0] error -116.
Mar 4 02:02:43 lustre17 kernel: LustreError: 126770:0:(mdc_locks.c:916:mdc_enqueue()) ldlm_cli_enqueue: -2

Client 18:
Mar 1 05:34:03 lustre18 kernel: LustreError: 146331:0:(mdc_locks.c:916:mdc_enqueue()) ldlm_cli_enqueue: -2

Attachments

Issue Links

is related to

LU-4973 MDD does not check nlink maximum limit properly, and cause LBUG in OSD layer(osd_handler.c:2805:osd_object_ref_add()) ASSERTION( inode->i_nlink <= 65000 )

Resolved

LU-4522 ldlm_cli_enqueue and ll_inode_revalidate_fini LustreError messages on 2.4.1 clients

Closed

Activity

[LU-4705] LustreError: 89827:0:(mdc_locks.c:916:mdc_enqueue()) ldlm_cli_enqueue: -2

Gerrit Updater added a comment - 13/Sep/17 4:53 PM

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/28978
Subject: ~~LU-4705~~ mdc: improve mdc_enqueue() error message
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9d8f53da6ac5482262c188ba1e0ca3fb395aedfd

Gerrit Updater added a comment - 13/Sep/17 4:53 PM Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/28978 Subject: LU-4705 mdc: improve mdc_enqueue() error message Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9d8f53da6ac5482262c188ba1e0ca3fb395aedfd

Kurt J. Strosahl (Inactive) added a comment - 18/Feb/16 8:43 PM - edited

I just saw an instance of this error in the Lustre file system at TJNAF. It is the only instance I can recall of it being seen here, we are running lustre 2.5.3 pristine

To expand a bit more... I have a test environment that I'm using to benchmark oss systems. Presently I have three osts on a single server running lustre 2.5.3. I've mounted it on a single client and am running IOR tests with the following parameters:

mpirun -np 12 -bynode -machinefile ./nodelist ./ior -F -e -m -g -i 10 -t 1024k -b 42G -o /testL/benchmark/test

where nodelist contains a single node.

Kurt J. Strosahl (Inactive) added a comment - 18/Feb/16 8:43 PM - edited I just saw an instance of this error in the Lustre file system at TJNAF. It is the only instance I can recall of it being seen here, we are running lustre 2.5.3 pristine To expand a bit more... I have a test environment that I'm using to benchmark oss systems. Presently I have three osts on a single server running lustre 2.5.3. I've mounted it on a single client and am running IOR tests with the following parameters: mpirun -np 12 -bynode -machinefile ./nodelist ./ior -F -e -m -g -i 10 -t 1024k -b 42G -o /testL/benchmark/test where nodelist contains a single node.

Mike O'Connor added a comment - 30/Jan/16 7:32 PM

This is being seen at Gulfstream. In their environment, there doesn't appear to be any operational consequence to it. But, it scared them. It'd be nice if we could mute these errors, as discussed in https://jira.hpdd.intel.com/browse/LU-4705?focusedCommentId=79255&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-79255

Mike O'Connor added a comment - 30/Jan/16 7:32 PM This is being seen at Gulfstream. In their environment, there doesn't appear to be any operational consequence to it. But, it scared them. It'd be nice if we could mute these errors, as discussed in https://jira.hpdd.intel.com/browse/LU-4705?focusedCommentId=79255&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-79255

Brett Lee (Inactive) added a comment - 27/Mar/14 6:41 PM

Andreas, the workload was a mix of real jobs with varying IO patterns - most prominent of which was many small reads from large files. There was no artificial creating/deleting of files. As for the application, am now noticing that a setting disabled printing of "some" error an warning messages during this run, however, each job completed successfully. No unexpected application-visible errors were seen.

Brett Lee (Inactive) added a comment - 27/Mar/14 6:41 PM Andreas, the workload was a mix of real jobs with varying IO patterns - most prominent of which was many small reads from large files. There was no artificial creating/deleting of files. As for the application, am now noticing that a setting disabled printing of "some" error an warning messages during this run, however, each job completed successfully. No unexpected application-visible errors were seen.

Andreas Dilger added a comment - 13/Mar/14 5:17 PM

Brett, what was the workload being run here? Something that is creating and deleting files concurrently (e.g. racer), or possibly multiple threads doing "rm -r" on the same tree? Either this is "normal" and maybe we should quiet the error messages, or it might imply some sort of bug on the MDS with inode lookup or files unexpectedly being deleted. Are there application-visible errors that are unexpected ("No such file or directory")?

Andreas Dilger added a comment - 13/Mar/14 5:17 PM Brett, what was the workload being run here? Something that is creating and deleting files concurrently (e.g. racer), or possibly multiple threads doing "rm -r" on the same tree? Either this is "normal" and maybe we should quiet the error messages, or it might imply some sort of bug on the MDS with inode lookup or files unexpectedly being deleted. Are there application-visible errors that are unexpected ("No such file or directory")?

Brett Lee (Inactive) added a comment - 11/Mar/14 5:20 PM

No, there was no re-exporting, but each Lustre client did have four (4) mounts of the file system - each mount appearing active via the stats files in /proc.

Brett Lee (Inactive) added a comment - 11/Mar/14 5:20 PM No, there was no re-exporting, but each Lustre client did have four (4) mounts of the file system - each mount appearing active via the stats files in /proc.

Keith Mannthey (Inactive) added a comment - 10/Mar/14 9:17 PM

I have seen this error with IOR no NFS. I am not sure if the errors were generated during one single file or file per process.

Keith Mannthey (Inactive) added a comment - 10/Mar/14 9:17 PM I have seen this error with IOR no NFS. I am not sure if the errors were generated during one single file or file per process.

Andreas Dilger added a comment - 10/Mar/14 7:29 PM

Is the filesystem re-exported via NFS, or possibly have concurrent threads that are accessing and unlinking files?

These messages mean that the client was looking up some file, but it was deleted by the time it tried to access it.

-116 = -ESTALE, -2 = -ENOENT.

The errors are not really fatal, and could probably be quieted from the console.

Andreas Dilger added a comment - 10/Mar/14 7:29 PM Is the filesystem re-exported via NFS, or possibly have concurrent threads that are accessing and unlinking files? These messages mean that the client was looking up some file, but it was deleted by the time it tried to access it. -116 = -ESTALE, -2 = -ENOENT. The errors are not really fatal, and could probably be quieted from the console.

Keith Mannthey (Inactive) added a comment - 10/Mar/14 3:52 PM

I see these same errors with a Lustre 2.5.0 Client. The do not seem to impact the usability of the filesystem. But this is listed as a Error so there could be something happening.

Keith Mannthey (Inactive) added a comment - 10/Mar/14 3:52 PM I see these same errors with a Lustre 2.5.0 Client. The do not seem to impact the usability of the filesystem. But this is listed as a Error so there could be something happening.

People

Assignee:: WC Triage

Reporter:: Brett Lee (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 04/Mar/14 3:28 PM

Updated:: 26/Oct/17 4:49 PM

Resolved:: 24/Oct/17 1:01 PM