[LU-4607] OSS servers crashing with error: (ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired Created: 10/Feb/14  Updated: 13/Feb/14  Resolved: 13/Feb/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Oz Rentas Assignee: Cliff White (Inactive)
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

mds1 (mgs + mds for home)
mds2 ( mds for scratch)
6 oss server serving 6 volumes for home and 24 volumes for scratch

mds1.ibb@o2ib0:mds2.ibb@o2ib0:/home
86T 42T 43T 50% /global/home
mds1.ibb@o2ib0:mds2.ibb@o2ib0:/scratch
342T 89T 250T 27% /global/scratch

all clients mounts:

  1. Lustre home and scratch FS from ahab1:ahab2
    mds1.ibb@o2ib0:mds2.ibb@o2ib0:/home /global/home lustre
    rw,user_xattr,localflock,_netdev 0 0

mds1.ibb@o2ib0:mds2.ibb@o2ib0:/scratch /global/scratch lustre rw,user_xattr,localflock,_netdev 0 0


Attachments: Text File do_IRQ-errors.txt     Text File kern.log.1     Text File lustre-log.partaa     Text File lustre-log.partab    
Severity: 3
Rank (Obsolete): 12604

 Description   

These messages appear every few hours on the oss nodes:
oss6 kernel: : LustreError:
0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 117s: evicting client at 192.168.224.14@o2ib ns: filter-scratch-OST000b_UUID lock: ffff8804a321f000/0xaa2e9b983dbd2233 lrc: 3/0,0 mode: PW/PW res: [0x4a3a76:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x20 nid: 192.168.224.14@o2ib remote: 0x55c758a593d4fc6 expref: 24 pid: 12551timeout: 5161946610 lvb_type: 0

On the client:

pod24b14 kernel: : LustreError: 11-0:
scratch-OST000b-osc-ffff880312dff800: Communicating with 192.168.254.36@o2ib, operation obd_ping failed with -107.
pod24b14 kernel: : Lustre:
scratch-OST000b-osc-ffff880312dff800: Connection to scratch-OST000b (at
192.168.254.36@o2ib) was lost; in progress operations using this service will wait for recovery to complete Feb 3 12:21:45 pod24b14 kernel: : Lustre: Skipped 1 previous pod24b14 kernel: : LustreError: 167-0:
scratch-OST000b-osc-ffff880312dff800: This client was evicted by scratch-OST000b; in progress operations using this service will fail.
pod24b14 kernel: : Lustre:
6039:0:(llite_lib.c:2506:ll_dirty_page_discard_warn()) scratch: dirty page
discard: 192.168.254.41@o2ib:192.168.254.42@o2ib:/scratch/fid:
[0x2000020a6:0x130dc:0x0]/ may get corrupted (rc -108)pod24b14 kernel: : LustreError: 16480:0:(vvp_io.c:1088:vvp_io_commit_write()) Write page 0 of inodeffff880476e1a638 failed -108
pod24b14 kernel: : LustreError:
16516:0:(osc_lock.c:817:osc_ldlm_completion_ast()) lock@ffff8806063297b8[2
3 0 1 1 00000000] W(2):[0,
18446744073709551615]@[0x1000b0000:0x4a3a76:0x0]

{ pod24b14 kernel: : LustreError: 16516:0:(osc_lock.c:817:osc_ldlm_completion_ast()) lovsub@ffff8804c673e620: [0 ffff880471ca03a0 W(2):[0, 18446744073709551615]@[0x2000020a6:0x14f4a:0x0]] pod24b14 kernel: : LustreError: 16516:0:(osc_lock.c:817:osc_ldlm_completion_ast()) osc@ffff8804901eaf00: ffff8805596246c0 0x20040000001 0x55c758a593d4fc6 3 ffff88044b08cc70 size: 0 mtime: 0 atime: 0 ctime: 0 blocks: 0 pod24b14 kernel: : LustreError: 16516:0:(osc_lock.c:817:osc_ldlm_completion_ast()) }

lock@ffff8806063297b8
pod24b14 kernel: : LustreError:
16516:0:(osc_lock.c:817:osc_ldlm_completion_ast()) dlmlock returned -5
od24b14 kernel: : LustreError:
16480:0:(cl_lock.c:1420:cl_unuse_try()) result = -5, this is unlikely!
pod24b14 kernel: : LustreError:
16480:0:(cl_lock.c:1435:cl_unuse_locked()) lock@ffff880606329978[1 0 0 1 0 00000000] W(2):[0, 18446744073709551615]@[0x2000020a6:0x14f4a:0x0]

{ pod24b14 kernel: : LustreError: 16480:0:(cl_lock.c:1435:cl_unuse_locked()) vvp@ffff8805f0daa678: pod24b14 kernel: : LustreError: 16480:0:(cl_lock.c:1435:cl_unuse_locked()) lov@ffff880471ca03a0: 1 pod24b14 kernel: : LustreError: 16480:0:(cl_lock.c:1435:cl_unuse_locked()) 0 0: --- pod24b14 kernel: : LustreError: 16480:0:(cl_lock.c:1435:cl_unuse_locked()) pod24b14 kernel: : LustreError: 16480:0:(cl_lock.c:1435:cl_unuse_locked()) }

lock@ffff880606329978
pod24b14 kernel: : LustreError: 16480:0:(cl_lock.c:1435:cl_unuse_locked()) unuse return -5



 Comments   
Comment by Cliff White (Inactive) [ 10/Feb/14 ]

These evictions are usually a sign of either a busy server, or a busy/bad network. The clients should recover, perhaps with an error. What does the load look like on the server side? Is there any indication of network error/issue?

Comment by Cliff White (Inactive) [ 10/Feb/14 ]

If this is only happening on a few clients every few hours, you may be able to isolate it to a specific task, again best to monitor server load over time (vmstat, top, etc).

Comment by Andreas Dilger [ 11/Feb/14 ]

For future reference, the binary debug logs that are dumped by the kernel need to be post-processed before we can use them. You need to run:

lctl debug_file /tmp/lustre-log > /tmp/lustre-log.txt

for them to be very useful. It would also be useful to include a bit more of the console logs from the OSS, since this error message is itself not a sign of a crash, but a normal message indicating that the client was not responsive to the server's request to cancel the lock.

Comment by Oz Rentas [ 12/Feb/14 ]

Noted, thank you Andreas.

The processed logs are too large to attach to this ticket but can be downloaded from: http://ddntsr.com/ftp/2014-02-11-R30000_ddn_lustre_processed_logs.tgz

Also, what can this mean?
Feb 12 02:25:28 oss5 kernel: : LustreError:
0:0:(ldlm_lockd.c:391:waiting_locks_callback()) ### lock callback timer expired after 1620s: evicting client at 192.168.224.16@o2ib ns:
filter-scratch-OST0004_UUID lock: ffff8804f4fbc240/0xde6e74a6631bb013 lrc:
3/0,0 mode: PW/PW res: [0x5b9afe:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x10020 nid:
192.168.224.16@o2ib remote: 0x3ef50ad1d4f06

Now on the client 192.168.224.16 i.e. pod24b16

[root@pod24b16 ~]# date ; lfs fid2path /global/scratch [0x5b9afe:0x0:0x0] ; date Wed Feb 12 03:09:16 PST 2014 ioctl err -22: Invalid argument (22) fid2path error: Invalid argument Wed Feb 12 03:09:16 PST 2014

The message on mds2 (MDS node for scratch) Feb 12 03:09:16 mds2 kernel: :
Lustre: 24536:0:(mdt_handler.c:5739:mdt_fid2path()) scratch-MDT0000:
[0x5b9afe:0x0:0x0] is invalid, sequence should be >= 0x200000400

Comment by John Fuchs-Chesney (Inactive) [ 12/Feb/14 ]

Thanks Cliff.

Comment by Cliff White (Inactive) [ 12/Feb/14 ]

What that means is quite simple: There is a client holding a lock on a resource. The client has stopped talking to the server. The server waits, in this case 1620 seconds, and then times out, the timeout allows the server to reclaim the lock. This is done so that a dead client does not halt other work on the cluster. This normally has several causes:

  • A client may be dead, or rebooted. (client state is gone after a reboot)
  • A client may be non responsive (very busy)
  • The network may be dead, or degraded.
  • The server itself may be busy/wedged in some fashion

First, you should determine if there are corresponding dead clients, or clients having application errors matching the server error timestamps.
Then, examine load on network switches/routers.
Then, examine load on servers. A server wedge is typically indicated by a very high load factor, dead clients/bad networks usually result in idle servers.
A callback error only affects a single client - the rest of the cluster should be happy. If multiple clients are experiencing these errors at the same time, that is a good indication of the issue being network or server related.

Comment by Cliff White (Inactive) [ 12/Feb/14 ]

To further clarify, this bit:
evicting client at 192.168.224.16@o2ib
means that the client (at 192.168.224.16) will get an error when it next tries to talk to the server. The client will then re-connect and continue. The thread holding this request will report an error to the user application that made the IO request. Depending on how this error is handled in the application, the eviction/reconnection maybe be un-noticed by a user, but there will be errors in the client logs.

Comment by Oz Rentas [ 12/Feb/14 ]

Thanks Cliff. Totally understand. In this case, there has been some push back from the customer to collect network / client / OSS stats. This feedback will hopefully help justify the need to collect the data that has already been requested (multiple times). Thanks again.

Comment by John Fuchs-Chesney (Inactive) [ 13/Feb/14 ]

Oz – do you want us to keep this ticket open/unresolved, while you try to get your customer data? Or shall we mark this ticket as resolved and wait for you to open a new ticket if the problem reoccurs? Please advise.

Thanks.

Comment by Oz Rentas [ 13/Feb/14 ]

There are 2 different issues:

1) Mellanox: OSS HCAs losing access to UFM, and thus losing complete access to the storage. Sequence of errors attached do_IRQ-errors.txt.

-irqbalance has been disabled on all servers, and now the system is being monitored.

2) Quota issue on a single OST (ost_scratch_11)

The first issue is not a Lustre problem, so we can go ahead and close LU-4607. A separate ticket will be opened for the quota problem.

Thanks all.

Comment by Peter Jones [ 13/Feb/14 ]

ok thanks Oz

Generated at Sat Feb 10 01:44:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.