Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.10.7
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

symptom is that clients cannot access lustre filesystem data. Seeing timeouts in the logs, e.g.,:

Jun 27 06:58:10 vanlustre3 kernel: Lustre: 86713:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1561643289/real 1561643289]  req@ffff9bffcbd3b900 x1637338661347344/t0(0) o36->echo-MDT0000-mdc-ffff9c2b2b775000@10.23.22.104@tcp:12/10 lens 880/856 e 24 to 1 dl 1561643890 ref 2 fl Rpc:X/2/ffffffff rc -11/-1

Jun 27 06:58:10 vanlustre3 kernel: Lustre: 86713:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 4 previous similar messages

Jun 27 06:58:10 vanlustre3 kernel: Lustre: echo-MDT0000-mdc-ffff9c2b2b775000: Connection to echo-MDT0000 (at 10.23.22.104@tcp) was lost; in progress operations using this service will wait for recovery to complete

Jun 27 06:58:10 vanlustre3 kernel: Lustre: echo-MDT0000-mdc-ffff9c2b2b775000: Connection restored to 10.23.22.104@tcp (at 10.23.22.104@tcp)

On the MDS we see:

Jun 27 06:55:08 emds1 kernel: LustreError: 27539:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1561643408, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-echo-MDT0000_UUID lock: ffff88522ed35800/0x7d046634332b9f1e lrc: 3/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2 rrc: 8 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 27539 timeout: 0 lvb_type: 0

Jun 27 07:00:04 emds1 kernel: Lustre: 27723:0:(service.c:1346:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (4/4), not sending early reply#012  req@ffff88231ac4f500 x1637338661348352/t0(0) o36->e46f0dd3-8775-ce8c-a09f-d393cecffa21@10.23.22.113@tcp:498/0 lens 928/3128 e 1 to 0 dl 1561644008 ref 2 fl Interpret:/0/0 rc 0/0

Jun 27 07:00:37 emds1 kernel: Lustre: 49916:0:(service.c:2114:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (118:3300s); client may timeout.  req@ffff8823586ce900 x1637332991328208/t910888684941(0) o36->e46f0dd3-8775-ce8c-a09f-d393cecffa21@10.23.22.113@tcp:247/0 lens 680/424 e 3 to 0 dl 1561640737 ref 1 fl Complete:/0/0 rc 0/0

Jun 27 07:00:37 emds1 kernel: LNet: Service thread pid 49916 completed after 3417.78s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

We see a ldlm_bl_01 (or ldlm_bl_02) at 100% on a CPU core for extended periods (over an hour). It will recover for a few minutes, then max out a CPU again

What might be causing this?

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

emds1-log.gz
4.39 MB
03/Jul/19 4:44 PM
messages-vanlustre3.gz
616 kB
27/Jun/19 5:02 PM

Assignee:: Peter Jones

Reporter:: Campbell Mcleay (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 27/Jun/19 4:34 PM

Updated:: 02/Dec/20 11:53 PM

Details

Description

Attachments

Attachments

Activity

People

Dates