[LU-12476] ldlm_bl_ processes running at 100% causing client issues Created: 27/Jun/19  Updated: 02/Dec/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.7
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Campbell Mcleay (Inactive) Assignee: Peter Jones
Resolution: Unresolved Votes: 0
Labels: None

Attachments: File emds1-log.gz     File messages-vanlustre3.gz    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

symptom is that clients cannot access lustre filesystem data. Seeing timeouts in the logs, e.g.,:

Jun 27 06:58:10 vanlustre3 kernel: Lustre: 86713:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1561643289/real 1561643289]  req@ffff9bffcbd3b900 x1637338661347344/t0(0) o36->echo-MDT0000-mdc-ffff9c2b2b775000@10.23.22.104@tcp:12/10 lens 880/856 e 24 to 1 dl 1561643890 ref 2 fl Rpc:X/2/ffffffff rc -11/-1

Jun 27 06:58:10 vanlustre3 kernel: Lustre: 86713:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 4 previous similar messages

Jun 27 06:58:10 vanlustre3 kernel: Lustre: echo-MDT0000-mdc-ffff9c2b2b775000: Connection to echo-MDT0000 (at 10.23.22.104@tcp) was lost; in progress operations using this service will wait for recovery to complete

Jun 27 06:58:10 vanlustre3 kernel: Lustre: echo-MDT0000-mdc-ffff9c2b2b775000: Connection restored to 10.23.22.104@tcp (at 10.23.22.104@tcp)

On the MDS we see:

Jun 27 06:55:08 emds1 kernel: LustreError: 27539:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1561643408, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-echo-MDT0000_UUID lock: ffff88522ed35800/0x7d046634332b9f1e lrc: 3/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2 rrc: 8 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 27539 timeout: 0 lvb_type: 0

Jun 27 07:00:04 emds1 kernel: Lustre: 27723:0:(service.c:1346:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (4/4), not sending early reply#012  req@ffff88231ac4f500 x1637338661348352/t0(0) o36->e46f0dd3-8775-ce8c-a09f-d393cecffa21@10.23.22.113@tcp:498/0 lens 928/3128 e 1 to 0 dl 1561644008 ref 2 fl Interpret:/0/0 rc 0/0

Jun 27 07:00:37 emds1 kernel: Lustre: 49916:0:(service.c:2114:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (118:3300s); client may timeout.  req@ffff8823586ce900 x1637332991328208/t910888684941(0) o36->e46f0dd3-8775-ce8c-a09f-d393cecffa21@10.23.22.113@tcp:247/0 lens 680/424 e 3 to 0 dl 1561640737 ref 1 fl Complete:/0/0 rc 0/0

Jun 27 07:00:37 emds1 kernel: LNet: Service thread pid 49916 completed after 3417.78s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

We see a ldlm_bl_01 (or ldlm_bl_02) at 100% on a CPU core for extended periods (over an hour). It will recover for a few minutes, then max out a CPU again

What might be causing this?



 Comments   
Comment by Campbell Mcleay (Inactive) [ 27/Jun/19 ]

I noticed the 'system was overloaded (too many service threads, or there were not enough hardware resources).' mentioned in the error, though this system has been running with a very consistent load on the same hardware. We are running an lfsck on the MDS (it has been running for about a week: lctl lfsck_query -M echo-MDT0000) but have only noticed the above issues just recently.

Comment by Patrick Farrell (Inactive) [ 28/Jun/19 ]

Campbell,

This doesn't obviously match up with any known issue from what we've got here, but it should be easy to get a bunch more info.

Can you do this on the system with the ldlm_bl thread maxing out (that's the MDS, right?) while that's happening, and then provide the resulting log to us:

DEBUGMB=`lctl get_param -n debug_mb`
lctl set_param *debug=-1 debug_mb=10000
lctl clear
lctl mark "start"
sleep 1
lctl dk > /tmp/log
#Set debug back to defaults
lctl set_param debug="super ioctl neterror warning dlmtrace error emerg ha rpctrace vfstrace config console lfsck"
lctl set_param debug_mb=$DEBUGMB

Then, also on the MDS, output of:

lctl get_param ldlm.namespaces.*.lock_count 

(It would be interesting to get that lock_count info both while the bl thread is acting up, and when it is not.)

It's also odd that lfsck has been running for so long - Can you tell us more about that?  Why are you running it - Is there a problem that came up?
And let's get the status of lfsck on the MDS:

lctl get_param mdd.*.lfsck_*
Comment by Patrick Farrell (Inactive) [ 28/Jun/19 ]

FWIW, the overloaded message you noted simply means something took too long.  Sometimes it means there is too much demand on the system resources, but it can also reflect a delay in completing a specific operation for other reasons.  So if you have some thread which is stuck for whatever reason, it will eventually print this message.  (We could perhaps improve it to be clearer)

Comment by Campbell Mcleay (Inactive) [ 02/Jul/19 ]

Thanks Patrick. I was away on the Friday and most of Monday, so apologies for the delay in replying. The problem seem to sort itself out without intervention before I could run the commands you suggested. We haven't seen this issue on our other MDSs that are comparatively busy, but there might have been an unusual operation or set of operations (millions of tiny files or a massive file or something else). If it happens again, I will run the commands you suggested to see what it reveals

Comment by Campbell Mcleay (Inactive) [ 03/Jul/19 ]

Hi Patrick,

This is happening now, attached log (emds1-log.gz) and the other info requested.

Thanks,

Campbell

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                          
 21718 root      20   0       0      0      0 R 100.0  0.0 208:06.75 ldlm_bl_03  

[root@emds1 cmcl]# lctl get_param ldlm.namespaces.*.lock_count 
ldlm.namespaces.MGC10.23.22.104@tcp.lock_count=6
ldlm.namespaces.MGS.lock_count=328
ldlm.namespaces.echo-MDT0000-lwp-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0000-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0001-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0002-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0003-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0004-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0005-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0006-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0007-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0008-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0009-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST000a-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST000b-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST000c-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST000d-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST000e-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST000f-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0010-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0011-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0012-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0013-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0014-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0015-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0016-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0017-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0018-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0019-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST001a-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST001b-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST001c-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST001d-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST001e-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST001f-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0020-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0021-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0022-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0023-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0024-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0025-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0026-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0027-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0028-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0029-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST002a-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST002b-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST002c-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST002d-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST002e-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST002f-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0030-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0031-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0032-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0033-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0034-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0035-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0036-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0037-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0038-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0039-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST003a-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST003b-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST003c-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST003d-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST003e-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST003f-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0040-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0041-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0042-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0043-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0044-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0045-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0046-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0047-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0048-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0049-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST004a-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST004b-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST004c-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST004d-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST004e-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST004f-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0050-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0051-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0052-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0053-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0054-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0055-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0056-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0057-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0058-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0059-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST005a-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST005b-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST005c-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST005d-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST005e-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST005f-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0060-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0061-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0062-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0063-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0064-osc-MDT0000.lock_count=0
ldlm.namespaces.echo-OST0065-osc-MDT0000.lock_count=0
ldlm.namespaces.mdt-echo-MDT0000_UUID.lock_count=15540771
[root@emds1 cmcl]# lctl get_param mdd.*.lfsck_*
mdd.echo-MDT0000.lfsck_async_windows=1024
mdd.echo-MDT0000.lfsck_layout=
name: lfsck_layout
magic: 0xb17371b9
version: 2
status: scanning-phase1
flags:
param:
last_completed_time: N/A
time_since_last_completed: N/A
latest_start_time: 1561056883
time_since_latest_start: 1113713 seconds
last_checkpoint_time: 1562170542
time_since_last_checkpoint: 54 seconds
latest_start_position: 77
last_checkpoint_position: 1666220057
first_failure_position: 262195188
success_count: 0
repaired_dangling: 0
repaired_unmatched_pair: 0
repaired_multiple_referenced: 0
repaired_orphan: 0
repaired_inconsistent_owner: 17
repaired_others: 0
skipped: 0
failed_phase1: 1
failed_phase2: 0
checked_phase1: 319917891
checked_phase2: 0
run_time_phase1: 1105579 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 289 items/sec
average_speed_phase2: N/A
real-time_speed_phase1: 522 items/sec
real-time_speed_phase2: N/A
current_position: 1666711854
mdd.echo-MDT0000.lfsck_namespace=
name: lfsck_namespace
magic: 0xa0621a0b
version: 2
status: scanning-phase1
flags: inconsistent,upgrade
param:
last_completed_time: N/A
time_since_last_completed: N/A
latest_start_time: 1561056883
time_since_latest_start: 1113713 seconds
last_checkpoint_time: 1562170542
time_since_last_checkpoint: 54 seconds
latest_start_position: 77, N/A, N/A
last_checkpoint_position: 1666220057, [0x20003e671:0x1e10:0x0], 0x511cd47b1a8309c2
first_failure_position: 73962505, [0x20003cb09:0x449c:0x0], 0x712fba5e57a3ada8
checked_phase1: 1924518312
checked_phase2: 0
updated_phase1: 710433
updated_phase2: 0
failed_phase1: 5
failed_phase2: 0
directories: 97008120
dirent_repaired: 0
linkea_repaired: 710433
nlinks_repaired: 0
multiple_linked_checked: 1674560030
multiple_linked_repaired: 0
unknown_inconsistency: 0
unmatched_pairs_repaired: 0
dangling_repaired: 0
multiple_referenced_repaired: 0
bad_file_type_repaired: 0
lost_dirent_repaired: 0
local_lost_found_scanned: 0
local_lost_found_moved: 0
local_lost_found_skipped: 0
local_lost_found_failed: 0
striped_dirs_scanned: 0
striped_dirs_repaired: 0
striped_dirs_failed: 0
striped_dirs_disabled: 0
striped_dirs_skipped: 0
striped_shards_scanned: 0
striped_shards_repaired: 0
striped_shards_failed: 0
striped_shards_skipped: 0
name_hash_repaired: 0
linkea_overflow_cleared: 0
success_count: 0
run_time_phase1: 1112435 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 1730 items/sec
average_speed_phase2: N/A
average_speed_total: 1730 items/sec
real_time_speed_phase1: 3808 items/sec
real_time_speed_phase2: N/A
current_position: 1666711854, [0x20003e1f9:0x1e423:0x0], 0x54bdf08ba56f6996
mdd.echo-MDT0000.lfsck_speed_limit=0
[root@emds1 cmcl]#
Comment by Campbell Mcleay (Inactive) [ 03/Jul/19 ]

Could the lfsck be blocking a resource?

Comment by Patrick Farrell (Inactive) [ 03/Jul/19 ]

cmcl,

I'll take a look.  I think it's more likely the problem is related to the number of active locks - 15 million is quite a few.  It's not impossible the lfsck scan is causing that, but...

I am wondering again by the way:
It's also odd that lfsck has been running for so long - Can you tell us more about that?  Why are you running it - Is there a problem that came up, or did it start automatically?

Comment by Campbell Mcleay (Inactive) [ 03/Jul/19 ]

Hi Patrick,

Errors were reported by e2fsck when it was rebooted (it had been crashing). It was running 2.10.6, so we upgraded it, and ran an lfsck as it seemed sensible to run a check.

Thanks,

Campbell

Comment by Campbell Mcleay (Inactive) [ 08/Jul/19 ]

Hi Patrick,

Could the errors on the MDS filesystem be causing problems on the client? The client was unresponsive and the last thing in the log on the client was:

Jul  8 09:14:50 vanlustre3 kernel: LustreError: 146130:0:(file.c:3644:ll_inode_revalidate_fini()) echo: revalidate FID [0x20001dcd0:0x1382e:0x0] error: rc = -4
Jul  8 09:14:50 vanlustre3 kernel: LustreError: 146130:0:(file.c:3644:ll_inode_revalidate_fini()) Skipped 841563 previous similar messages

On the MDS

Jul  8 09:13:49 emds1 kernel: LustreError: 32492:0:(osp_object.c:582:osp_attr_get()) echo-OST0006-osc-MDT0000:osp_attr_get update error [0x100060000:0x2c5f05e:0x0]: rc = -4

We stopped the lfsck to see if this might be the cause of the problems but I think that was not a good idea if the filesystem has errors and we're not sure what is causing the problems (this seems to be a different problem). It seems somehow a bad idea to run a fsck on an active filesystem (we're only doing this as it is an important system). Should we perhaps stop all filesystem activity and just run the fsck? The problem is we don't know how long it would be offline for.

Thanks for any help,

Campbell

Comment by Patrick Farrell (Inactive) [ 08/Jul/19 ]

Ah, sorry - I managed to lose track of this one.  I'll look at your logs.

Certainly the errors could be causing client side problems.  -4 (EINTR) is an odd error to be getting from osp_attr_get, I'll have to poke around.

Comment by Campbell Mcleay (Inactive) [ 09/Jul/19 ]

Hi Patrick,

We had another one of these incidents, after we stopped the lfsck, so that doesn't look like it should cause issues. I've gathered some more data - if you want it, let me know and I'll upload it. Would you recommend restarting the lfsck, or leave it for now?

Thanks,

Campbell

Comment by Campbell Mcleay (Inactive) [ 18/Jul/19 ]

Just an amendment: the fsck had actually finished. Any updates from your side?

Comment by Campbell Mcleay (Inactive) [ 24/Jul/19 ]

Hi Patrick,

Any recommendations? The issue is becoming so bad we are getting a bit desparate. Should we try 2.12?

Thanks for any help,

Campbell

Comment by Stephane Thiell [ 02/Dec/20 ]

Even the latest Lustre 2.12.5, lblm_bl doesn't seem to scale very well when the lock count per MDT is above 10M, like in this ticket. This happens when users trigger large parallel copies for example, of hundreds of millions of files (esp. when using the --delete flag of rsync, not sure why). We've seen that again today, with ldlm_bl at 100% and a lock_count above 15M, generating a lot of evictions and misc. errors. The workaround is to reduce lru_max_age on clients as mentioned in LU-12832. It would be nice if the MDT could adjust this dynamically when ldlm_bl becomes cpu bound.

Generated at Sat Feb 10 02:52:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.