[LU-10268] rcu_sched self-detected stall in lfsck Created: 22/Nov/17  Updated: 23/Jan/18  Resolved: 17/Dec/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.11.0, Lustre 2.10.3

Type: Bug Priority: Major
Reporter: Olaf Faaland Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

toss 3.2-0rc8
kernel-3.10.0-693.5.2.1chaos.ch6.x86_64
lustre-2.8.0_13.chaos-1.ch6.x86_64

See lustre-release-fe-llnl project in gerritt


Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

lquake-MDT0001 ran out of space while multiple invocations of "lfs migrate --mdt-index XX" were running in parallel. Space was freed up by deleting snapshots, and then an "lctl lfsck_start --all" was invoked on the node hosting the MGS and MDT0000.

After the layout portion of the lfsck completed and the namespace portion started, we began seeing console messages like this on the node hosting MDT0008:

INFO: rcu_sched self-detected stall on CPU[ 1678.988863] INFO: rcu_sched detected stalls on CPUs/tasks: { 12} (detected by 2, t=600017 jiffies, g=17401, c=17400, q=850241)
Task dump for CPU 12:
lfsck_namespace R  running task        0 36441      2 0x00000088
 0000000000000000 ffff88807ffd8000 0000000000000000 0000000000000002
 ffff88807ffd8008 ffff883f00000141 ffff8840a7003f40 ffff88807ffd7000
 0000000000000010 0000000000000000 fffffffffffffff8 0000000000000001
Call Trace:
 [<ffffffff8119649f>] ? __alloc_pages_nodemask+0x17f/0x470
 [<ffffffffc030e35d>] ? spl_kmem_alloc_impl+0xcd/0x180 [spl]
 [<ffffffffc030e35d>] ? spl_kmem_alloc_impl+0xcd/0x180 [spl]
 [<ffffffffc0315cb4>] ? xdrmem_dec_bytes+0x64/0xa0 [spl]
 [<ffffffff8119355e>] ? __rmqueue+0xee/0x4a0
 [<ffffffff811ad598>] ? zone_statistics+0x88/0xa0
 [<ffffffff81195e22>] ? get_page_from_freelist+0x502/0xa00
 [<ffffffffc0328a50>] ? nvs_operation+0xf0/0x2e0 [znvpair]
 [<ffffffff816c88d5>] ? mutex_lock+0x25/0x42
 [<ffffffff8119649f>] ? __alloc_pages_nodemask+0x17f/0x470
 [<ffffffff811dd008>] ? alloc_pages_current+0x98/0x110
 [<ffffffffc032afc2>] ? nvlist_lookup_common.part.71+0xa2/0xb0 [znvpair]
 [<ffffffffc032b4b6>] ? nvlist_lookup_byte_array+0x26/0x30 [znvpair]
 [<ffffffffc123d2f3>] ? lfsck_namespace_filter_linkea_entry.isra.64+0x83/0x180 [lfsck]
 [<ffffffffc124f4da>] ? lfsck_namespace_double_scan_one+0x3aa/0x19d0 [lfsck]
 [<ffffffffc08356d6>] ? dbuf_rele+0x36/0x40 [zfs]
 [<ffffffffc11f9c17>] ? osd_index_it_rec+0x1a7/0x240 [osd_zfs]
 [<ffffffffc1250ead>] ? lfsck_namespace_double_scan_one_trace_file+0x3ad/0x830 [lfsck]
 [<ffffffffc1254af5>] ? lfsck_namespace_assistant_handler_p2+0x795/0xa70 [lfsck]
 [<ffffffff811ec173>] ? kfree+0x133/0x170
 [<ffffffffc10283e8>] ? ptlrpc_set_destroy+0x208/0x4f0 [ptlrpc]
 [<ffffffffc1238afe>] ? lfsck_assistant_engine+0x13de/0x21d0 [lfsck]
 [<ffffffff816ca33b>] ? __schedule+0x38b/0x780
 [<ffffffff810c9de0>] ? wake_up_state+0x20/0x20
 [<ffffffffc1237720>] ? lfsck_master_engine+0x1370/0x1370 [lfsck]
 [<ffffffff810b4eef>] ? kthread+0xcf/0xe0
 [<ffffffff810b4e20>] ? insert_kthread_work+0x40/0x40
 [<ffffffff816d6818>] ? ret_from_fork+0x58/0x90
 [<ffffffff810b4e20>] ? insert_kthread_work+0x40/0x40

Also, lfsck_namespace process reported as stuck by watchdog. Stacks are all like this:

NMI watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [lfsck_namespace:36441]
Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) sha512_ssse3 sha512_generic crypto_null libcfs(OE) nfsv3 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib iTCO_wdt iTCO_vendor_support ib_core sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass mlx5_core pcspkr devlink joydev i2c_i801 ioatdma lpc_ich zfs(POE) zunicode(POE) zavl(POE) ses icp(POE) enclosure sg shpchp ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_cpufreq binfmt_misc zcommon(POE) znvpair(POE) spl(OE) msr_safe(OE) nfsd nfs_acl ip_tables rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache dm_round_robin sd_mod crc_t10dif crct10dif_generic scsi_transport_iscsi dm_multipath mgag200 8021q i2c_algo_bit garp drm_kms_helper stp syscopyarea crct10dif_pclmul llc sysfillrect crct10dif_common mrp crc32_pclmul sysimgblt fb_sys_fops crc32c_intel ttm ghash_clmulni_intel ixgbe(OE) drm ahci mpt3sas aesni_intel mxm_wmi libahci dca lrw gf128mul glue_helper ablk_helper cryptd ptp raid_class libata i2c_core scsi_transport_sas pps_core wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
CPU: 12 PID: 36441 Comm: lfsck_namespace Tainted: P           OEL ------------   3.10.0-693.5.2.1chaos.ch6.x86_64 #1
Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0016.033120161139 03/31/2016
task: ffff883f1eed3f40 ti: ffff883f12b04000 task.ti: ffff883f12b04000
RIP: 0010:[<ffffffffc123d2f5>]  [<ffffffffc123d2f5>] lfsck_namespace_filter_linkea_entry.isra.64+0x85/0x180 [lfsck]
RSP: 0018:ffff883f12b07ad0  EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffffffffc032b4b6 RCX: ffff887f19214971
RDX: 0000000000000000 RSI: ffff883ef42f1010 RDI: ffff883f12b07ba8
RBP: ffff883f12b07b18 R08: 0000000000000000 R09: 0000000000000025
R10: ffff883ef42f1010 R11: 0000000000000000 R12: ffff883f12b07ab4
R13: ffff883ef42f1040 R14: ffff887f1c31a7e0 R15: ffffffffc1282fa3
FS:  0000000000000000(0000) GS:ffff887f7df00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffff7ad74f0 CR3: 0000000001a16000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
 ffff883f12b07ba8 ffff883ef42f1040 0000000000000001 ffff883f12b07b18
 ffff887f18d35ce8 ffff887f1c31a7e0 ffff883ef42f1000 ffff887f2c933c00
 ffff883ef42f1010 ffff883f12b07c18 ffffffffc124f4da ffffffffc08356d6
Call Trace:
 [<ffffffffc124f4da>] lfsck_namespace_double_scan_one+0x3aa/0x19d0 [lfsck]
 [<ffffffffc08356d6>] ? dbuf_rele+0x36/0x40 [zfs]
 [<ffffffffc11f9c17>] ? osd_index_it_rec+0x1a7/0x240 [osd_zfs]
 [<ffffffffc1250ead>] lfsck_namespace_double_scan_one_trace_file+0x3ad/0x830 [lfsck]
 [<ffffffffc1254af5>] lfsck_namespace_assistant_handler_p2+0x795/0xa70 [lfsck]
 [<ffffffff811ec173>] ? kfree+0x133/0x170
 [<ffffffffc10283e8>] ? ptlrpc_set_destroy+0x208/0x4f0 [ptlrpc]
 [<ffffffffc1238afe>] lfsck_assistant_engine+0x13de/0x21d0 [lfsck]
 [<ffffffff816ca33b>] ? __schedule+0x38b/0x780
 [<ffffffff810c9de0>] ? wake_up_state+0x20/0x20
 [<ffffffffc1237720>] ? lfsck_master_engine+0x1370/0x1370 [lfsck]
 [<ffffffff810b4eef>] kthread+0xcf/0xe0
 [<ffffffff810b4e20>] ? insert_kthread_work+0x40/0x40
 [<ffffffff816d6818>] ret_from_fork+0x58/0x90
 [<ffffffff810b4e20>] ? insert_kthread_work+0x40/0x40
Code: c7 47 10 00 00 00 00 45 31 e4 45 31 c0 4d 63 ce 66 0f 1f 44 00 00 4d 85 e4 74 41 41 0f b6 1c 24 41 0f b6 44 24 01 c1 e3 08 09 c3 <41> 39 de 41 89 5d 18 74 47 49 8b 4d 08 48 85 c9 0f 84 ad 00 00



 Comments   
Comment by Olaf Faaland [ 22/Nov/17 ]

An attempt was made to stop lfsck, since this was hurting performance and MDS nodes were occasionally crashing. lfsck was not stopped on MDT0008 as far as I can tell, and it continues after a failover.

[root@jet9:~]# lctl lfsck_query
layout_mdts_init: 0            
layout_mdts_scanning-phase1: 0 
layout_mdts_scanning-phase2: 0 
layout_mdts_completed: 14      
layout_mdts_failed: 0          
layout_mdts_stopped: 1         
layout_mdts_paused: 0          
layout_mdts_crashed: 0         
layout_mdts_partial: 1         
layout_mdts_co-failed: 0       
layout_mdts_co-stopped: 0      
layout_mdts_co-paused: 0       
layout_mdts_unknown: 0         
layout_osts_init: 0            
layout_osts_scanning-phase1: 0 
layout_osts_scanning-phase2: 0 
layout_osts_completed: 4       
layout_osts_failed: 0          
layout_osts_stopped: 0         
layout_osts_paused: 0          
layout_osts_crashed: 0         
layout_osts_partial: 0         
layout_osts_co-failed: 0       
layout_osts_co-stopped: 0      
layout_osts_co-paused: 0       
layout_osts_unknown: 0         
layout_repaired: 204717011     
namespace_mdts_init: 0         
namespace_mdts_scanning-phase1: 0
namespace_mdts_scanning-phase2: 1
namespace_mdts_completed: 0      
namespace_mdts_failed: 0         
namespace_mdts_stopped: 11       
namespace_mdts_paused: 0         
namespace_mdts_crashed: 0        
namespace_mdts_partial: 4        
namespace_mdts_co-failed: 0      
namespace_mdts_co-stopped: 0     
namespace_mdts_co-paused: 0      
namespace_mdts_unknown: 0
namespace_osts_init: 0
namespace_osts_scanning-phase1: 0
namespace_osts_scanning-phase2: 0
namespace_osts_completed: 0
namespace_osts_failed: 0
namespace_osts_stopped: 0
namespace_osts_paused: 0
namespace_osts_crashed: 0
namespace_osts_partial: 0
namespace_osts_co-failed: 0
namespace_osts_co-stopped: 0
namespace_osts_co-paused: 0
namespace_osts_unknown: 0
namespace_repaired: 889657

Comment by Peter Jones [ 22/Nov/17 ]

Fan Yong

Can you please advise?

Thanks

Peter

Comment by Olaf Faaland [ 22/Nov/17 ]

And indeed the node hosting MDT0008 is the only with with processes whose names include lfsck:

UID PID PPID C STIME TTY STAT TIME CMD
root 36433 2 0 11:01 ? S 0:00 [lfsck]
root 36441 2 0 11:01 ? R 0:00 [lfsck_namespace]
Comment by Olaf Faaland [ 24/Nov/17 ]

I now see that mount.lustre takes a mount option "skip_lfsck" to prevent the MDT from automatically resuming the lfsck, I'll try that to address the immediate problem.

I don't recall seeing that mentioned in the lustre operations manual section regarding lfsck.

Let me know what information to gather to help track down the RCU usage problem, though.

Comment by nasf (Inactive) [ 24/Nov/17 ]

Yes, you can use the mount option "-o skip_lfsck" to prevent LFSCK auto-resume when start Lustre service.

How many MDTs in your system? which command you used to initially trigger the namespace LFSCK? would you please to show me the output on MDT0008:

lctl get_param -n mdd.${fsname}-mdt0008.lfsck_namespace

Do you have any Lustre debug logs on the MDT0008 when or before the hung happened? Thanks!

Comment by Olaf Faaland [ 24/Nov/17 ]

I added the "skip_lfsck" mount option, and find that does not prevent the lfsck_namespace thread from being started, which then encounters a NULL pointer dereference.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: [<ffffffffc0ad3352>] fld_local_lookup+0x52/0x270 [fld]
...
CPU: 9 PID: 22725 Comm: lfsck_namespace Tainted: P OE ------------ 3.10.0-693.5.2.1chaos.ch6.x86_64 #1
H
...
Call Trace:
 [<ffffffffc07244bd>] ? zap_cursor_retrieve+0x11d/0x2f0 [zfs]
 [<ffffffffc0ad42b5>] fld_server_lookup+0x55/0x320 [fld]
 [<ffffffffc11fbea0>] lfsck_find_mdt_idx_by_fid+0x50/0x70 [lfsck]
 [<ffffffffc1226ed2>] lfsck_namespace_double_scan_one_trace_file+0x3d2/0x830 [lfsck]
 [<ffffffffc122aa73>] lfsck_namespace_assistant_handler_p2+0x713/0xa70 [lfsck]
 [<ffffffffc0fcd3e8>] ? ptlrpc_set_destroy+0x208/0x4f0 [ptlrpc]
 [<ffffffffc120eafe>] lfsck_assistant_engine+0x13de/0x21d0 [lfsck]
 [<ffffffff816ca33b>] ? __schedule+0x38b/0x780
 [<ffffffff810c9de0>] ? wake_up_state+0x20/0x20
 [<ffffffffc120d720>] ? lfsck_master_engine+0x1370/0x1370 [lfsck]
 [<ffffffff810b4eef>] kthread+0xcf/0xe0
 [<ffffffff810b4e20>] ? insert_kthread_work+0x40/0x40
 [<ffffffff816d6818>] ret_from_fork+0x58/0x90
 [<ffffffff810b4e20>] ? insert_kthread_work+0x40/0x40
Code: 74 0e 8b 35 11 7f 18 00 85 f6 0f 88 b9 00 00 00 48 89 df 48 c7 c6 c0 be ad c0 e8 fa ac 2b 00 48 85 c0 48 89 c3 0f 84 1b 01 00 00 <49> 8b 7d 18 48 8d 50 18 4c 89 f6 e8 3e e9 ff ff 85 c0 75 5a 8b
RIP [<ffffffffc0ad3352>] fld_local_lookup+0x52/0x270 [fld]

This system has 16 MDTs. The lfsck was initially started with "lctl lfsck_start --all", I believe (but not sure) on the node hosting MDT0000 and the MGS.

I will look for debug logs from when MDT0008 namespace scan first started reporting it was stuck, I believe I got them.

I'll remove the skip_lfsck mount option and restart to fetch your get_param output.

Comment by Olaf Faaland [ 24/Nov/17 ]

I'm unable to find the debug logs, unfortunately.

Comment by Olaf Faaland [ 24/Nov/17 ]

So far I haven't been able to log into the node to get your mdt0008.lfsck_namespace output. Once the mdt completes recovery it stops responding to ssh/rsh/etc.. I'll try to work around that.

Comment by Olaf Faaland [ 24/Nov/17 ]
[root@jet10:lu-10268]# lctl get_param -n mdd.lquake-MDT0008.lfsck_namespace
name: lfsck_namespace
magic: 0xa0621a0b
version: 2
status: crashed
flags: scanned-once
param: all_targets
last_completed_time: N/A
time_since_last_completed: N/A
latest_start_time: 1510950812
time_since_latest_start: 604044 seconds
last_checkpoint_time: 1510954081
time_since_last_checkpoint: 600775 seconds
latest_start_position: 270, N/A, N/A
last_checkpoint_position: 35184372088832, N/A, N/A
first_failure_position: 28586726, N/A, N/A
checked_phase1: 38434361
checked_phase2: 0
updated_phase1: 0
updated_phase2: 0
failed_phase1: 4
failed_phase2: 0
directories: 12821352
dirent_repaired: 0
linkea_repaired: 0
nlinks_repaired: 0
multiple_linked_checked: 585
multiple_linked_repaired: 0
unknown_inconsistency: 0
unmatched_pairs_repaired: 0
dangling_repaired: 0
multiple_referenced_repaired: 0
bad_file_type_repaired: 0
lost_dirent_repaired: 0
local_lost_found_scanned: 0
local_lost_found_moved: 0
local_lost_found_skipped: 0
local_lost_found_failed: 0
striped_dirs_scanned: 0
striped_dirs_repaired: 0
striped_dirs_failed: 0
striped_dirs_disabled: 0
striped_dirs_skipped: 0
striped_shards_scanned: 13194
striped_shards_repaired: 0
striped_shards_failed: 0
striped_shards_skipped: 0
name_hash_repaired: 0
success_count: 0
run_time_phase1: 3265 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 11771 items/sec
average_speed_phase2: 0 objs/sec
average_speed_total: 11771 items/sec
real_time_speed_phase1: N/A
real_time_speed_phase2: N/A
current_position: N/A
Comment by nasf (Inactive) [ 27/Nov/17 ]

I added the "skip_lfsck" mount option, and find that does not prevent the lfsck_namespace thread from being started, which then encounters a NULL pointer dereference.

The namespace LFSCK should not be triggered by local auto-start mechanism. Instead, the LFSCK start request should come from other MDT via "lfsck_start --all", at that time, on the MDT0008, the local seq_server_site::ss_server_fld was not initialized yet, then caused the NULL pointer. I will make patch to avoid NULL pointer.

Comment by Gerrit Updater [ 27/Nov/17 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30259
Subject: LU-10268 lfsck: postpone lfsck start until initialized
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 13b99ea9c265ed159db6be4fa2ae13fcb725f944

Comment by Gerrit Updater [ 27/Nov/17 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30263
Subject: LU-10268 lfsck: postpone lfsck start until initialized
Project: fs/lustre-release
Branch: b2_8_fe
Current Patch Set: 1
Commit: 7bfcef0c80df334ba8c316e44e77bd30dca064f9

Comment by Olaf Faaland [ 29/Nov/17 ]

The namespace LFSCK should not be triggered by local auto-start mechanism. Instead, the LFSCK start request should come from other MDT via "lfsck_start --all", at that time, on the MDT0008, the local seq_server_site::ss_server_fld was not initialized yet, then caused the NULL pointer.

I'm confused by this. MDT0008 can run on either jet9 or jet10, the two nodes connected to the disks that contain that target.

I added the "-o skip_lfsck" mount option to the script that we use to start the servers, and then rebooted only jet9 and jet10. When jet9 and jet10 came back up, the lfsck resumed (based on the presence of the lfsck_namespace thread).

The rest of the nodes were neither rebooted nor unmounted and remounted. I did not run "lctl lfsck_start" again anywhere.

So are you saying that one of the other MDTs sent an RPC to MDT0008 commanding it to resume the lfsck, when it detected that MDT0008 had restarted?

Comment by Olaf Faaland [ 29/Nov/17 ]

I un-mounted and re-mounted all the targets using "-o skip_lfsck". I no longer see any lfsck threads running, which seems to support my theory that the lfsck was being re-started/resumed on MDT0008 due to a message from some other target. If I'm correct, I suggest we discuss this user interface design, as I find it not at all intuitive.

I now see MDT0001 working really hard in __mdd_orphan_cleanup(). This doesn't appear to be hurting anything, but is it a side effect of the lfsck?

thanks!

Comment by Olaf Faaland [ 29/Nov/17 ]

Also, the output of lfsck_query now reports some namespace_mdts paused and others crashed, which it did not before:

[root@jeti:~]# pdsh -w e1 lctl lfsck_query | dshbak -c
----------------                                      
e1                                                    
----------------                                      
layout_mdts_init: 0                                   
layout_mdts_scanning-phase1: 0                        
layout_mdts_scanning-phase2: 0                        
layout_mdts_completed: 14                             
layout_mdts_failed: 0                                 
layout_mdts_stopped: 1                                
layout_mdts_paused: 0                                 
layout_mdts_crashed: 0                                
layout_mdts_partial: 1                                
layout_mdts_co-failed: 0                              
layout_mdts_co-stopped: 0                             
layout_mdts_co-paused: 0                              
layout_mdts_unknown: 0                                
layout_osts_init: 0                                   
layout_osts_scanning-phase1: 0                        
layout_osts_scanning-phase2: 0
layout_osts_completed: 4
layout_osts_failed: 0
layout_osts_stopped: 0
layout_osts_paused: 0
layout_osts_crashed: 0
layout_osts_partial: 0
layout_osts_co-failed: 0
layout_osts_co-stopped: 0
layout_osts_co-paused: 0
layout_osts_unknown: 0
layout_repaired: 204717011
namespace_mdts_init: 0
namespace_mdts_scanning-phase1: 0
namespace_mdts_scanning-phase2: 0
namespace_mdts_completed: 0
namespace_mdts_failed: 0
namespace_mdts_stopped: 11
namespace_mdts_paused: 3
namespace_mdts_crashed: 2
namespace_mdts_partial: 0
namespace_mdts_co-failed: 0
namespace_mdts_co-stopped: 0
namespace_mdts_co-paused: 0
namespace_mdts_unknown: 0
namespace_osts_init: 0
namespace_osts_scanning-phase1: 0
namespace_osts_scanning-phase2: 0
namespace_osts_completed: 0
namespace_osts_failed: 0
namespace_osts_stopped: 0
namespace_osts_paused: 0
namespace_osts_crashed: 0
namespace_osts_partial: 0
namespace_osts_co-failed: 0
namespace_osts_co-stopped: 0
namespace_osts_co-paused: 0
namespace_osts_unknown: 0
namespace_repaired: 889657

Comment by nasf (Inactive) [ 04/Dec/17 ]

Strictly speaking, there are three ways to trigger namespace LFSCK:
1) "lctl lfsck_start -t namespace" on the MDT, it will trigger namespace LFSCK on current MDT locally.
2) "lctl lfsck_start -t namespace -A", it will trigger namespace LFSCK on all MDTs (via RPC).
3) If the former namespace LFSCK crashed, when the MDT is recovered, and if without "-o skip_lfsck" mount option, it will automatically resume the former crashed namespace LFSCK.

The mount option "-o skip_lfsck" only controls the 3rd case. The 1st and 2nd case are LFSCK manual interfaces. They are prior to the "skip_lfsck" option. For your case, it is quite possible the 2nd case. To avoid the 2nd case caused trouble, you need to stop LFSCK by force on all MDTs (lctl lfsck_stop -A).

As for the namespace LFSCK status: "stopped" means the LFSCK is stopped via "lctl lfsck_stop" explicitly; "paused" means the LFSCK is auto stopped when server umount normally; "crashed" means the LFSCK in running but the server shutdown abnormally. In you cases, the MDT0008 crashed abnormally, so its status is "crashed"; the others, you umount the servers, then related LFSCK status will be "paused".

Comment by nasf (Inactive) [ 04/Dec/17 ]

I now see MDT0001 working really hard in __mdd_orphan_cleanup(). This doesn't appear to be hurting anything, but is it a side effect of the lfsck?

For ldiskfs backend, during the 2nd phase scanning, the namespace LFSCK will scan the backend "/lost+found" directory and try to recover the ldiskfs orphans back to normal namespace. But the logic __mdd_orphan_cleanup() will handle the Lustre orphans under "/PENDING" directory. I do not think the two parts will affect each other. Especially your system is ZFS backend based, so I do not think LFSCK will affect the Lustre orphans handling.

If there are too many Lustre orphans under "/PENDING" directory, then __mdd_orphan_cleanup() may be very slow, you can check that via mount the backend as "zfs" directly.

Comment by Olaf Faaland [ 04/Dec/17 ]

Thanks for the updates!

I don't believe my observations fit with your description of namespace scan trigger method #3, but I will re test in case I made an error. I'll update the ticket about that tomorrow.

Have you been able to learn anything about what could have led to the RCU symptoms?

Comment by nasf (Inactive) [ 05/Dec/17 ]

I said your case may be the #2, not the #3. That is why your former "-o skip_lfsck" did not work.

As for the original RCU symptoms, only with the given stack information, it is not easy to locate the root reason. It seems that the namespace LFSCK dropped into some loop scanning of some file's linkEA. One most possible case is that such linkEA was corrupted. Do you have LU-8084 patch (https://review.whamcloud.com/#/c/19877/) applied on your system?

Comment by Olaf Faaland [ 05/Dec/17 ]

No, we do not have LU-8084 https://review.whamcloud.com/#/c/19877/ in our patch stack.

I still don't understand something about the lfsck_namespace thread.

Yes, I used the "-A" option to "lfsck_start" about 3 weeks ago. That was the last time I ran "lctl lfsck_start".

Today I:
1. Powered off all the servers, then powered them all on.
2. Mounted lustre on all of them with "mount -t lustre <pool>/<dataset> <mntpoint>", that is no mount options
3. On MDT0008 the lfsck_namespace thread was started and got stuck in the loop scanning a linkEA
4. I powered off the node running MDT0008 and powered it on again
5. I mounted MDT0008, this time using the "-o skip_fsck" option
6. The lfsck_namespace thread was started and got stuck in the loop

Why didn't the skip_lfsck option prevent lfsck_namespace from continuing, today (step #5 and #6)?

thanks!

Comment by Olaf Faaland [ 05/Dec/17 ]

I made a mistake and need to do the experiment again. Ignore the below comment for now.

I still don't understand something about the lfsck_namespace thread.

Yes, I used the "-A" option to "lfsck_start" about 3 weeks ago. That was the last time I ran "lctl lfsck_start".

Today I:
1. Powered off all the servers, then powered them all on.
2. Mounted lustre on all of them with "mount -t lustre <pool>/<dataset> <mntpoint>", that is no mount options
3. On MDT0008 the lfsck_namespace thread was started and got stuck in the loop scanning a linkEA
4. I powered off the node running MDT0008 and powered it on again
5. I mounted MDT0008, this time using the "-o skip_fsck" option
6. The lfsck_namespace thread was started and got stuck in the loop

Why didn't the skip_lfsck option prevent lfsck_namespace from continuing, today (step #5 and #6)?

Comment by Gerrit Updater [ 07/Dec/17 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30421
Subject: LU-10268 lfsck: postpone lfsck start until initialized
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: bdc80f63b0e278040de8e2ba82b3f82e640733bf

Comment by Olaf Faaland [ 08/Dec/17 ]

I'll build with the LU-8084 patch and see if that allows the lfsck to finish, and report back.

Comment by Olaf Faaland [ 08/Dec/17 ]

I built Lustre with patch from LU-8084, upgraded the rpms on the servers, and power cycled them.

All the servers were started without any mount options. When MDT0008 started, lfsck and lfsck_namespace threads started. The lfsck_namespace thread again was reported as stuck by the NMI watchdog, with same stack as originally reported and RIP lfsck_namespace_filter_linkea_entry.isra.64+0x8e as before.

So, the patch appeared not to make a difference. What next?

Comment by nasf (Inactive) [ 12/Dec/17 ]

About the LU-8084 patch, you back ported such patch by yourself or directly used the following one ?
https://review.whamcloud.com/#/c/30370/

Comment by Olaf Faaland [ 12/Dec/17 ]

I used the original commit from LU-8084 applied to master, https://review.whamcloud.com/#/c/19877/. I didn't see that you had started a backport. Did you find changes were required?

Comment by Olaf Faaland [ 13/Dec/17 ]

I see that you did need to make changes. I'll try with your backport.

Comment by Olaf Faaland [ 15/Dec/17 ]

Fan,
I see that your backport wasn't tested because all the tests were failed in provisioning.

Comment by Peter Jones [ 15/Dec/17 ]

Olaf

node-provisioning failures indicate an issue in the auto test system rather than a problem with the patch itself. Changes were being made yesterday to split the tests into different test groups so perhaps that was the issue. Fan Yong has re-triggered the tests and they seem to be running ok now

Peter

Comment by Gerrit Updater [ 17/Dec/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30259/
Subject: LU-10268 lfsck: postpone lfsck start until initialized
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f95ee72ab6ecffdaf6dd4f0202d954dfc45d0ba1

Comment by Peter Jones [ 17/Dec/17 ]

Landed for 2.11

Comment by Olaf Faaland [ 18/Dec/17 ]

Peter,
I believe you closed this too early. Fan Yong said:

One most possible case is that such linkEA was corrupted

and so fsck is not safe to run without the patch she is backporting. Her backport isn't yet reviewed and merged, so we're not done yet, right?

Comment by Peter Jones [ 18/Dec/17 ]

Olaf

I closed the ticket because the ticket itself is tracking the status for master - outstanding equivalent work against an older maintenance branch would still be tracked by the presence of the topllnl label. Is more work still needed for master?

Peter

Comment by Olaf Faaland [ 18/Dec/17 ]

Peter,

I see. It seems to me that use of "topllnl" might lead to mistakes, but I agree it can work if everyone knows the convention.

Yes, I think the work for master is done.

thanks,
Olaf

Comment by Gerrit Updater [ 04/Jan/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/30421/
Subject: LU-10268 lfsck: postpone lfsck start until initialized
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: b1e6cdef3f28034f6d1c49e491fbb7837d388c22

Generated at Sat Feb 10 02:33:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.