[LU-15380] Hang in osp_precreate_reserve() while setstriping (-C 200) on single OST Created: 17/Dec/21  Updated: 18/Dec/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Arshad Hussain Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14918 too many ldiskfs transaction credits ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hang in osp_precreate_reserve() while creating 2000+ overstripe. There is only one OST.

Steps to recreate

 

1. Prepare enviroment with just single ost. (single node will also do)
2. lfs setstripe -C 2000 /mnt/lustre/file0
<Hang>

 

 

Dmesg

06:23:00 centos79 kernel: Lustre: mdt00_001: service thread pid 2247 was inactive for 40.028 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
06:23:00 centos79 kernel: Pid: 2247, comm: mdt00_001 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021
06:23:00 centos79 kernel: Call Trace:
06:23:00 centos79 kernel: [<0>] osp_precreate_reserve+0x490/0x9b0 [osp]
06:23:00 centos79 kernel: [<0>] osp_declare_create+0x1ac/0x660 [osp]
06:23:00 centos79 kernel: [<0>] lod_sub_declare_create+0xdf/0x210 [lod]
06:23:00 centos79 kernel: [<0>] lod_qos_declare_object_on+0xf3/0x420 [lod]
06:23:00 centos79 kernel: [<0>] lod_ost_alloc_rr.constprop.23+0xaad/0x1150 [lod]
06:23:00 centos79 kernel: [<0>] lod_qos_prep_create+0x1245/0x1a20 [lod]
06:23:00 centos79 kernel: [<0>] lod_prepare_create+0x230/0x320 [lod]
06:23:00 centos79 kernel: [<0>] lod_declare_striped_create+0x1fe/0x9e0 [lod]
06:23:00 centos79 kernel: [<0>] lod_declare_xattr_set+0x139/0x2020 [lod]
06:23:00 centos79 kernel: [<0>] mdd_create_data+0x532/0x760 [mdd]
06:23:00 centos79 kernel: [<0>] mdt_mfd_open+0xcb7/0xed0 [mdt]
06:23:00 centos79 kernel: [<0>] mdt_finish_open+0x79b/0x930 [mdt]
06:23:00 centos79 kernel: [<0>] mdt_open_by_fid_lock+0x704/0xb10 [mdt]
06:23:00 centos79 kernel: [<0>] mdt_reint_open+0x783/0x2a80 [mdt]
06:23:00 centos79 kernel: [<0>] mdt_reint_rec+0x8a/0x240 [mdt]
06:23:00 centos79 kernel: [<0>] mdt_reint_internal+0x76c/0xb50 [mdt]
06:23:00 centos79 kernel: [<0>] mdt_intent_open+0x93/0x480 [mdt]
06:23:00 centos79 kernel: [<0>] mdt_intent_opc+0x1e0/0xc10 [mdt]
06:23:00 centos79 kernel: [<0>] mdt_intent_policy+0x1a1/0x360 [mdt]
06:23:00 centos79 kernel: [<0>] ldlm_lock_enqueue+0x3c5/0xb50 [ptlrpc]
06:23:00 centos79 kernel: [<0>] ldlm_handle_enqueue0+0xab6/0x16d0 [ptlrpc]
06:23:00 centos79 kernel: [<0>] tgt_enqueue+0x64/0x240 [ptlrpc]
06:23:00 centos79 kernel: [<0>] tgt_request_handle+0x823/0x1850 [ptlrpc]
06:23:00 centos79 kernel: [<0>] ptlrpc_server_handle_request+0x253/0xb10 [ptlrpc]
06:23:00 centos79 kernel: [<0>] ptlrpc_main+0xbf4/0x15e0 [ptlrpc]
06:23:00 centos79 kernel: [<0>] kthread+0xd1/0xe0
06:23:00 centos79 kernel: [<0>] ret_from_fork_nospec_begin+0xe/0x21
06:23:00 centos79 kernel: [<0>] 0xfffffffffffffffe


 Comments   
Comment by Arshad Hussain [ 17/Dec/21 ]
(gdb) list *(osp_precreate_reserve+0x490)
0x13fc0 is in osp_precreate_reserve (/root/lustre-dev/lustre-release/lustre/osp/osp_precreate.c:1509).
1504                LASSERT(d->opd_pre);
1505                rc = -ENOBUFS;
1506                break;
1507            }
1508    
1509            if (wait_event_idle_timeout(
1510                    d->opd_pre_user_waitq,
1511                    osp_precreate_ready_condition(env, d),
1512                    cfs_time_seconds(obd_timeout)) == 0) {
1513                CDEBUG(D_HA,
(gdb) 
Comment by Patrick Farrell [ 17/Dec/21 ]

Arshad,

Did the hang resolve or did it stay stuck indefinitely?  Also, 2000 stripes on one OST is not going to work well at all, in a practical sense

Comment by Arshad Hussain [ 17/Dec/21 ]

Patrick,

I hard reset the VM after ~10 mins. I can confirm that until ~10 mins the hang was not resolved and was stuck. Let me rerun the recreation and confirm to you if it is stuck indefinitely. Probably, I will wait for ~30 mins instead of 10min, that I understand should be good enough.

>Also, 2000 stripes on one OST is not going to work well at all, ...
Agree!. This was surface because of Andreas's patch exercising "llverdev" in the test environment. Mostly this is environment related for the test-case and will not be seen on real world. I did kept the bug marked "minor" for this reason.

Thanks

Comment by Arshad Hussain [ 17/Dec/21 ]

Patrick,

Now that I am looking at the osp_precreate code. As I understand the 100seconds "obd_timeout" should already have triggered? It is a complete hang IMO.

if (wait_event_idle_timeout(                                    
        d->opd_pre_user_waitq,                              
        osp_precreate_ready_condition(env, d),              
        cfs_time_seconds(obd_timeout)) == 0) {  <<<<<<<<< 100 seconds (However, after 10mins it was still stuck)
Comment by Andreas Dilger [ 17/Dec/21 ]

You should try Alex's patch for reducing the credits for overstriped files:

https://review.whamcloud.com/45765 "LU-14918 osd: don't declare similar writes twice"

Comment by Andreas Dilger [ 17/Dec/21 ]

Oh, never mind, this is stuck on the MDS before it even gets to the OST. Probably on a newly-booted MDS there are no procreated objects, and it is unable to get the objects it needs. It would probably work with "-C 32". Of course it shouldn't hang either. We have some tests doing "-C 2000" with only 7 OSTs, so that is 280 objects per OST, so not sure why this wouldn't work.

Comment by Patrick Farrell [ 17/Dec/21 ]

Or even 2 OSTs - I've certainly tried it locally.  But it's not a new file system in those cases, is it?

Comment by Arshad Hussain [ 18/Dec/21 ]

 

>Or even 2 OSTs - I've certainly tried it locally.  But it's not a new file system in those cases, is it?

Yes, it hangs with 2 OSTs also. Mine is relatively new FS (Mostly bought up with llmount.sh).

This is my local setup (edited to remove FS options making it more readable)

# mount | grep lus
/dev/sda2 on /usr/sbin/mount.lustre type ext4 (rw,relatime,data=ordered)
1. MDS/MGS
/dev/mapper/mds1_flakey on /mnt/lustre-mds1 type lustre 
3 OST
/dev/mapper/ost1_flakey on /mnt/lustre-ost1 type lustre 
/dev/mapper/ost2_flakey on /mnt/lustre-ost2 type lustre 
/dev/mapper/ost3_flakey on /mnt/lustre-ost3 type lustre 

192.168.50.95@tcp:/lustre on /mnt/lustre type lustre 
Comment by Arshad Hussain [ 18/Dec/21 ]

Andreas,

>...It would probably work with "-C 32". Of course it shouldn't hang either. ...

With 1 OST it does work with overstripe(-C) 32. However, I am observing that the 'rm' after setstripe hangs with no backtrace on dmesg etc... 

Given that llmount.sh bring up the environment like below

 

# mount | grep lus 
/dev/sda2 on /usr/sbin/mount.lustre type ext4 (rw,relatime,data=ordered) 
/dev/mapper/mds1_flakey on /mnt/lustre-mds1 type lustre 
/dev/mapper/ost1_flakey on /mnt/lustre-ost1 type lustre 
/dev/mapper/ost2_flakey on /mnt/lustre-ost2 type lustre 
/dev/mapper/ost3_flakey on /mnt/lustre-ost3 type lustre 
192.168.50.95@tcp:/lustre on /mnt/lustre type lustre 

umount 2 OST (simulate a single OST)

 

# umount /mnt/lustre-ost3
# umount /mnt/lustre-ost2

 

# mount | grep lus
/dev/sda2 on /usr/sbin/mount.lustre type ext4 (rw,relatime,data=ordered)
/dev/mapper/mds1_flakey on /mnt/lustre-mds1 type lustre (ro,svname=lustre-MDT0000,mgs,osd=osd-ldiskfs,user_xattr,errors=remount-ro)
/dev/mapper/ost1_flakey on /mnt/lustre-ost1 type lustre (ro,svname=lustre-OST0000,mgsnode=192.168.50.95@tcp,osd=osd-ldiskfs,errors=remount-ro)
192.168.50.95@tcp:/lustre on /mnt/lustre type lustre (rw,checksum,flock,user_xattr,lruresize,lazystatfs,nouser_fid2path,verbose,noencrypt)

Run "setstrip" - This is success

# lfs setstripe -C 32 /mnt/lustre/file0

Try to remove newly created file.. "rm" hangs forever

 

# rm /mnt/lustre/file0

 

meanwhile dmesg has

 

Dec 17 23:54:30 centos79 kernel: LustreError: 11-0: lustre-OST0001-osc-ffff917e5a017800: operation ost_connect to node 0@lo failed: rc = -19
Dec 17 23:55:00 centos79 kernel: LustreError: 137-5: lustre-OST0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.
Dec 17 23:55:00 centos79 kernel: LustreError: Skipped 51 previous similar messages
Dec 17 23:56:46 centos79 kernel: LustreError: 11-0: lustre-OST0002-osc-ffff917e5a017800: operation ost_connect to node 0@lo failed: rc = -19
Dec 17 23:57:01 centos79 kernel: LustreError: 11-0: lustre-OST0001-osc-ffff917e5a017800: operation ost_connect to node 0@lo failed: rc = -19
Dec 17 23:57:31 centos79 kernel: LustreError: 137-5: lustre-OST0001_UUID: not available for connect from 0@lo (no target). If you are running an HA pair check that the target is mounted on the other server.

 

 

Now bring up the privous unmonted OST

 

# mount.lustre /dev/mapper/ost2_flakey /mnt/lustre-ost2/
# mount.lustre /dev/mapper/ost3_flakey /mnt/lustre-ost3/

"rm" comes alive...

# rm /mnt/lustre/file0
rm: remove regular empty file ‘/mnt/lustre/file0’? y
# ls -ali /mnt/lustre/
total 8
144115188193296385 drwxr-xr-x  3 root root 4096 Dec 18 00:02 .
           1700609 drwxr-xr-x. 9 root root 4096 Sep 13 06:17 ..

 

 

 

 

 

Comment by Arshad Hussain [ 18/Dec/21 ]

...continue

The same is applied on "setstrip -C 2000" on single OST.

# umount /mnt/lustre-ost3/
# umount /mnt/lustre-ost2/
# lfs setstripe -C 2000 /mnt/lustre/file0
<hangs>

Wait for ~15 mins... Then mount back the unmounted OST

# mount.lustre /dev/mapper/ost2_flakey /mnt/lustre-ost2/
# mount.lustre /dev/mapper/ost3_flakey /mnt/lustre-ost3/

Wait some more... setstripe thread comes alive. Ofcourse object has no stripe info though

 

# lfs setstripe -C 2000 /mnt/lustre/file0
...lots of waiting...
# lfs getstripe /mnt/lustre/file0
/mnt/lustre/file0 has no stripe info

dmesg

Dec 18 00:46:01 centos79 kernel: Lustre: lustre-OST0002-osc-MDT0000: Connection restored to 192.168.50.95@tcp (at 0@lo)
Dec 18 00:54:16 centos79 systemd-logind: New session 4 of user root.
Dec 18 00:54:16 centos79 systemd: Started Session 4 of user root.
Dec 18 00:54:50 centos79 kernel: Lustre: 6473:0:(client.c:2290:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1639806289/real 1639806289]  req@ffff929b61754900 x1719460205452416/t0(0) o101->lustre-MDT0000-mdc-ffff929b58e76000@0@lo:12/10 lens 648/66264 e 24 to 1 dl 1639806890 ref 2 fl Rpc:IXPQr/2/ffffffff rc 0/-1 job:'lfs.0'
Dec 18 00:54:50 centos79 kernel: Lustre: lustre-MDT0000-mdc-ffff929b58e76000: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
Dec 18 00:54:50 centos79 kernel: Lustre: lustre-MDT0000: Client a48ea68c-ea49-4863-8b15-e54b432375a0 (at 0@lo) reconnecting
Dec 18 00:54:50 centos79 kernel: Lustre: lustre-MDT0000-mdc-ffff929b58e76000: Connection restored to 192.168.50.95@tcp (at 0@lo)

Summary:

IMO, the way OST's are initially initilize the precreated objects are set is causing the setstripe hang.  If lustre is setup with 3 OST then -C 2000 works flawlessly on this. As soon as we unmount a OST it starts to fail. Either in setstrip or any operation like rm (shown above). Even in maloo I think we can simulate it by letting 7 OST (presently setup) to come up and then unmounting a sinlge OST. Setstrip will also hang there.

 

 

Generated at Sat Feb 10 03:17:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.