[LU-2789] lod_load_striping()) ASSERTION( lo->ldo_stripenr == 0 ) failed Created: 08/Feb/13  Updated: 12/Oct/16  Resolved: 12/Oct/16

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.5.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Critical
Reporter: John Hammond Assignee: John Hammond
Resolution: Fixed Votes: 0
Labels: LB, lod, mdd

Issue Links:
Related
is related to LU-2523 ll_update_inode()) ASSERTION( lu_fid_... Resolved
is related to LU-4083 lod_lov.c:824:lod_load_striping()) AS... Resolved
is related to LU-4260 ASSERTION( lc->ldo_stripenr == 0 ) fa... Resolved
is related to LU-3059 shrink lod_object to 128 bytes Resolved
is related to LU-3072 add more operations to racer Closed
Severity: 3
Rank (Obsolete): 6755

 Description   

This could probably be reproduced by racer given enough runs but I can reproduce it as follows:

# llmount.sh
# mount n@tcp:/lustre /mnt/lustre2 -t lustre
# (cd /mnt/lustre; while true; do lfs setstripe -c 1 f0; done) &
# (cd /mnt/lustre2; while true; do mv f0 f1; done) &

Message from syslogd@n at Feb  8 15:36:51 ...
 kernel:LustreError: 3186:0:(lod_lov.c:782:lod_load_striping()) ASSERTION( lo->ldo_stripe[i] ) failed: stripe 0 is NULL

Message from syslogd@n at Feb  8 15:36:51 ...
 kernel:LustreError: 3186:0:(lod_lov.c:782:lod_load_striping()) LBUG

Message from syslogd@n at Feb  8 15:36:51 ...
 kernel:Kernel panic - not syncing: LBUG

Here is the crash dump for the rename handler:

crash> bt -l
PID: 13628  TASK: ffff8800a98a1540  CPU: 1   COMMAND: "mdt00_001"
 #0 [ffff8800a98a3828] machine_kexec at ffffffff81031f7b
    /usr/src/debug/kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6.x86_64/arch/x86/kernel/machine_kexec_64.c: 336
 #1 [ffff8800a98a3888] crash_kexec at ffffffff810b8c22
    /usr/src/debug/kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6.x86_64/kernel/kexec.c: 1106
 #2 [ffff8800a98a3958] panic at ffffffff814e9818
    /usr/src/debug/kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6.x86_64/kernel/panic.c: 103
 #3 [ffff8800a98a39d8] lbug_with_loc at ffffffffa0595eeb [libcfs]
    /root/lustre-release/libcfs/libcfs/linux/linux-debug.c: 188
 #4 [ffff8800a98a39f8] lod_load_striping at ffffffffa0e199f3 [lod]
    /root/lustre-release/lustre/lod/lod_internal.h: 255
 #5 [ffff8800a98a3a38] lod_declare_attr_set at ffffffffa0e25fbb [lod]
    /root/lustre-release/lustre/lod/lod_object.c: 300
 #6 [ffff8800a98a3a88] mdd_rename at ffffffffa0beb6d8 [mdd]
    /root/lustre-release/lustre/mdd/mdd_dir.c: 2087
 #7 [ffff8800a98a3ba8] mdt_reint_rename at ffffffffa0d54617 [mdt]
    /root/lustre-release/lustre/mdt/mdt_reint.c: 1270
 #8 [ffff8800a98a3cc8] mdt_reint_rec at ffffffffa0d506b1 [mdt]
    /root/lustre-release/libcfs/include/libcfs/libcfs_debug.h: 211
 #9 [ffff8800a98a3ce8] mdt_reint_internal at ffffffffa0d49d13 [mdt]
    /root/lustre-release/libcfs/include/libcfs/libcfs_debug.h: 211
#10 [ffff8800a98a3d28] mdt_reint at ffffffffa0d4a044 [mdt]
    /root/lustre-release/lustre/mdt/mdt_handler.c: 1818
#11 [ffff8800a98a3d48] mdt_handle_common at ffffffffa0d3afb8 [mdt]
    /root/lustre-release/lustre/mdt/mdt_handler.c: 2981
#12 [ffff8800a98a3d98] mds_regular_handle at ffffffffa0d725f5 [mdt]
    /root/lustre-release/lustre/mdt/mdt_mds.c: 354
#13 [ffff8800a98a3da8] ptlrpc_server_handle_request at ffffffffa08e9c7c [ptlrpc]
    /root/lustre-release/lustre/include/lustre_net.h: 2771
#14 [ffff8800a98a3ea8] ptlrpc_main at ffffffffa08eb1c6 [ptlrpc]
    /root/lustre-release/lustre/ptlrpc/service.c: 2487
#15 [ffff8800a98a3f48] kernel_thread at ffffffff8100c0ca
    /usr/src/debug///////kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6.x86_64/arch/x86/kernel/entry_64.S: 1213

lfs setstripe is in ioctl() with its mdt_reint_open() handler in:

mdt_reint_open()
 ...
  mdt_create_data()
   ...
    lod_declare_xattr_set()
     ...
      osp_precreate_reserve()


 Comments   
Comment by Jodi Levi (Inactive) [ 11/Feb/13 ]

Assigning to John as he already has a patch.

Comment by John Hammond [ 12/Feb/13 ]

Thanks for the vote of confidence but I don't have a patch for this one. I've added Alex and Mike to the watcher list as they seem to be the lod experts.

Comment by John Hammond [ 13/Feb/13 ]

Please see http://review.whamcloud.com/5422 (or suggest something better).

Comment by Andreas Dilger [ 13/Feb/13 ]

I've submitted http://review.whamcloud.com/5424 to disable the setstripe in racer/file_create.sh, unless "RACER_SETSTRIPE=true" is set in the environment. That will allow adding racer to the review test list, but also allow testing these patches as needed via "Test-Parameters: envdefinitions=RACER_SETSTRIPE=true".

Please delete the exception for RACER_SETSTRIPE from file_create.sh when this patch and LU-2523 are landed.

Comment by John Hammond [ 13/Feb/13 ]

Forgot to mention, if I understand correctly there is still no guarantee that the objects present at time of commit were all present at declaration.

Comment by Alex Zhuravlev [ 14/Feb/13 ]

> Forgot to mention, if I understand correctly there is still no guarantee that the objects present at time of commit were all present at declaration.

we just need to ensure ldo_stripe is valid. this can be done with proper ordering: allocate it, initialize with all the objects, then set ldo_stripe to that.
this is still not enough in some cases, but not worse than your patch.

Comment by John Hammond [ 20/Feb/13 ]

Patch landed.

Comment by Oleg Drokin [ 20/Mar/13 ]

I just triggered this during a racer run again

[12352.649735] LustreError: 3673:0:(osp_precreate.c:484:osp_precreate_send()) lustr
e-OST0001-osc-MDT0000: can't precreate: rc = -28
[12352.650805] Lustre: wanted 1, found 0
[12352.664985] Lustre: wanted 1, found 0
[12352.665240] Lustre: Skipped 1 previous similar message
[12352.731858] LustreError: 32093:0:(lod_lov.c:819:lod_load_striping()) ASSERTION( 
lo->ldo_stripenr == 0 ) failed: 
[12352.732406] LustreError: 32093:0:(lod_lov.c:819:lod_load_striping()) LBUG
[12352.732586] Pid: 32093, comm: mdt01_008
[12352.732733] 
[12352.732733] Call Trace:
[12352.732989]  [<ffffffffa045b915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[12352.733180]  [<ffffffffa045bf17>] lbug_with_loc+0x47/0xb0 [libcfs]
[12352.733403]  [<ffffffffa07008d3>] lod_load_striping+0x383/0x4b0 [lod]
[12352.733591]  [<ffffffffa070cecb>] lod_declare_attr_set+0x17b/0x450 [lod]
[12352.735056]  [<ffffffffa0c5a888>] mdd_rename+0x328/0x22f0 [mdd]
[12352.735253]  [<ffffffffa07090f5>] ? lod_index_lookup+0x25/0x30 [lod]
[12352.735439]  [<ffffffffa0c55ebe>] ? __mdd_lookup_locked+0x1ee/0x490 [mdd]
[12352.735695]  [<ffffffffa0471692>] ? cfs_hash_bd_from_key+0x42/0xd0 [libcfs]
[12352.735915]  [<ffffffffa0d444c5>] mdt_reint_rename+0x13d5/0x1b10 [mdt]
[12352.736145]  [<ffffffffa061dec0>] ? lu_ucred+0x20/0x30 [obdclass]
[12352.736379]  [<ffffffffa0d3f251>] mdt_reint_rec+0x41/0xe0 [mdt]
[12352.736599]  [<ffffffffa0d388b3>] mdt_reint_internal+0x4e3/0x7d0 [mdt]
[12352.736839]  [<ffffffffa0d38be4>] mdt_reint+0x44/0xe0 [mdt]
[12352.737033]  [<ffffffffa0d2a078>] mdt_handle_common+0x628/0x1630 [mdt]
[12352.737244]  [<ffffffffa0d62935>] mds_regular_handle+0x15/0x20 [mdt]
[12352.737495]  [<ffffffffa07ac003>] ptlrpc_server_handle_request+0x443/0xe40 [ptlrpc]
[12352.738270]  [<ffffffffa045c65e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
[12352.738462]  [<ffffffffa07a4fc1>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
[12352.738654]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
[12352.738853]  [<ffffffffa07aeacd>] ptlrpc_main+0xafd/0x17f0 [ptlrpc]
[12352.739162]  [<ffffffffa07adfd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
[12352.739391]  [<ffffffff8100c14a>] child_rip+0xa/0x20
[12352.739661]  [<ffffffffa07adfd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
[12352.740114]  [<ffffffffa07adfd0>] ? ptlrpc_main+0x0/0x17f0 [ptlrpc]
[12352.740387]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
[12352.740647] 
[12352.741400] Kernel panic - not syncing: LBUG

I have a crashdump and modules in /exports/crashdumps/192.168.10.220-2013-03-20-12\:47\:02/
code tag in my repo: master-20130320

Comment by John Hammond [ 21/Mar/13 ]

New patch at http://review.whamcloud.com/#change,5802.

(There remains a setstripe vs chown race.)

Comment by John Hammond [ 25/Mar/13 ]

A more robust patch is at http://review.whamcloud.com/5839.

I hope they can both go through because the first offers an optimization to rename()/chmod()/utime() in that it does not require that the MDT load striping for the object.

Comment by John Hammond [ 29/Mar/13 ]

Based on the discussion at http://review.whamcloud.com/#change,5839, I have added a semaphore to lod_object (as of patch set 3). With the changes attached to LU-3059 this puts the size of lod_object at a nice round 128 bytes. This does not solve all problems but it does fix several and reduce the odds of the remaining ones I know of---as measured by the time until LBUG() under some manual testing. For an example try the following before and after #5839:

# MOUNT_2=y llmount.sh
# (cd /mnt/lustre; while true; do lfs setstripe f0; done) &
# (cd /mnt/lustre2; while true; do chown 42:42 f0; rm f0; done) &

Before this would quickly trigger an LASSERT() in LOD. After it less quickly triggers one in OSP.

LustreError: 6283:0:(osp_sync.c:248:osp_sync_add_rec()) ASSERTION( txn ) failed:
LustreError: 6283:0:(osp_sync.c:248:osp_sync_add_rec()) LBUG

crash> bt -l
PID: 6283   TASK: ffff88013b9b6080  CPU: 0   COMMAND: "mdt00_002"
 #0 [ffff88013ba23848] machine_kexec at ffffffff81031f7b
    /usr/src/debug/kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6_lustre_gcov.x86_6\
4/arch/x86/kernel/machine_kexec_64.c: 336
 #1 [ffff88013ba238a8] crash_kexec at ffffffff810b8c22
    /usr/src/debug/kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6_lustre_gcov.x86_6\
4/kernel/kexec.c: 1106
 #2 [ffff88013ba23978] panic at ffffffff814eae18
    /usr/src/debug/kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6_lustre_gcov.x86_6\
4/kernel/panic.c: 103
 #3 [ffff88013ba239f8] lbug_with_loc at ffffffffa051deeb [libcfs]
    /root/lustre-release/libcfs/libcfs/linux/linux-debug.c: 176
 #4 [ffff88013ba23a18] osp_sync_add_rec at ffffffffa0dc33dd [osp]
    /root/lustre-release/lustre/osp/osp_sync.c: 1196
 #5 [ffff88013ba23a88] osp_sync_add at ffffffffa0dc35bb [osp]
    /root/lustre-release/lustre/osp/osp_sync.c: 286
 #6 [ffff88013ba23aa8] osp_attr_set at ffffffffa0db76c3 [osp]
    /root/lustre-release/libcfs/include/libcfs/libcfs_debug.h: 211
 #7 [ffff88013ba23ad8] lod_attr_set at ffffffffa0d85e6f [lod]
    /root/lustre-release/lustre/lod/lod_object.c: 369
 #8 [ffff88013ba23b38] mdd_attr_set_internal at ffffffffa07ba561 [mdd]
    /root/lustre-release/lustre/mdd/mdd_object.c: 417
 #9 [ffff88013ba23b88] mdd_attr_set at ffffffffa07bc8aa [mdd]
    /root/lustre-release/lustre/mdd/mdd_object.c: 882
#10 [ffff88013ba23c08] mdt_attr_set at ffffffffa0cb23c8 [mdt]
    /root/lustre-release/lustre/mdt/mdt_reint.c: 440
#11 [ffff88013ba23c58] mdt_reint_setattr at ffffffffa0cb2cb4 [mdt]
    /root/lustre-release/lustre/mdt/mdt_reint.c: 576
#12 [ffff88013ba23cc8] mdt_reint_rec at ffffffffa0cac7b1 [mdt]
    /root/lustre-release/libcfs/include/libcfs/libcfs_debug.h: 211
#13 [ffff88013ba23ce8] mdt_reint_internal at ffffffffa0ca5df3 [mdt]
    /root/lustre-release/libcfs/include/libcfs/libcfs_debug.h: 211
#14 [ffff88013ba23d28] mdt_reint at ffffffffa0ca6124 [mdt]
    /root/lustre-release/lustre/mdt/mdt_handler.c: 1821
#15 [ffff88013ba23d48] mdt_handle_common at ffffffffa0c940e8 [mdt]
    /root/lustre-release/lustre/mdt/mdt_handler.c: 2984
#16 [ffff88013ba23d98] mds_regular_handle at ffffffffa0cd0335 [mdt]
    /root/lustre-release/lustre/mdt/mdt_mds.c: 354
#17 [ffff88013ba23da8] ptlrpc_server_handle_request at ffffffffa086f29c [ptlrpc]
    /root/lustre-release/lustre/include/lustre_net.h: 2894
#18 [ffff88013ba23ea8] ptlrpc_main at ffffffffa08707e5 [ptlrpc]
    /root/lustre-release/lustre/ptlrpc/service.c: 2486
#19 [ffff88013ba23f48] kernel_thread at ffffffff8100c0ca
    /usr/src/debug///////kernel-2.6.32-279.19.1.el6/linux-2.6.32-279.19.1.el6_lustre_gcov\
.x86_64/arch/x86/kernel/entry_64.S: 1213
Comment by Andreas Dilger [ 30/Mar/13 ]

John, can you please make the test you described in the last comment into a sanity.sh test, or add it to racer in some manner. The OSP failure should probably be fixed in a separate ticket along with landing the test that triggers it.

Comment by Alex Zhuravlev [ 30/Mar/13 ]

John, this mean the transaction did not declare llog records. in the context of the patch, I'd guess one transaction made declaration, another created the striping (but didn't do declaration).

Comment by John Hammond [ 30/Mar/13 ]

Andreas: I have created LU-3072 for the racer improvements. AFAIK, we have not used this sort of approach in sanity. Is that right? I don't see any obvious issues but it makes me somewhat uneasy.

Alex: This is what I was worried about yesterday. It seems that when lod_declare_attr_set() was called the object had no striping, but by the time lod_attr_set() was called it did. I will try to verify this.

Running:

# MOUNT_2=y llmount.sh
# (cd /mnt/lustre; while true; do sys_setstripe -c1 f0; sys_unlink f0; done) &
# (cd /mnt/lustre2; while true; do sys_truncate f0 117; done) & ### does truncate("f0", 117) 

I see:

LustreError: 3215:0:(osp_object.c:56:osp_object_assign_fid()) ASSERTION( fid_is_zero(lu_object_fid(&o->opo_obj.do_lu)) ) failed: 
LustreError: 3215:0:(osp_object.c:56:osp_object_assign_fid()) LBUG
Pid: 3215, comm: mdt00_002

Call Trace:
 [<ffffffffa051a895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa051ae97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa0db4912>] osp_object_assign_fid+0xd2/0x160 [osp]
 [<ffffffffa0db4b04>] osp_declare_attr_set+0x164/0x330 [osp]
 [<ffffffffa0d83425>] lod_declare_attr_set+0x275/0x4f0 [lod]
 [<ffffffffa07b8c49>] mdd_attr_set+0x3f9/0x1390 [mdd]
 [<ffffffffa0caf3c8>] mdt_attr_set+0x268/0x560 [mdt]
 [<ffffffffa0cafcb4>] mdt_reint_setattr+0x5f4/0xd10 [mdt]
 [<ffffffffa085bfde>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
 [<ffffffffa0ca97b1>] mdt_reint_rec+0x41/0xe0 [mdt]
 [<ffffffffa0ca2df3>] mdt_reint_internal+0x4e3/0x7d0 [mdt]
 [<ffffffffa0ca3124>] mdt_reint+0x44/0xe0 [mdt]
 [<ffffffffa0c910e8>] mdt_handle_common+0x648/0x1660 [mdt]
 [<ffffffffa0ccd335>] mds_regular_handle+0x15/0x20 [mdt]
 [<ffffffffa086c29c>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc]
 [<ffffffffa051b5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
 [<ffffffffa08638d9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
 [<ffffffff81052223>] ? __wake_up+0x53/0x70
 [<ffffffffa086d7e5>] ptlrpc_main+0xb75/0x1870 [ptlrpc]
 [<ffffffffa086cc70>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffffa086cc70>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
 [<ffffffffa086cc70>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

As well as:

LustreError: 13907:0:(osp_object.c:245:osp_object_create()) ASSERTION( fid_is_sane(fid) ) failed: fid for osp_obj ffffffffa0db3f00 is insane[0x0:0x0:0x0]!
LustreError: 13907:0:(osp_object.c:245:osp_object_create()) LBUG
Pid: 13907, comm: mdt01_003

Call Trace:
 [<ffffffffa051a895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa051ae97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa0db50c2>] osp_object_create+0x3f2/0x890 [osp]
 [<ffffffffa0d81ae6>] lod_striping_create+0x196/0x2f0 [lod]
 [<ffffffffa0d8393a>] lod_xattr_set+0x29a/0x7d0 [lod]
 [<ffffffffa0d6d2f9>] ? lod_trans_start+0x1b9/0x250 [lod]
 [<ffffffffa0d836a0>] ? lod_xattr_set+0x0/0x7d0 [lod]
 [<ffffffffa07c3195>] mdd_create_data+0x5d5/0x760 [mdd]
 [<ffffffffa0cbca6c>] mdt_finish_open+0x123c/0x1880 [mdt]
 [<ffffffffa0cb8258>] ? mdt_object_open_lock+0x1c8/0x510 [mdt]
 [<ffffffffa0cbd556>] mdt_open_by_fid_lock+0x4a6/0x7c0 [mdt]
 [<ffffffffa0cbe0cb>] mdt_reint_open+0x56b/0x20c0 [mdt]
 [<ffffffffa05377ae>] ? upcall_cache_get_entry+0x28e/0x860 [libcfs]
 [<ffffffffa085bc5c>] ? lustre_msg_add_version+0x6c/0xc0 [ptlrpc]
 [<ffffffffa06ff010>] ? lu_ucred+0x20/0x30 [obdclass]
 [<ffffffffa0c89385>] ? mdt_ucred+0x15/0x20 [mdt]
 [<ffffffffa0ca4fbc>] ? mdt_root_squash+0x2c/0x410 [mdt]
 [<ffffffffa0883b66>] ? __req_capsule_get+0x166/0x700 [ptlrpc]
 [<ffffffffa06ff010>] ? lu_ucred+0x20/0x30 [obdclass]
 [<ffffffffa0ca97b1>] mdt_reint_rec+0x41/0xe0 [mdt]
 [<ffffffffa0ca2df3>] mdt_reint_internal+0x4e3/0x7d0 [mdt]
 [<ffffffffa0ca33ad>] mdt_intent_reint+0x1ed/0x520 [mdt]
 [<ffffffffa0c9ef8e>] mdt_intent_policy+0x3ae/0x750 [mdt]
 [<ffffffffa0814351>] ldlm_lock_enqueue+0x361/0x8d0 [ptlrpc]
 [<ffffffffa083a347>] ldlm_handle_enqueue0+0x4f7/0x10b0 [ptlrpc]
 [<ffffffffa0c9f466>] mdt_enqueue+0x46/0x110 [mdt]
 [<ffffffffa0c910e8>] mdt_handle_common+0x648/0x1660 [mdt]
 [<ffffffffa0ccd335>] mds_regular_handle+0x15/0x20 [mdt]
 [<ffffffffa086c29c>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc]
 [<ffffffffa051b5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
 [<ffffffffa08638d9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
 [<ffffffff81052223>] ? __wake_up+0x53/0x70
 [<ffffffffa086d7e5>] ptlrpc_main+0xb75/0x1870 [ptlrpc]
 [<ffffffffa086cc70>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffffa086cc70>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
 [<ffffffffa086cc70>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Comment by Alex Zhuravlev [ 01/Apr/13 ]

hmm, what attributes first lod_declare_attr_set() was trying to set ?

Comment by John Hammond [ 01/Apr/13 ]

Working to reproduce that in gdb.

I got a similar one (triggered the LASSERT() on the next line) where the attrs were mtime, ctime, and size.

MOUNT_2=y llmount.sh
(cd /mnt/lustre; while true; do sys_setstripe -c1 f0; sys_unlink f0; done) &
(cd /mnt/lustre2; while true; do sys_truncate f0 117; done) &

(gdb) bt
#0  lbug_with_loc (msgdata=0xffffffffa0dd5080)
    at /root/lustre-release/libcfs/libcfs/linux/linux-debug.c:167
#1  0xffffffffa0db799a in osp_object_assign_fid (env=0xffff88016ba1f400,
    d=0xffff88017a7af000, o=0xffff880170909858)
    at /root/lustre-release/lustre/osp/osp_object.c:57
#2  0xffffffffa0db7b04 in osp_declare_attr_set (env=0xffff88016ba1f400,
    dt=0xffff8801709098b0, attr=0xffff88016c095160, th=<value optimized out>)
    at /root/lustre-release/lustre/osp/osp_object.c:100
#3  0xffffffffa0d86435 in dt_declare_attr_set (env=0xffff88016ba1f400,
    dt=0xffff88016aa5b130, attr=0xffff88016c095160, handle=0xffff88016a0b14c0)
    at /root/lustre-release/lustre/include/dt_object.h:1097
#4  lod_declare_attr_set (env=0xffff88016ba1f400, dt=0xffff88016aa5b130,
    attr=0xffff88016c095160, handle=0xffff88016a0b14c0)
    at /root/lustre-release/lustre/lod/lod_object.c:335
#5  0xffffffffa07bbc49 in dt_declare_attr_set (env=0xffff88016ba1f400,
    obj=0xffff88016a075d90, ma=0xffff88016c095148)
    at /root/lustre-release/lustre/include/dt_object.h:1097
#6  mdo_declare_attr_set (env=0xffff88016ba1f400, obj=0xffff88016a075d90,
    ma=0xffff88016c095148) at /root/lustre-release/lustre/mdd/mdd_internal.h:660
#7  mdd_declare_attr_set (env=0xffff88016ba1f400, obj=0xffff88016a075d90,
    ma=0xffff88016c095148) at /root/lustre-release/lustre/mdd/mdd_object.c:798
#8  mdd_attr_set (env=0xffff88016ba1f400, obj=0xffff88016a075d90, ma=0xffff88016c095148)
    at /root/lustre-release/lustre/mdd/mdd_object.c:858
#9  0xffffffffa0cb23c8 in mo_attr_set (info=0xffff88016c095000, mo=0xffff88016ab74168,
    ma=0xffff88016c095148, flags=<value optimized out>)
    at /root/lustre-release/lustre/include/md_object.h:574
#10 mdt_attr_set (info=0xffff88016c095000, mo=0xffff88016ab74168, ma=0xffff88016c095148,
    flags=<value optimized out>) at /root/lustre-release/lustre/mdt/mdt_reint.c:439
#11 0xffffffffa0cb2cb4 in mdt_reint_setattr (info=0xffff88016c095000,
    lhc=<value optimized out>) at /root/lustre-release/lustre/mdt/mdt_reint.c:575
#12 0xffffffffa0cac7b1 in mdt_reint_rec (info=0xffff88016c095000, lhc=0x0)
    at /root/lustre-release/lustre/mdt/mdt_reint.c:1352
#13 0xffffffffa0ca5df3 in mdt_reint_internal (info=0xffff88016c095000, lhc=0x0, op=1)
    at /root/lustre-release/lustre/mdt/mdt_handler.c:1755
#14 0xffffffffa0ca6124 in mdt_reint (info=0xffff88016c095000)
    at /root/lustre-release/lustre/mdt/mdt_handler.c:1816
#15 0xffffffffa0c940e8 in mdt_req_handle (req=0xffff88016a9ba400,
    supported=0xffffffffa0cf7860) at /root/lustre-release/lustre/mdt/mdt_handler.c:2983
#16 mdt_handle0 (req=0xffff88016a9ba400, supported=0xffffffffa0cf7860)
    at /root/lustre-release/lustre/mdt/mdt_handler.c:3337
#17 mdt_handle_common (req=0xffff88016a9ba400, supported=0xffffffffa0cf7860)
    at /root/lustre-release/lustre/mdt/mdt_handler.c:3379
#18 0xffffffffa0cd0335 in mds_regular_handle (req=<value optimized out>)
    at /root/lustre-release/lustre/mdt/mdt_mds.c:353
#19 0xffffffffa086f38c in ptlrpc_server_handle_request (svcpt=0xffff880181c00c00,
    thread=<value optimized out>) at /root/lustre-release/lustre/ptlrpc/service.c:1980
#20 0xffffffffa08708d5 in ptlrpc_main (arg=0xffff88016b9ce2c0)
    at /root/lustre-release/lustre/ptlrpc/service.c:2485
#21 0xffffffff8100c0ca in ?? () at arch/x86/kernel/entry_64.S:1211
#22 0x0000000000000000 in ?? ()


(gdb) f 4
#4  lod_declare_attr_set (env=0xffff88016ba1f400, dt=0xffff88016aa5b130,
    attr=0xffff88016c095160, handle=0xffff88016a0b14c0)
    at /root/lustre-release/lustre/lod/lod_object.c:335
335     RETURN(rc);

(gdb) p/x *attr
$5 = {
  la_size = 0x75,
  la_mtime = 0x5159a7e2,
  la_atime = 0x5159a7e2,
  la_ctime = 0x5159a7e2,
  la_blocks = 0x0,
  la_mode = 0x7f6a,
  la_uid = 0x96017240,
  la_gid = 0xffff8801,
  la_flags = 0x0,
  la_nlink = 0x0,
  la_blkbits = 0x0,
  la_blksize = 0x0,
  la_rdev = 0x0,
  la_valid = 0xe /* LA_MTIME | LA_CTIME | LA_SIZE */
}


(gdb) f 4
#4  lod_declare_attr_set (env=0xffff88016ba1f400, dt=0xffff88016aa5b130,
    attr=0xffff88016c095160, handle=0xffff88016a0b14c0)
    at /root/lustre-release/lustre/lod/lod_object.c:335
335     RETURN(rc);

(gdb) p/x *(struct lod_object *) 0xffff88016aa5b130
$15 = {
  ldo_obj = {
    do_lu = {
      lo_header = 0xffff88016ab74168,
      lo_dev = 0xffff8801802d0000,
      lo_ops = 0xffffffffa0d97c60,
      lo_linkage = {
        next = 0xffff88016a06d098,
        prev = 0xffff88016a075da8
      },
      lo_depth = 0x0,
      lo_flags = 0x1,
      lo_dev_ref = 0x0
    },
    do_ops = 0xffffffffa0d97b80,
    do_body_ops = 0x0,
    do_index_ops = 0x0,
    do_lock_ops = 0xffffffffa0d97c58
  },
  ldo_stripe_sem = {
    lock = {
      raw_lock = {
        slock = 0x0
      }
    },
    count = 0x1,
    wait_list = {
      next = 0xffff88016aa5b190,
      prev = 0xffff88016aa5b190
    }
  },
  ldo_stripenr = 0x1,
  ldo_layout_gen = 0x0,
  ldo_stripe_size = 0x100000,
  ldo_pool = 0xffff88016a0a3e80,
  ldo_stripe = 0xffff88016a0a3e60,
  ldo_stripes_allocated = 0x1,
  ldo_striping_cached = 0x0,
  ldo_def_striping_set = 0x0,
  ldo_def_stripe_size = 0x0,
  ldo_def_stripenr = 0x1,
  ldo_def_stripe_offset = 0xffff,
  ldo_mds_num = 0x0
}


(gdb) f 1
#1  0xffffffffa0db799a in osp_object_assign_fid (env=0xffff88016ba1f400,
    d=0xffff88017a7af000, o=0xffff880170909858)
    at /root/lustre-release/lustre/osp/osp_object.c:57
57     LASSERT(o->opo_reserved);

$19 = {
  opo_header = {
    loh_flags = 0x0,
    loh_ref = {
      counter = 0x1
    },
    loh_fid = {
      f_seq = 0x100000000,
      f_oid = 0x15f,
      f_ver = 0x0
    },
    loh_attr = 0x0,
    loh_hash = {
      next = 0x0,
      pprev = 0xffffc900076637d8
    },
    loh_lru = {
      next = 0xffff880170909890,
      prev = 0xffff880170909890
    },
    loh_layers = {
      next = 0xffff8801709098c8,
      prev = 0xffff8801709098c8
    },
    loh_reference = {<No data fields>}
  },
  opo_obj = {
    do_lu = {
      lo_header = 0xffff880170909858,
      lo_dev = 0xffff88017a7af000,
      lo_ops = 0xffffffffa0dd4d00,
      lo_linkage = {
        next = 0xffff8801709098a0,
        prev = 0xffff8801709098a0
      },
      lo_depth = 0x0,
      lo_flags = 0x1,
      lo_dev_ref = 0x0
    },
    do_ops = 0xffffffffa0dd4c20,
    do_body_ops = 0x0,
    do_index_ops = 0x0,
    do_lock_ops = 0x0
  },
  opo_reserved = 0x0,
  opo_new = 0x1,
  opo_empty = 0x0,
  opo_sem = {
    count = 0x0,
    wait_lock = {
      raw_lock = {
        slock = 0x0
      }
    },
    wait_list = {
      next = 0x0,
      prev = 0x0
    }
  },
  opo_owner = 0x0
}
Comment by John Hammond [ 01/Apr/13 ]

BTW, with both patches (http://review.whamcloud.com/#change,5802 and http://review.whamcloud.com/5839) I do not see any LBUGs from setstripe vs truncate.

It looks like the remaining case (chown()) could be handled by (conditionally) acquiring mot_lov_mutex in mdt_attr_set().

Comment by Cory Spitz [ 03/Jul/13 ]

John, do you plan on refreshing #5839?

Comment by John Hammond [ 04/Jul/13 ]

Cory, yes, eventually. Are you seeing this assertion in production?

Comment by Cory Spitz [ 08/Jul/13 ]

John, not exactly production, but a small test system. Upon initial look, it seemed that #5839 could solve/fix our problem.

Comment by Patrick Farrell (Inactive) [ 12/Jul/13 ]

John,

We've recently run in to the osp_sync_add_rec assertion on a system with 5802 and 5839. Sounds like you expected that, but I just wanted to let you know.

Comment by Patrick Farrell (Inactive) [ 02/Aug/13 ]

I tested and confirmed John's assertion that taking lov_mutex resolves the race condition.

Patch for the last race condition is here. Verified to resolve the issue, and I don't think the condition for taking the lock can be any more specific to chown:
http://review.whamcloud.com/7223

Comment by Alexey Lyashkov [ 06/Sep/13 ]

any of these patches don't solve whole issue.
we still have a race when create object insert between declare and execute operation.

Comment by Alex Zhuravlev [ 06/Sep/13 ]

mot_lov_mutex should solve the race given it protects a chain of declare/execution.

Comment by Oleg Drokin [ 10/Oct/13 ]

there's LU-4083 with a patch that is probably related.

Comment by Ryan Haasken [ 30/Oct/13 ]

We've run into a failed assertion added in patch 5839 on a system running Lustre 2.5 with patches 5802, 5839, and 7223 applied. This is the failed assertion:

LustreError: 3407:0:(lod_object.c:993:lod_striping_create()) ASSERTION( lo->ldo_stripe != ((void *)0) && lo->ldo_stripenr > 0 ) failed:
Comment by Ryan Haasken [ 16/Dec/13 ]

The patch for LU-4083 (7919) looks related to this issue, but LU-4083 is in the rename and unlink paths. This bug is in the setattr path. Since http://review.whamcloud.com/#/c/5839/ was abandoned and obsoleted by 7919, does that imply that 7919 fixes this bug? It doesn't look like it covers the setattr path.

I think Patrick's patch at http://review.whamcloud.com/#/c/7223/ attempts to fix the setattr path. The complaint on that patch is that it looks into LOD internals. However, http://review.whamcloud.com/#/c/7919/4 does the same thing, and that has landed. What is preventing Patrick's change from landing?

In Patrick's patch (7223), would it be preferable to lock mot_lov_mutex in mdt_reint_setattr around the call to mdt_attr_set? That would match the way patch 7919 locks the mutex in mdt_reint_unlink around the call to mdo_unlink. I'm not sure if that would be correct though.

Please share your thoughts on the relationship between patches 5839, 7919, and 7223. Thanks!

Comment by Cory Spitz [ 18/Dec/13 ]

The fix version for this bug should probably be 2.6 (and 2.5.1 at least).

Comment by Ryan Haasken [ 20/Jun/14 ]

I think that http://review.whamcloud.com/#/c/7919 resolves the race condition in the unlink and rename paths, but isn't http://review.whamcloud.com/#/c/7223/3 still necessary for the setattr case? After 7223 is landed, will this bug be ready to close?

7223 was rejected on the basis that the mdt layer should not look into lod internals, but doesn't 7919 do the same thing? Can we land 7223 as a short-term solution for the setattr case?

Generated at Sat Feb 10 01:28:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.