[LU-3917] lod_initialize_objects()) ASSERTION( cfs_bitmap_check(md->lod_ost_descs.ltd_tgt_bitmap, idx) ) failed Created: 10/Sep/13  Updated: 16/Oct/13  Resolved: 23/Sep/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.5.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Ned Bass Assignee: Zhenyu Xu
Resolution: Duplicate Votes: 0
Labels: None
Environment:

lustre-2.4.0-15chaos_2.6.32_358.14.1.2chaos.ch5.1.1.ch5.1.1.x86_64


Attachments: File lustre.log.LU-3917.gz    
Issue Links:
Duplicate
is duplicated by LU-4005 Test failure on test suite recovery-d... Closed
Related
is related to LU-3161 LASSERT() in lod_initialize_objects() Resolved
is related to LU-3918 kernel NULL pointer dereference at lo... Resolved
Severity: 3
Rank (Obsolete): 10350

 Description   

MDS crashes with summary assertion after upgrade from 2.1 to 2.4. Added -o abort_recov mount option but MDS still crashes each time it is started.

PID: 4684   TASK: ffff880628b33540  CPU: 12  COMMAND: "tgt_recov"
 #0 [ffff880628b35a48] machine_kexec at ffffffff81035fcb
 #1 [ffff880628b35aa8] crash_kexec at ffffffff810c10b2
 #2 [ffff880628b35b78] panic at ffffffff81510333
 #3 [ffff880628b35bf8] lbug_with_loc at ffffffffa0507f4b [libcfs]
 #4 [ffff880628b35c18] lod_initialize_objects at ffffffffa1141d3b [lod]
 #5 [ffff880628b35ca8] lod_parse_striping at ffffffffa11421e1 [lod]
 #6 [ffff880628b35cd8] lod_load_striping at ffffffffa1143c44 [lod]
 #7 [ffff880628b35d18] lod_declare_object_destroy at ffffffffa114f6db [lod]
 #8 [ffff880628b35d48] __mdd_orphan_cleanup at ffffffffa0e190a9 [mdd]
 #9 [ffff880628b35de8] mdd_recovery_complete at ffffffffa0e2833d [mdd]
#10 [ffff880628b35e18] mdt_postrecov at ffffffffa1079cb5 [mdt]
#11 [ffff880628b35e38] mdt_obd_postrecov at ffffffffa107b178 [mdt]
#12 [ffff880628b35ea8] target_recovery_thread at ffffffffa09a6ca4 [ptlrpc]
#13 [ffff880628b35f48] kernel_thread at ffffffff8100c10a
ZFS: Loaded module v0.6.2-1.2, ZFS pool version 5000, ZFS filesystem version 5
Lustre: Lustre: Build Version: 2.4.0-15chaos-15chaos--PRISTINE-2.6.32-358.14.1.2chaos.ch5.1.1.x86_64
LDISKFS-fs (sdb): mounted filesystem with ordered data mode. quota=off. Opts: 
Lustre: lsc-MDT0000: Not available for connect from 192.168.117.178@o2ib10 (not set up)
Lustre: 4673:0:(mdt_handler.c:4947:mdt_process_config()) For interoperability, skip this mdd.quota_type. It is obsolete.
LustreError: 11-0: lsc-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
LustreError: 4469:0:(mdt_handler.c:5930:mdt_iocontrol()) lsc-MDT0000: Aborting recovery for device
Lustre: lsc-MDT0000: Aborting recovery
LustreError: 4684:0:(lod_lov.c:706:lod_initialize_objects()) ASSERTION( cfs_bitmap_check(md->lod_ost_descs.ltd_tgt_bitmap, idx) ) failed: 
LustreError: 4684:0:(lod_lov.c:706:lod_initialize_objects()) LBUG
Pid: 4684, comm: tgt_recov

Call Trace:
 [<ffffffffa05078f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa0507ef7>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa1141d3b>] lod_initialize_objects+0x98b/0xc30 [lod]
 [<ffffffffa11421e1>] lod_parse_striping+0x201/0x300 [lod]
 [<ffffffffa1143c44>] lod_load_striping+0x2a4/0x4b0 [lod]
 [<ffffffffa114f6db>] lod_declare_object_destroy+0x16b/0x390 [lod]
 [<ffffffffa0e190a9>] __mdd_orphan_cleanup+0x7d9/0xca0 [mdd]
 [<ffffffffa0e2833d>] mdd_recovery_complete+0xed/0x170 [mdd]
 [<ffffffffa1079cb5>] mdt_postrecov+0x35/0xd0 [mdt]
 [<ffffffffa107b178>] mdt_obd_postrecov+0x78/0x90 [mdt]
 [<ffffffffa09964e0>] ? ldlm_reprocess_res+0x0/0x20 [ptlrpc]
 [<ffffffffa099189e>] ? ldlm_reprocess_all_ns+0x3e/0x110 [ptlrpc]
 [<ffffffffa09a6ca4>] target_recovery_thread+0xc64/0x1980 [ptlrpc]
 [<ffffffffa09a6040>] ? target_recovery_thread+0x0/0x1980 [ptlrpc]
 [<ffffffff8100c10a>] child_rip+0xa/0x20
 [<ffffffffa09a6040>] ? target_recovery_thread+0x0/0x1980 [ptlrpc]
 [<ffffffffa09a6040>] ? target_recovery_thread+0x0/0x1980 [ptlrpc]
 [<ffffffff8100c100>] ? child_rip+0x0/0x20

Kernel panic - not syncing: LBUG
Pid: 4684, comm: tgt_recov Tainted: P           ---------------    2.6.32-358.14.1.2chaos.ch5.1.1.x86_64 #1
Call Trace:
 [<ffffffff8151032c>] ? panic+0xa7/0x16f
 [<ffffffffa0507f4b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
 [<ffffffffa1141d3b>] ? lod_initialize_objects+0x98b/0xc30 [lod]
 [<ffffffffa11421e1>] ? lod_parse_striping+0x201/0x300 [lod]
 [<ffffffffa1143c44>] ? lod_load_striping+0x2a4/0x4b0 [lod]
 [<ffffffffa114f6db>] ? lod_declare_object_destroy+0x16b/0x390 [lod]
 [<ffffffffa0e190a9>] ? __mdd_orphan_cleanup+0x7d9/0xca0 [mdd]
 [<ffffffffa0e2833d>] ? mdd_recovery_complete+0xed/0x170 [mdd]
 [<ffffffffa1079cb5>] ? mdt_postrecov+0x35/0xd0 [mdt]
 [<ffffffffa107b178>] ? mdt_obd_postrecov+0x78/0x90 [mdt]
 [<ffffffffa09964e0>] ? ldlm_reprocess_res+0x0/0x20 [ptlrpc]
 [<ffffffffa099189e>] ? ldlm_reprocess_all_ns+0x3e/0x110 [ptlrpc]
 [<ffffffffa09a6ca4>] ? target_recovery_thread+0xc64/0x1980 [ptlrpc]
 [<ffffffffa09a6040>] ? target_recovery_thread+0x0/0x1980 [ptlrpc]
 [<ffffffff8100c10a>] ? child_rip+0xa/0x20
 [<ffffffffa09a6040>] ? target_recovery_thread+0x0/0x1980 [ptlrpc]
 [<ffffffffa09a6040>] ? target_recovery_thread+0x0/0x1980 [ptlrpc]
 [<ffffffff8100c100>] ? child_rip+0x0/0x20
REWRITING MCP55 CFG REG
CFG = c1


 Comments   
Comment by Ned Bass [ 10/Sep/13 ]

Possibly related to LU-3161

Comment by Christopher Morrone [ 10/Sep/13 ]

Note that this issue is Severity 1. This filesystem is completely out of action until the problem is resolved.

Comment by Peter Jones [ 10/Sep/13 ]

Oleg

Can you please look into this one?

Thanks

Peter

Comment by Oleg Drokin [ 10/Sep/13 ]

This is indeed the same assert as referenced by LU-3161 and the patch referenced there: http://review.whamcloud.com/#/c/7234/6 should stop this assertion, the added benefit is it would print the values the assertion is unhappy about.

Also I assume you have a crashdump, can we get a debug log out of it (with some higher debug setting enabled)

Comment by Christopher Morrone [ 10/Sep/13 ]

Another point that may be significant, it that this is an upgrade of a older ldiskfs filesystem that was probably formatted under 1.8 or earlier. As such, in order for the upgraded 2.4 filesystem to work with 2.4 clients, we needed to writeconf the filesystem.

Perhaps the lack of any registered OSTs at this stage explains the assertion.

Comment by Christopher Morrone [ 10/Sep/13 ]

I'm in the process of building a test tag with the LU-3161 patch.

Comment by Christopher Morrone [ 10/Sep/13 ]

FYI, Ned found that moving the PENDING directory out of the way allowed us to boot without a crash. We are going to put the patch in, restore the PENDING directory, and see if the patch allows a clean boot with the PENDING contents in place.

Comment by Ned Bass [ 10/Sep/13 ]

Attached full debug log lustre.log.LU-3917.gz

Comment by Ned Bass [ 10/Sep/13 ]

The LU-3161 patch appears to address this assertion. However we still don't make it past lod_initialize_ojbects() due to a NULL pointer deref reported in LU-3918.

Comment by Oleg Drokin [ 10/Sep/13 ]

So from the logs it looks like no attempts to bring up any osps were made, possibly due to old config log that don't list any targets?

Can you please show what's in config logs? use llog_read to read it off directly mounted mgs filesystem.

Comment by Ned Bass [ 10/Sep/13 ]

Hi Oleg, as Chris mentioned above we had performed a writeconf to clear the config logs.

Comment by Oleg Drokin [ 10/Sep/13 ]

Ah, well, then that's why it crashed.

Comment by Oleg Drokin [ 10/Sep/13 ]

anyway, crashing is not a normal behavior of course, so I'll update the fixing patch.
Now the other problem is the object leakage is bound to occur here because the orphan cleanup would not happen (no OSTs connected to send destroy requests to).
I wonder how to better address this.

Comment by Oleg Drokin [ 10/Sep/13 ]

BTW, MGS is still supposed to regenerate the config logs, but I imagine you did not start OSTs before starting MDT?

Comment by Ned Bass [ 10/Sep/13 ]

Oleg, that's right, we started the MDT/MGS before the OSTs.

Comment by Ned Bass [ 10/Sep/13 ]

Our plan for future updates is to restart the MDS before writeconf to force orphan cleanup.

Comment by Oleg Drokin [ 11/Sep/13 ]

Just to be sure I just reproduced this on master with these steps:
1. mount lustre
2. cat /dev/zero >/mnt/lustre/file
3. ^Z
4. rm /mnt/lustre/file
5. umount /mnt/mds1
6. tunefs.lustre --writeconf /path/to/mds/device
7 mount /path/to/mds/device /mnt -t lustre

Comment by Jodi Levi (Inactive) [ 13/Sep/13 ]

Duplicate of LU-3161

Comment by Christopher Morrone [ 13/Sep/13 ]

This is related to LU-3161, but not a duplicate. The LU-3161 work does not go far enough to deal with the problem in this ticket.

In this ticket, we need to deal with files in the PENDING directory after a writeconf of the filesystem (a writeconf is currently required when upgrading to 2.4, otherwise 2.4 clients won't work).

FYI, we are not doing any further 2.4 server upgrades at LLNL until a couple of things are fixed, and this is one of the things that needs to be fixed.

Comment by Peter Jones [ 13/Sep/13 ]

Bobijam

Could you please continue with this effort - Oleg will be at LAD next week

Thanks

Peter

Comment by Zhenyu Xu [ 18/Sep/13 ]

I tried patch http://review.whamcloud.com/7234, it addresses this issue, and I can't reproduce the issue stated in LU-3918.

Comment by Andreas Dilger [ 23/Sep/13 ]

Closing this as a duplicate of LU-3161, which has a patch already.

Generated at Sat Feb 10 01:38:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.