[LU-3917] lod_initialize_objects()) ASSERTION( cfs_bitmap_check(md->lod_ost_descs.ltd_tgt_bitmap, idx) ) failed Created: 10/Sep/13 Updated: 16/Oct/13 Resolved: 23/Sep/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.5.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Ned Bass | Assignee: | Zhenyu Xu |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
lustre-2.4.0-15chaos_2.6.32_358.14.1.2chaos.ch5.1.1.ch5.1.1.x86_64 |
||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 10350 | ||||||||||||||||||||
| Description |
|
MDS crashes with summary assertion after upgrade from 2.1 to 2.4. Added -o abort_recov mount option but MDS still crashes each time it is started. PID: 4684 TASK: ffff880628b33540 CPU: 12 COMMAND: "tgt_recov" #0 [ffff880628b35a48] machine_kexec at ffffffff81035fcb #1 [ffff880628b35aa8] crash_kexec at ffffffff810c10b2 #2 [ffff880628b35b78] panic at ffffffff81510333 #3 [ffff880628b35bf8] lbug_with_loc at ffffffffa0507f4b [libcfs] #4 [ffff880628b35c18] lod_initialize_objects at ffffffffa1141d3b [lod] #5 [ffff880628b35ca8] lod_parse_striping at ffffffffa11421e1 [lod] #6 [ffff880628b35cd8] lod_load_striping at ffffffffa1143c44 [lod] #7 [ffff880628b35d18] lod_declare_object_destroy at ffffffffa114f6db [lod] #8 [ffff880628b35d48] __mdd_orphan_cleanup at ffffffffa0e190a9 [mdd] #9 [ffff880628b35de8] mdd_recovery_complete at ffffffffa0e2833d [mdd] #10 [ffff880628b35e18] mdt_postrecov at ffffffffa1079cb5 [mdt] #11 [ffff880628b35e38] mdt_obd_postrecov at ffffffffa107b178 [mdt] #12 [ffff880628b35ea8] target_recovery_thread at ffffffffa09a6ca4 [ptlrpc] #13 [ffff880628b35f48] kernel_thread at ffffffff8100c10a ZFS: Loaded module v0.6.2-1.2, ZFS pool version 5000, ZFS filesystem version 5 Lustre: Lustre: Build Version: 2.4.0-15chaos-15chaos--PRISTINE-2.6.32-358.14.1.2chaos.ch5.1.1.x86_64 LDISKFS-fs (sdb): mounted filesystem with ordered data mode. quota=off. Opts: Lustre: lsc-MDT0000: Not available for connect from 192.168.117.178@o2ib10 (not set up) Lustre: 4673:0:(mdt_handler.c:4947:mdt_process_config()) For interoperability, skip this mdd.quota_type. It is obsolete. LustreError: 11-0: lsc-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11. LustreError: 4469:0:(mdt_handler.c:5930:mdt_iocontrol()) lsc-MDT0000: Aborting recovery for device Lustre: lsc-MDT0000: Aborting recovery LustreError: 4684:0:(lod_lov.c:706:lod_initialize_objects()) ASSERTION( cfs_bitmap_check(md->lod_ost_descs.ltd_tgt_bitmap, idx) ) failed: LustreError: 4684:0:(lod_lov.c:706:lod_initialize_objects()) LBUG Pid: 4684, comm: tgt_recov Call Trace: [<ffffffffa05078f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa0507ef7>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa1141d3b>] lod_initialize_objects+0x98b/0xc30 [lod] [<ffffffffa11421e1>] lod_parse_striping+0x201/0x300 [lod] [<ffffffffa1143c44>] lod_load_striping+0x2a4/0x4b0 [lod] [<ffffffffa114f6db>] lod_declare_object_destroy+0x16b/0x390 [lod] [<ffffffffa0e190a9>] __mdd_orphan_cleanup+0x7d9/0xca0 [mdd] [<ffffffffa0e2833d>] mdd_recovery_complete+0xed/0x170 [mdd] [<ffffffffa1079cb5>] mdt_postrecov+0x35/0xd0 [mdt] [<ffffffffa107b178>] mdt_obd_postrecov+0x78/0x90 [mdt] [<ffffffffa09964e0>] ? ldlm_reprocess_res+0x0/0x20 [ptlrpc] [<ffffffffa099189e>] ? ldlm_reprocess_all_ns+0x3e/0x110 [ptlrpc] [<ffffffffa09a6ca4>] target_recovery_thread+0xc64/0x1980 [ptlrpc] [<ffffffffa09a6040>] ? target_recovery_thread+0x0/0x1980 [ptlrpc] [<ffffffff8100c10a>] child_rip+0xa/0x20 [<ffffffffa09a6040>] ? target_recovery_thread+0x0/0x1980 [ptlrpc] [<ffffffffa09a6040>] ? target_recovery_thread+0x0/0x1980 [ptlrpc] [<ffffffff8100c100>] ? child_rip+0x0/0x20 Kernel panic - not syncing: LBUG Pid: 4684, comm: tgt_recov Tainted: P --------------- 2.6.32-358.14.1.2chaos.ch5.1.1.x86_64 #1 Call Trace: [<ffffffff8151032c>] ? panic+0xa7/0x16f [<ffffffffa0507f4b>] ? lbug_with_loc+0x9b/0xb0 [libcfs] [<ffffffffa1141d3b>] ? lod_initialize_objects+0x98b/0xc30 [lod] [<ffffffffa11421e1>] ? lod_parse_striping+0x201/0x300 [lod] [<ffffffffa1143c44>] ? lod_load_striping+0x2a4/0x4b0 [lod] [<ffffffffa114f6db>] ? lod_declare_object_destroy+0x16b/0x390 [lod] [<ffffffffa0e190a9>] ? __mdd_orphan_cleanup+0x7d9/0xca0 [mdd] [<ffffffffa0e2833d>] ? mdd_recovery_complete+0xed/0x170 [mdd] [<ffffffffa1079cb5>] ? mdt_postrecov+0x35/0xd0 [mdt] [<ffffffffa107b178>] ? mdt_obd_postrecov+0x78/0x90 [mdt] [<ffffffffa09964e0>] ? ldlm_reprocess_res+0x0/0x20 [ptlrpc] [<ffffffffa099189e>] ? ldlm_reprocess_all_ns+0x3e/0x110 [ptlrpc] [<ffffffffa09a6ca4>] ? target_recovery_thread+0xc64/0x1980 [ptlrpc] [<ffffffffa09a6040>] ? target_recovery_thread+0x0/0x1980 [ptlrpc] [<ffffffff8100c10a>] ? child_rip+0xa/0x20 [<ffffffffa09a6040>] ? target_recovery_thread+0x0/0x1980 [ptlrpc] [<ffffffffa09a6040>] ? target_recovery_thread+0x0/0x1980 [ptlrpc] [<ffffffff8100c100>] ? child_rip+0x0/0x20 REWRITING MCP55 CFG REG CFG = c1 |
| Comments |
| Comment by Ned Bass [ 10/Sep/13 ] |
|
Possibly related to |
| Comment by Christopher Morrone [ 10/Sep/13 ] |
|
Note that this issue is Severity 1. This filesystem is completely out of action until the problem is resolved. |
| Comment by Peter Jones [ 10/Sep/13 ] |
|
Oleg Can you please look into this one? Thanks Peter |
| Comment by Oleg Drokin [ 10/Sep/13 ] |
|
This is indeed the same assert as referenced by Also I assume you have a crashdump, can we get a debug log out of it (with some higher debug setting enabled) |
| Comment by Christopher Morrone [ 10/Sep/13 ] |
|
Another point that may be significant, it that this is an upgrade of a older ldiskfs filesystem that was probably formatted under 1.8 or earlier. As such, in order for the upgraded 2.4 filesystem to work with 2.4 clients, we needed to writeconf the filesystem. Perhaps the lack of any registered OSTs at this stage explains the assertion. |
| Comment by Christopher Morrone [ 10/Sep/13 ] |
|
I'm in the process of building a test tag with the |
| Comment by Christopher Morrone [ 10/Sep/13 ] |
|
FYI, Ned found that moving the PENDING directory out of the way allowed us to boot without a crash. We are going to put the patch in, restore the PENDING directory, and see if the patch allows a clean boot with the PENDING contents in place. |
| Comment by Ned Bass [ 10/Sep/13 ] |
|
Attached full debug log lustre.log.LU-3917.gz |
| Comment by Ned Bass [ 10/Sep/13 ] |
|
The |
| Comment by Oleg Drokin [ 10/Sep/13 ] |
|
So from the logs it looks like no attempts to bring up any osps were made, possibly due to old config log that don't list any targets? Can you please show what's in config logs? use llog_read to read it off directly mounted mgs filesystem. |
| Comment by Ned Bass [ 10/Sep/13 ] |
|
Hi Oleg, as Chris mentioned above we had performed a writeconf to clear the config logs. |
| Comment by Oleg Drokin [ 10/Sep/13 ] |
|
Ah, well, then that's why it crashed. |
| Comment by Oleg Drokin [ 10/Sep/13 ] |
|
anyway, crashing is not a normal behavior of course, so I'll update the fixing patch. |
| Comment by Oleg Drokin [ 10/Sep/13 ] |
|
BTW, MGS is still supposed to regenerate the config logs, but I imagine you did not start OSTs before starting MDT? |
| Comment by Ned Bass [ 10/Sep/13 ] |
|
Oleg, that's right, we started the MDT/MGS before the OSTs. |
| Comment by Ned Bass [ 10/Sep/13 ] |
|
Our plan for future updates is to restart the MDS before writeconf to force orphan cleanup. |
| Comment by Oleg Drokin [ 11/Sep/13 ] |
|
Just to be sure I just reproduced this on master with these steps: |
| Comment by Jodi Levi (Inactive) [ 13/Sep/13 ] |
|
Duplicate of |
| Comment by Christopher Morrone [ 13/Sep/13 ] |
|
This is related to In this ticket, we need to deal with files in the PENDING directory after a writeconf of the filesystem (a writeconf is currently required when upgrading to 2.4, otherwise 2.4 clients won't work). FYI, we are not doing any further 2.4 server upgrades at LLNL until a couple of things are fixed, and this is one of the things that needs to be fixed. |
| Comment by Peter Jones [ 13/Sep/13 ] |
|
Bobijam Could you please continue with this effort - Oleg will be at LAD next week Thanks Peter |
| Comment by Zhenyu Xu [ 18/Sep/13 ] |
|
I tried patch http://review.whamcloud.com/7234, it addresses this issue, and I can't reproduce the issue stated in |
| Comment by Andreas Dilger [ 23/Sep/13 ] |
|
Closing this as a duplicate of |