[LU-7419] llog corruption after hitting ASSERTION( handle->lgh_hdr == ((void *)0) ) in llog_init_handle - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.8.0
Labels:
- llnl
Environment:
2.6.32-504.16.2.1chaos.ch5.3.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

A kernel panic occurred while initializing a new llog catalog record (details below) and an assertion was hit that was attempting to ensure that the llog_handle had a NULL pointer to a llog header struct. The panic occurred while I was running an mdtest job which was writing to a striped directory from 32 client nodes running 4 threads each. The panic caused an llog file to become corrupt. I manually repaired the llog file and restarted my MDSs and recovery now completes.

To summarize my setup, I am running a test cluster with 4 MDSs, 2 OSTs, and 32 client nodes with the filesystem mounted. No failover. The lustre version running is 2.7.62. The error messages and call stack from MDS1 are below:

2015-11-04 13:36:26 LustreError: 38007:0:(llog.c:342:llog_init_handle()) ASSERTION( handle->lgh_hdr == ((void *)0) ) failed:
2015-11-04 13:36:26 LustreError: 38007:0:(llog.c:342:llog_init_handle()) LBUG
2015-11-04 13:36:26 Pid: 38007, comm: mdt01_005
2015-11-04 13:36:26 Nov  4 13:36:26
...
2015-11-04 13:36:26 Kernel panic - not syncing: LBUG
2015-11-04 13:36:26 Pid: 38007, comm: mdt01_005 Tainted: P           ---------------    2.6.32-504.16.2.1chaos.ch5.3.x86_64 #1
2015-11-04 13:36:26 Call Trace:
2015-11-04 13:36:26  [<ffffffff8152d471>] ? panic+0xa7/0x16f
2015-11-04 13:36:26  [<ffffffffa0847f2b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
2015-11-04 13:36:26  [<ffffffffa09a62cf>] ? llog_init_handle+0x86f/0xb10 [obdclass]
2015-11-04 13:36:26  [<ffffffffa09ac809>] ? llog_cat_new_log+0x3d9/0xdc0 [obdclass]
2015-11-04 13:36:26  [<ffffffffa09a4663>] ? llog_declare_write_rec+0x93/0x210 [obdclass]
2015-11-04 13:36:26  [<ffffffffa09ad616>] ? llog_cat_declare_add_rec+0x426/0x430 [obdclass]
2015-11-04 13:36:26  [<ffffffffa09a406f>] ? llog_declare_add+0x7f/0x1b0 [obdclass]
2015-11-04 13:36:26  [<ffffffffa0c9c19c>] ? top_trans_start+0x17c/0x960 [ptlrpc]
2015-11-04 13:36:26  [<ffffffffa127cc11>] ? lod_trans_start+0x61/0x70 [lod]
2015-11-04 13:36:26  [<ffffffffa13248b4>] ? mdd_trans_start+0x14/0x20 [mdd]
2015-11-04 13:36:26  [<ffffffffa1313333>] ? mdd_create+0xe53/0x1aa0 [mdd]
2015-11-04 13:36:26  [<ffffffffa11c6784>] ? mdt_version_save+0x84/0x1a0 [mdt]
2015-11-04 13:36:26  [<ffffffffa11c8f46>] ? mdt_reint_create+0xbb6/0xcc0 [mdt]
2015-11-04 13:36:26  [<ffffffffa0a13230>] ? lu_ucred+0x20/0x30 [obdclass]
2015-11-04 13:36:26  [<ffffffffa11a8675>] ? mdt_ucred+0x15/0x20 [mdt]
2015-11-04 13:36:26  [<ffffffffa11c183c>] ? mdt_root_squash+0x2c/0x3f0 [mdt]
2015-11-04 13:36:26  [<ffffffffa0c43d32>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
2015-11-04 13:36:26  [<ffffffffa11c597d>] ? mdt_reint_rec+0x5d/0x200 [mdt]
2015-11-04 13:36:26  [<ffffffffa11b177b>] ? mdt_reint_internal+0x62b/0xb80 [mdt]
2015-11-04 13:36:26  [<ffffffffa11b216b>] ? mdt_reint+0x6b/0x120 [mdt]
2015-11-04 13:36:26  [<ffffffffa0c8621c>] ? tgt_request_handle+0x8bc/0x12e0 [ptlrpc]
2015-11-04 13:36:26  [<ffffffffa0c2da21>] ? ptlrpc_main+0xe41/0x1910 [ptlrpc]
2015-11-04 13:36:26  [<ffffffff8106d740>] ? pick_next_task_fair+0xd0/0x130
2015-11-04 13:36:26  [<ffffffff8152d8f6>] ? schedule+0x176/0x3a0
2015-11-04 13:36:26  [<ffffffffa0c2cbe0>] ? ptlrpc_main+0x0/0x1910 [ptlrpc]
2015-11-04 13:36:26  [<ffffffff8109fffe>] ? kthread+0x9e/0xc0
2015-11-04 13:36:27  [<ffffffff8100c24a>] ? child_rip+0xa/0x20
2015-11-04 13:36:27  [<ffffffff8109ff60>] ? kthread+0x0/0xc0
2015-11-04 13:36:27  [<ffffffff8100c240>] ? child_rip+0x0/0x20

After rebooting MDS1, I started to see llog corruption messages for an llog file that was on MDS4 (remember the panic was on MDS1) shown below:

2015-11-04 14:15:59 LustreError: 11466:0:(llog_osd.c:833:llog_osd_next_block()) ldne-MDT0003-osp-MDT0000: can't read llog block from log [0x300000401:0x1:0x0] offset 32768: rc = -5
2015-11-04 14:15:59 LustreError: 11466:0:(llog.c:578:llog_process_thread()) Local llog found corrupted

Eventually, the recovery timer goes negative and recovery never ends (see LU-6994). I manually fixed the llog file on MDS4 and recovery now completes. Attached are the original corrupted llog (0x300000401.0x1.0x0) and the version after the fix (0x300000401.0x1.0x0.replace).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

0x300000401.0x1.0x0
34 kB
11/Nov/15 11:56 PM
0x300000401.0x1.0x0.replace
34 kB
11/Nov/15 11:56 PM

Issue Links

is related to

LU-6994 MDT recovery timer goes negative, recovery never ends

Resolved

Activity

[LU-7419] llog corruption after hitting ASSERTION( handle->lgh_hdr == ((void *)0) ) in llog_init_handle

Peter Jones added a comment - 14/Dec/15 5:29 AM

Landed for 2.8

Peter Jones added a comment - 14/Dec/15 5:29 AM Landed for 2.8

Gerrit Updater added a comment - 13/Dec/15 9:57 PM

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17132/
Subject: ~~LU-7419~~ llog: lock new llog object creation
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 63a3e412bddcc94b7497aecee91864813a614f83

Gerrit Updater added a comment - 13/Dec/15 9:57 PM Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17132/ Subject: LU-7419 llog: lock new llog object creation Project: fs/lustre-release Branch: master Current Patch Set: Commit: 63a3e412bddcc94b7497aecee91864813a614f83

Giuseppe Di Natale (Inactive) added a comment - 10/Dec/15 11:49 PM

I applied this patch locally to lustre running on one of our test clusters. I am still using the setup mentioned in the comments of LU-6994. I no longer see MDSs crashing due to this call stack.

Giuseppe Di Natale (Inactive) added a comment - 10/Dec/15 11:49 PM I applied this patch locally to lustre running on one of our test clusters. I am still using the setup mentioned in the comments of LU-6994 . I no longer see MDSs crashing due to this call stack.

Andreas Dilger added a comment - 23/Nov/15 6:57 PM

The patch http://review.whamcloud.com/17132 is intended for landing, the 17196 patch is just for testing.

Andreas Dilger added a comment - 23/Nov/15 6:57 PM The patch http://review.whamcloud.com/17132 is intended for landing, the 17196 patch is just for testing.

Gerrit Updater added a comment - 13/Nov/15 10:54 PM

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/17196
Subject: ~~LU-7419~~ llog: lock new llog object creation
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ef35db927b07c08ca32f5bfd0b43e1e07430439b

Gerrit Updater added a comment - 13/Nov/15 10:54 PM Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/17196 Subject: LU-7419 llog: lock new llog object creation Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ef35db927b07c08ca32f5bfd0b43e1e07430439b

Gerrit Updater added a comment - 12/Nov/15 5:17 AM

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/17132
Subject: ~~LU-7419~~ llog: lock new llog object creation
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1917a3075506fd7879888bf1698022ecbc7cdc5d

Gerrit Updater added a comment - 12/Nov/15 5:17 AM wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/17132 Subject: LU-7419 llog: lock new llog object creation Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1917a3075506fd7879888bf1698022ecbc7cdc5d

Di Wang (Inactive) added a comment - 12/Nov/15 1:20 AM

Hmm, we need lock log handle (lgh_lock) during log_cat_new_log, otherwise it might cause race like this.
I will cook a patch. Btw: I am trying to resolve all of the update llog corruption issue in ~~LU-7039~~.

Btw: recovery never ends mostly likely due to the failure of update recovery.

Di Wang (Inactive) added a comment - 12/Nov/15 1:20 AM Hmm, we need lock log handle (lgh_lock) during log_cat_new_log, otherwise it might cause race like this. I will cook a patch. Btw: I am trying to resolve all of the update llog corruption issue in LU-7039 . Btw: recovery never ends mostly likely due to the failure of update recovery.

People

Assignee:: Di Wang (Inactive)

Reporter:: Giuseppe Di Natale (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 11/Nov/15 11:56 PM

Updated:: 09/Jan/16 3:20 PM

Resolved:: 14/Dec/15 5:29 AM