[LU-8463] OSSes drop into KDB during recovery Created: 02/Aug/16  Updated: 08/Sep/16  Resolved: 02/Sep/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Herbert Yeung Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-8462 OSS keeps dropping into KDB Resolved
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

During the crashes reported in LU-8462, several times upon recovery, the OSS will drop into KDB. Here's a expanded timeline:

Tues Jul 26 - 394 - 2028 KDB (I/O)

395 - 2000 KDB (I/O), (disk) crash during recovery, until e2fsck
Fri jUl 29 - s393 - 0910 KDB (aborting journal)
Fri Jul 29 - s394 - 1922 KDB (disk), (disk) crash druing recovery, e2fcsk
Fri Jul 29 - s393 - 2123 KDB (disk), (disk) crash during recovery, no e2fsck
Sun Jul 31 - s393 - 1519 KDB (disk)

After the initial crash during recovery on Tuesday, I ran e2fsck on the node and was able to successfully mount the OSTs. However, running e2fsck may not be necessary as on one occassion, it was able to recovery after a second attempt.



 Comments   
Comment by Herbert Yeung [ 02/Aug/16 ]

Accidentally hit CTRL + enter, submitting the bug before inputting data, so here it is:

During the crashes reported in LU-8462, several times upon recovery, the OSS will drop into KDB. Here's a expanded timeline:

Tues Jul 26 - 394 - 2028 KDB (I/O)

  • 395 - 2000 KDB (I/O), (disk) crash during recovery, until e2fsck
    Fri jUl 29 - s393 - 0910 KDB (aborting journal)
    Fri Jul 29 - s394 - 1922 KDB (disk), (disk) crash druing recovery, e2fcsk
    Fri Jul 29 - s393 - 2123 KDB (disk), (disk) crash during recovery, no e2fsck
    Sun Jul 31 - s393 - 1519 KDB (disk)

After the initial crash during recovery on Tuesday, I ran e2fsck on the node and was able to successfully mount the OSTs. However, running e2fsck may not be necessary as on one occassion, it was able to recovery after a second attempt.

Also, there appears to be a discrepancy in the device that the journal aborts on compared to where the LIDSKFS-fs error is reported on. Is this normal?
s395 Tues Jul 26 - 2000:

LDISKFS-fs error (device dm-15): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 3037corrupted: 31949 blocks free in bitmap, 14347 - in gd

Aborting journal on device dm-6.
Kernel panic - not syncing: LDISKFS-fs (device dm-15): panic forced after error

Below is the output from the crashes during recovery/e2fsck output:

s395 Tues Jul 26 - 2000 crash during recovery:

Lustre: nbp6-OST00b7: deleting orphan objects from 0x0:549495 to 0x0:549665
Lustre: nbp6-OST00bb: deleting orphan objects from 0x0:549224 to 0x0:549281
Lustre: nbp6-OST00cf: Recovery over after 6:43, of 12593 clients 10903 recovered and 1690 were evicted.
Lustre: nbp6-OST00cf: deleting orphan objects from 0x0:550044 to 0x0:550113
LDISKFS-fs error (device dm-15): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 3037corrupted: 31949 blocks free in bitmap, 14347 - in gd

Aborting journal on device dm-6.
Kernel panic - not syncing: LDISKFS-fs (device dm-15): panic forced after error

Pid: 17694, comm: ll_ost_ i o1052 o_u0t0 6o fN o1t6  tcapiunst eidn  2k.6d.b3,2 -w5a0i4t.i3n0g. 3f.oerl t6h.e2 0r1e5s1t0,0 8t.ixm8e6o_u6t4 .iluns t1r0e 2s5e3c o#n1d(
s)
Call Trace:
 [<ffffffff81564fb9>] ? panic+0xa7/0x190
 [<ffffffffa0eb9118>] ? ldiskfs_commit_super+0x188/0x210 [ldiskfs]
 [<ffffffffa0eb9724>] ? ldiskfs_handle_error+0xc4/0xd0 [ldiskfs]
 [<ffffffffa0eb9ac2>] ? __ldiskfs_error+0x82/0x90 [ldiskfs]
 [<ffffffff810f2059>] ? delayacct_end+0x89/0xa0
 [<ffffffffa0e9ba29>] ? ldiskfs_mb_check_ondisk_bitmap+0x149/0x150 [ldiskfs]
 [<ffffffffa0e9baad>] ? ldiskfs_mb_generate_from_pa+0x7d/0x180 [ldiskfs]
 [<ffffffff8109e1a0>] ? wake_bit_function+0x0/0x50
 [<ffffffffa0e9e28b>] ? ldiskfs_mb_init_cache+0x55b/0xa30 [ldiskfs]
 [<ffffffffa0e9e87e>] ? ldiskfs_mb_init_group+0x11e/0x210 [ldiskfs]
 [<ffffffffa0e9edd5>] ? ldiskfs_mb_load_buddy+0x355/0x390 [ldiskfs]
 [<ffffffffa0e8500b>] ? __ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs]
 [<ffffffffa0e9fb7d>] ? ldiskfs_mb_find_by_goal+0x6d/0x2e0 [ldiskfs]
 [<ffffffffa0ea0019>] ? ldiskfs_mb_regular_allocator+0x59/0x410 [ldiskfs]
 [<ffffffffa0eba2e8>] ? __ldiskfs_journal_stop+0x68/0xa0 [ldiskfs]
 [<ffffffffa0e9a872>] ? ldiskfs_mb_normalize_request+0x2c2/0x3d0 [ldiskfs]
 [<ffffffffa0ea223d>] ? ldiskfs_mb_new_blocks+0x47d/0x630 [ldiskfs]
 [<ffffffffa0eba390>] ? ldiskfs_journal_start_sb+0x70/0x170 [ldiskfs]
 [<ffffffffa01f19ff>] ? ldiskfs_ext_new_extent_cb+0x57f/0x6cc [fsfilt_ldiskfs]
 [<ffffffffa0e87cd2>] ? ldiskfs_ext_walk_space+0x142/0x310 [ldiskfs]
 [<ffffffffa01f1480>] ? ldiskfs_ext_new_extent_cb+0x0/0x6cc [fsfilt_ldiskfs]
 [<ffffffffa01f11cc>] ? fsfilt_map_nblocks+0xcc/0xf0 [fsfilt_ldiskfs]
 [<ffffffffa01f12f0>] ? fsfilt_ldiskfs_map_ext_inode_pages+0x100/0x200 [fsfilt_ldiskfs]
 [<ffffffffa01f1475>] ? fsfilt_ldiskfs_map_inode_pages+0x85/0x90 [fsfilt_ldiskfs]
 [<ffffffffa0eba884>] ? ldiskfs_dquot_initialize+0x94/0xd0 [ldiskfs]
 [<ffffffffa1422342>] ? osd_write_commit+0x302/0x620 [osd_ldiskfs]
 [<ffffffffa17cb094>] ? ofd_commitrw_write+0x684/0x11b0 [ofd]
 [<ffffffffa17cddfd>] ? ofd_commitrw+0x5cd/0xbb0 [ofd]
 [<ffffffffa0374861>] ? lprocfs_counter_add+0x151/0x1d6 [lvfs]
 [<ffffffffa176219d>] ? obd_commitrw+0x11d/0x390 [ost]
 [<ffffffffa176bf71>] ? ost_brw_write+0xea1/0x15d0 [ost]
 [<ffffffff812b9076>] ? vsnprintf+0x336/0x5e0
 [<ffffffffa068f500>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
 [<ffffffffa17725cc>] ? ost_handle+0x439c/0x44d0 [ost]
 [<ffffffffa0401a44>] ? libcfs_id2str+0x74/0xb0 [libcfs]
 [<ffffffffa06df0c5>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
 [<ffffffffa04078d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
 [<ffffffffa06d7a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
 [<ffffffffa06e189d>] ? ptlrpc_main+0xafd/0x1780 [ptlrpc]
 [<ffffffff8100c28a>] ? child_rip+0xa/0x20
 [<ffffffffa06e0da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
 [<ffffffff8100c280>] ? child_rip+0x0/0x20
.All cpus are now in kdb

Entering kdb (current=0xffff881d9d9faab0, pid 17694) on processor 8 Oops: (null)
due to oops @ 0x0
kdba_dumpregs: pt_regs not available, use bt* or pid to select a different task
[8]kdb> [-- Tue Jul 26 18:40:11 2016]

s395 Tues Jul 26 - 2000 2nd crash during recovery:

Lustre: nbp6-OST00cf: Recovery over after 12:41, of 10903 clients 4619 recovered and 6284 were evicted.
Lustre: Skipped 1 previous similar message
LDISKFS-fs error (device dm-27): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 3037corrupted: 31949 blocks free in bitmap, 14347 - in gd

Aborting journal on device dm-6.
Kernel panic - not syncing: LDISKFS-fs (device dm-27): panic forced after error

Pid: 16034, comm: ll_ost_io02_002 Not tainted 2.6.32-504.30.3.el6.20151008.x86_64.lustre253 #1
Call Trace:
 [<ffffffff81564fb9>] ? panic+0xa7/0x190
 [<ffffffffa0b74118>] ? ldiskfs_commit_super+0x188/0x210 [ldiskfs]
 [<ffffffffa0b74724>] ? ldiskfs_handle_error+0xc4/0xd0 [ldiskfs]

Entering kdb (current=0xffff881e5493cab0, pid 16034) on processor 8 Oops: (null)
due to oops @ 0x0
kdba_dumpregs: pt_regs not available, use bt* or pid to select a different task
[8]kdb> [-- Tue Jul 26 19:16:06 2016]

s395 Tues Jul 26 - 2000 e2fsck:

nbp6-oss16 ~ # /usr/local/bin/fscklustre.sh
nbp6-OST00CF: nbp6-OST00cf contains a file system with errors, check forced.
nbp6-OST00EF: nbp6-OST00ef: clean, 83709/7397376 files, 1005802096/1893728256 blocks
nbp6-OST00DF: nbp6-OST00df: clean, 83601/7397376 files, 1000039730/1893728256 blocks
nbp6-OST00C3: nbp6-OST00c3: clean, 83857/7397376 files, 987218431/1893728256 blocks
nbp6-OST00DB: nbp6-OST00db: clean, 83949/7397376 files, 992486492/1893728256 blocks
nbp6-OST00E3: nbp6-OST00e3: clean, 83847/7397376 files, 993140941/1893728256 blocks
nbp6-OST00BF: nbp6-OST00bf: clean, 84168/7397376 files, 993181370/1893728256 blocks
nbp6-OST00E7: nbp6-OST00e7: clean, 84352/7397376 files, 997493607/1893728256 blocks
nbp6-OST00D3: nbp6-OST00d3: clean, 83798/7397376 files, 1000758166/1893728256 blocks
nbp6-OST00D7: nbp6-OST00d7: clean, 84268/7397376 files, 994045438/1893728256 blocks
nbp6-OST00EB: nbp6-OST00eb: clean, 84081/7397376 files, 1006478293/1893728256 blocks
nbp6-OST00CB: nbp6-OST00cb: clean, 83367/7397376 files, 992639326/1893728256 blocks
nbp6-OST00C7: nbp6-OST00c7: clean, 84000/7397376 files, 998232233/1893728256 blocks
nbp6-OST00BB: nbp6-OST00bb: clean, 82971/7397376 files, 1018277281/1893728256 blocks
nbp6-OST00B7: nbp6-OST00b7: clean, 83727/7397376 files, 1006767654/1893728256 blocks
nbp6-OST00CF: nbp6-OST00cf: Inode 172464, i_blocks is 53305288, should be 53117200.  FIXED.
nbp6-OST00CF: [QUOTA WARNING] Usage inconsistent for ID 0:actual (15872000, 301) != expected (36864, 4)
nbp6-OST00CF: [QUOTA WARNING] Usage inconsistent for ID 9225:actual (10485760, 1) != expected (0, 0)
nbp6-OST00CF: [QUOTA WARNING] Usage inconsistent for ID 11913:actual (40223547392, 8) != expected (13027540992, 7)
nbp6-OST00CF: [QUOTA WARNING] Usage inconsistent for ID 11632:actual (140062720, 17) != expected (140058624, 16)
nbp6-OST00CF: [QUOTA WARNING] Usage inconsistent for ID 310286060:actual (7704576, 15) != expected (7696384, 15)
nbp6-OST00CF: [QUOTA WARNING] Usage inconsistent for ID 11968:actual (185972432896, 5936) != expected (146873962496, 4667)
nbp6-OST00CF: [QUOTA WARNING] Usage inconsistent for ID 12031:actual (2330624, 35) != expected (552960, 7)
nbp6-OST00CF: nbp6-OST00cf: Update quota info for quota type 0.
nbp6-OST00CF: [QUOTA WARNING] Usage inconsistent for ID 0:actual (15872000, 301) != expected (36864, 4)
nbp6-OST00CF: [QUOTA WARNING] Usage inconsistent for ID 41007:actual (3909255766016, 78036) != expected (1892489019392, 38304)
nbp6-OST00CF: [QUOTA WARNING] Usage inconsistent for ID 1125:actual (10485760, 1) != expected (0, 0)
nbp6-OST00CF: [QUOTA WARNING] Usage inconsistent for ID 41557:actual (179625074688, 5824) != expected (156474228736, 5039)
nbp6-OST00CF: [QUOTA WARNING] Usage inconsistent for ID 41020:actual (24576, 2) != expected (16384, 2)
nbp6-OST00CF: nbp6-OST00cf: Update quota info for quota type 1.
nbp6-OST00CF: nbp6-OST00cf: 84189/7397376 files (40.0% non-contiguous), 998870723/1893728256 blocks
ldev: Fatal: parallel command execution failed

s394 Fri Jul 29 - 1922 crash during recovery:

LDISKFS-fs error (device dm-27): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 4891corrupted: 21438 blocks free in bitmap, 25535 - in gd

Aborting journal on device dm-2.
LDISKFS-fs error (device dm-27): ldiskfs_journal_start_sb:
Kernel panic - not syncing: LDISKFS-fs (device dm-27): panic forced after error

Pid: 95204, comm: ll_ost_io01_021 Tainted: G        W  ---------------    2.6.32-504.30.3.el6.20151008.x86_64.lustre253 #1
Call Trace:
 [<ffffffff81564fb9>] ? panic+0xa7/0x190
 [<ffffffffa0eb0118>] ? ldiskfs_commit_super+0x188/0x210 [ldiskf s ]14
 out of 16 cpus in kdb, waiting for the rest, timeout in 10 second(s)
 [<ffffffffa0eb0724>] ? ldiskfs_handle_error+0xc4/0xd0 [ldiskfs]
 [<ffffffffa0eb0ac2>] ? __ldiskfs_error+0x82/0x90 [ldiskfs]
 [<ffffffff810f2059>] ? delayacct_end+0x89/0xa0
 [<ffffffffa0e92a29>] ? ldiskfs_mb_check_ondisk_bitmap+0x149/0x150 [ldiskfs]
 [<ffffffffa0e92aad>] ? ldiskfs_mb_generate_from_pa+0x7d/0x180 [ldiskfs]
 [<ffffffff8109e1a0>] ? wake_bit_function+0x0/0x50
 [<ffffffffa0e9528b>] ? ldiskfs_mb_init_cache+0x55b/0xa30 [ldiskfs]
 [<ffffffffa0e9587e>] ? ldiskfs_mb_init_group+0x11e/0x210 [ldiskfs]
 [<ffffffffa0e95dd5>] ? ldiskfs_mb_load_buddy+0x355/0x390 [ldiskfs]
 [<ffffffffa0e7c00b>] ? __ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs]
 [<ffffffffa0e96b7d>] ? ldiskfs_mb_find_by_goal+0x6d/0x2e0 [ldiskfs]
 [<ffffffffa0e97019>] ? ldiskfs_mb_regular_allocator+0x59/0x410 [ldiskfs]
 [<ffffffffa0eb12e8>] ? __ldiskfs_journal_stop+0x68/0xa0 [ldiskfs]
 [<ffffffffa0e91872>] ? ldiskfs_mb_normalize_request+0x2c2/0x3d0 [ldiskfs]
 [<ffffffffa0e9923d>] ? ldiskfs_mb_new_blocks+0x47d/0x630 [ldiskfs]
 [<ffffffffa0eb1390>] ? ldiskfs_journal_start_sb+0x70/0x170 [ldiskfs]
 [<ffffffffa02369ff>] ? ldiskfs_ext_new_extent_cb+0x57f/0x6cc [fsfilt_ldiskfs]
 [<ffffffffa0e7ecd2>] ? ldiskfs_ext_walk_space+0x142/0x310 [ldiskfs]
 [<ffffffffa0236480>] ? ldiskfs_ext_new_extent_cb+0x0/0x6cc [fsfilt_ldiskfs]
 [<ffffffffa02361cc>] ? fsfilt_map_nblocks+0xcc/0xf0 [fsfilt_ldiskfs]
 [<ffffffffa02362f0>] ? fsfilt_ldiskfs_map_ext_inode_pages+0x100/0x200 [fsfilt_ldiskfs]
 [<ffffffffa0236475>] ? fsfilt_ldiskfs_map_inode_pages+0x85/0x90 [fsfilt_ldiskfs]
 [<ffffffffa0eb1884>] ? ldiskfs_dquot_initialize+0x94/0xd0 [ldiskfs]
 [<ffffffffa0f89342>] ? osd_write_commit+0x302/0x620 [osd_ldiskfs]
 [<ffffffffa137c094>] ? ofd_commitrw_write+0x684/0x11b0 [ofd]
 [<ffffffffa137edfd>] ? ofd_commitrw+0x5cd/0xbb0 [ofd]
 [<ffffffffa0372861>] ? lprocfs_counter_add+0x151/0x1d6 [lvfs]
 [<ffffffffa131319d>] ? obd_commitrw+0x11d/0x390 [ost]
 [<ffffffffa131cf71>] ? ost_brw_write+0xea1/0x15d0 [ost]
 [<ffffffff812b9076>] ? vsnprintf+0x336/0x5e0
 [<ffffffffa0686500>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
 [<ffffffffa13235cc>] ? ost_handle+0x439c/0x44d0 [ost]
 [<ffffffffa03f8a44>] ? libcfs_id2str+0x74/0xb0 [libcfs]
 [<ffffffffa06d60c5>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
 [<ffffffffa03fe8d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
 [<ffffffffa06cea69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
 [<ffffffffa06d889d>] ? ptlrpc_main+0xafd/0x1780 [ptlrpc]
 [<ffffffff8100c28a>] ? child_rip+0xa/0x20
 [<ffffffffa06d7da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
LDISKFS-fs error (device dm-27): ldiskfs_journal_start_sb: Detected aborted journal [<ffffffff8100c280>] ? child_rip+0x0/0x20

.  15 out of 16 cpus in kdb, waiting for the rest, timeout in 9 second(s)
..1 cpu is not in kdb, its state is unknown

Entering kdb (current=0xffff880e989cc040, pid 95204) on processor 7 Oops: (null)
due to oops @ 0x0
kdba_dumpregs: pt_regs not available, use bt* or pid to select a different task
[7]kdb> [-- MARK -- Fri Jul 29 19:00:00 2016]

s394 Fri Jul 29 - 1922 e2fsck:

nbp6-oss15 ~ # /usr/local/bin/ldev -c /usr/local/etc/ldev-nbp6.conf /sbin/e2fsck -p %d |tee /root/e2fsck.160729
nbp6-OST00C2: nbp6-OST00c2: clean, 64886/7397376 files, 823185966/1893728256 blocks 
nbp6-OST00E2: nbp6-OST00e2: clean, 64936/7397376 files, 763956787/1893728256 blocks
nbp6-OST00DE: nbp6-OST00de: clean, 64459/7397376 files, 766022848/1893728256 blocks
nbp6-OST00E6: nbp6-OST00e6: clean, 64496/7397376 files, 766103433/1893728256 blocks
nbp6-OST00CE: nbp6-OST00ce: clean, 65124/7397376 files, 756735149/1893728256 blocks
nbp6-OST00DA: nbp6-OST00da: clean, 65162/7397376 files, 758145783/1893728256 blocks
nbp6-OST00EE: nbp6-OST00ee: clean, 65088/7397376 files, 759562710/1893728256 blocks
nbp6-OST00BE: nbp6-OST00be: clean, 64565/7397376 files, 764337531/1893728256 blocks
nbp6-OST00B6: nbp6-OST00b6: clean, 64453/7397376 files, 771893757/1893728256 blocks
nbp6-OST00EA: nbp6-OST00ea: clean, 64333/7397376 files, 772395615/1893728256 blocks
nbp6-OST00C6: nbp6-OST00c6: clean, 64955/7397376 files, 759923311/1893728256 blocks
nbp6-OST00BA: nbp6-OST00ba: clean, 64945/7397376 files, 757944974/1893728256 blocks
nbp6-OST00CA: nbp6-OST00ca: clean, 64103/7397376 files, 777560549/1893728256 blocks
nbp6-OST00D2: nbp6-OST00d2 contains a file system with errors, check forced.
nbp6-OST00D2: [QUOTA WARNING] Usage inconsistent for ID 0:actual (15863808, 319) != expected (36864, 4)
nbp6-OST00D2: [QUOTA WARNING] Usage inconsistent for ID 30009:actual (2871015219200, 58476) != expected (2656438280192, 54656)
nbp6-OST00D2: [QUOTA WARNING] Usage inconsistent for ID 9225:actual (10485760, 1) != expected (0, 0)
nbp6-OST00D2: [QUOTA WARNING] Usage inconsistent for ID 11913:actual (50065829888, 11) != expected (22638972928, 9)
nbp6-OST00D2: [QUOTA WARNING] Usage inconsistent for ID 11968:actual (166439620608, 5272) != expected (114509180928, 3829)
nbp6-OST00D2: [QUOTA WARNING] Usage inconsistent for ID 310286060:actual (7212900352, 10) != expected (21749760, 11)
nbp6-OST00D2: [QUOTA WARNING] Usage inconsistent for ID 12031:actual (2412544, 30) != expected (638976, 8)
nbp6-OST00D2: [QUOTA WARNING] Usage inconsistent for ID 11810:actual (10202492928, 6) != expected (10199027712, 5)
nbp6-OST00D2: nbp6-OST00d2: Update quota info for quota type 0. 
nbp6-OST00D2: [QUOTA WARNING] Usage inconsistent for ID 0:actual (15863808, 319) != expected (36864, 4)
nbp6-OST00D2: [QUOTA WARNING] Usage inconsistent for ID 41007:actual (3001073184768, 59307) != expected (1945474367488, 38766)
nbp6-OST00D2: [QUOTA WARNING] Usage inconsistent for ID 1125:actual (10485760, 1) != expected (0, 0)
nbp6-OST00D2: [QUOTA WARNING] Usage inconsistent for ID 41557:actual (156396912640, 5146) != expected (104593395712, 3727)
nbp6-OST00D2: [QUOTA WARNING] Usage inconsistent for ID 41020:actual (7191187456, 1) != expected (0, 0)
nbp6-OST00D2: nbp6-OST00d2: Update quota info for quota type 1.
nbp6-OST00D2: nbp6-OST00d2: 64800/7397376 files (38.0% non-contiguous), 773234595/1893728256 blocks
nbp6-OST00D6: nbp6-OST00d6 contains a file system with errors, check forced.
nbp6-OST00D6: [QUOTA WARNING] Usage inconsistent for ID 0:actual (15863808, 312) != expected (36864, 4)
nbp6-OST00D6: [QUOTA WARNING] Usage inconsistent for ID 30009:actual (2823348002816, 58633) != expected (3762248306688, 77514)
nbp6-OST00D6: [QUOTA WARNING] Usage inconsistent for ID 11632:actual (222023680, 21) != expected (222011392, 18)
nbp6-OST00D6: [QUOTA WARNING] Usage inconsistent for ID 11968:actual (169801756672, 5385) != expected (115151048704, 3902)
nbp6-OST00D6: [QUOTA WARNING] Usage inconsistent for ID 12031:actual (3214045184, 46) != expected (503808, 5)
nbp6-OST00D6: [QUOTA WARNING] Usage inconsistent for ID 11810:actual (3606089728, 6) != expected (3605233664, 5)
nbp6-OST00D6: nbp6-OST00d6: Update quota info for quota type 0.
nbp6-OST00D6: [QUOTA WARNING] Usage inconsistent for ID 0:actual (15863808, 312) != expected (36864, 4)
nbp6-OST00D6: [QUOTA WARNING] Usage inconsistent for ID 41007:actual (2898631974912, 59507) != expected (3833795325952, 78302)
nbp6-OST00D6: [QUOTA WARNING] Usage inconsistent for ID 41557:actual (156144955392, 5256) != expected (101637935104, 3800)
nbp6-OST00D6: nbp6-OST00d6: Update quota info for quota type 1.
nbp6-OST00D6: nbp6-OST00d6: 65098/7397376 files (36.9% non-contiguous), 746402105/1893728256 blocks

s393 Fri Jul 29 - 2123 crash during recovery:

LustreError: dumping log to /tmp/lustre-log.1469847950.16311
LDISKFS-fs error (device dm-17): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 4891corrupted: 21438 blocks free in bitmap, 25535 - in gd

Aborting journal on device dm-2. 
LDISKFS-fs error (device dm-17): ldiskfs_journal_start_sb: Detected aborted journal
Kernel panic - not syncing: LDISKFS-fs panic from previous error

Pid: 23565, comm: ll_ost_io00_ 0 5125  oTauit ntofe d:1 6G c p  us    in   Wkd b -, -w--a-it-i--n-g- f--o-r- t-h  e   re2s.6t., 3t2-i5me0o4.u3t 0.i3n. 1e0l6 s.e20c1o5nd10(0s)8.x
86_64.lustre253 #1
Call Trace:
 [<ffffffff81564fb9>] ? panic+0xa7/0x190
 [<ffffffffa10c9be0>] ? ldiskfs_remount+0x0/0x590 [ldiskfs]
 [<ffffffff811c2130>] ? sync_buffer+0x0/0x50
 [<ffffffffa10ca3c8>] ? ldiskfs_journal_start_sb+0xa8/0x170 [ldiskfs]
 [<ffffffff81566a58>] ? out_of_line_wait_on_bit+0x78/0x90
 [<ffffffffa10962e0>] ? __ldiskfs_ext_check+0x1c0/0x240 [ldiskfs]
 [<ffffffff8109e1a0>] ? wake_bit_function+0x0/0x50
 [<ffffffffa01e362f>] ? ldiskfs_ext_new_extent_cb+0x1af/0x6cc [fsfilt_ldiskfs]
 [<ffffffffa1097570>] ? ldiskfs_ext_find_extent+0x2a0/0x360 [ldiskfs]
 [<ffffffffa1097cd2>] ? ldiskfs_ext_walk_space+0x142/0x310 [ldiskfs]
 [<ffffffffa01e3480>] ? ldiskfs_ext_new_extent_cb+0x0/0x6cc [fsfilt_ldiskfs]
 [<ffffffffa01e31cc>] ? fsfilt_map_nblocks+0xcc/0xf0 [fsfilt_ldiskfs]
 [<ffffffffa01e32f0>] ? fsfilt_ldiskfs_map_ext_inode_pages+0x100/0x200 [fsfilt_ldiskfs]
 [<ffffffffa01e3475>] ? fsfilt_ldiskfs_map_inode_pages+0x85/0x90 [fsfilt_ldiskfs]
 [<ffffffffa10ca884>] ? ldiskfs_dquot_initialize+0x94/0xd0 [ldiskfs]
 [<ffffffffa15ae342>] ? osd_write_commit+0x302/0x620 [osd_ldiskfs]
 [<ffffffffa176f094>] ? ofd_commitrw_write+0x684/0x11b0 [ofd]
 [<ffffffffa1771dfd>] ? ofd_commitrw+0x5cd/0xbb0 [ofd]
 [<ffffffffa0374861>] ? lprocfs_counter_add+0x151/0x1d6 [lvfs]
 [<ffffffffa170619d>] ? obd_commitrw+0x11d/0x390 [ost]
 [<ffffffffa170ff71>] ? ost_brw_write+0xea1/0x15d0 [ost]
 [<ffffffff812b9076>] ? vsnprintf+0x336/0x5e0
 [<ffffffffa068f500>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
 [<ffffffffa17165cc>] ? ost_handle+0x439c/0x44d0 [ost]
 [<ffffffffa0401a44>] ? libcfs_id2str+0x74/0xb0 [libcfs]
 [<ffffffffa06df0c5>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
 [<ffffffffa04078d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
 [<ffffffffa06d7a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
 [<ffffffffa06e189d>] ? ptlrpc_main+0xafd/0x1780 [ptlrpc]
 [<ffffffff8100c28a>] ? child_rip+0xa/0x20
 [<ffffffffa06e0da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
 [<ffffffff8100c280>] ? child_rip+0x0/0x20
.All cpus are now in kdb
 
Entering kdb (current=0xffff880b0c8d0ab0, pid 23565) on processor 1 Oops: (null)
due to oops @ 0x0
kdba_dumpregs: pt_regs not available, use bt* or pid to select a different task
[1]kdb> [-- Fri Jul 29 20:07:23 2016]
Comment by Andreas Dilger [ 02/Aug/16 ]

Until the IO errors in LU-8462 are fixed, there appears to be ongoing corruption introduced in this filesystem. Running e2fsck in such an environment will only cause additional problems because it is making bad decisions based on ongoing corruption to the filesystem. The source of the corruption needs to be fixed first before e2fsck is run on the filesystem again. Then it will be possible to address whatever problems remain.

Comment by Herbert Yeung [ 02/Aug/16 ]

As mentioned in the update to LU-8462, I don't believe there are ongoing I/O errors.

Also, in case it was missed, there appears to be a discrepancy in the device that the journal aborts on compared to where the LIDSKFS-fs error is reported on. In the below example, it indicates that error occurs on device dm-15, yet it aborts the journal on dm-6. Is this expected behavior, incorrect printing of the aborted journal device, or incorrect behavior?

s395 Tues Jul 26 - 2000:

LDISKFS-fs error (device dm-15): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 3037corrupted: 31949 blocks free in bitmap, 14347 - in gd

Aborting journal on device dm-6.
Kernel panic - not syncing: LDISKFS-fs (device dm-15): panic forced after error

Comment by Peter Jones [ 04/Aug/16 ]

Fan Yong

What do you advise here?

Peter

Comment by nasf (Inactive) [ 05/Aug/16 ]

Also, in case it was missed, there appears to be a discrepancy in the device that the journal aborts on compared to where the LIDSKFS-fs error is reported on. In the below example, it indicates that error occurs on device dm-15, yet it aborts the journal on dm-6. Is this expected behavior, incorrect printing of the aborted journal device, or incorrect behavior?
s395 Tues Jul 26 - 2000:
LDISKFS-fs error (device dm-15): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 3037corrupted: 31949 blocks free in bitmap, 14347 - in gd
Aborting journal on device dm-6.

This should be related with the using of external journal. Means the Lustre device was on the dm-15, its journal was on the dm-6. Please check for conformation.

The panic is because there was bitmap corruption as to the journal was aborted, and you have set the system behaviour on journal abort as "errors=panic", then the ldiskfs will auto panic on the event of journal abort. It you do not want panic, it can be changed as "errors=remount-ro" or "errors=continue". But in your case, since some bitmap corrupted, making the system to run continuously may be dangerous.

Comment by nasf (Inactive) [ 24/Aug/16 ]

Any feedback? Thanks!

Comment by Nathan Dauchy (Inactive) [ 02/Sep/16 ]

We did run e2fsck on all the servers that use the same backend disk array it corrected some errors.

This case can be closed.

Comment by nasf (Inactive) [ 02/Sep/16 ]

The issue has been resolved via e2fsck.

Comment by Mahmoud Hanafi [ 08/Sep/16 ]

Please close.

Generated at Sat Feb 10 02:17:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.