[LU-8463] OSSes drop into KDB during recovery Created: 02/Aug/16 Updated: 08/Sep/16 Resolved: 02/Sep/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Herbert Yeung | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 2 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
During the crashes reported in Tues Jul 26 - 394 - 2028 KDB (I/O) 395 - 2000 KDB (I/O), (disk) crash during recovery, until e2fsck After the initial crash during recovery on Tuesday, I ran e2fsck on the node and was able to successfully mount the OSTs. However, running e2fsck may not be necessary as on one occassion, it was able to recovery after a second attempt. |
| Comments |
| Comment by Herbert Yeung [ 02/Aug/16 ] |
|
Accidentally hit CTRL + enter, submitting the bug before inputting data, so here it is: During the crashes reported in Tues Jul 26 - 394 - 2028 KDB (I/O)
After the initial crash during recovery on Tuesday, I ran e2fsck on the node and was able to successfully mount the OSTs. However, running e2fsck may not be necessary as on one occassion, it was able to recovery after a second attempt. Also, there appears to be a discrepancy in the device that the journal aborts on compared to where the LIDSKFS-fs error is reported on. Is this normal? LDISKFS-fs error (device dm-15): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 3037corrupted: 31949 blocks free in bitmap, 14347 - in gd Aborting journal on device dm-6. Kernel panic - not syncing: LDISKFS-fs (device dm-15): panic forced after error Below is the output from the crashes during recovery/e2fsck output: s395 Tues Jul 26 - 2000 crash during recovery: Lustre: nbp6-OST00b7: deleting orphan objects from 0x0:549495 to 0x0:549665 Lustre: nbp6-OST00bb: deleting orphan objects from 0x0:549224 to 0x0:549281 Lustre: nbp6-OST00cf: Recovery over after 6:43, of 12593 clients 10903 recovered and 1690 were evicted. Lustre: nbp6-OST00cf: deleting orphan objects from 0x0:550044 to 0x0:550113 LDISKFS-fs error (device dm-15): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 3037corrupted: 31949 blocks free in bitmap, 14347 - in gd Aborting journal on device dm-6. Kernel panic - not syncing: LDISKFS-fs (device dm-15): panic forced after error Pid: 17694, comm: ll_ost_ i o1052 o_u0t0 6o fN o1t6 tcapiunst eidn 2k.6d.b3,2 -w5a0i4t.i3n0g. 3f.oerl t6h.e2 0r1e5s1t0,0 8t.ixm8e6o_u6t4 .iluns t1r0e 2s5e3c o#n1d( s) Call Trace: [<ffffffff81564fb9>] ? panic+0xa7/0x190 [<ffffffffa0eb9118>] ? ldiskfs_commit_super+0x188/0x210 [ldiskfs] [<ffffffffa0eb9724>] ? ldiskfs_handle_error+0xc4/0xd0 [ldiskfs] [<ffffffffa0eb9ac2>] ? __ldiskfs_error+0x82/0x90 [ldiskfs] [<ffffffff810f2059>] ? delayacct_end+0x89/0xa0 [<ffffffffa0e9ba29>] ? ldiskfs_mb_check_ondisk_bitmap+0x149/0x150 [ldiskfs] [<ffffffffa0e9baad>] ? ldiskfs_mb_generate_from_pa+0x7d/0x180 [ldiskfs] [<ffffffff8109e1a0>] ? wake_bit_function+0x0/0x50 [<ffffffffa0e9e28b>] ? ldiskfs_mb_init_cache+0x55b/0xa30 [ldiskfs] [<ffffffffa0e9e87e>] ? ldiskfs_mb_init_group+0x11e/0x210 [ldiskfs] [<ffffffffa0e9edd5>] ? ldiskfs_mb_load_buddy+0x355/0x390 [ldiskfs] [<ffffffffa0e8500b>] ? __ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs] [<ffffffffa0e9fb7d>] ? ldiskfs_mb_find_by_goal+0x6d/0x2e0 [ldiskfs] [<ffffffffa0ea0019>] ? ldiskfs_mb_regular_allocator+0x59/0x410 [ldiskfs] [<ffffffffa0eba2e8>] ? __ldiskfs_journal_stop+0x68/0xa0 [ldiskfs] [<ffffffffa0e9a872>] ? ldiskfs_mb_normalize_request+0x2c2/0x3d0 [ldiskfs] [<ffffffffa0ea223d>] ? ldiskfs_mb_new_blocks+0x47d/0x630 [ldiskfs] [<ffffffffa0eba390>] ? ldiskfs_journal_start_sb+0x70/0x170 [ldiskfs] [<ffffffffa01f19ff>] ? ldiskfs_ext_new_extent_cb+0x57f/0x6cc [fsfilt_ldiskfs] [<ffffffffa0e87cd2>] ? ldiskfs_ext_walk_space+0x142/0x310 [ldiskfs] [<ffffffffa01f1480>] ? ldiskfs_ext_new_extent_cb+0x0/0x6cc [fsfilt_ldiskfs] [<ffffffffa01f11cc>] ? fsfilt_map_nblocks+0xcc/0xf0 [fsfilt_ldiskfs] [<ffffffffa01f12f0>] ? fsfilt_ldiskfs_map_ext_inode_pages+0x100/0x200 [fsfilt_ldiskfs] [<ffffffffa01f1475>] ? fsfilt_ldiskfs_map_inode_pages+0x85/0x90 [fsfilt_ldiskfs] [<ffffffffa0eba884>] ? ldiskfs_dquot_initialize+0x94/0xd0 [ldiskfs] [<ffffffffa1422342>] ? osd_write_commit+0x302/0x620 [osd_ldiskfs] [<ffffffffa17cb094>] ? ofd_commitrw_write+0x684/0x11b0 [ofd] [<ffffffffa17cddfd>] ? ofd_commitrw+0x5cd/0xbb0 [ofd] [<ffffffffa0374861>] ? lprocfs_counter_add+0x151/0x1d6 [lvfs] [<ffffffffa176219d>] ? obd_commitrw+0x11d/0x390 [ost] [<ffffffffa176bf71>] ? ost_brw_write+0xea1/0x15d0 [ost] [<ffffffff812b9076>] ? vsnprintf+0x336/0x5e0 [<ffffffffa068f500>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] [<ffffffffa17725cc>] ? ost_handle+0x439c/0x44d0 [ost] [<ffffffffa0401a44>] ? libcfs_id2str+0x74/0xb0 [libcfs] [<ffffffffa06df0c5>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] [<ffffffffa04078d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs] [<ffffffffa06d7a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc] [<ffffffffa06e189d>] ? ptlrpc_main+0xafd/0x1780 [ptlrpc] [<ffffffff8100c28a>] ? child_rip+0xa/0x20 [<ffffffffa06e0da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc] [<ffffffff8100c280>] ? child_rip+0x0/0x20 .All cpus are now in kdb Entering kdb (current=0xffff881d9d9faab0, pid 17694) on processor 8 Oops: (null) due to oops @ 0x0 kdba_dumpregs: pt_regs not available, use bt* or pid to select a different task [8]kdb> [-- Tue Jul 26 18:40:11 2016] s395 Tues Jul 26 - 2000 2nd crash during recovery: Lustre: nbp6-OST00cf: Recovery over after 12:41, of 10903 clients 4619 recovered and 6284 were evicted. Lustre: Skipped 1 previous similar message LDISKFS-fs error (device dm-27): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 3037corrupted: 31949 blocks free in bitmap, 14347 - in gd Aborting journal on device dm-6. Kernel panic - not syncing: LDISKFS-fs (device dm-27): panic forced after error Pid: 16034, comm: ll_ost_io02_002 Not tainted 2.6.32-504.30.3.el6.20151008.x86_64.lustre253 #1 Call Trace: [<ffffffff81564fb9>] ? panic+0xa7/0x190 [<ffffffffa0b74118>] ? ldiskfs_commit_super+0x188/0x210 [ldiskfs] [<ffffffffa0b74724>] ? ldiskfs_handle_error+0xc4/0xd0 [ldiskfs] Entering kdb (current=0xffff881e5493cab0, pid 16034) on processor 8 Oops: (null) due to oops @ 0x0 kdba_dumpregs: pt_regs not available, use bt* or pid to select a different task [8]kdb> [-- Tue Jul 26 19:16:06 2016] s395 Tues Jul 26 - 2000 e2fsck: nbp6-oss16 ~ # /usr/local/bin/fscklustre.sh nbp6-OST00CF: nbp6-OST00cf contains a file system with errors, check forced. nbp6-OST00EF: nbp6-OST00ef: clean, 83709/7397376 files, 1005802096/1893728256 blocks nbp6-OST00DF: nbp6-OST00df: clean, 83601/7397376 files, 1000039730/1893728256 blocks nbp6-OST00C3: nbp6-OST00c3: clean, 83857/7397376 files, 987218431/1893728256 blocks nbp6-OST00DB: nbp6-OST00db: clean, 83949/7397376 files, 992486492/1893728256 blocks nbp6-OST00E3: nbp6-OST00e3: clean, 83847/7397376 files, 993140941/1893728256 blocks nbp6-OST00BF: nbp6-OST00bf: clean, 84168/7397376 files, 993181370/1893728256 blocks nbp6-OST00E7: nbp6-OST00e7: clean, 84352/7397376 files, 997493607/1893728256 blocks nbp6-OST00D3: nbp6-OST00d3: clean, 83798/7397376 files, 1000758166/1893728256 blocks nbp6-OST00D7: nbp6-OST00d7: clean, 84268/7397376 files, 994045438/1893728256 blocks nbp6-OST00EB: nbp6-OST00eb: clean, 84081/7397376 files, 1006478293/1893728256 blocks nbp6-OST00CB: nbp6-OST00cb: clean, 83367/7397376 files, 992639326/1893728256 blocks nbp6-OST00C7: nbp6-OST00c7: clean, 84000/7397376 files, 998232233/1893728256 blocks nbp6-OST00BB: nbp6-OST00bb: clean, 82971/7397376 files, 1018277281/1893728256 blocks nbp6-OST00B7: nbp6-OST00b7: clean, 83727/7397376 files, 1006767654/1893728256 blocks nbp6-OST00CF: nbp6-OST00cf: Inode 172464, i_blocks is 53305288, should be 53117200. FIXED. nbp6-OST00CF: [QUOTA WARNING] Usage inconsistent for ID 0:actual (15872000, 301) != expected (36864, 4) nbp6-OST00CF: [QUOTA WARNING] Usage inconsistent for ID 9225:actual (10485760, 1) != expected (0, 0) nbp6-OST00CF: [QUOTA WARNING] Usage inconsistent for ID 11913:actual (40223547392, 8) != expected (13027540992, 7) nbp6-OST00CF: [QUOTA WARNING] Usage inconsistent for ID 11632:actual (140062720, 17) != expected (140058624, 16) nbp6-OST00CF: [QUOTA WARNING] Usage inconsistent for ID 310286060:actual (7704576, 15) != expected (7696384, 15) nbp6-OST00CF: [QUOTA WARNING] Usage inconsistent for ID 11968:actual (185972432896, 5936) != expected (146873962496, 4667) nbp6-OST00CF: [QUOTA WARNING] Usage inconsistent for ID 12031:actual (2330624, 35) != expected (552960, 7) nbp6-OST00CF: nbp6-OST00cf: Update quota info for quota type 0. nbp6-OST00CF: [QUOTA WARNING] Usage inconsistent for ID 0:actual (15872000, 301) != expected (36864, 4) nbp6-OST00CF: [QUOTA WARNING] Usage inconsistent for ID 41007:actual (3909255766016, 78036) != expected (1892489019392, 38304) nbp6-OST00CF: [QUOTA WARNING] Usage inconsistent for ID 1125:actual (10485760, 1) != expected (0, 0) nbp6-OST00CF: [QUOTA WARNING] Usage inconsistent for ID 41557:actual (179625074688, 5824) != expected (156474228736, 5039) nbp6-OST00CF: [QUOTA WARNING] Usage inconsistent for ID 41020:actual (24576, 2) != expected (16384, 2) nbp6-OST00CF: nbp6-OST00cf: Update quota info for quota type 1. nbp6-OST00CF: nbp6-OST00cf: 84189/7397376 files (40.0% non-contiguous), 998870723/1893728256 blocks ldev: Fatal: parallel command execution failed s394 Fri Jul 29 - 1922 crash during recovery: LDISKFS-fs error (device dm-27): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 4891corrupted: 21438 blocks free in bitmap, 25535 - in gd Aborting journal on device dm-2. LDISKFS-fs error (device dm-27): ldiskfs_journal_start_sb: Kernel panic - not syncing: LDISKFS-fs (device dm-27): panic forced after error Pid: 95204, comm: ll_ost_io01_021 Tainted: G W --------------- 2.6.32-504.30.3.el6.20151008.x86_64.lustre253 #1 Call Trace: [<ffffffff81564fb9>] ? panic+0xa7/0x190 [<ffffffffa0eb0118>] ? ldiskfs_commit_super+0x188/0x210 [ldiskf s ]14 out of 16 cpus in kdb, waiting for the rest, timeout in 10 second(s) [<ffffffffa0eb0724>] ? ldiskfs_handle_error+0xc4/0xd0 [ldiskfs] [<ffffffffa0eb0ac2>] ? __ldiskfs_error+0x82/0x90 [ldiskfs] [<ffffffff810f2059>] ? delayacct_end+0x89/0xa0 [<ffffffffa0e92a29>] ? ldiskfs_mb_check_ondisk_bitmap+0x149/0x150 [ldiskfs] [<ffffffffa0e92aad>] ? ldiskfs_mb_generate_from_pa+0x7d/0x180 [ldiskfs] [<ffffffff8109e1a0>] ? wake_bit_function+0x0/0x50 [<ffffffffa0e9528b>] ? ldiskfs_mb_init_cache+0x55b/0xa30 [ldiskfs] [<ffffffffa0e9587e>] ? ldiskfs_mb_init_group+0x11e/0x210 [ldiskfs] [<ffffffffa0e95dd5>] ? ldiskfs_mb_load_buddy+0x355/0x390 [ldiskfs] [<ffffffffa0e7c00b>] ? __ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs] [<ffffffffa0e96b7d>] ? ldiskfs_mb_find_by_goal+0x6d/0x2e0 [ldiskfs] [<ffffffffa0e97019>] ? ldiskfs_mb_regular_allocator+0x59/0x410 [ldiskfs] [<ffffffffa0eb12e8>] ? __ldiskfs_journal_stop+0x68/0xa0 [ldiskfs] [<ffffffffa0e91872>] ? ldiskfs_mb_normalize_request+0x2c2/0x3d0 [ldiskfs] [<ffffffffa0e9923d>] ? ldiskfs_mb_new_blocks+0x47d/0x630 [ldiskfs] [<ffffffffa0eb1390>] ? ldiskfs_journal_start_sb+0x70/0x170 [ldiskfs] [<ffffffffa02369ff>] ? ldiskfs_ext_new_extent_cb+0x57f/0x6cc [fsfilt_ldiskfs] [<ffffffffa0e7ecd2>] ? ldiskfs_ext_walk_space+0x142/0x310 [ldiskfs] [<ffffffffa0236480>] ? ldiskfs_ext_new_extent_cb+0x0/0x6cc [fsfilt_ldiskfs] [<ffffffffa02361cc>] ? fsfilt_map_nblocks+0xcc/0xf0 [fsfilt_ldiskfs] [<ffffffffa02362f0>] ? fsfilt_ldiskfs_map_ext_inode_pages+0x100/0x200 [fsfilt_ldiskfs] [<ffffffffa0236475>] ? fsfilt_ldiskfs_map_inode_pages+0x85/0x90 [fsfilt_ldiskfs] [<ffffffffa0eb1884>] ? ldiskfs_dquot_initialize+0x94/0xd0 [ldiskfs] [<ffffffffa0f89342>] ? osd_write_commit+0x302/0x620 [osd_ldiskfs] [<ffffffffa137c094>] ? ofd_commitrw_write+0x684/0x11b0 [ofd] [<ffffffffa137edfd>] ? ofd_commitrw+0x5cd/0xbb0 [ofd] [<ffffffffa0372861>] ? lprocfs_counter_add+0x151/0x1d6 [lvfs] [<ffffffffa131319d>] ? obd_commitrw+0x11d/0x390 [ost] [<ffffffffa131cf71>] ? ost_brw_write+0xea1/0x15d0 [ost] [<ffffffff812b9076>] ? vsnprintf+0x336/0x5e0 [<ffffffffa0686500>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] [<ffffffffa13235cc>] ? ost_handle+0x439c/0x44d0 [ost] [<ffffffffa03f8a44>] ? libcfs_id2str+0x74/0xb0 [libcfs] [<ffffffffa06d60c5>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] [<ffffffffa03fe8d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs] [<ffffffffa06cea69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc] [<ffffffffa06d889d>] ? ptlrpc_main+0xafd/0x1780 [ptlrpc] [<ffffffff8100c28a>] ? child_rip+0xa/0x20 [<ffffffffa06d7da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc] LDISKFS-fs error (device dm-27): ldiskfs_journal_start_sb: Detected aborted journal [<ffffffff8100c280>] ? child_rip+0x0/0x20 . 15 out of 16 cpus in kdb, waiting for the rest, timeout in 9 second(s) ..1 cpu is not in kdb, its state is unknown Entering kdb (current=0xffff880e989cc040, pid 95204) on processor 7 Oops: (null) due to oops @ 0x0 kdba_dumpregs: pt_regs not available, use bt* or pid to select a different task [7]kdb> [-- MARK -- Fri Jul 29 19:00:00 2016] s394 Fri Jul 29 - 1922 e2fsck: nbp6-oss15 ~ # /usr/local/bin/ldev -c /usr/local/etc/ldev-nbp6.conf /sbin/e2fsck -p %d |tee /root/e2fsck.160729 nbp6-OST00C2: nbp6-OST00c2: clean, 64886/7397376 files, 823185966/1893728256 blocks nbp6-OST00E2: nbp6-OST00e2: clean, 64936/7397376 files, 763956787/1893728256 blocks nbp6-OST00DE: nbp6-OST00de: clean, 64459/7397376 files, 766022848/1893728256 blocks nbp6-OST00E6: nbp6-OST00e6: clean, 64496/7397376 files, 766103433/1893728256 blocks nbp6-OST00CE: nbp6-OST00ce: clean, 65124/7397376 files, 756735149/1893728256 blocks nbp6-OST00DA: nbp6-OST00da: clean, 65162/7397376 files, 758145783/1893728256 blocks nbp6-OST00EE: nbp6-OST00ee: clean, 65088/7397376 files, 759562710/1893728256 blocks nbp6-OST00BE: nbp6-OST00be: clean, 64565/7397376 files, 764337531/1893728256 blocks nbp6-OST00B6: nbp6-OST00b6: clean, 64453/7397376 files, 771893757/1893728256 blocks nbp6-OST00EA: nbp6-OST00ea: clean, 64333/7397376 files, 772395615/1893728256 blocks nbp6-OST00C6: nbp6-OST00c6: clean, 64955/7397376 files, 759923311/1893728256 blocks nbp6-OST00BA: nbp6-OST00ba: clean, 64945/7397376 files, 757944974/1893728256 blocks nbp6-OST00CA: nbp6-OST00ca: clean, 64103/7397376 files, 777560549/1893728256 blocks nbp6-OST00D2: nbp6-OST00d2 contains a file system with errors, check forced. nbp6-OST00D2: [QUOTA WARNING] Usage inconsistent for ID 0:actual (15863808, 319) != expected (36864, 4) nbp6-OST00D2: [QUOTA WARNING] Usage inconsistent for ID 30009:actual (2871015219200, 58476) != expected (2656438280192, 54656) nbp6-OST00D2: [QUOTA WARNING] Usage inconsistent for ID 9225:actual (10485760, 1) != expected (0, 0) nbp6-OST00D2: [QUOTA WARNING] Usage inconsistent for ID 11913:actual (50065829888, 11) != expected (22638972928, 9) nbp6-OST00D2: [QUOTA WARNING] Usage inconsistent for ID 11968:actual (166439620608, 5272) != expected (114509180928, 3829) nbp6-OST00D2: [QUOTA WARNING] Usage inconsistent for ID 310286060:actual (7212900352, 10) != expected (21749760, 11) nbp6-OST00D2: [QUOTA WARNING] Usage inconsistent for ID 12031:actual (2412544, 30) != expected (638976, 8) nbp6-OST00D2: [QUOTA WARNING] Usage inconsistent for ID 11810:actual (10202492928, 6) != expected (10199027712, 5) nbp6-OST00D2: nbp6-OST00d2: Update quota info for quota type 0. nbp6-OST00D2: [QUOTA WARNING] Usage inconsistent for ID 0:actual (15863808, 319) != expected (36864, 4) nbp6-OST00D2: [QUOTA WARNING] Usage inconsistent for ID 41007:actual (3001073184768, 59307) != expected (1945474367488, 38766) nbp6-OST00D2: [QUOTA WARNING] Usage inconsistent for ID 1125:actual (10485760, 1) != expected (0, 0) nbp6-OST00D2: [QUOTA WARNING] Usage inconsistent for ID 41557:actual (156396912640, 5146) != expected (104593395712, 3727) nbp6-OST00D2: [QUOTA WARNING] Usage inconsistent for ID 41020:actual (7191187456, 1) != expected (0, 0) nbp6-OST00D2: nbp6-OST00d2: Update quota info for quota type 1. nbp6-OST00D2: nbp6-OST00d2: 64800/7397376 files (38.0% non-contiguous), 773234595/1893728256 blocks nbp6-OST00D6: nbp6-OST00d6 contains a file system with errors, check forced. nbp6-OST00D6: [QUOTA WARNING] Usage inconsistent for ID 0:actual (15863808, 312) != expected (36864, 4) nbp6-OST00D6: [QUOTA WARNING] Usage inconsistent for ID 30009:actual (2823348002816, 58633) != expected (3762248306688, 77514) nbp6-OST00D6: [QUOTA WARNING] Usage inconsistent for ID 11632:actual (222023680, 21) != expected (222011392, 18) nbp6-OST00D6: [QUOTA WARNING] Usage inconsistent for ID 11968:actual (169801756672, 5385) != expected (115151048704, 3902) nbp6-OST00D6: [QUOTA WARNING] Usage inconsistent for ID 12031:actual (3214045184, 46) != expected (503808, 5) nbp6-OST00D6: [QUOTA WARNING] Usage inconsistent for ID 11810:actual (3606089728, 6) != expected (3605233664, 5) nbp6-OST00D6: nbp6-OST00d6: Update quota info for quota type 0. nbp6-OST00D6: [QUOTA WARNING] Usage inconsistent for ID 0:actual (15863808, 312) != expected (36864, 4) nbp6-OST00D6: [QUOTA WARNING] Usage inconsistent for ID 41007:actual (2898631974912, 59507) != expected (3833795325952, 78302) nbp6-OST00D6: [QUOTA WARNING] Usage inconsistent for ID 41557:actual (156144955392, 5256) != expected (101637935104, 3800) nbp6-OST00D6: nbp6-OST00d6: Update quota info for quota type 1. nbp6-OST00D6: nbp6-OST00d6: 65098/7397376 files (36.9% non-contiguous), 746402105/1893728256 blocks s393 Fri Jul 29 - 2123 crash during recovery: LustreError: dumping log to /tmp/lustre-log.1469847950.16311 LDISKFS-fs error (device dm-17): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 4891corrupted: 21438 blocks free in bitmap, 25535 - in gd Aborting journal on device dm-2. LDISKFS-fs error (device dm-17): ldiskfs_journal_start_sb: Detected aborted journal Kernel panic - not syncing: LDISKFS-fs panic from previous error Pid: 23565, comm: ll_ost_io00_ 0 5125 oTauit ntofe d:1 6G c p us in Wkd b -, -w--a-it-i--n-g- f--o-r- t-h e re2s.6t., 3t2-i5me0o4.u3t 0.i3n. 1e0l6 s.e20c1o5nd10(0s)8.x 86_64.lustre253 #1 Call Trace: [<ffffffff81564fb9>] ? panic+0xa7/0x190 [<ffffffffa10c9be0>] ? ldiskfs_remount+0x0/0x590 [ldiskfs] [<ffffffff811c2130>] ? sync_buffer+0x0/0x50 [<ffffffffa10ca3c8>] ? ldiskfs_journal_start_sb+0xa8/0x170 [ldiskfs] [<ffffffff81566a58>] ? out_of_line_wait_on_bit+0x78/0x90 [<ffffffffa10962e0>] ? __ldiskfs_ext_check+0x1c0/0x240 [ldiskfs] [<ffffffff8109e1a0>] ? wake_bit_function+0x0/0x50 [<ffffffffa01e362f>] ? ldiskfs_ext_new_extent_cb+0x1af/0x6cc [fsfilt_ldiskfs] [<ffffffffa1097570>] ? ldiskfs_ext_find_extent+0x2a0/0x360 [ldiskfs] [<ffffffffa1097cd2>] ? ldiskfs_ext_walk_space+0x142/0x310 [ldiskfs] [<ffffffffa01e3480>] ? ldiskfs_ext_new_extent_cb+0x0/0x6cc [fsfilt_ldiskfs] [<ffffffffa01e31cc>] ? fsfilt_map_nblocks+0xcc/0xf0 [fsfilt_ldiskfs] [<ffffffffa01e32f0>] ? fsfilt_ldiskfs_map_ext_inode_pages+0x100/0x200 [fsfilt_ldiskfs] [<ffffffffa01e3475>] ? fsfilt_ldiskfs_map_inode_pages+0x85/0x90 [fsfilt_ldiskfs] [<ffffffffa10ca884>] ? ldiskfs_dquot_initialize+0x94/0xd0 [ldiskfs] [<ffffffffa15ae342>] ? osd_write_commit+0x302/0x620 [osd_ldiskfs] [<ffffffffa176f094>] ? ofd_commitrw_write+0x684/0x11b0 [ofd] [<ffffffffa1771dfd>] ? ofd_commitrw+0x5cd/0xbb0 [ofd] [<ffffffffa0374861>] ? lprocfs_counter_add+0x151/0x1d6 [lvfs] [<ffffffffa170619d>] ? obd_commitrw+0x11d/0x390 [ost] [<ffffffffa170ff71>] ? ost_brw_write+0xea1/0x15d0 [ost] [<ffffffff812b9076>] ? vsnprintf+0x336/0x5e0 [<ffffffffa068f500>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] [<ffffffffa17165cc>] ? ost_handle+0x439c/0x44d0 [ost] [<ffffffffa0401a44>] ? libcfs_id2str+0x74/0xb0 [libcfs] [<ffffffffa06df0c5>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] [<ffffffffa04078d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs] [<ffffffffa06d7a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc] [<ffffffffa06e189d>] ? ptlrpc_main+0xafd/0x1780 [ptlrpc] [<ffffffff8100c28a>] ? child_rip+0xa/0x20 [<ffffffffa06e0da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc] [<ffffffff8100c280>] ? child_rip+0x0/0x20 .All cpus are now in kdb Entering kdb (current=0xffff880b0c8d0ab0, pid 23565) on processor 1 Oops: (null) due to oops @ 0x0 kdba_dumpregs: pt_regs not available, use bt* or pid to select a different task [1]kdb> [-- Fri Jul 29 20:07:23 2016] |
| Comment by Andreas Dilger [ 02/Aug/16 ] |
|
Until the IO errors in |
| Comment by Herbert Yeung [ 02/Aug/16 ] |
|
As mentioned in the update to Also, in case it was missed, there appears to be a discrepancy in the device that the journal aborts on compared to where the LIDSKFS-fs error is reported on. In the below example, it indicates that error occurs on device dm-15, yet it aborts the journal on dm-6. Is this expected behavior, incorrect printing of the aborted journal device, or incorrect behavior? s395 Tues Jul 26 - 2000: LDISKFS-fs error (device dm-15): ldiskfs_mb_check_ondisk_bitmap: on-disk bitmap for group 3037corrupted: 31949 blocks free in bitmap, 14347 - in gd Aborting journal on device dm-6. |
| Comment by Peter Jones [ 04/Aug/16 ] |
|
Fan Yong What do you advise here? Peter |
| Comment by nasf (Inactive) [ 05/Aug/16 ] |
This should be related with the using of external journal. Means the Lustre device was on the dm-15, its journal was on the dm-6. Please check for conformation. The panic is because there was bitmap corruption as to the journal was aborted, and you have set the system behaviour on journal abort as "errors=panic", then the ldiskfs will auto panic on the event of journal abort. It you do not want panic, it can be changed as "errors=remount-ro" or "errors=continue". But in your case, since some bitmap corrupted, making the system to run continuously may be dangerous. |
| Comment by nasf (Inactive) [ 24/Aug/16 ] |
|
Any feedback? Thanks! |
| Comment by Nathan Dauchy (Inactive) [ 02/Sep/16 ] |
|
We did run e2fsck on all the servers that use the same backend disk array it corrected some errors. This case can be closed. |
| Comment by nasf (Inactive) [ 02/Sep/16 ] |
|
The issue has been resolved via e2fsck. |
| Comment by Mahmoud Hanafi [ 08/Sep/16 ] |
|
Please close. |