[LU-1294] Segmentation fault running lfsck Created: 09/Apr/12  Updated: 10/Apr/12  Resolved: 09/Apr/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Sarah Liu
Resolution: Duplicate Votes: 0
Labels: None
Environment:
  • jenkins-arch=x86_64,build_type=server,distro=el6,ib_stack=inkern (x86_64)
  • jenkins-arch=x86_64,build_type=client,distro=el5,ib_stack=inkern (x86_64)

Issue Links:
Related
Severity: 3
Rank (Obsolete): 6420

 Description   

This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/59343d1e-8207-11e1-9c84-525400d2bfa6.

lfsck hit a segmentation fault after printing an invalid argument message. It appears that the "ostdb-7" file is cut off, but that may just be due to the way the output is logged.

lfsck -c -l --mdsdb /home/autotest/shared_dir/2012-04-07/230543-7f46bd3a7040/mdsdb --ostdb /home/autotest/shared_dir/2012-04-07/230543-7f46bd3a7040/ostdb-0 /home/autotest/shared_dir/2012-04-07/230543-7f46bd3a7040/ostdb-1 /home/autotest/shared_dir/2012-04-07/230543-7f46bd3a7040/ostdb-2 /home/autotest/shared_dir/2012-04-07/230543-7f46bd3a7040/ostdb-3 /home/autotest/shared_dir/2012-04-07/230543-7f46bd3a7040/ostdb-4 /home/autotest/shared_dir/2012-04-07/230543-7f46bd3a7040/ostdb-5 /home/autotest/shared_dir/2012-04-07/230543-7f46bd3a7040/ostdb-6 /mnt/lustre
06:03:09:lfsck 1.41.90.wc4 (01-Sep-2011)
06:03:10:/home/autotest/shared_dir/2012-04-07/230543-7f46bd3a7040/mdsdb:mdshdr
06:03:10:: Invalid argument
06:03:10:/usr/lib64/lustre/tests/test-framework.sh: line 2772:  8975 Segmentation fault      lfsck -c -l --mdsdb /home/autotest/shared_dir/2012-04-07/230543-7f46bd3a7040/mdsdb --ostdb /home/autotest/shared_dir/2012-04-07/230543-7f46bd3a7040/ostdb-0 /home/autotest/shared_dir/2012-04-07/230543-7f46bd3a7040/ostdb-1 /home/autotest/shared_dir/2012-04-07/230543-7f46bd3a7040/ostdb-2 /home/autotest/shared_dir/2012-04-07/230543-7f46bd3a7040/ostdb-3 /home/autotest/shared_dir/2012-04-07/230543-7f46bd3a7040/ostdb-4 /home/autotest/shared_dir/2012-04-07/230543-7f46bd3a7040/ostdb-5 /home/autotest/shared_dir/2012-04-07/230543-7f46bd3a7040/ostdb-6 /mnt/lustre
06:03:10: lfsck : @@@@@@ FAIL: lfsck -c -l --mdsdb /home/autotest/shared_dir/2012-04-07/230543-7f46bd3a7040/mdsdb --ostdb  /home/autotest/shared_dir/2012-04-07/230543-7f46bd3a7040/ostdb-0 /home/autotest/shared_dir/2012-04-07/230543-7f46bd3a7040/ostdb-1 /home/autotest/shared_dir/2012-04-07/230543-7f46bd3a7040/ostdb-2 /home/autotest/shared_dir/2012-04-07/230543-7f46bd3a7040/ostdb-3 /home/autotest/shared_dir/2012-04-07/230543-7f46bd3a7040/ostdb-4 /home/autotest/shared_dir/2012-04-07/230543-7f46bd3a7040/ostdb-5 /home/autotest/shared_dir/2012-04-07/230543-7f46bd3a7040/ostdb-6 /mnt/lustre returned 139, should be <= 1 
06:03:11:Dumping lctl log to /logdir/test_logs/2012-04-07/lustre-master-el6-x86_64-el5-x86_64__479__-7f46bd3a7040/lfsck..*.1333879390.log
06:03:15:lfsck returned 0


 Comments   
Comment by Andreas Dilger [ 09/Apr/12 ]

TT-487 is to track automated collection of core dumps, but until that is finished this test needs to be run by hand and a core file attached to this bug and/or run lfsck under gdb and collect the stack trace.

Comment by Andreas Dilger [ 09/Apr/12 ]

The client and server are running different versions - RHEL5 on the client and RHEL6 on the server. This may be the root cause of the problem, since db4 is not a very portable database format. Typically this is not a problem for lfsck, since the databases are only useful for a very short time.

Comment by Peter Jones [ 09/Apr/12 ]

Sarah

Could you please try and reproduce this failure and gather fuller data?

Thanks

Peter

Comment by Peter Jones [ 09/Apr/12 ]

duplicate of LU-1096

Comment by Sarah Liu [ 10/Apr/12 ]

(gdb) run -c -l --mdsdb /scratch/mdsdb --ostdb /scratch/ostdb-0 /mnt/lustre
Starting program: /usr/sbin/lfsck -c -l --mdsdb /scratch/mdsdb --ostdb /scratch/ostdb-0 /mnt/lustre
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x2aaaaaaab000
[Thread debugging using libthread_db enabled]
lfsck 1.41.90.wc4 (01-Sep-2011)
/scratch/mdsdb:mdshdr
: Invalid argument

Program received signal SIGSEGV, Segmentation fault.
0x0000003d9e478480 in strlen () from /lib64/libc.so.6
(gdb) bt
#0 0x0000003d9e478480 in strlen () from /lib64/libc.so.6
#1 0x0000003d9e446aae in vfprintf () from /lib64/libc.so.6
#2 0x0000003d9e4e6ae7 in __vfprintf_chk () from /lib64/libc.so.6
#3 0x0000000000403dd9 in log_write (fmt=0x40f950 "%s: error opening mds_hdr in %s: rc %d\n") at lfsck.c:231
#4 0x0000000000406038 in lfsck_run_checks () at lfsck.c:1849
#5 0x00000000004061d7 in main (argc=8, argv=0x7fffffffe918) at lfsck.c:2077
(gdb)

Generated at Sat Feb 10 01:15:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.