I have had a heck of a time with a memory problem. Apparently, the Citizen class is supposed to make memory debugging simple. Well, I have no idea how to use it. Its taken about 2 weeks straight of my time to figure this out. Lets hope it doesn't take you as long.
I had been doing my debugging in C; perhaps its easier to use Citizen from Python. If so I will let you know.
Problem
I am having a memory overrun in my code somewhere. I think it is in an inner loop, where I allocate an lsst::fw::Image<KernelT> kImage(kernelCols, kernelRows);. Its being double-freed somehow, and I can't figure out how. This happens on something like 10% of the SuperMACHO images I have been working with. I can't predict under what conditions it crops up, but it does so repeatedly.
*** glibc detected *** free(): invalid pointer: 0x000000000384ede0 ***
Solution 1a : gdb and C
I don't know how to use gdb worth anything. I am a bad, bad programmer. Its killing me right now...
Result : I need to learn how to use gdb better
Solution 1b : gdb and Python
gdb python run examples/runImageSubtract.py $FWDATA_DIR/SM/sm85.021115_0707.112_8.im $FWDATA_DIR/SM/sm85.021115_0707.112_8.tmpl tests/ImageSubtract_policy.paf
Solution 2 : Valgrind
Valgrind is apparently a good way to check on memory problems, but as far as I can tell its not interactive. It is however verbose. I need to compile both fw and imageproc with scons debug=1. Then I run my code with
valgrind --error-limit=no ./tests/testImageSubtract $FWDATA_DIR/SM/local/8/sm85.021115_0707.112_8.tmpl $FWDATA_DIR/SM/local/8/sm85.021115_0707.112_8.im > & VALGRIND.OUT &
I had to set --error-limit=no because there is a problem with VW, and everytime I try and do a pseudoinverse Valgrind says
==20071== Invalid read of size 8 ==20071== at 0x430EBD: vw::math::Matrix<double, 0, 0>::operator()(unsigned, unsigned) (Matrix.h:460) ==20071== Address 0x7FECF2688 is on thread 1's stack ==20071== ==20071== Invalid write of size 8 ==20071== at 0x44BBBF: void vw::math::svd<vw::math::Matrix<double, 0, 0>, vw::math::Matrix<double, 0, 0>, vw::math::Vector<double, 0>, vw::math::Matrix<double, 0, 0> >(vw::math::Matrix<double, 0, 0> const&, vw::math::Matrix<double, 0, 0>&, vw::math::Vector<double, 0>&, vw::math::Matrix<double, 0, 0>&) (LinearAlgebra.h:196) ==20071== Address 0x7FECF26C8 is on thread 1's stack
So there is an outstanding memory problem with VW+LAPACK; lets ignore it for now. Maybe this is actually the problem. But it takes the program a factor of several hundred longer to execute when running with Valgrind. And unfortunately this means my code takes 1 day to finally get to the error. Also, since I have sooooo many VW problems, I need to redirect the output to a file since it scrolls off my screen.
Well, this is getting closer. Get errors like
==20071== Invalid write of size 8 ==20071== at 0x432360: std::vector<boost::shared_ptr<lsst::fw::Kernel<double> >, std::allocator<boost::shared_ptr<lsst::fw::Kernel<double> > > > lsst::imageproc::computePcaKernelBasis<double>(std::vector<lsst::imageproc::DiffImContainer<double>, std::allocator<lsst::imageproc::DiffImContainer<double> > >&, lsst::mwi::policy::Policy&) (ImageSubtract.cc:876) ==20071== by 0x429C60: boost::shared_ptr<lsst::fw::LinearCombinationKernel<double> > lsst::imageproc::computePsfMatchingKernelForMaskedImage<float, unsigned short, double>(boost::shared_ptr<lsst::fw::function::Function2<double> >&, boost::shared_ptr<lsst::fw::function::Function2<double> >&, lsst::fw::MaskedImage<float, unsigned short> const&, lsst::fw::MaskedImage<float, unsigned short> const, std::vector<boost::shared_ptr<lsst::fw::Kernel<double> >, std::allocator<lsst::fw::Kernel> > const&, lsst::fw::MaskedImage<float, unsigned short> const&<boost::shared_ptr<lsst::detection::Footprint>, std::allocator<std::vector<boost::shared_ptr<lsst::fw::Kernel<double> >, std::allocator<lsst::fw::Kernel> > const&> > const&, lsst::mwi::policy::Policy&) (ImageSubtract.cc:293) ==20071== Address 0xE831590 is 0 bytes after a block of size 77,976 alloc'd
but I have edited my code since I did this run, so the lines don't match up. Run this on a non-development computer so that the code remains static. Well, the lines do match up. On line 876 of ImageSubtract?.c, I have
M(mIdx, ki) = *imageAccessorRow;
If I comment this out, I'll be damned if the code runs! Seems like this is the problem. The number 77976 helped. I divided this by 26, which is the number of good kernels I was using, and got a float. I divided by 27 which is the *total* number of kernels I was using and got an integer. I divided this by 19 x 19 which is the number of pixels in a kernel, and got 8 (for 8 bytes). This gave me an idea that it was a double array of kernels going bad somewhere.
Result : Seems to help localize errors, but is VERY slow. WINNER OF THE DEBUGGING CONTEST!!!
Solution 3 : Electric Fence
If you compile your code with -lefence, it it supposed to do strict error checking. It doesn't appear to work here though...
g++ -o tests/testImageSubtract ... -lefence
gdb ./tests/testImageSubtract
run $FWDATA_DIR/SM/local/8/sm85.021115_0707.112_8.tmpl $FWDATA_DIR/SM/local/8/sm85.021115_0707.112_8.im
[Thread debugging using libthread_db enabled]
[New Thread 182933496448 (LWP 22468)]
Electric Fence 2.2.0 Copyright (C) 1987-1999 Bruce Perens <bruce@perens.com>
ElectricFence Exiting: mprotect() failed: Cannot allocate memory
Program exited with code 0377.
We need to figure out a way to enable this in Scons.
Result : Does not work currently
Solution 4 : MALLOC_DEBUG_
Set as an environment variable.
becker49: setenv MALLOC_CHECK_ 3 becker50: ./tests/testImageSubtract $FWDATA_DIR/SM/local/8/sm85.021115_0707.112_8.tmpl $FWDATA_DIR/SM/local/8/sm85.021115_0707.112_8.im ...
Nothing exciting happened, just
*** glibc detected *** free(): invalid pointer: 0x000000000384ede0 ***
Try in GDB again so I can interact... This finds the error at a slightly different place, in ~Matrix this time.
#10 0x000000000043afc3 in lsst::imageproc::computePcaKernelBasis<double> (diffImContainerList=@0x7fbfffc390,
policy=@0x7fbfffca50) at include/lsst/imageproc/ImageSubtract.cc:1058
1058 return kernelPcaBasisList;
#9 0x000000000043811b in ~Matrix (this=0x7fbfffbd50)
at /lsst/lsst_root_old/Linux/external/visionWorkbench/1.0.1/include/vw/Math/Matrix.h:1886
1886 Matrix<double> id(size,size);
I went through PCA.cc very hard to make sure that my Matrix sizes were correct. I did find a bug; somehow std::fill(eVec.begin(), eVec.end(), 0.0) got in there, where eVec is a VW vector, not a std::vector. Wonder if this was causing the problems? Instead say vw::math::fill(cVec, 0.0). This fix didn't solve the problem...
Result : Does not seem to give any more info than just running it, unless you use gdb. It does however crash with different errors
