I have had a heck of a time with a memory problem. Apparently, the Citizen class is supposed to make memory debugging simple. Well, I have no idea how to use it. Its taken about 2 weeks straight of my time to figure this out. Lets hope it doesn't take you as long.

I had been doing my debugging in C; perhaps its easier to use Citizen from Python. If so I will let you know.

Problem

I am having a memory overrun in my code somewhere. I think it is in an inner loop, where I allocate an lsst::fw::Image<KernelT> kImage(kernelCols, kernelRows);. Its being double-freed somehow, and I can't figure out how. This happens on something like 10% of the SuperMACHO images I have been working with. I can't predict under what conditions it crops up, but it does so repeatedly.

*** glibc detected *** free(): invalid pointer: 0x000000000384ede0 ***

Solution 1a : gdb and C

I don't know how to use gdb worth anything. I am a bad, bad programmer. Its killing me right now...

Result : I need to learn how to use gdb better

Solution 1b : gdb and Python

gdb python
run examples/runImageSubtract.py $FWDATA_DIR/SM/sm85.021115_0707.112_8.im $FWDATA_DIR/SM/sm85.021115_0707.112_8.tmpl tests/ImageSubtract_policy.paf

Solution 2 : Valgrind

Valgrind is apparently a good way to check on memory problems, but as far as I can tell its not interactive. It is however verbose. I need to compile both fw and imageproc with scons debug=1. Then I run my code with

valgrind --error-limit=no ./tests/testImageSubtract $FWDATA_DIR/SM/local/8/sm85.021115_0707.112_8.tmpl $FWDATA_DIR/SM/local/8/sm85.021115_0707.112_8.im > & VALGRIND.OUT &

I had to set --error-limit=no because there is a problem with VW, and everytime I try and do a pseudoinverse Valgrind says

==20071== Invalid read of size 8
==20071==    at 0x430EBD: vw::math::Matrix<double, 0, 0>::operator()(unsigned, unsigned) (Matrix.h:460)
==20071==  Address 0x7FECF2688 is on thread 1's stack
==20071==
==20071== Invalid write of size 8
==20071==    at 0x44BBBF: void vw::math::svd<vw::math::Matrix<double, 0, 0>, vw::math::Matrix<double, 0, 0>, vw::math::Vector<double, 0>, vw::math::Matrix<double, 0, 0> >(vw::math::Matrix<double, 0, 0> const&, vw::math::Matrix<double, 0, 0>&, vw::math::Vector<double, 0>&, vw::math::Matrix<double, 0, 0>&) (LinearAlgebra.h:196)
==20071==  Address 0x7FECF26C8 is on thread 1's stack

So there is an outstanding memory problem with VW+LAPACK; lets ignore it for now. Maybe this is actually the problem. But it takes the program a factor of several hundred longer to execute when running with Valgrind. And unfortunately this means my code takes 1 day to finally get to the error. Also, since I have sooooo many VW problems, I need to redirect the output to a file since it scrolls off my screen.

Well, this is getting closer. Get errors like

==20071== Invalid write of size 8

==20071==    at 0x432360: std::vector<boost::shared_ptr<lsst::fw::Kernel<double> >, std::allocator<boost::shared_ptr<lsst::fw::Kernel<double> > > > lsst::imageproc::computePcaKernelBasis<double>(std::vector<lsst::imageproc::DiffImContainer<double>, std::allocator<lsst::imageproc::DiffImContainer<double> > >&, lsst::mwi::policy::Policy&) (ImageSubtract.cc:876)

==20071==    by 0x429C60: boost::shared_ptr<lsst::fw::LinearCombinationKernel<double> > lsst::imageproc::computePsfMatchingKernelForMaskedImage<float, unsigned short, double>(boost::shared_ptr<lsst::fw::function::Function2<double> >&, boost::shared_ptr<lsst::fw::function::Function2<double> >&, lsst::fw::MaskedImage<float, unsigned short> const&, lsst::fw::MaskedImage<float, unsigned short> const, std::vector<boost::shared_ptr<lsst::fw::Kernel<double> >, std::allocator<lsst::fw::Kernel> > const&, lsst::fw::MaskedImage<float, unsigned short> const&<boost::shared_ptr<lsst::detection::Footprint>, std::allocator<std::vector<boost::shared_ptr<lsst::fw::Kernel<double> >, std::allocator<lsst::fw::Kernel> > const&> > const&, lsst::mwi::policy::Policy&) (ImageSubtract.cc:293)

==20071==  Address 0xE831590 is 0 bytes after a block of size 77,976 alloc'd

but I have edited my code since I did this run, so the lines don't match up. Run this on a non-development computer so that the code remains static. Well, the lines do match up. On line 876 of ImageSubtract?.c, I have

M(mIdx, ki) = *imageAccessorRow;

If I comment this out, I'll be damned if the code runs! Seems like this is the problem. The number 77976 helped. I divided this by 26, which is the number of good kernels I was using, and got a float. I divided by 27 which is the *total* number of kernels I was using and got an integer. I divided this by 19 x 19 which is the number of pixels in a kernel, and got 8 (for 8 bytes). This gave me an idea that it was a double array of kernels going bad somewhere.

Result : Seems to help localize errors, but is VERY slow. WINNER OF THE DEBUGGING CONTEST!!!


Solution 3 : Electric Fence

If you compile your code with -lefence, it it supposed to do strict error checking. It doesn't appear to work here though...

g++ -o tests/testImageSubtract ... -lefence
gdb ./tests/testImageSubtract
run $FWDATA_DIR/SM/local/8/sm85.021115_0707.112_8.tmpl $FWDATA_DIR/SM/local/8/sm85.021115_0707.112_8.im
   [Thread debugging using libthread_db enabled]
   [New Thread 182933496448 (LWP 22468)]
     Electric Fence 2.2.0 Copyright (C) 1987-1999 Bruce Perens <bruce@perens.com>
   ElectricFence Exiting: mprotect() failed: Cannot allocate memory
   Program exited with code 0377.

We need to figure out a way to enable this in Scons.

Result : Does not work currently


Solution 4 : MALLOC_DEBUG_

Set as an environment variable.

becker49: setenv MALLOC_CHECK_ 3
becker50: ./tests/testImageSubtract $FWDATA_DIR/SM/local/8/sm85.021115_0707.112_8.tmpl $FWDATA_DIR/SM/local/8/sm85.021115_0707.112_8.im
 ...

Nothing exciting happened, just

*** glibc detected *** free(): invalid pointer: 0x000000000384ede0 ***

Try in GDB again so I can interact... This finds the error at a slightly different place, in ~Matrix this time.

#10 0x000000000043afc3 in lsst::imageproc::computePcaKernelBasis<double> (diffImContainerList=@0x7fbfffc390,
    policy=@0x7fbfffca50) at include/lsst/imageproc/ImageSubtract.cc:1058
1058        return kernelPcaBasisList;


#9  0x000000000043811b in ~Matrix (this=0x7fbfffbd50)
    at /lsst/lsst_root_old/Linux/external/visionWorkbench/1.0.1/include/vw/Math/Matrix.h:1886
1886        Matrix<double> id(size,size);

I went through PCA.cc very hard to make sure that my Matrix sizes were correct. I did find a bug; somehow std::fill(eVec.begin(), eVec.end(), 0.0) got in there, where eVec is a VW vector, not a std::vector. Wonder if this was causing the problems? Instead say vw::math::fill(cVec, 0.0). This fix didn't solve the problem...

Result : Does not seem to give any more info than just running it, unless you use gdb. It does however crash with different errors