Thanks to Serge for this, by the way, who showed me how to do all this.
Problem
When I Scons and then run my code with opt=2, I come across a bug that is not seen with opt=1 or no opt.
cd $IP_DIFFIM_DIR/ scons opt=2 python examples/testKernelModel.py Segmentation fault
Long and drawn out Solution Step 1 : Find out where its going bad
gdb python
run examples/testKernelModel.py -v 5
[Switching to Thread 47505733450224 (LWP 12490)]
0x00002b34db254fa6 in lsst::ip::diffim::getCollectionOfFootprintsForPsfMatching<float, unsigned short> (
imageToConvolve=@0xde24ed0, imageToNotConvolve=@0xdee25c0, policy=@0xffffffff)
at /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_algobase.h:270
270 *__result = *__first;
Current language: auto; currently c++
Another way to investigate this, using python debugger. Add to the python script
import pdb # go to line before this is called and type pdb.set_trace()
Run the script
python examples/testKernelModel.py -v 5
In another window, find the job ID that this is running (12532). Go into gdb
gdb attach 12532 continue
Go back into (Pdb) window and also type
continue
And you get in your (gdb) window
[Switching to Thread 47347301497328 (LWP 12532)]
0x00002b0ff7dd8fa6 in lsst::ip::diffim::getCollectionOfFootprintsForPsfMatching<float, unsigned short> (
imageToConvolve=@0x1b1e0540, imageToNotConvolve=@0x1b1cb0c0, policy=@0xffffffff)
at /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_algobase.h:270
270 *__result = *__first;
Clearly a problem with getCollectionOfFootprintsForPsfMatching.
Step 2 : Set breakpoints in Swig wrapper
The wrapper function will contain "getCollectionOfFootprintsForPsfMatching" in its name, so look inside diffimLib_wrap.cc to find
python/lsst/ip/diffim/diffimLib_wrap.cc:41506
SWIGINTERN PyObject *_wrap_getCollectionOfFootprintsForPsfMatching(PyObject *self, PyObject *args) {
Restart pdb
python examples/testKernelModel.py -v 5
and gdb
gdb attach 12562
Now set a breakpoint in the Swig wrapper. Do this like
(gdb) break python/lsst/ip/diffim/diffimLib_wrap.cc:41506 Breakpoint 1 at 0x2b98433e7c69: file python/lsst/ip/diffim/diffimLib_wrap.cc, line 41506.
Or, since gdb was started after all the python modules were loaded
(gdb) break _wrap_getCollectionOfFootprintsForPsfMatching Breakpoint 1 at 0x2b98433e7c50: file python/lsst/ip/diffim/diffimLib_wrap.cc, line 41505.
I was off by a line; you need to set the breakpoint at the first command within the subroutine. Continue in gdb
(gdb) continue
and continue in pdb
(Pdb) continue
In the gdb window next through each subroutine until you hit the problem
(gdb) next
41527 return _wrap_getCollectionOfFootprintsForPsfMatching__SWIG_1(self, args);
(gdb) next
Program received signal SIGSEGV, Segmentation fault.
0x00002b98436d2fa6 in lsst::ip::diffim::getCollectionOfFootprintsForPsfMatching<float, unsigned short> (
imageToConvolve=@0x11479540, imageToNotConvolve=@0x114640c0, policy=@0xffffffff)
at /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_algobase.h:270
270 *__result = *__first;
OK, check out _wrap_getCollectionOfFootprintsForPsfMatching__SWIG_1. Restart the whole process, when you get to _SWIG_1, step not next.
(gdb) next
41527 return _wrap_getCollectionOfFootprintsForPsfMatching__SWIG_1(self, args);
(gdb) step
_wrap_getCollectionOfFootprintsForPsfMatching__SWIG_1 (args=0xb698e60) at python/lsst/ip/diffim/diffimLib_wrap.cc:41381
41381 SWIGINTERN PyObject *_wrap_getCollectionOfFootprintsForPsfMatching__SWIG_1(PyObject *SWIGUNUSEDPARM(self), PyObject *args) {
Start nexting through this
(gdb) next
41424 result = lsst::ip::diffim::SWIGTEMPLATEDISAMBIGUATOR getCollectionOfFootprintsForPsfMatching< float,lsst::afw::image::maskPixelType >((lsst::afw::image::MaskedImage< float,boost::uint16_t > const &)*arg1,(lsst::afw::image::MaskedImage< float,boost::uint16_t > const &)*arg2,*arg3);
(gdb) next
Program received signal SIGSEGV, Segmentation fault.
0x00002b41d382afa6 in lsst::ip::diffim::getCollectionOfFootprintsForPsfMatching<float, unsigned short> (
imageToConvolve=@0xb67f540, imageToNotConvolve=@0xb66a0c0, policy=@0xffffffff)
at /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_algobase.h:270
270 *__result = *__first;
Looks like the problem is in the C-code.
Step 3 : Set breakpoints in the C-code
Try
(gdb) break getCollectionOfFootprintsForPsfMatching Function "getCollectionOfFootprintsForPsfMatching" not defined. Make breakpoint pending on future shared library load? (y or [n])
Hmm... OK, go to code and find out where this is being defined. Line 248. Set break at 247.
(gdb) break src/ImageSubtract.cc:247 Breakpoint 1 at 0x2b5bf5b89670: file src/ImageSubtract.cc, line 247.
and then
(gdb) continue (Pdb) continue # did not work, hit bug right away!
I can't seem to get the code to break in the middle of the source code. Mail Serge.
Interesting option : rbreak regexp will set breakpoints at anything matching the regexp. E.g.
(gdb) rbreak getCollectionOfFootprintsForPsfMatching Breakpoint 1 at 0x2aab8a6ddc50: file python/lsst/ip/diffim/diffimLib_wrap.cc, line 41505. static PyObject *_wrap_getCollectionOfFootprintsForPsfMatching(PyObject *, PyObject *); Breakpoint 2 at 0x2aab8a6dc200: file python/lsst/ip/diffim/diffimLib_wrap.cc, line 41381. static PyObject *_wrap_getCollectionOfFootprintsForPsfMatching__SWIG_1(PyObject *, PyObject *); Breakpoint 3 at 0x2aab8a6dd3b0: file python/lsst/ip/diffim/diffimLib_wrap.cc, line 41443. static PyObject *_wrap_getCollectionOfFootprintsForPsfMatching__SWIG_2(PyObject *, PyObject *); Breakpoint 4 at 0x2aab8a9c5670: file src/ImageSubtract.cc, line 253. std::vector<boost::shared_ptr<lsst::detection::Footprint>,std::allocator<boost::shared_ptr<lsst::detection::Footprint> > > std::vector<boost::shared_ptr<lsst::detection::Footprint>, std::allocator<boost::shared_ptr<lsst::detection::Footprint> > > lsst::ip::diffim::getCollectionOfFootprintsForPsfMatching<double, unsigned short>(lsst::afw::image::MaskedImage<double, unsigned short> const&, lsst::afw::image::MaskedImage<double, unsigned short> const&, lsst::pex::policy::Policy&); Breakpoint 5 at 0x2aab8a9c8a00: file src/ImageSubtract.cc, line 253. std::vector<boost::shared_ptr<lsst::detection::Footprint>,std::allocator<boost::shared_ptr<lsst::detection::Footprint> > > std::vector<boost::shared_ptr<lsst::detection::Footprint>, std::allocator<boost::shared_ptr<lsst::detection::Footprint> > > lsst::ip::diffim::getCollectionOfFootprintsForPsfMatching<float, unsigned short>(lsst::afw::image::MaskedImage<float, unsigned short> const&, lsst::afw::image::MaskedImage<float, unsigned short> const&, lsst::pex::policy::Policy&);
So try and set my ImageSubtract?.cc break at line 253. Note this is where the close parenthesis is to start the subroutine.
(gdb) break src/ImageSubtract.cc:253
Breakpoint 1 at 0x2b6f520b7670: file src/ImageSubtract.cc, line 253.
(gdb) continue
Continuing.
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 47756836362736 (LWP 16568)]
0x00002b6f520bafa6 in lsst::ip::diffim::getCollectionOfFootprintsForPsfMatching<float, unsigned short> (
imageToConvolve=@0x1e3fe540, imageToNotConvolve=@0x1e3e90c0, policy=@0xffffffff)
at /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_algobase.h:270
270 *__result = *__first;
Still did not work... Interestingly, if I...
(gdb) rbreak getCollectionOfFootprintsForPsfMatching (gdb) continue Breakpoint 1, _wrap_getCollectionOfFootprintsForPsfMatching (self=0x0, args=0x127c1e60) (gdb) continue Breakpoint 2, _wrap_getCollectionOfFootprintsForPsfMatching__SWIG_1 (args=0x127c1e60) (gdb) continue Breakpoint 5, lsst::ip::diffim::getCollectionOfFootprintsForPsfMatching<float, unsigned short> (
I can get there! Nice...
By the way, the problem probably has to do with optimization mangling things. One way to get the *exact* function name is
nm lib/libip_diffim.so | grep getCollectionOfFootprintsForPsfMatching 0000000000033670 W _ZN4lsst2ip6diffim39getCollectionOfFootprintsForPsfMatchingIdtEESt6vectorIN5boost10shared_ptrINS_9detection9FootprintEEESaIS8_EERKNS_3afw5image11MaskedImageIT_T0_EESI_RNS_3pex6policy6PolicyE # and then (gdb) break _ZN4lsst2ip6diffim39getCollectionOfFootprintsForPsfMatchingIdtEESt6vectorIN5boost10shared_ptrINS_9detection9FootprintEEESaIS8_EERKNS_3afw5image11MaskedImageIT_T0_EESI_RNS_3pex6policy6PolicyE Breakpoint 1 at 0x2aac61613670: file src/ImageSubtract.cc, line 253.
DANG this did not work either... It looks like due to optimization a bunch of things got inlined into the function().
Note in passing : apparently I should not have grabbed the ...PsfMatchingIdtEESt... version but instead the ...PsfMatchingIftEESt... version and this would have worked. Or Serge will eat his shorts. But I have now gotten around the bug and don't want to go through this again. Note for next time.
Anyways, once I get in there with rbreak...
(gdb) next
# at this stage, just keep hitting <ENTER> and it will keep going next. do this until its clear you're in a loop...
888 void increment() { ++i1; ++i2; }
270 *__result = *__first;
269 for (; __first != __last; ++__result, ++__first)
886 bool equal( iterator const& iter ) const { return (i1==iter.i1) && (i2==iter.i2); }
270 *__result = *__first;
888 void increment() { ++i1; ++i2; }
# above was the loop (look familiar!?; *__result = *__first;)
(gdb) print /x $pc
$1 = 0x2ac62e82cfa8
(gdb) break *0x2ac62e82cfa8
Breakpoint 6 at 0x2ac62e82cfa8: file /lsst/becker/lsst_devel/DMS/vw/vw-1.0.1+3a/include/vw/Math/Vector.h, line 888.
(gdb) continue
# YES, this gets me out of the loop! or at least that iteration. continue N tries to get out of N-1 iterations.
(gdb) continue 10000
Program received signal SIGSEGV, Segmentation fault.
0x00002ac62e82cfa6 in lsst::ip::diffim::getCollectionOfFootprintsForPsfMatching<float, unsigned short> (
imageToConvolve=@0x5178540, imageToNotConvolve=@0x51630c0, policy=@0xffffffff)
at /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_algobase.h:270
270 *__result = *__first;
Note that the instruction where it dies (0x00002ac62e82cfa6) is right around the loop I was stuck in (0x2ac62e82cfa8). Time to get down and dirty...
Step 4 : Look at Assembly Code
To get assembly for the function you're currently in inside gdb, type
disassemble
Now, the segfault output is
0x00002ac62e82cfa6 in lsst::ip::diffim::getCollectionOfFootprintsForPsfMatching<float, unsigned short> (
imageToConvolve=@0x5178540, imageToNotConvolve=@0x51630c0, policy=@0xffffffff)
at /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_algobase.h:270
270 *__result = *__first;
The first number, 0x00002ac62e82cfa6, is the address of the instruction that caused the seg fault. You can access this using gdb's architecture independent instruction pointer register, pc (for program counter). Also this is $rip, the instruction pointer register in x86-64 architectures.
(gdb) print /x $pc $2 = 0x2ac62e82cfa6
Additional ways to get info
(gdb) info frame
Stack level 0, frame at 0x7fff8d4eff90:
rip = 0x2ac62e82cfa6
in std::vector<boost::shared_ptr<lsst::detection::Footprint>, std::allocator<boost::shared_ptr<lsst::detection::Footprint> > > lsst::ip::diffim::getCollectionOfFootprintsForPsfMatching<float, unsigned short>(lsst::afw::image::MaskedImage<float, unsigned short> const&, lsst::afw::image::MaskedImage<float, unsigned short> const&, lsst::pex::policy::Policy&)
(/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_algobase.h:270);
saved rip 0x201d2cfb2f399b3e
called by frame at 0x7fff8d4eff98
source language c++.
Arglist at 0x7fff8d4efb48, args: imageToConvolve=@0x5178540, imageToNotConvolve=@0x51630c0, policy=@0xffffffff
Locals at 0x7fff8d4efb48, Previous frame's sp is 0x7fff8d4eff90
Saved registers:
rbx at 0x7fff8d4eff58, rbp at 0x7fff8d4eff60, r12 at 0x7fff8d4eff68, r13 at 0x7fff8d4eff70, r14 at 0x7fff8d4eff78,
r15 at 0x7fff8d4eff80, rip at 0x7fff8d4eff88
And
(gdb) info registers rax 0x1d6 470 rbx 0x7fff8d4eff00 140735564152576 rcx 0x7fff8d4efc80 140735564151936 rdx 0x7fff8d4f5000 140735564173312 rsi 0x7fff8d4f4ff0 140735564173296 rdi 0x7fff8d4efe38 140735564152376 rbp 0xffffffff 0xffffffff rsp 0x7fff8d4efb50 0x7fff8d4efb50 r8 0x7fff8d4efe40 140735564152384 r9 0x7fff8d4efca0 140735564151968 r10 0x3 3 r11 0x36cfd4ea10 235415071248 r12 0x0 0 r13 0x0 0 r14 0x5352590 87369104 r15 0x5191d20 85531936 rip 0x2ac62e82cfa6 0x2ac62e82cfa6 <std::vector<boost::shared_ptr<lsst::detection::Footprint>, std::allocator<boost::shared_ptr<lsst::detection::Footprint> > > lsst::ip::diffim::getCollectionOfFootprintsForPsfMatching<float, unsigned short>(lsst::afw::image::MaskedImage<float, unsigned short> const&, lsst::afw::image::MaskedImage<float, unsigned short> const&, lsst::pex::policy::Policy&)+1446> eflags 0x10283 [ CF SF IF RF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0
Notice $rip is 0x2ac62e82cfa6 as we expect. Anyways, we have found the instruction where it all goes bad! It has something to do with vw/Math/Vector.h. I want to be really good; get back in the loop and wait until you get to line 270 in the loop and do the print /x $pc to see where in the code I am (that is, what file I am on line 270 of).
270 *__result = *__first; (gdb) print /x $pc $1 = 0x2b5e2a72bfa4 (gdb) break *0x2b5e2a72bfa4 Breakpoint 6 at 0x2b5e2a72bfa4: file /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_algobase.h, line 270.
Doesn't totally help. In fact, the other context does help much more...
My disassemble says...
0x00002ac62e82cf9a <+1434>: jmp 0x2ac62e82cfb7 <+1463> 0x00002ac62e82cf9c <+1436>: nopl 0x0(%rax) 0x00002ac62e82cfa0 <+1440>: mov 0x8(%rcx),%rdx 0x00002ac62e82cfa4 <+1444>: mov (%rdi),%eax 0x00002ac62e82cfa6 <+1446>: add (%rdx),%eax 0x00002ac62e82cfa8 <+1448>: addq $0x4,(%rcx) 0x00002ac62e82cfac <+1452>: addq $0x4,0x8(%rcx) 0x00002ac62e82cfb1 <+1457>: mov %eax,(%rsi) 0x00002ac62e82cfb3 <+1459>: add $0x4,%rsi 0x00002ac62e82cfb7 <+1463>: cmp %r8,%rdi 0x00002ac62e82cfba <+1466>: jne 0x2ac62e82cfa0 <+1440> 0x00002ac62e82cfbc <+1468>: mov 0x8(%rcx),%rdx 0x00002ac62e82cfc0 <+1472>: cmp 0x8(%r9),%rdx 0x00002ac62e82cfc4 <+1476>: jne 0x2ac62e82cfa4 <+1444> 0x00002ac62e82cfc6 <+1478>: mov 0x3c0(%rsp),%eax 0x00002ac62e82cfcd <+1485>: cmp 0x2e8(%rsp),%eax 0x00002ac62e82cfd4 <+1492>: jle 0x2ac62e82cfdd <+1501> 0x00002ac62e82cfd6 <+1494>: mov %eax,0x2e8(%rsp) 0x00002ac62e82cfdd <+1501>: cmp %eax,0x2e0(%rsp)
Says Serge :
Execution flows from the top, hits the unconditional jump at +1434, and ends up at +1463, at which point the contents of %r8 and %rdi are compared. If they are not equal, a jump backwards to +1440 is performed. Notice that from +1440 to +1466 there are no other jumps and that %rdi/%r8 are never written to. What this means is that if %r8/%rdi ever compare as not equal, they will always compare as not equal. In your example, it turns out that %r8 and %rdi are in fact not equal, so we enter a loop with no way out other than a memory access violation. Taking a look at the inner loop details: <+1440>: mov 0x8(%rcx),%rdx <----------o <+1444>: mov (%rdi),%eax | <+1446>: add (%rdx),%eax <-------------|-- SEGFAULT <+1448>: addq $0x4,(%rcx) | <+1452>: addq $0x4,0x8(%rcx) | <+1457>: mov %eax,(%rsi) | <+1459>: add $0x4,%rsi | <+1463>: cmp %r8,%rdi | <+1466>: jne 0x2ac62e82cfa0 <+1440> ---o The loop repeatedly computes a value for %eax (instructions at +1444,+1446), stores it at the address contained in %rsi (instruction +1457), then increments %rsi (instruction +1459). Instructions +1448 and +1452 look like pointer increments, though only the pointer stored at 0x8(%rcx) is used: it's loaded into %rdx and its content is added to %eax. Eventually, a pointer with an invalid address is generated and the seg-fault is triggered. Depending on which instruction a breakpoint is set on, the debugger claims that - we are inside std::copy (invoked by vw::math::Vector<int, 2> operator=). - we are inside various parts of the vw Vector classes. So it looks like the optimizer tried to merge the implementation details of the following lines from getCollectionOfFootprintsForPsfMatching() in ImageSubtract.cc: 306 vw::BBox2i footprintBBox = (*i)->getBBox(); 307 footprintBBox.grow(footprintBBox.max() + vw::Vector2i(footprintDiffimGrow,footprintDiffimGrow)); and generated what is clearly garbage.
Another way to get the assembly level stuff
/usr/bin/objdump -d src/ImageSubtract.os > ImageSubtract.S
And if you can read this you can fix any bugs. THE END.
