Thanks to Serge for this, by the way, who showed me how to do all this.

Problem

When I Scons and then run my code with opt=2, I come across a bug that is not seen with opt=1 or no opt.

cd $IP_DIFFIM_DIR/
scons opt=2
python examples/testKernelModel.py
  Segmentation fault

Long and drawn out Solution Step 1 : Find out where its going bad

gdb python
run examples/testKernelModel.py -v 5
[Switching to Thread 47505733450224 (LWP 12490)]
0x00002b34db254fa6 in lsst::ip::diffim::getCollectionOfFootprintsForPsfMatching<float, unsigned short> (
    imageToConvolve=@0xde24ed0, imageToNotConvolve=@0xdee25c0, policy=@0xffffffff)
    at /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_algobase.h:270
270                 *__result = *__first;
Current language:  auto; currently c++

Another way to investigate this, using python debugger. Add to the python script

import pdb
# go to line before this is called and type
pdb.set_trace()

Run the script

python examples/testKernelModel.py -v 5

In another window, find the job ID that this is running (12532). Go into gdb

gdb
attach 12532
continue

Go back into (Pdb) window and also type

continue

And you get in your (gdb) window

[Switching to Thread 47347301497328 (LWP 12532)]
0x00002b0ff7dd8fa6 in lsst::ip::diffim::getCollectionOfFootprintsForPsfMatching<float, unsigned short> (
    imageToConvolve=@0x1b1e0540, imageToNotConvolve=@0x1b1cb0c0, policy=@0xffffffff)
    at /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_algobase.h:270
270                 *__result = *__first;

Clearly a problem with getCollectionOfFootprintsForPsfMatching.


Step 2 : Set breakpoints in Swig wrapper

 Ways to set breakpoints.

The wrapper function will contain "getCollectionOfFootprintsForPsfMatching" in its name, so look inside diffimLib_wrap.cc to find

python/lsst/ip/diffim/diffimLib_wrap.cc:41506
SWIGINTERN PyObject *_wrap_getCollectionOfFootprintsForPsfMatching(PyObject *self, PyObject *args) {

Restart pdb

python examples/testKernelModel.py -v 5

and gdb

gdb
attach 12562

Now set a breakpoint in the Swig wrapper. Do this like

(gdb) break python/lsst/ip/diffim/diffimLib_wrap.cc:41506
Breakpoint 1 at 0x2b98433e7c69: file python/lsst/ip/diffim/diffimLib_wrap.cc, line 41506.

Or, since gdb was started after all the python modules were loaded

(gdb) break _wrap_getCollectionOfFootprintsForPsfMatching
Breakpoint 1 at 0x2b98433e7c50: file python/lsst/ip/diffim/diffimLib_wrap.cc, line 41505.

I was off by a line; you need to set the breakpoint at the first command within the subroutine. Continue in gdb

(gdb) continue

and continue in pdb

(Pdb) continue

In the gdb window next through each subroutine until you hit the problem

(gdb) next
41527             return _wrap_getCollectionOfFootprintsForPsfMatching__SWIG_1(self, args);
(gdb) next

Program received signal SIGSEGV, Segmentation fault.
0x00002b98436d2fa6 in lsst::ip::diffim::getCollectionOfFootprintsForPsfMatching<float, unsigned short> (
    imageToConvolve=@0x11479540, imageToNotConvolve=@0x114640c0, policy=@0xffffffff)
    at /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_algobase.h:270
270                 *__result = *__first;

OK, check out _wrap_getCollectionOfFootprintsForPsfMatching__SWIG_1. Restart the whole process, when you get to _SWIG_1, step not next.

(gdb) next
41527             return _wrap_getCollectionOfFootprintsForPsfMatching__SWIG_1(self, args);
(gdb) step
_wrap_getCollectionOfFootprintsForPsfMatching__SWIG_1 (args=0xb698e60) at python/lsst/ip/diffim/diffimLib_wrap.cc:41381
41381   SWIGINTERN PyObject *_wrap_getCollectionOfFootprintsForPsfMatching__SWIG_1(PyObject *SWIGUNUSEDPARM(self), PyObject *args) {

Start nexting through this

(gdb) next
41424         result = lsst::ip::diffim::SWIGTEMPLATEDISAMBIGUATOR getCollectionOfFootprintsForPsfMatching< float,lsst::afw::image::maskPixelType >((lsst::afw::image::MaskedImage< float,boost::uint16_t > const &)*arg1,(lsst::afw::image::MaskedImage< float,boost::uint16_t > const &)*arg2,*arg3);
(gdb) next

Program received signal SIGSEGV, Segmentation fault.
0x00002b41d382afa6 in lsst::ip::diffim::getCollectionOfFootprintsForPsfMatching<float, unsigned short> (
    imageToConvolve=@0xb67f540, imageToNotConvolve=@0xb66a0c0, policy=@0xffffffff)
    at /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_algobase.h:270
270                 *__result = *__first;

Looks like the problem is in the C-code.


Step 3 : Set breakpoints in the C-code

Try

(gdb) break getCollectionOfFootprintsForPsfMatching
Function "getCollectionOfFootprintsForPsfMatching" not defined.
Make breakpoint pending on future shared library load? (y or [n])

Hmm... OK, go to code and find out where this is being defined. Line 248. Set break at 247.

(gdb) break src/ImageSubtract.cc:247
Breakpoint 1 at 0x2b5bf5b89670: file src/ImageSubtract.cc, line 247.

and then

(gdb) continue
(Pdb) continue
# did not work, hit bug right away!  

I can't seem to get the code to break in the middle of the source code. Mail Serge.

Interesting option : rbreak regexp will set breakpoints at anything matching the regexp. E.g.

(gdb) rbreak getCollectionOfFootprintsForPsfMatching
Breakpoint 1 at 0x2aab8a6ddc50: file python/lsst/ip/diffim/diffimLib_wrap.cc, line 41505.
static PyObject *_wrap_getCollectionOfFootprintsForPsfMatching(PyObject *, PyObject *);
Breakpoint 2 at 0x2aab8a6dc200: file python/lsst/ip/diffim/diffimLib_wrap.cc, line 41381.
static PyObject *_wrap_getCollectionOfFootprintsForPsfMatching__SWIG_1(PyObject *, PyObject *);
Breakpoint 3 at 0x2aab8a6dd3b0: file python/lsst/ip/diffim/diffimLib_wrap.cc, line 41443.
static PyObject *_wrap_getCollectionOfFootprintsForPsfMatching__SWIG_2(PyObject *, PyObject *);
Breakpoint 4 at 0x2aab8a9c5670: file src/ImageSubtract.cc, line 253.
std::vector<boost::shared_ptr<lsst::detection::Footprint>,std::allocator<boost::shared_ptr<lsst::detection::Footprint> > > std::vector<boost::shared_ptr<lsst::detection::Footprint>, std::allocator<boost::shared_ptr<lsst::detection::Footprint> > > lsst::ip::diffim::getCollectionOfFootprintsForPsfMatching<double, unsigned short>(lsst::afw::image::MaskedImage<double, unsigned short> const&, lsst::afw::image::MaskedImage<double, unsigned short> const&, lsst::pex::policy::Policy&);
Breakpoint 5 at 0x2aab8a9c8a00: file src/ImageSubtract.cc, line 253.
std::vector<boost::shared_ptr<lsst::detection::Footprint>,std::allocator<boost::shared_ptr<lsst::detection::Footprint> > > std::vector<boost::shared_ptr<lsst::detection::Footprint>, std::allocator<boost::shared_ptr<lsst::detection::Footprint> > > lsst::ip::diffim::getCollectionOfFootprintsForPsfMatching<float, unsigned short>(lsst::afw::image::MaskedImage<float, unsigned short> const&, lsst::afw::image::MaskedImage<float, unsigned short> const&, lsst::pex::policy::Policy&);

So try and set my ImageSubtract?.cc break at line 253. Note this is where the close parenthesis is to start the subroutine.

(gdb) break src/ImageSubtract.cc:253
Breakpoint 1 at 0x2b6f520b7670: file src/ImageSubtract.cc, line 253.
(gdb) continue
Continuing.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 47756836362736 (LWP 16568)]
0x00002b6f520bafa6 in lsst::ip::diffim::getCollectionOfFootprintsForPsfMatching<float, unsigned short> (
    imageToConvolve=@0x1e3fe540, imageToNotConvolve=@0x1e3e90c0, policy=@0xffffffff)
    at /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_algobase.h:270
270                 *__result = *__first;

Still did not work... Interestingly, if I...

(gdb) rbreak getCollectionOfFootprintsForPsfMatching
(gdb) continue
Breakpoint 1, _wrap_getCollectionOfFootprintsForPsfMatching (self=0x0, args=0x127c1e60)
(gdb) continue
Breakpoint 2, _wrap_getCollectionOfFootprintsForPsfMatching__SWIG_1 (args=0x127c1e60)
(gdb) continue
Breakpoint 5, lsst::ip::diffim::getCollectionOfFootprintsForPsfMatching<float, unsigned short> (

I can get there! Nice...

By the way, the problem probably has to do with optimization mangling things. One way to get the *exact* function name is

nm lib/libip_diffim.so  | grep getCollectionOfFootprintsForPsfMatching
0000000000033670 W _ZN4lsst2ip6diffim39getCollectionOfFootprintsForPsfMatchingIdtEESt6vectorIN5boost10shared_ptrINS_9detection9FootprintEEESaIS8_EERKNS_3afw5image11MaskedImageIT_T0_EESI_RNS_3pex6policy6PolicyE
# and then
(gdb) break _ZN4lsst2ip6diffim39getCollectionOfFootprintsForPsfMatchingIdtEESt6vectorIN5boost10shared_ptrINS_9detection9FootprintEEESaIS8_EERKNS_3afw5image11MaskedImageIT_T0_EESI_RNS_3pex6policy6PolicyE 
Breakpoint 1 at 0x2aac61613670: file src/ImageSubtract.cc, line 253.

DANG this did not work either... It looks like due to optimization a bunch of things got inlined into the function().

Note in passing : apparently I should not have grabbed the ...PsfMatchingIdtEESt... version but instead the ...PsfMatchingIftEESt... version and this would have worked. Or Serge will eat his shorts. But I have now gotten around the bug and don't want to go through this again. Note for next time.

Anyways, once I get in there with rbreak...

(gdb) next

# at this stage, just keep hitting <ENTER> and it will keep going next.  do this until its clear you're in a loop...

888           void increment() { ++i1; ++i2; }
270                 *__result = *__first;
269               for (; __first != __last; ++__result, ++__first)
886           bool equal( iterator const& iter ) const { return (i1==iter.i1) && (i2==iter.i2); }
270                 *__result = *__first;
888           void increment() { ++i1; ++i2; }

# above was the loop (look familiar!?; *__result = *__first;)

(gdb) print /x $pc 
$1 = 0x2ac62e82cfa8
(gdb) break *0x2ac62e82cfa8
Breakpoint 6 at 0x2ac62e82cfa8: file /lsst/becker/lsst_devel/DMS/vw/vw-1.0.1+3a/include/vw/Math/Vector.h, line 888.
(gdb) continue

# YES, this gets me out of the loop!  or at least that iteration.  continue N tries to get out of N-1 iterations.

(gdb) continue 10000
Program received signal SIGSEGV, Segmentation fault.
0x00002ac62e82cfa6 in lsst::ip::diffim::getCollectionOfFootprintsForPsfMatching<float, unsigned short> (
    imageToConvolve=@0x5178540, imageToNotConvolve=@0x51630c0, policy=@0xffffffff)
    at /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_algobase.h:270
270                 *__result = *__first;

Note that the instruction where it dies (0x00002ac62e82cfa6) is right around the loop I was stuck in (0x2ac62e82cfa8). Time to get down and dirty...


Step 4 : Look at Assembly Code

To get assembly for the function you're currently in inside gdb, type

disassemble

Now, the segfault output is

0x00002ac62e82cfa6 in lsst::ip::diffim::getCollectionOfFootprintsForPsfMatching<float, unsigned short> (
    imageToConvolve=@0x5178540, imageToNotConvolve=@0x51630c0, policy=@0xffffffff)
    at /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_algobase.h:270
270                 *__result = *__first;

The first number, 0x00002ac62e82cfa6, is the address of the instruction that caused the seg fault. You can access this using gdb's architecture independent instruction pointer register, pc (for program counter). Also this is $rip, the instruction pointer register in x86-64 architectures.

(gdb) print /x $pc
$2 = 0x2ac62e82cfa6

Additional ways to get info

(gdb) info frame
Stack level 0, frame at 0x7fff8d4eff90:
 rip = 0x2ac62e82cfa6
    in std::vector<boost::shared_ptr<lsst::detection::Footprint>, std::allocator<boost::shared_ptr<lsst::detection::Footprint> > > lsst::ip::diffim::getCollectionOfFootprintsForPsfMatching<float, unsigned short>(lsst::afw::image::MaskedImage<float, unsigned short> const&, lsst::afw::image::MaskedImage<float, unsigned short> const&, lsst::pex::policy::Policy&)
    (/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_algobase.h:270); 
    saved rip 0x201d2cfb2f399b3e
 called by frame at 0x7fff8d4eff98
 source language c++.
 Arglist at 0x7fff8d4efb48, args: imageToConvolve=@0x5178540, imageToNotConvolve=@0x51630c0, policy=@0xffffffff
 Locals at 0x7fff8d4efb48, Previous frame's sp is 0x7fff8d4eff90
 Saved registers:
  rbx at 0x7fff8d4eff58, rbp at 0x7fff8d4eff60, r12 at 0x7fff8d4eff68, r13 at 0x7fff8d4eff70, r14 at 0x7fff8d4eff78,
  r15 at 0x7fff8d4eff80, rip at 0x7fff8d4eff88

And

(gdb) info registers 
rax            0x1d6    470
rbx            0x7fff8d4eff00   140735564152576
rcx            0x7fff8d4efc80   140735564151936
rdx            0x7fff8d4f5000   140735564173312
rsi            0x7fff8d4f4ff0   140735564173296
rdi            0x7fff8d4efe38   140735564152376
rbp            0xffffffff       0xffffffff
rsp            0x7fff8d4efb50   0x7fff8d4efb50
r8             0x7fff8d4efe40   140735564152384
r9             0x7fff8d4efca0   140735564151968
r10            0x3      3
r11            0x36cfd4ea10     235415071248
r12            0x0      0
r13            0x0      0
r14            0x5352590        87369104
r15            0x5191d20        85531936
rip            0x2ac62e82cfa6   0x2ac62e82cfa6 <std::vector<boost::shared_ptr<lsst::detection::Footprint>, std::allocator<boost::shared_ptr<lsst::detection::Footprint> > > lsst::ip::diffim::getCollectionOfFootprintsForPsfMatching<float, unsigned short>(lsst::afw::image::MaskedImage<float, unsigned short> const&, lsst::afw::image::MaskedImage<float, unsigned short> const&, lsst::pex::policy::Policy&)+1446>
eflags         0x10283  [ CF SF IF RF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0

Notice $rip is 0x2ac62e82cfa6 as we expect. Anyways, we have found the instruction where it all goes bad! It has something to do with vw/Math/Vector.h. I want to be really good; get back in the loop and wait until you get to line 270 in the loop and do the print /x $pc to see where in the code I am (that is, what file I am on line 270 of).

270                 *__result = *__first;
(gdb) print /x $pc 
$1 = 0x2b5e2a72bfa4
(gdb) break *0x2b5e2a72bfa4
Breakpoint 6 at 0x2b5e2a72bfa4: file /usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../include/c++/4.1.2/bits/stl_algobase.h, line 270.

Doesn't totally help. In fact, the other context does help much more...

My disassemble says...

0x00002ac62e82cf9a <+1434>:  jmp    0x2ac62e82cfb7 <+1463>
0x00002ac62e82cf9c <+1436>:  nopl   0x0(%rax)
0x00002ac62e82cfa0 <+1440>:  mov    0x8(%rcx),%rdx
0x00002ac62e82cfa4 <+1444>:  mov    (%rdi),%eax
0x00002ac62e82cfa6 <+1446>:  add    (%rdx),%eax
0x00002ac62e82cfa8 <+1448>:  addq   $0x4,(%rcx)
0x00002ac62e82cfac <+1452>:  addq   $0x4,0x8(%rcx)
0x00002ac62e82cfb1 <+1457>:  mov    %eax,(%rsi)
0x00002ac62e82cfb3 <+1459>:  add    $0x4,%rsi
0x00002ac62e82cfb7 <+1463>:  cmp    %r8,%rdi
0x00002ac62e82cfba <+1466>:  jne    0x2ac62e82cfa0 <+1440>
0x00002ac62e82cfbc <+1468>:  mov    0x8(%rcx),%rdx
0x00002ac62e82cfc0 <+1472>:  cmp    0x8(%r9),%rdx
0x00002ac62e82cfc4 <+1476>:  jne    0x2ac62e82cfa4 <+1444>
0x00002ac62e82cfc6 <+1478>:  mov    0x3c0(%rsp),%eax
0x00002ac62e82cfcd <+1485>:  cmp    0x2e8(%rsp),%eax
0x00002ac62e82cfd4 <+1492>:  jle    0x2ac62e82cfdd <+1501>
0x00002ac62e82cfd6 <+1494>:  mov    %eax,0x2e8(%rsp)
0x00002ac62e82cfdd <+1501>:  cmp    %eax,0x2e0(%rsp)

Says Serge :

Execution flows from the top, hits the unconditional jump at +1434, and ends up at +1463, at which point the contents of %r8 and %rdi are compared. 
If they are not equal, a jump backwards to +1440 is performed. Notice that from +1440 to +1466 there are no other jumps and that %rdi/%r8 are never 
written to. What this means is that if %r8/%rdi ever compare as not equal, they will always compare as not equal. In your example, it turns out that 
%r8 and %rdi are in fact not equal, so we enter a loop with no way out other than a memory access violation. Taking a look at the inner loop details:

<+1440>:   mov    0x8(%rcx),%rdx <----------o
<+1444>:   mov    (%rdi),%eax               |
<+1446>:   add    (%rdx),%eax <-------------|-- SEGFAULT
<+1448>:   addq   $0x4,(%rcx)               |
<+1452>:   addq   $0x4,0x8(%rcx)            |
<+1457>:   mov    %eax,(%rsi)               |
<+1459>:   add    $0x4,%rsi                 |
<+1463>:   cmp    %r8,%rdi                  |
<+1466>:   jne    0x2ac62e82cfa0 <+1440> ---o

The loop repeatedly computes a value for %eax (instructions at +1444,+1446), stores it at the address contained in %rsi (instruction +1457), then 
increments %rsi (instruction +1459). Instructions +1448 and +1452 look like pointer increments, though only the pointer stored at 0x8(%rcx) is used: 
it's loaded into %rdx and its content is added to %eax. Eventually, a pointer with an invalid address is generated and the seg-fault is triggered.

Depending on which instruction a breakpoint is set on, the debugger claims that
 - we are inside std::copy (invoked by vw::math::Vector<int, 2> operator=).
 - we are inside various parts of the vw Vector classes.

So it looks like the optimizer tried to merge the implementation details of the following lines from getCollectionOfFootprintsForPsfMatching() in 
ImageSubtract.cc:

306             vw::BBox2i footprintBBox = (*i)->getBBox();
307             footprintBBox.grow(footprintBBox.max() + vw::Vector2i(footprintDiffimGrow,footprintDiffimGrow));

and generated what is clearly garbage. 

Another way to get the assembly level stuff

/usr/bin/objdump -d src/ImageSubtract.os  > ImageSubtract.S

And if you can read this you can fix any bugs. THE END.