Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What version of MPI and gcc should I use to compile? #6

Open
GodloveD opened this issue May 10, 2017 · 42 comments
Open

What version of MPI and gcc should I use to compile? #6

GodloveD opened this issue May 10, 2017 · 42 comments

Comments

@GodloveD
Copy link

Can you please provide some guidance on what version of MPI / gcc you've used to compile and run bison?

@dpryan79
Copy link
Owner

It shouldn't much matter. I've generally used openMPI and gcc 5.3 or 5.4, but it shouldn't much matter. The Makefile will change a bit depending on the MPI source (openMPI, MPICH2, etc.), but that should be it.

@GodloveD
Copy link
Author

Great. Can you give me the version number of openMPI too please? I'm seeing some unexpected behavior and I want to rule out differences in the build environment.

@dpryan79
Copy link
Owner

It looks like I most recently built it against 1.10.2

@GodloveD
Copy link
Author

Thanks! I'll give it a go.

@dpryan79
Copy link
Owner

Feel free to post error messages if you run into any problems!

@GodloveD
Copy link
Author

That didn't seem to help. Next question. Is there a specific version of bowtie2 I should be using?

@dpryan79
Copy link
Owner

I would try the most recent version, though really anything that's come out in the past few years should work. That's not needed for the installation, so if you're getting some sort of error at that point then that's not the cause. Just make sure some version of bowtie2 is in the PATH.

@GodloveD
Copy link
Author

OK. I'm trying to run this on a Slurm cluster and I'm basically getting a different error every time I submit. I'm using data from your 2014 tutorial. Here is my script:

#!/bin/bash
echo ======
echo $HOSTNAME
echo ======
echo
module load bowtie/2-2.2.9 openmpi/1.10.3/gcc-5.3.0
cd /data/godlovedc/bison_test/bison_tutorial
mpiexec -n 3 /data/godlovedc/bison/62bf61f7/bison --directional -g genomes/E.Coli/ -1 reads/100_1.fq.gz -2 reads/100_2.fq.gz

Here is my submission command:

$ sbatch --ntasks=3 --ntasks-per-core=1 bisonjob

And here are the output files from 10 different runs (with ~10 different errors). Can you see something obvious I'm doing wrong?

======
cn3627
======

[-] Unloading openmpi 1.10.3 for GCC 5.3.0
[+] Loading openmpi 1.10.3 for GCC 5.3.0
cn3627 has rank 0
cn3627 has rank 1
cn3627 has rank 2
Allocating space for 3000000000 characters
Will C->T convert reads/100_1.fq.gz and store the results in reads/100_1.CT.fq.gz.
Will G->A convert reads/100_2.fq.gz and store the results in reads/100_2.GA.fq.gz.
fatal flex scanner internal error--end of buffer missed
[cn3627:27630] *** Process received signal ***
[cn3627:27630] Signal: Segmentation fault (11)
[cn3627:27630] Signal code: Invalid permissions (2)
[cn3627:27630] Failing at address: 0x2aaaac000000
[cn3627:27630] [ 0] /lib64/libpthread.so.0(+0xf7e0)[0x2aaaaaf617e0]
[cn3627:27630] [ 1] /lib64/libc.so.6(+0x79279)[0x2aaaab954279]
[cn3627:27630] [ 2] /usr/local/OpenMPI/1.10.3/gcc-5.3.0/lib/libopen-pal.so.13(opal_libevent2021_event_base_loop+0x7d9)[0x2aaaac1e18b9]
[cn3627:27630] [ 3] /usr/local/OpenMPI/1.10.3/gcc-5.3.0/lib/libopen-rte.so.12(+0x585fe)[0x2aaaabed95fe]
[cn3627:27630] [ 4] /lib64/libpthread.so.0(+0x7aa1)[0x2aaaaaf59aa1]
[cn3627:27630] [ 5] /lib64/libc.so.6(clone+0x6d)[0x2aaaab9c3aad]
[cn3627:27630] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 27630 on node cn3627 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
======
cn1816
======

[-] Unloading openmpi 1.10.3 for GCC 5.3.0
[+] Loading openmpi 1.10.3 for GCC 5.3.0
cn1816 has rank 0
cn1816 has rank 1
Allocating space for 3000000000 characters
Will C->T convert reads/100_1.fq.gz and store the results in reads/100_1.CT.fq.gz.
Will G->A convert reads/100_2.fq.gz and store the results in reads/100_2.GA.fq.gz.
[cn1816:17079] *** Process received signal ***
[cn1816:17079] Signal: Segmentation fault (11)
[cn1816:17079] Signal code: Address not mapped (1)
[cn1816:17079] Failing at address: 0x2aab92ede952
cn1816 has rank 2
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 17079 on node cn1816 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
======
cn1980
======

[-] Unloading openmpi 1.10.3 for GCC 5.3.0
[+] Loading openmpi 1.10.3 for GCC 5.3.0
cn1980 has rank 0
cn1980 has rank 1
Allocating space for 3000000000 characters
Will C->T convert reads/100_1.fq.gz and store the results in reads/100_1.CT.fq.gz.
Will G->A convert reads/100_2.fq.gz and store the results in reads/100_2.GA.fq.gz.
cn1980 has rank 2
[cn1980:17108] *** Process received signal ***
[cn1980:17108] Signal: Segmentation fault (11)
[cn1980:17108] Signal code: Address not mapped (1)
[cn1980:17108] Failing at address: 0x3c
[cn1980:17108] [ 0] --------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 17108 on node cn1980 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
======
cn1804
======

[-] Unloading openmpi 1.10.3 for GCC 5.3.0
[+] Loading openmpi 1.10.3 for GCC 5.3.0
Allocating space for 3000000000 characters
cn1804 has rank 2
cn1804 has rank 0
cn1804 has rank 1
Will C->T convert reads/100_1.fq.gz and store the results in reads/100_1.CT.fq.gz.
Will G->A convert reads/100_2.fq.gz and store the results in reads/100_2.GA.fq.gz.
[cn1804:49992] *** Process received signal ***
[cn1804:49992] Signal: Segmentation fault (11)
[cn1804:49992] Signal code: Address not mapped (1)
[cn1804:49992] Failing at address: 0x3c
--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:          cn1804 (PID 49992)
  MPI_COMM_WORLD rank: 0

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
[cn1804:49992] [ 0] /lib64/libpthread.so.0(+0xf7e0)[0x2aaaaaf617e0]
[cn1804:49992] [ 1] /usr/local/OpenMPI/1.10.3/gcc-5.3.0/lib/libopen-pal.so.13(opal_show_help_yylex+0x38a)[0x2aaaac1d38ca]
[cn1804:49992] [ 2] /usr/local/OpenMPI/1.10.3/gcc-5.3.0/lib/libopen-pal.so.13(opal_show_help_vstring+0x198)[0x2aaaac1d2c68]
[cn1804:49992] [ 3] /usr/local/OpenMPI/1.10.3/gcc-5.3.0/lib/libopen-rte.so.12(orte_show_help+0xb4)[0x2aaaabeb3494]
[cn1804:49992] [ 4] /usr/local/OpenMPI/1.10.3/gcc-5.3.0/lib/libmpi.so.12(+0x7a0b2)[0x2aaaab1e90b2]
[cn1804:49992] [ 5] /lib64/libc.so.6(__libc_fork+0x55)[0x2aaaab987c65]
[cn1804:49992] [ 6] /lib64/libc.so.6(_IO_proc_open+0x137)[0x2aaaab9435b7]
[cn1804:49992] [ 7] /lib64/libc.so.6(popen+0x69)[0x2aaaab9438a9]
[cn1804:49992] [ 8] /data/godlovedc/bison/62bf61f7/bison[0x407080]
[cn1804:49992] [ 9] /lib64/libpthread.so.0(+0x7aa1)[0x2aaaaaf59aa1]
[cn1804:49992] [10] /lib64/libc.so.6(clone+0x6d)[0x2aaaab9c3aad]
[cn1804:49992] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 49992 on node cn1804 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
======
cn1805
======

[-] Unloading openmpi 1.10.3 for GCC 5.3.0
[+] Loading openmpi 1.10.3 for GCC 5.3.0
cn1805 has rank 0
cn1805 has rank 1
cn1805 has rank 2
Allocating space for 3000000000 characters
Will C->T convert reads/100_1.fq.gz and store the results in reads/100_1.CT.fq.gz.
Will G->A convert reads/100_2.fq.gz and store the results in reads/100_2.GA.fq.gz.
[cn1805:43889] *** Process received signal ***
[cn1805:43889] Signal: Segmentation fault (11)
[cn1805:43889] Signal code: Address not mapped (1)
[cn1805:43889] Failing at address: 0x2aab92ede952
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 43889 on node cn1805 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
======
cn1850
======

[-] Unloading openmpi 1.10.3 for GCC 5.3.0
[+] Loading openmpi 1.10.3 for GCC 5.3.0
cn1850 has rank 0
cn1850 has rank 1
cn1850 has rank 2
Allocating space for 3000000000 characters
Will C->T convert reads/100_1.fq.gz and store the results in reads/100_1.CT.fq.gz.
Will G->A convert reads/100_2.fq.gz and store the results in reads/100_2.GA.fq.gz.
fatal flex scanner internal error--end of buffer missed
[cn1850:31304] *** Process received signal ***
[cn1850:31304] Signal: Segmentation fault (11)
[cn1850:31304] Signal code: Address not mapped (1)
[cn1850:31304] Failing at address: 0x815000
[cn1850:31304] [ 0] /lib64/libpthread.so.0(+0xf7e0)[0x2aaaaaf617e0]
[cn1850:31304] [ 1] /usr/local/OpenMPI/1.10.3/gcc-5.3.0/lib/libopen-pal.so.13(opal_show_help_yylex+0x953)[0x2aaaac1d3e93]
[cn1850:31304] [ 2] /usr/local/OpenMPI/1.10.3/gcc-5.3.0/lib/libopen-pal.so.13(opal_show_help_vstring+0x198)[0x2aaaac1d2c68]
[cn1850:31304] [ 3] /usr/local/OpenMPI/1.10.3/gcc-5.3.0/lib/libopen-rte.so.12(orte_show_help+0xb4)[0x2aaaabeb3494]
[cn1850:31304] [ 4] /usr/local/OpenMPI/1.10.3/gcc-5.3.0/lib/libmpi.so.12(+0x7a0b2)[0x2aaaab1e90b2]
[cn1850:31304] [ 5] /lib64/libc.so.6(__libc_fork+0x55)[0x2aaaab987c65]
[cn1850:31304] [ 6] /lib64/libc.so.6(_IO_proc_open+0x137)[0x2aaaab9435b7]
[cn1850:31304] [ 7] /lib64/libc.so.6(popen+0x69)[0x2aaaab9438a9]
[cn1850:31304] [ 8] /data/godlovedc/bison/62bf61f7/bison[0x407574]
[cn1850:31304] [ 9] /lib64/libpthread.so.0(+0x7aa1)[0x2aaaaaf59aa1]
[cn1850:31304] [10] /lib64/libc.so.6(clone+0x6d)[0x2aaaab9c3aad]
[cn1850:31304] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 31304 on node cn1850 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
======
cn2131
======

[-] Unloading openmpi 1.10.3 for GCC 5.3.0
[+] Loading openmpi 1.10.3 for GCC 5.3.0
Allocating space for 3000000000 characters
cn2131 has rank 1
cn2131 has rank 0
Will C->T convert reads/100_1.fq.gz and store the results in reads/100_1.CT.fq.gz.
Will G->A convert reads/100_2.fq.gz and store the results in reads/100_2.GA.fq.gz.
[cn2131:28366] *** Process received signal ***
[cn2131:28366] Signal: Segmentation fault (11)
[cn2131:28366] Signal code: Address not mapped (1)
[cn2131:28366] Failing at address: 0x3c
cn2131 has rank 2
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 28366 on node cn2131 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
======
cn2148
======

[-] Unloading openmpi 1.10.3 for GCC 5.3.0
[+] Loading openmpi 1.10.3 for GCC 5.3.0
cn2148 has rank 0
cn2148 has rank 1
cn2148 has rank 2
Allocating space for 3000000000 characters
Will C->T convert reads/100_1.fq.gz and store the results in reads/100_1.CT.fq.gz.
Will G->A convert reads/100_2.fq.gz and store the results in reads/100_2.GA.fq.gz.
[cn2148:16707] *** Process received signal ***
[cn2148:16707] Signal: Segmentation fault (11)
[cn2148:16707] Signal code: Address not mapped (1)
[cn2148:16707] Failing at address: 0x2aaa06e36b20
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 16707 on node cn2148 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
======
cn2150
======

[-] Unloading openmpi 1.10.3 for GCC 5.3.0
[+] Loading openmpi 1.10.3 for GCC 5.3.0
cn2150 has rank 0
cn2150 has rank 2
Allocating space for 3000000000 characters
Will C->T convert reads/100_1.fq.gz and store the results in reads/100_1.CT.fq.gz.
Will G->A convert reads/100_2.fq.gz and store the results in reads/100_2.GA.fq.gz.
[cn2150:18509] *** Process received signal ***
[cn2150:18509] Signal: Segmentation fault (11)
[cn2150:18509] Signal code: Address not mapped (1)
[cn2150:18509] Failing at address: 0x2aab92ede952
cn2150 has rank 1
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 18509 on node cn2150 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
======
cn2184
======

[-] Unloading openmpi 1.10.3 for GCC 5.3.0
[+] Loading openmpi 1.10.3 for GCC 5.3.0
cn2184 has rank 0
cn2184 has rank 1
cn2184 has rank 2
Allocating space for 3000000000 characters
Will C->T convert reads/100_1.fq.gz and store the results in reads/100_1.CT.fq.gz.
Will G->A convert reads/100_2.fq.gz and store the results in reads/100_2.GA.fq.gz.
[cn2184:28029] *** Process received signal ***
[cn2184:28029] Signal: Segmentation fault (11)
[cn2184:28029] Signal code: Invalid permissions (2)
[cn2184:28029] Failing at address: 0x2aaaad621008
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    mpi_init:warn-fork
from the file:
    help-mpi-runtime.txt
But I couldn't find that topic in the file.  Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    mpi_init:warn-fork
from the file:
    help-mpi-runtime.txt
But I couldn't find that topic in the file.  Sorry!
--------------------------------------------------------------------------

@dpryan79
Copy link
Owner

That's a new one. What happens if you set -p 1? If that doesn't help then I'll have to have a look when I get into the office tomorrow.

@GodloveD
Copy link
Author

Thanks for the suggestion. I ran 10 more jobs with the -p 1 option and here is the output. Still lots of errors and segfaults.

======
cn3119
======

[+] Loading openmpi 1.10.3 for GCC 5.3.0
cn3119 has rank 0
cn3119 has rank 1
cn3119 has rank 2
Allocating space for 3000000000 characters
Will C->T convert reads/100_1.fq.gz and store the results in reads/100_1.CT.fq.gz.
Will G->A convert reads/100_2.fq.gz and store the results in reads/100_2.GA.fq.gz.
--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
An MPI process has executed an operation involving a call to the
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
data corruption.  The use of fork() (or system() or other calls that
data corruption.  The use of fork() (or system() or other calls that
The process that invoked fork was:
reate child processes) is strongly discouraged.
  Local host:          cn3119 (PID 19407)
and correctly survive a call to fork(), you may disable this warning

of its peer processes in the job will be killed properly.  You should
double check that everything has shut down cleanly.

  Reason:     (null)
  Local host: ▒▒M▒▒*
  PID:        8102928
--------------------------------------------------------------------------
[cn3119:19407] *** Process received signal ***
[cn3119:19407] Signal: Segmentation fault (11)
[cn3119:19407] Signal code: Invalid permissions (2)
[cn3119:19407] Failing at address: 0x2aaaac000000
[cn3119:19407] [ 0] /lib64/libpthread.so.0(+0xf7e0)[0x2aaaaaf617e0]
[cn3119:19407] [ 1] /lib64/libc.so.6(+0x79279)[0x2aaaab954279]
[cn3119:19407] [ 2] /usr/local/OpenMPI/1.10.3/gcc-5.3.0/lib/libopen-pal.so.13(opal_libevent2021_event_base_loop+0x7d9)[0x2aaaac1e18b9]
[cn3119:19407] [ 3] /usr/local/OpenMPI/1.10.3/gcc-5.3.0/lib/libopen-rte.so.12(+0x585fe)[0x2aaaabed95fe]
[cn3119:19407] [ 4] /lib64/libpthread.so.0(+0x7aa1)[0x2aaaaaf59aa1]
[cn3119:19407] [ 5] /lib64/libc.so.6(clone+0x6d)[0x2aaaab9c3aad]
[cn3119:19407] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 19407 on node cn3119 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
======
cn3295
======

[+] Loading openmpi 1.10.3 for GCC 5.3.0
cn3295 has rank 0
cn3295 has rank 1
cn3295 has rank 2
Allocating space for 3000000000 characters
Will C->T convert reads/100_1.fq.gz and store the results in reads/100_1.CT.fq.gz.
Will G->A convert reads/100_2.fq.gz and store the results in reads/100_2.GA.fq.gz.
[cn3295:28695] *** Process received signal ***
--------------------------------------------------------------------------
--------------------------------------------------------------------------
[cn3295:28695] Signal: Segmentation fault (11)
[cn3295:28695] Signal code: Address not mapped (1)
[cn3295:28695] Failing at address: 0x2c
[cn3295:28695] [ 0] /lib64/libpthread.so.0(+0xf7e0)[0x2aaaaaf617e0]
[cn3295:28695] [ 1] /usr/local/OpenMPI/1.10.3/gcc-5.3.0/lib/libopen-pal.so.13(opal_show_help_yylex+0x55b)[0x2aaaac1d3a9b]
[cn3295:28695] [ 2] /usr/local/OpenMPI/1.10.3/gcc-5.3.0/lib/libopen-pal.so.13(opal_show_help_vstring+0x221)[0x2aaaac1d2cf1]
[cn3295:28695] [ 3] /usr/local/OpenMPI/1.10.3/gcc-5.3.0/lib/libopen-rte.so.12(orte_show_help+0xb4)[0x2aaaabeb3494]
[cn3295:28695] [ 4] /usr/local/OpenMPI/1.10.3/gcc-5.3.0/lib/libmpi.so.12(+0x7a0b2)[0x2aaaab1e90b2]
[cn3295:28695] [ 5] /lib64/libc.so.6(__libc_fork+0x55)[0x2aaaab987c65]
[cn3295:28695] [ 6] /lib64/libc.so.6(_IO_proc_open+0x137)[0x2aaaab9435b7]
[cn3295:28695] [ 7] /lib64/libc.so.6(popen+0x69)[0x2aaaab9438a9]
[cn3295:28695] [ 8] /data/godlovedc/bison/62bf61f7/bison[0x407080]
[cn3295:28695] [ 9] /lib64/libpthread.so.0(+0x7aa1)[0x2aaaaaf59aa1]
[cn3295:28695] [10] /lib64/libc.so.6(clone+0x6d)[0x2aaaab9c3aad]
[cn3295:28695] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 28695 on node cn3295 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
======
cn3365
======

[+] Loading openmpi 1.10.3 for GCC 5.3.0
cn3365 has rank 0
Allocating space for 3000000000 characters
Will C->T convert reads/100_1.fq.gz and store the results in reads/100_1.CT.fq.gz.
Will G->A convert reads/100_2.fq.gz and store the results in reads/100_2.GA.fq.gz.
cn3365 has rank 1
cn3365 has rank 2
[cn3365:22321] *** Process received signal ***
[cn3365:22321] Signal: Segmentation fault (11)
[cn3365:22321] Signal code: Address not mapped (1)
[cn3365:22321] Failing at address: 0x3c
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 22321 on node cn3365 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
======
cn3468
======

[+] Loading openmpi 1.10.3 for GCC 5.3.0
cn3468 has rank 0
cn3468 has rank 2
Allocating space for 3000000000 characters
Will C->T convert reads/100_1.fq.gz and store the results in reads/100_1.CT.fq.gz.
Will G->A convert reads/100_2.fq.gz and store the results in reads/100_2.GA.fq.gz.
cn3468 has rank 1
[cn3468:35400] *** Process received signal ***
--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:          cn3468 (PID 35400)
  MPI_COMM_WORLD rank: 0

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
[cn3468:35400] Signal: Segmentation fault (11)
[cn3468:35400] Signal code: Address not mapped (1)
[cn3468:35400] Failing at address: 0x2aab74ede55e
[cn3468:35400] [ 0] /lib64/libpthread.so.0(+0xf7e0)[0x2aaaaaf617e0]
[cn3468:35400] [ 1] /usr/local/OpenMPI/1.10.3/gcc-5.3.0/lib/libopen-pal.so.13(opal_show_help_yylex+0x1c1)[0x2aaaac1d3701]
[cn3468:35400] [ 2] /usr/local/OpenMPI/1.10.3/gcc-5.3.0/lib/libopen-pal.so.13(opal_show_help_vstring+0x198)[0x2aaaac1d2c68]
[cn3468:35400] [ 3] /usr/local/OpenMPI/1.10.3/gcc-5.3.0/lib/libopen-rte.so.12(orte_show_help+0xb4)[0x2aaaabeb3494]
[cn3468:35400] [ 4] /usr/local/OpenMPI/1.10.3/gcc-5.3.0/lib/libmpi.so.12(+0x7a0b2)[0x2aaaab1e90b2]
[cn3468:35400] [ 5] /lib64/libc.so.6(__libc_fork+0x55)[0x2aaaab987c65]
[cn3468:35400] [ 6] /lib64/libc.so.6(_IO_proc_open+0x137)[0x2aaaab9435b7]
[cn3468:35400] [ 7] /lib64/libc.so.6(popen+0x69)[0x2aaaab9438a9]
[cn3468:35400] [ 8] /data/godlovedc/bison/62bf61f7/bison[0x407080]
[cn3468:35400] [ 9] /lib64/libpthread.so.0(+0x7aa1)[0x2aaaaaf59aa1]
[cn3468:35400] [10] /lib64/libc.so.6(clone+0x6d)[0x2aaaab9c3aad]
[cn3468:35400] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 35400 on node cn3468 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
======
cn3514
======

[+] Loading openmpi 1.10.3 for GCC 5.3.0
cn3514 has rank 0
cn3514 has rank 2
cn3514 has rank 1
Allocating space for 3000000000 characters
Will C->T convert reads/100_1.fq.gz and store the results in reads/100_1.CT.fq.gz.
Will G->A convert reads/100_2.fq.gz and store the results in reads/100_2.GA.fq.gz.
fatal flex scanner internal error--end of buffer missed
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    mpi_init:warn-fork
from the file:
    help-mpi-runtime.txt
But I couldn't find that topic in the file.  Sorry!
--------------------------------------------------------------------------
reads/100_1.fq.gz contained 100000 reads
======
cn3528
======

[+] Loading openmpi 1.10.3 for GCC 5.3.0
cn3528 has rank 0
Allocating space for 3000000000 characters
Will C->T convert reads/100_1.fq.gz and store the results in reads/100_1.CT.fq.gz.
Will G->A convert reads/100_2.fq.gz and store the results in reads/100_2.GA.fq.gz.
cn3528 has rank 1
cn3528 has rank 2
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    mpi_init:warn-fork
from the file:
    help-mpi-runtime.txt
But I couldn't find that topic in the file.  Sorry!
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    mpi_init:warn-fork
from the file:
    help-mpi-runtime.txt
But I couldn't find that topic in the file.  Sorry!
--------------------------------------------------------------------------
reads/100_2.fq.gz contained 100000 reads
reads/100_1.fq.gz contained 100000 reads
Reading in genomes/E.Coli/Escherichia_coli.GCA_000597845.1.23.dna.genome.fa
Finished genomes/E.Coli/Escherichia_coli.GCA_000597845.1.23.dna.genome.fa
Alignment metrics will be printed to reads/100_1.txt
Sending start to node 1
Sending start to node 2
Node 1 executing: bowtie2 -q --reorder  -p 1 --score-min 'L,-0.6,-0.6' --norc -x genomes/E.Coli/bisulfite_genome/CT_conversion/BS_CT -1 reads/100_1.CT.fq.gz -2 reads/100_2.GA.fq.gz
Node 2 executing: bowtie2 -q --reorder  -p 1 --score-min 'L,-0.6,-0.6' --nofw -x genomes/E.Coli/bisulfite_genome/GA_conversion/BS_GA -1 reads/100_1.CT.fq.gz -2 reads/100_2.GA.fq.gz
--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:          cn3528 (PID 6724)
  MPI_COMM_WORLD rank: 2

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
Node 2 began sending reads @Thu May 11 08:54:55 2017
Node 1 began sending reads @Thu May 11 08:54:55 2017
Started slurping @Thu May 11 08:54:55 2017
[cn3528:06720] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:warn-fork
[cn3528:06720] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
100000 reads; of these:
  100000 (100.00%) were paired; of these:
    53464 (53.46%) aligned concordantly 0 times
    44030 (44.03%) aligned concordantly exactly 1 time
    2506 (2.51%) aligned concordantly >1 times
    ----
    53464 pairs aligned concordantly 0 times; of these:
      4550 (8.51%) aligned discordantly 1 time
    ----
    48914 pairs aligned 0 times concordantly or discordantly; of these:
      97828 mates make up the pairs; of these:
        96925 (99.08%) aligned 0 times
        169 (0.17%) aligned exactly 1 time
        734 (0.75%) aligned >1 times
51.54% overall alignment rate
Node 1 finished sending reads @Thu May 11 08:55:09 2017
        (14.000000 sec elapsed)
Exiting worker node 1
Returning from worker node 1
100000 reads; of these:
  100000 (100.00%) were paired; of these:
    53748 (53.75%) aligned concordantly 0 times
    43760 (43.76%) aligned concordantly exactly 1 time
    2492 (2.49%) aligned concordantly >1 times
    ----
    53748 pairs aligned concordantly 0 times; of these:
      4509 (8.39%) aligned discordantly 1 time
    ----
    49239 pairs aligned 0 times concordantly or discordantly; of these:
      98478 mates make up the pairs; of these:
        97582 (99.09%) aligned 0 times
        171 (0.17%) aligned exactly 1 time
        725 (0.74%) aligned >1 times
51.21% overall alignment rate
Node 2 finished sending reads @Thu May 11 08:55:09 2017
        (14.000000 sec elapsed)
Finished slurping @Thu May 11 08:55:09 2017
        (14.000000 seconds elapsed)
Exiting worker node 2
Returning from worker node 2
======
cn3542
======

[+] Loading openmpi 1.10.3 for GCC 5.3.0
Allocating space for 3000000000 characters
cn3542 has rank 0
Will C->T convert reads/100_1.fq.gz and store the results in reads/100_1.CT.fq.gz.
Will G->A convert reads/100_2.fq.gz and store the results in reads/100_2.GA.fq.gz.
cn3542 has rank 2
cn3542 has rank 1
fatal flex scanner internal error--end of buffer missed
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    mpi_init:warn-fork
from the file:
    help-mpi-runtime.txt
But I couldn't find that topic in the file.  Sorry!
--------------------------------------------------------------------------
reads/100_2.fq.gz contained 100000 reads
======
cn3129
======

[+] Loading openmpi 1.10.3 for GCC 5.3.0
cn3129 has rank 0
cn3129 has rank 1
cn3129 has rank 2
Allocating space for 3000000000 characters
Will C->T convert reads/100_1.fq.gz and store the results in reads/100_1.CT.fq.gz.
Will G->A convert reads/100_2.fq.gz and store the results in reads/100_2.GA.fq.gz.
[cn3129:19241] *** Process received signal ***
[cn3129:19241] Signal: Segmentation fault (11)
[cn3129:19241] Signal code: Address not mapped (1)
[cn3129:19241] Failing at address: 0x20
--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
perating in a condition that could result in memory corruption or
data corruption.  The use of fork() (or system() or other calls that


8F▒▒*
Local host:          cn3129 (PID 19241)
If you are *absolutely sure* that your application will successfully
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
[cn3129:19241] [ 0] /lib64/libpthread.so.0(+0xf7e0)[0x2aaaaaf617e0]
[cn3129:19241] [ 1] /usr/local/OpenMPI/1.10.3/gcc-5.3.0/lib/libopen-pal.so.13(opal_show_help_yylex+0xc73)[0x2aaaac1d41b3]
[cn3129:19241] [ 2] /usr/local/OpenMPI/1.10.3/gcc-5.3.0/lib/libopen-pal.so.13(opal_show_help_vstring+0x198)[0x2aaaac1d2c68]
[cn3129:19241] [ 3] /usr/local/OpenMPI/1.10.3/gcc-5.3.0/lib/libopen-rte.so.12(orte_show_help+0xb4)[0x2aaaabeb3494]
[cn3129:19241] [ 4] /usr/local/OpenMPI/1.10.3/gcc-5.3.0/lib/libmpi.so.12(+0x7a0b2)[0x2aaaab1e90b2]
[cn3129:19241] [ 5] /lib64/libc.so.6(__libc_fork+0x55)[0x2aaaab987c65]
[cn3129:19241] [ 6] /lib64/libc.so.6(_IO_proc_open+0x137)[0x2aaaab9435b7]
[cn3129:19241] [ 7] /lib64/libc.so.6(popen+0x69)[0x2aaaab9438a9]
[cn3129:19241] [ 8] /data/godlovedc/bison/62bf61f7/bison[0x407574]
[cn3129:19241] [ 9] /lib64/libpthread.so.0(+0x7aa1)[0x2aaaaaf59aa1]
[cn3129:19241] [10] /lib64/libc.so.6(clone+0x6d)[0x2aaaab9c3aad]
[cn3129:19241] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 19241 on node cn3129 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
======
cn3190
======

[+] Loading openmpi 1.10.3 for GCC 5.3.0
cn3190 has rank 0
cn3190 has rank 1
cn3190 has rank 2
Allocating space for 3000000000 characters
Will C->T convert reads/100_1.fq.gz and store the results in reads/100_1.CT.fq.gz.
Will G->A convert reads/100_2.fq.gz and store the results in reads/100_2.GA.fq.gz.
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    mpi_init:warn-fork
from the file:
    help-mpi-runtime.txt
But I couldn't find that topic in the file.  Sorry!
--------------------------------------------------------------------------
fatal flex scanner internal error--end of buffer missed
reads/100_2.fq.gz contained 100000 reads
======
cn3298
======

[+] Loading openmpi 1.10.3 for GCC 5.3.0
cn3298 has rank 0
cn3298 has rank 1
cn3298 has rank 2
Allocating space for 3000000000 characters
Will C->T convert reads/100_1.fq.gz and store the results in reads/100_1.CT.fq.gz.
Will G->A convert reads/100_2.fq.gz and store the results in reads/100_2.GA.fq.gz.
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    mpi_init:warn-fork
from the file:
    help-mpi-runtime.txt
But I couldn't find that topic in the file.  Sorry!
--------------------------------------------------------------------------
fatal flex scanner internal error--end of buffer missed
reads/100_2.fq.gz contained 100000 reads

@dpryan79
Copy link
Owner

I'll have to see if I can reproduce this locally then. Perhaps I tagged a bad release.

@dpryan79
Copy link
Owner

In the cluster in my office we're using Slurm with OpenMPI 1.10.2 and gcc 4.8.5. There, the following works:

#!/bin/bash
module load bowtie2 mpi slurm
mpiexec ~/bin/bison --directional -p 20 -g genomes/E.Coli/ -1 reads/100_1.fq.gz -2 reads/100_2.fq.gz

With the command sbatch -N 3 -c 20 -p bioinfo foo.sh I get a proper result. I'll try to get your version of GCC and openMPI installed and see if I can reproduce the issue with that.

@GodloveD
Copy link
Author

What version of linux are you using and what glibc are you up to?

@dpryan79
Copy link
Owner

Our nodes are running CentOS7 I think, I don't recall what version of glibc that uses. If push comes to shove, I can try sending you some binaries to see if they work.

@GodloveD
Copy link
Author

We're using CentOS 6 with glibc version 2.12. I'm guessing this may be the problem. I'm going to try building it in a Singularity container using CentOS 7 and see if that fixes the bug. But it might take a little time because I don't have any experience with MPI in a container yet.

@dpryan79
Copy link
Owner

Good luck with that. I just noticed that openmpi is in conda-forge, so perhaps I can put this in bioconda...

@GodloveD
Copy link
Author

I think I have something working but I'd like to do a larger scale test before I declare victory. I'm not an omics person, and your tutorial example runs on a single node. Are you able to provide some test data and sample commands to run a larger scale job on 3 nodes? Any help would be much appreciated.

@patidarr
Copy link

Hi Dave,

You may use the example command I gave which include the location of test dataset.

@GodloveD
Copy link
Author

Perfect. Thanks @patidarr!

@GodloveD
Copy link
Author

I think I'm making progress, but now I have a new error. I copied patidarr's data and then wrote a script based on his like so:

#!/bin/sh
module load bison
cd /scratch/godlovedc/patidarr_bison

mpiexec -n 5 bison \
    -g bison-index \
    -1 Sample_CL0080_T1D_M_HYCKNBGXY_R1_val_1.fq.gz \
    -2 Sample_CL0080_T1D_M_HYCKNBGXY_R2_val_2.fq.gz

Then I submitted the job like so:

sbatch --ntasks=6 --exclusive --time=24:00:00 --ntasks-per-core=1 --mem-per-cpu=20g bison.sh

It hummed along merrily for a while and then produced this error:

[...snip...]
14900000 reads Tue May 16 17:33:28 2017
15000000 reads Tue May 16 17:37:16 2017
15100000 reads Tue May 16 17:41:01 2017
15200000 reads Tue May 16 17:44:59 2017
bison: genome.c:19: get_seq: Assertion `fgets(line, 1024, fp) != ((void *)0)' failed.
[cn2934:26883] *** Process received signal ***
[cn2934:26883] Signal: Aborted (6)
[cn2934:26883] Signal code:  (-6)
[cn2934:26883] [ 0] /lib64/libpthread.so.0(+0xf370)[0x2aaaaafdf370]
[cn2934:26883] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2aaaab7111d7]
[cn2934:26883] [ 2] /lib64/libc.so.6(abort+0x148)[0x2aaaab7128c8]
[cn2934:26883] [ 3] /lib64/libc.so.6(+0x2e146)[0x2aaaab70a146]
[cn2934:26883] [ 4] /lib64/libc.so.6(+0x2e1f2)[0x2aaaab70a1f2]
[cn2934:26883] [ 5] bison[0x407db7]
[cn2934:26883] [ 6] bison[0x40d1b8]
[cn2934:26883] [ 7] /lib64/libpthread.so.0(+0x7dc5)[0x2aaaaafd7dc5]
[cn2934:26883] [ 8] /lib64/libc.so.6(clone+0x6d)[0x2aaaab7d373d]
[cn2934:26883] *** End of error message ***
--------------------------------------------------------------------------
mpiexec has exited due to process rank 0 with PID 26808 on
node cn2934 exiting improperly. There are three reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.

This may have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).

You can avoid this message by specifying -quiet on the mpiexec command line.

--------------------------------------------------------------------------

@patidarr
Copy link

@GodloveD, your bison index is not initialized? /data/Clinomics/Ref/khanlab/Index/Bison/

@dpryan79
Copy link
Owner

How many reads are in the fastq files (zcat Sample_CL0080_T1D_M_HYCKNBGXY_R1_val_1.fq.gz | wc -l divided by 4)? This should only happen if bowtie2 starts sending back more alignments than reads, which I can't say I've ever seen happen. While I presume the BAM file that was output is truncated, one could samtools view -h foo.bam | samtools sort -n -o foo.resorted.bam - and then see if there are any read pairs with more than the expected numbers of entries: samtools view foo.resorted.bam | cut -f 1 | uniq -c | awk '{if($1>2) print}'. That should produce any output...

@patidarr
Copy link

Here is FastQC info on this library.
`
##FastQC 0.11.2

Basic Statistics pass
#Measure Value
Filename CL0080_T1D_M_HYCKNBGXY_R1.fastq.gz
File type Conventional base calls
Encoding Sanger / Illumina 1.9
Total Sequences 72496934
Sequences flagged as poor quality 0
Sequence length 81
%GC 37
END_MODULE

`

@GodloveD
Copy link
Author

So it sounds to me like there is some uncertainty whether this is a bug or if there is something wrong with the data. Is that correct? @dpryan79 are you able to provide a known good data set that produces a known output when processed using known commands so that we can rule out a bug?

@dpryan79
Copy link
Owner

I'll have to put something larger together. Note that there's generally a bit of randomness in some of the results from aligners. I'll post a link when I've put together something a bit larger.

@dpryan79
Copy link
Owner

I've uploaded a tar ball that contains a reference index (GRCh38) and index as well as fastq files and the BAM file produced locally. I have additionally included the script I used to submit that with sbatch ("foo.sh"), which you'll need to modify to suite your cluster. In that script I additionally have a commented out example of using bison_herd, which is intended for use on many more nodes, but isn't always supported by people's local MPI stacks.

I wouldn't do an md5sum to compare results since there's always a bit of randomness in a few of the alignments, but the output text files should be quite similar (and the program shouldn't segfault for any reason).

@GodloveD
Copy link
Author

@dpryan79 Thanks very much for going to this trouble. It is greatly appreciated.

I downloaded and untarred that data and ran it on our cluster. I was able to replicate your slurm-*.out file and subset_1.txt files exactly. So it seems like this installation is working properly. The problem appears to have been the mismatching glibc.

It may be useful to post a link to these data and the commands and output in the documentation so that other users can verify their installation as well. Thanks again for putting this together.

For the record, if any other users run into trouble using bison on an older version of Linux and want to run it from a container within Singularity, here is the definition file that I used to build it.

BootStrap: docker
From: centos:latest

%post
    BUILD_DIR=/tmp

    umask 0002
    yum clean all
    yum -y update
    yum -y install wget git libz.so.1 tbb-devel zlib-devel bzip2-devel
    yum -y groupinstall "Development Tools" # gcc 4.8.5-11
    cd $BUILD_DIR

    # build openmpi
    VER=1.10.3
    wget https://www.open-mpi.org/software/ompi/v1.10/downloads/openmpi-$VER.tar.gz
    tar xvf openmpi-$VER.tar.gz
    cd openmpi-$VER
    ./configure
    make install
    echo /usr/local/lib >> /etc/ld.so.conf
    ldconfig
    cd $BUILD_DIR
    rm openmpi-$VER.tar.gz

    # build bowtie
    VER=2.3.2
    wget https://downloads.sourceforge.net/project/bowtie-bio/bowtie2/$VER/bowtie2-$VER-source.zip
    unzip bowtie2-$VER-source.zip
    cd bowtie2-$VER
    make install
    cd $BUILD_DIR
    rm bowtie2-$VER-source.zip

    # build bison
    DIR=/usr/local/apps
    APP=bison
    VER=0.4.0
    wget https://github.com/dpryan79/bison/archive/$VER.tar.gz
    tar xvf $VER.tar.gz
    cd bison-$VER
    git clone https://github.com/samtools/htslib.git
    cd htslib
    git checkout 9b1cb94
    cd ..
    mkdir -p $DIR/$APP/$VER
    make CC=mpicc
    make herd CC=mpicc
    make auxiliary CC=mpicc
    make install PREFIX=$DIR/$APP/$VER

    # add to PATH
    echo "" >> /environment
    echo "export PATH=$DIR/$APP/$VER:"'$PATH' >> /environment

    # create bind points for NIH HPC environment
    mkdir /gpfs /spin1 /gs2 /gs3 /gs4 /gs5 /gs6 /data /scratch /fdb /lscratch

And here is the wrapper script I use to drive it.

#!/bin/bash
SINGULARITY_BINDPATH="/gpfs,/gs2,/gs3,/gs4,/gs5,/gs6,/spin1,/data,/scratch,/fdb,/lscratch"
export SINGULARITY_BINDPATH
dir=$(dirname  "$0")
cmd=$(basename "$0")
args="$@"
singularity exec $dir/bison.img $cmd $args

This enables you to create links to the wrapper script which will essentially become executables that run within the container like so:

bedGraph2BSseq.py -> bison.sh
bedGraph2MOABS -> bison.sh
bedGraph2MethylSeekR.py -> bison.sh
bedGraph2methylKit -> bison.sh
bison -> bison.sh
bison_CpG_coverage -> bison.sh
bison_herd -> bison.sh
bison_index -> bison.sh
bison_markduplicates -> bison.sh
bison_mbias -> bison.sh
bison_mbias2pdf -> bison.sh
bison_merge_CpGs -> bison.sh
bison_methylation_extractor -> bison.sh
make_reduced_genome -> bison.sh
merge_bedGraphs.py -> bison.sh

Thanks again! Feel free to close!

@tantrev
Copy link

tantrev commented May 26, 2017

@dpryan79 - Is there any chance your bioconda version of bison might be pushed soon? I've been trying to compile bison locally and have been running into a nightmare of errors trying to compile htslib with all of its dependencies (especially on libcurl, libbz2, and liblzma).

@dpryan79
Copy link
Owner

@tantrev: That's been on the back burner. I'll see if I have time next week. I don't think the version of htslib that comes with bison needs libcurl or libbz2, though I'd be surprised if those weren't available on your system (they're pretty common).

@tantrev
Copy link

tantrev commented May 31, 2017

@dpryan79 - thank you for the tip on just using the built-in htslib! I was stupidly just following the tutorial blindly without thinking about it (in trying to use the latest htslib).

I've gotten everything to compile just fine, but I keep getting errors when trying to run bison. Here's the stdout and stderr for this slurm submission script. Also, here's the top of the Makefile I used.

Is there anything obviously wrong that I'm doing? Thanks again for your help.

@GodloveD
Copy link
Author

@tantrev what is your OS?

@tantrev
Copy link

tantrev commented May 31, 2017

When cat /etc/*-release is executed on the login node and the worker nodes that execute the script, it returns:

CentOS Linux release 7.2.1511 (Core)
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

CentOS Linux release 7.2.1511 (Core)
CentOS Linux release 7.2.1511 (Core)

@dpryan79
Copy link
Owner

dpryan79 commented Jun 1, 2017

@tantrev For some reason your system is segfaulting at a popen() call, which makes absolutely no sense. Can you run this in gdb on the head node? It'd be nice to see the output of the gdb bt command is after the crash.

BTW, I have a bioconda recipe that at least partially works. I just have to get the testing to work on the TravisCI cluster on both OSes.

@tantrev
Copy link

tantrev commented Jun 4, 2017

@dpryan79 So I had to re-compile bison to use the Intel mpi in order to get gdb to work (I couldn't figure how to use it with Openmpi), but here's the bt command output.

The stdout and stderr of the regular non-gdb command also slightly changed.

@GodloveD I tried building your Singularity definition file on my Mac using Vagrant, then uploading it to CHPC, but when I try executing bison with this command

singularity exec bison.img mpiexec -n 5 bison -upto 10000 -g data/Mus_musculus/C57BL_6J/bowtie2/bisulfite_genome -1 oldbox/GSE47819_wgbs/fasta/y12_1.test.fq.gz -2 oldbox/GSE47819_wgbs/fasta/y12_2.test.fq.gz

on the head node, it complains:

bison: symbol lookup error: bison: undefined symbol: ompi_mpi_int
bison: symbol lookup error: bison: undefined symbol: ompi_mpi_int
bison: symbol lookup error: bison: undefined symbol: ompi_mpi_int
bison: symbol lookup error: bison: undefined symbol: ompi_mpi_int
bison: symbol lookup error: bison: undefined symbol: ompi_mpi_int

@GodloveD
Copy link
Author

GodloveD commented Jun 4, 2017

@tantrev It's kind of hard to find documentation on this, but you are actually supposed to call mpirun outside of your container. For instance, I believe the proper way to execute the command you are trying to run would be something like this:

mpiexec -n 5 singularity exec bison.img bison -upto 10000 -g data/Mus_musculus/C57BL_6J/bowtie2/bisulfite_genome -1 oldbox/GSE47819_wgbs/fasta/y12_1.test.fq.gz -2 oldbox/GSE47819_wgbs/fasta/y12_2.test.fq.gz

If you want, have a look at wrapper script and links I described above. If you set those things up properly you can dispense with all of the singularity exec business and just pretend that the bison command is installed directly on your system like this:

mpiexec -n 5 bison -upto 10000 -g data/Mus_musculus/C57BL_6J/bowtie2/bisulfite_genome -1 oldbox/GSE47819_wgbs/fasta/y12_1.test.fq.gz -2 oldbox/GSE47819_wgbs/fasta/y12_2.test.fq.gz

@dpryan79
Copy link
Owner

dpryan79 commented Jun 4, 2017

@tantrev: The stdout and stderr suggest you have a couple things going on. Firstly, you're running bison on more than 5 nodes (try bison_herd instead, presuming you ran make herd). Secondly, the bowtie2 index isn't where you specified. It looks like you have an extra bisulfite_genome in your path, since I presume the actual base path to where you ran bison_index is /scratch/general/lustre/u0597274/data/Mus_musculus/C57BL_6J/bowtie2.

@tantrev
Copy link

tantrev commented Jun 4, 2017

@GodloveD - thank you! I should've just followed your wrapper script from the beginning, I was able to regular bison running that way. The only problem I ran into was when trying to run bison_herd, I received the following error:

You're MPI implementation doesn't support MPI_THREAD_MULTIPLE, which is required for bison_herd to work.

Also, I was reading through the Singularity documentation and it probably doesn't matter, but it might be worth considering using Open MPI 2.1, just in case someone doesn't have Open MPI 1.10.x natively available on their cluster.

@dpryan79 - thank you as well! Yes, sorry I had a couple of idiot mistakes there. Yes, like you noticed, I forgot to fix my garbled genome path (a relic from earlier experimentation) and it turns out something had gone wrong with my initial index generation as well. Bison_herd seems to be running smoothly right now though with the Intel MPI implementation.

Also, I was reading through Bison's paper and saw the note how bison_herd's scaling can become "limited by the underlying MPI implementation and network architecture of the cluster". Since the paper seems to be optimized at around 9 nodes, but bison's documentation talks about you using more than 9 nodes frequently, I am just curious if you have any general feeling for how to gauge the optimal potential of any given cluster (e.g. OpenMPI version, NIC card speed, etc.)? For example, the cluster I'm running on right now has QDR Infiniband interconnect, but I'd ideally like to use as many simultaneous nodes as possible. :P

Thanks again for both of your guys' help.

@dpryan79
Copy link
Owner

dpryan79 commented Jun 4, 2017

Glad things are working now. I used to use up to 21 nodes on my old cluster, but that was hardware that's now >5 years old. I unfortunately don't have any great advice on gauging the optimal numbers, but if you find what's optimal for you then please relay it :)

@tantrev
Copy link

tantrev commented Jun 4, 2017

@dpryan79 - sounds good, I'll have to do some tinkering. :) On that note, I unfortunately seemed to have jinxed myself with the previous hopeful comment. Bison_herd was running just fine for about 20 minutes, then exited with an error. Here's the stdout and stderr if you might have any ideas about what's going on.

Sorry to keep bothering you!

@dpryan79
Copy link
Owner

dpryan79 commented Jun 4, 2017

Since bowtie2 died with a signal 9 (aka, "kill"), which will most likely happen if the scheduler you're using doesn't like something (e.g., the default memory allocation is too low). For many schedulers there's a switch to specify reporting back how much memory was actually used, so if yours has such an option then give that a try.

@tantrev
Copy link

tantrev commented Jun 5, 2017

Thank you! You were dead on, the memory indeed was exceeded - everything is now running on 256GB nodes. :P

@tantrev
Copy link

tantrev commented Jun 5, 2017

So I've been doing some tinkering for a larger-scale deployment and it seems that I'm only getting marginal speed-ups when using 13-nodes (20 Intel cores, 64GB RAM), compared to when using 5-nodes.

For example, 13-nodes gives about ~76,000 reads/sec, when using 5-nodes gives about ~66,000 reads/sec. In both configurations, I allocate 7 threads to both the -mp and -@ arguments.

Is it possible that other bison parameters, like queue-size and throttling, are the culprit for this marginal performance improvement? If it makes any difference, each mate of the raw .fastq file pairs is about ~100GB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants