Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault happening #2265

Open
joto opened this issue Oct 13, 2024 · 6 comments
Open

Segfault happening #2265

joto opened this issue Oct 13, 2024 · 6 comments

Comments

@joto
Copy link
Collaborator

joto commented Oct 13, 2024

See https://gist.github.com/tomhughes/0ebdc537b6a9a390b904d394d796b5e0

@pnorman Please add all the details about what you were doing here.

What version of osm2pgsql are you using?

2.0.0+ds-1~bpo12+1

What operating system and PostgreSQL/PostGIS version are you using?

Debian 12

Tell us something about your system

OpenStack Virtual Machine with 64 x 1 core AMD EPYC Processor, 444GB

What did you do exactly?

export LUA_PATH='/srv/vector.openstreetmap.org/osm2pgsql-themepark/lua/?.lua;/srv/vector.openstreetmap.org/spirit/?.lua;;'

# Import the osm2pgsql file specified as an argument, using the locations for spirit
osm2pgsql \
  --output flex \
  --style '/srv/vector.openstreetmap.org/spirit/shortbread.lua' \
  --slim \
  --flat-nodes '/srv/vector.openstreetmap.org/data/nodes.bin' \
  -d spirit \
  --cache 75000 data.pbf

What did you expect to happen?

The planet to import.

What did happen instead?

A core dump occurred in the post-processing stage

What did you do to try analyzing the problem?

@tomhughes provided the backtrace at https://gist.github.com/tomhughes/0ebdc537b6a9a390b904d394d796b5e0

The machine had an unusual postgresql setup at the time. When starting the import it had sufficient connection slots for osm2pgsql to start, but by the time post-processing had started it would not be possible to acquire new connections. When I fixed this error in the machine configuration it worked fine.

@joto
Copy link
Collaborator Author

joto commented Oct 19, 2024

Strange. The backtrace does not fit with the idea that the database connection failed. I am not sure that's what the problem was. What was this "unusual postgresql setup"? I tried limiting the number of connections and osm2pgsql properly reported an error and exited.

@pnorman
Copy link
Collaborator

pnorman commented Oct 21, 2024

What was this "unusual postgresql setup"?

Too few max_connections. I fixed it by properly stopping chef and the replication process.

@joto
Copy link
Collaborator Author

joto commented Oct 21, 2024

I am pretty sure it is not the max_connections which caused this. At least when I try that it seems to work correctly. I'd rather expect some race condition that went one way in the first try and the other in the seconds. Is there any chance you can re-create the situation with the too small max_connection and try again? At the moment I don't know how to debug this until we get a reproducible way to trigger the bug.

@pnorman
Copy link
Collaborator

pnorman commented Oct 23, 2024

I can see if I have a suitable system

@joto
Copy link
Collaborator Author

joto commented Dec 12, 2024

I believe I have figured out what is happening here: There are several threads in the thread pool. Thread A is started and opens a database connection and all is fine so far. Then thread B is started and doesn't get a database connection any more, it throws an exception. The exception is propagated and as part of that propagation lots of things are destructed. At some point the data structure containing the information about tables is destructed. Now thread A gets a chance to run again and wants to build the CREATE INDEX command. It needs the information about the tables, but that was destructed already. And then it segfaults.

To solve this we would need to make sure all threads are destructed before anything else. But there is no way to kill a running std::thread. We would have to wait until it is done with its work, which doesn't make much sense, because that situation isn't recoverable anyway. In C++20 there is a new std::jthread which has some mechanism for stopping it from the outside, but I believe that also only works if the thread cooperates. But if that thread just called CREATE INDEX it might be a long time before it even gets to run and can notice that it should shutdown.

At the moment, I don't have a good idea how to solve this. :-(

@tomhughes
Copy link
Contributor

Yes jthread stop tokens are essentially co-operative so you'd probably have to combine that with use of PQcancelCreate to hold a cancellation object from each connection that you could call PQcancelBlocking on at the same time you requested the thread to stop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants