Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic row counts for spatial joins in DuckDB 1.2.0 #508

Open
2 tasks done
grobins opened this issue Feb 27, 2025 · 9 comments
Open
2 tasks done

Non-deterministic row counts for spatial joins in DuckDB 1.2.0 #508

grobins opened this issue Feb 27, 2025 · 9 comments

Comments

@grobins
Copy link

grobins commented Feb 27, 2025

What happens?

Description
I'm experiencing inconsistent query results when running the same spatial query with DuckDB on EC2. When executing the identical query multiple times in succession, I get different row counts:

First run: 35,817 records
Second run: 291,614 records
Third run: 9,555 records

The correct number of records - which I get from running locally with 1.1.3 is 1,632,012

This appears to be a non-deterministic behavior introduced in DuckDB 1.2.0.
Environment

Local: DuckDB 1.1.3 (gives consistent results)
EC2: DuckDB 1.2.0
R package: duckdb
Extensions: spatial, aws, httpfs, icu

I'm not able to installed 1.1.3 into the EC2, closest I get is v1.1.3-dev165 but then the spatial extensions aren't working

Error: IO Error: Failed to install 'spatial'
The file was built for DuckDB version 'v1.1.3', but we can only load extensions built for DuckDB version '19864453f7'.

To Reproduce

SQL I'm running

WITH events AS (SELECT *
                     , ST_Point(longitude, latitude) geom
                     , timezone('Pacific/Auckland', event_timestamp::TIMESTAMP WITH TIME ZONE) AS event_timestamp_nz
                     , CAST(event_timestamp AS DATE) dom
                     , strftime(date_trunc('hour', event_timestamp_nz), '%H:%M') timegroup
                     , extract(dow FROM event_timestamp_nz) dow
                    FROM read_parquet('MYLOCATION')
)
   , filtered_route_ways AS (SELECT *
                                  , st_buffer(ST_Envelope(geom), 0.001) bbox_geom
                                 FROM route_ways)
   , ranked_matches AS (SELECT
                            e.*
                          , r.way_id
                          , r.highway
                          , r.maxspeed
                          , r.osm_tuple_length_m
                          , r.osm_way_length_m
                          , r.osm_tuple_direction
                          , ROW_NUMBER() OVER (
            PARTITION BY e.machine_gid, e.event_timestamp
            ORDER BY ST_Distance(e.geom, r.geom)
        ) AS rank
                            FROM events e
                                     JOIN filtered_route_ways r
                                          ON ST_Intersects(e.geom, r.bbox_geom))
 
select * from ranked_matches;

SETUP before running

con <- dbConnect(duckdb::duckdb())
dbExecute(con, "INSTALL spatial; LOAD spatial;")
dbExecute(con, "INSTALL aws; LOAD aws;")
dbExecute(con, "INSTALL httpfs; LOAD httpfs;")
dbExecute(con, "INSTALL icu; LOAD icu;")
dbExecute(con, "SET TimeZone = 'Pacific/Auckland';")
dbWriteTable(con, "route_ways", route_ways)
dbExecute(con, "ALTER TABLE route_ways ADD COLUMN geom GEOMETRY;")
dbExecute(con, "UPDATE route_ways SET geom = ST_GeomFromText(linestring);")
dbExecute(con, "CREATE INDEX idx_geom ON route_ways USING RTREE (geom);")
dbExecute(con, "CALL load_aws_credentials('robinsight-data-hub');")

The 'events' source is exactly the same location in S3 that the local and EC2 versions are querying. the 'route_ways' dataframe is exactly the same, and I've confirmed both the EC2 and local queries return the correct results for the count from the 'events' CTE and 'filtered_route_ways' CTE.

UPDATE:
I replaced the ST_Intersects function with the following SQL and I'm getting consistent results between local and EC2 versions

WITH events AS (
    SELECT *
         , longitude
         , latitude
         , ST_Point(longitude, latitude) AS geom
         , timezone('Pacific/Auckland', event_timestamp::TIMESTAMP WITH TIME ZONE) AS event_timestamp_nz
         , CAST(event_timestamp AS DATE) AS dom
         , strftime(date_trunc('hour', event_timestamp_nz), '%H:%M') AS timegroup
         , extract(dow FROM event_timestamp_nz) AS dow
    FROM read_parquet('MYLOCATION')
),

route_ways_bounds AS (
    SELECT *
         , geom
         , ST_Buffer(ST_Envelope(geom), 0.001) AS bbox_geom
         , ST_XMin(ST_Buffer(ST_Envelope(geom), 0.001)) AS min_lon
         , ST_XMax(ST_Buffer(ST_Envelope(geom), 0.001)) AS max_lon
         , ST_YMin(ST_Buffer(ST_Envelope(geom), 0.001)) AS min_lat
         , ST_YMax(ST_Buffer(ST_Envelope(geom), 0.001)) AS max_lat
    FROM route_ways
),

filtered_events AS (
    SELECT e.*, r.way_id, r.highway, r.maxspeed, r.osm_tuple_length_m, r.osm_way_length_m, r.osm_tuple_direction, r.geom AS route_geom
    FROM events e
    JOIN route_ways_bounds r
        ON e.longitude BETWEEN r.min_lon AND r.max_lon
        AND e.latitude BETWEEN r.min_lat AND r.max_lat
),

ranked_matches AS (
    SELECT
        f.*,
        ROW_NUMBER() OVER (
            PARTITION BY f.machine_gid, f.event_timestamp
            ORDER BY ST_Distance(f.geom, f.route_geom)
        ) AS rank
    FROM filtered_events f
)

OS:

Mac M2

DuckDB Version:

1.2.0

DuckDB Client:

R

Hardware:

No response

Full Name:

Gareth Robins

Affiliation:

Robinsight

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

No - Other reason (please specify in the issue body)

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have
@szarnyasg szarnyasg transferred this issue from duckdb/duckdb Feb 27, 2025
@Maxxen
Copy link
Member

Maxxen commented Feb 27, 2025

This should be fixed since yesterday, could you try reinstalling the v1.2.0 spatial extension? I.e. execute FORCE INSTALL spatial;

In short, we accidentally introduced a bug in the version of the extension we distribute on v1.2.0 when trying to backport another fix. But we've now replaced the binary again, with yet another patch applied that should fix this. Some more context

@bmcandr
Copy link

bmcandr commented Mar 9, 2025

Hi @Maxxen, I'm also still seeing this issue. I'm on duckdb v1.2.1 and FORCE INSTALL spatial; is installing:

┌──────────────────┬─────────┬───────────┬───┬───────────────────┬───────────────────┬────────────────┐
│  extension_name  │ loaded  │ installed │ … │ extension_version │   install_mode    │ installed_from │
│     varchar      │ boolean │  boolean  │   │      varchar      │      varchar      │    varchar     │
├──────────────────┼─────────┼───────────┼───┼───────────────────┼───────────────────┼────────────────┤
│ spatial          │ false   │ true      │ … │ 6b3d93c           │ REPOSITORY        │ core           │

which is about 2 weeks old.

FORCE INSTALL spatial from core_nightly; fails with:

Failed to download extension "spatial" at URL "http://nightly-extensions.duckdb.org/v1.2.1/osx_amd64/spatial.duckdb_extension.gz" (HTTP 403)
Extension "spatial" is an existing extension.

For more info, visit https://duckdb.org/docs/extensions/troubleshooting/?version=v1.2.1&platform=osx_amd64&extension=spatial

because a nightly v1.2.1 doesn't seem to be available.

I've replicated the above on both x86 Mac OSX and x86 Ubuntu EC2 instance with fresh installs of duckdb.

@Maxxen
Copy link
Member

Maxxen commented Mar 9, 2025

@bmcandr Hello! Are you able to provide a reproduction?

@Maxxen
Copy link
Member

Maxxen commented Mar 9, 2025

Extension version 6b3d93c is not the version that should be distributed for v1.2.1

@bmcandr
Copy link

bmcandr commented Mar 9, 2025

Sure, I'm playing around with a Parquet file I created from a STAC Collection describing 7M+ satellite images. I'm just trying to select images intersecting a particular geometry, like the state of California:

WITH cali as (
    SELECT ST_GeomFromGeoJSON(geometry) geometry
    FROM read_json('https://spelunker.whosonfirst.org/id/85688637/geojson')
) SELECT COUNT(s.*)
FROM "s3://satellogic-earthview-stac-geoparquet/satellogic-earthview-stac-items.parquet" s 
JOIN cali ON ST_Intersects(cali.geometry, s.geometry);

There are many images over California contained in the dataset, but the query frequently returns 0 rows but has also returned 77,066, 2662, 4700 rows, etc.

Image

@Maxxen
Copy link
Member

Maxxen commented Mar 9, 2025

Alright, seems like we distribute the wrong extension version. This was fixed in a commit after 6b3d93c. Strangely main DuckDB CI is pinned at 919c69f and 2905968 on both main and v1.2-branches, so im not sure how we ended up with 6b3d93c. Ill see check in with the others once I get into the office tomorrow.

@bmcandr
Copy link

bmcandr commented Mar 9, 2025

Thanks!

@Maxxen
Copy link
Member

Maxxen commented Mar 10, 2025

Ok, like always caching is the culprit. We've made adjustments on our end so if you already have spatial installed a simple UPDATE EXTENSIONS or FORCE INSTALL spatial should give you spatial version 2905968, which has this fixed. However, there might be more caches downstream so it might take some minutes/hours for everything to propagate.

@bmcandr
Copy link

bmcandr commented Mar 10, 2025

Can confirm this version works. Thanks again @Maxxen!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants