-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate Persons in cdp-seattle instance #222
Comments
I think there is a possibility that this ends up being an issue for cdp-scrapers. There is mechanism in place there to handle situations like this, i.e. erratic/duplicate/etc. information entered by the municipalities/clerks. So, looks like different IDs were entered for the same person. I haven't investigated this at all, but this is my guess for the time being. |
Wait, I'm confused by this issue. In the CDP Seattle instance DB, I see only 1 Tammy Morales. e.g. If I follow the quickstart example, and query the person collection, there is just 1 Tammy Morales. from cdp_backend.database import models as db_models
from cdp_backend.pipeline.transcript_model import Transcript
import fireo
from gcsfs import GCSFileSystem
from google.auth.credentials import AnonymousCredentials
from google.cloud.firestore import Client
fireo.connection(client=Client(
project="cdp-seattle-21723dcf",
credentials=AnonymousCredentials()
))
ppl = list(db_models.Person.collection.fetch())
for p in ppl:
if 'tammy' in p.name.lower():
print(p.name, p.external_source_id, p.id, p.key)
# Tammy J. Morales 662 d1dbed7401e6 person/d1dbed7401e6 If this issue is saying that the Legistar end point for Seattle is returning multiple records for Tammy Morales (and others), that is known, unfortunately. And we have a system in place on the scrapers side to at least help us deal with those situations. Definitely possible it's not working 100%, but if so, shouldn't I be able to see multiple Tammy Morales when I execute the code blob above? I think I'm probably not looking at the same "database" that Brian used to get those IDs... |
Can also confirm from the database directly that there are not two people of the same name. |
Where did you get those IDs btw? the IDs in the firestore database are much much shorter |
I’d link but I’m on my phone on a ten lane freeway in Texas. The firestore
document PKs can be rather long? I grabbed these directly from the Seattle
firestore console view of the DB. Maybe I was looking at dev?
…On Thu, Nov 17, 2022 at 5:00 PM Eva Maxfield Brown ***@***.***> wrote:
Where did you get those IDs btw? the IDs in the firestore database are
much much shorter
—
Reply to this email directly, view it on GitHub
<#222 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACE4BLXVEH2F5ABZIKEWBQDWI22IZANCNFSM6AAAAAASBYZRSU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I think I'm gonna pull some events on the scraper and check out the ingestion model Persons. Will report back. |
i dont think it is the scraper. and i think you were checking staging (should probably refresh the data on staging since its a bit behind i think). I think it is just a minutes item / and event minutes item ref that is broken somewhere. I will look into this weekend -- no worries. |
Describe the Bug
There are duplicate documents in the Person collection in the cdp-seattle instance. They have the same name and the same legistar person id. I think this may result in unexpected behavior in the front end.
Expected Behavior
There should be one and only one record of legistar persons per... person.
Reproduction
Check out records for Tammy Morales:
ID 87638bc6-fd68-4f1f-8449-6137ac242a8
ID 4bf88f27-9933-4c57-8819-342111a6a68c
Dan Strauss:
0996b93d-fdbb-41d4-b488-1dc85ac37366
2ff3c312-04cd-4e84-b46e-5989f239259
Mosqueda and Lewis also have dupes.
The text was updated successfully, but these errors were encountered: