-
-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Match opinions based on pincite #4323
Comments
This is a bit of a complicated bug, so here's the TLDR:
|
I think the scenarios proposed look like this:
For the scenarios where we can't resolve the pincite I think we should match the "combined" opinion if it exists, instead of the first in order. Looking at a particular cluster with 4 opinions, the first 3 have ordering keys and types lead, concurrence, dissent; the last and oldest is a "combined" opinion with null ordering key. So, if the pincite was not solvable, or in the case of a general citation, shouldn't we point to the combined opinion which is the "whole" decision instead of a fragment? Even if we decide against this, there would be corrections to do. In that same cluster, 3297 opinions are citing the combined opinion; and only 1 is citing the "lead" opinion. Jumping into the task itself: Case when we can identify page numbers in the opinion
This is possible for opinions that come from a HTML / XML source
Once we have the pincite page number, we test for it's presence in any of the HTML fields, and it's a match if it exists. We have 363 895 clusters where a pincite citing into them would be resolvable, around 3.7% of the clusters in the DB It seems eyecite already identifies pincites... Matching on ordering key
This can be done easily, but I am unsure if it's the correct choice. If done, we should back-correct the As of time of writing this, 430 596 clusters have a opinions with at least 1 not null Queries for the stats: -- clusters with more than 1 opinion, and 1 rich structured field per opinion
courtlistener=> select count(*) from (select cluster_id from search_opinion group by cluster_id having count(*) > 1 and bool_and(xml_harvard <> '' or html_lawbox <> '' or html_columbia <> '')) a;
count
--------
363895
-- clusters with at least 1 ordering key
courtlistener=> select count(distinct(cluster_id)) from search_opinion where ordering_key is not null;
count
--------
430596
(1 row)
courtlistener=> select count(distinct(cluster_id)) from search_opinion;
count
---------
9823322
(1 row)
-- clusters that do not have a combined opinion
courtlistener=> select count(*) from (select cluster_id from search_opinion group by cluster_id having bool_and(ordering_key is not null)) a;
count
--------
219875
(1 row)
Some extra thoughtsAssuming we can resolve the pincites, we should update the Also, resolving pincites suggests some model changes. We could add a filed to OpinionsCited with the actual page number. class OpinionsCited(models.Model):
citing_opinion = models.ForeignKey(
Opinion, related_name="cited_opinions", on_delete=models.CASCADE
)
cited_opinion = models.ForeignKey(
Opinion, related_name="citing_opinions", on_delete=models.CASCADE
)
pincite = models.IntegerField(
help_text="The page cited"
) --- computing depth
SELECT citing_opinion_id, cited_opinion_id, count(*) as depth
FROM search_opinions_cited What's more, we could even leave the "depth" on the model, as a "depth" of pincites, which I imagine happens if the same opinion part is cited multiple times --- computing depth
SELECT citing_opinion_id, cited_opinion_id, sum(depth) as depth
FROM search_opinions_cited |
Hm, @flooie might have an opinion here, but I think if we have sub-opinions (plural) as well as a combined opinion, we should just match to the first one when we can't resolve the pincite. I think it's generally the most important decision in the cluster and the one that's assumed.
When we re-run the citation finder, it'll nuke existing citations and replace them with better ones. It's designed that way.
We wouldn't want to be looking in the HTML to do matches, BUT if we're going to do pincites, we should fix #4843 first. I think it'd give us an efficient way to do this.
Yes!
Hm, that doesn't seem worth it, but could we just not store the pincite in the DB? Our destination is:
I noted on the pincite sub-issue that it would be hard to do. Up to Bill if it's worth it now or something we should do later. It's pretty tough. |
Perhaps this is obvious, but I would just point out that pin-cites alone are not sufficient to identify which sub-opinion is being cited. If someone searched
I agree
Fixing #4843 only allows us to pincite to the cluster in a safer way - it doesnt help us pincite to sub-opinions. As highlighted above.
We should already be generating I'm not sure anyone mentioned the fact that parallel citations are also going to make things trickier. |
Some thoughts after talking with Bill and looking for examples
|
I think we should table this - and just link to the first ordered opinion. I think we need to improve eyecite more first as well as think about changes to citation and/or other models first. |
Sounds good, thanks Bill and everybody else for the analysis! We'll get to this at some later point. |
This is a follow-up to #4211, where we discussed potential improvements for matching the correct opinion when resolving citations to create
OpinionCited
instances.Currently within es_reverse_match if more than one Opinion found belongs to the same cluster, the first one matched is the one retrieved to create the
OpinionCited
instance.This can be improved in one of two ways:
Currently, ordering keys in Opinions are empty, so this improvement might have to wait until the ordering is populated.
The text was updated successfully, but these errors were encountered: