Skip to content

Improve rule-based "Plain citations parser" #12893

@bwakkie

Description

@bwakkie

JabRef version

Other (please describe below)

Operating system

GNU / Linux

Details on version and operating system

JabRef 5.16--2024-07-25--771c4cd Linux 6.12.20-2-manjaro amd64 Java 21.0.2 JavaFX 22.0.2+4

Checked with the latest development build (copy version output from About dialog)

  • I made a backup of my libraries before testing the latest development version.
  • I have tested the latest development version and the problem persists

Steps to reproduce the behaviour

JabRef 5.16

There is a problem with the text parser which is changing the citations completely into a not related citations.
I created a test case see two attached files.

testcase.txt

testcase.bib.txt

How come this totally different citations are matched? Is there a way to parse the strings without the use of grobid? As I think blindly trusting grobid is wrong. At least verify the whole title string would already help to see something is not right.

More test cases

At https://github.com/inukshuk/anystyle/blob/main/spec/benchmark.rb, anystyle has the following tests for benchmarking:

data = <<-END_REFERENCES
<author> A. Cau, R. Kuiper, and W.-P. de Roever. </author> <title> Formalising Dijkstra's development strategy within Stark's formalism. </title> <editor> In C. B. Jones, R. C. Shaw, and T. Denvir, editors, </editor> <container-title> Proc. 5th. BCS-FACS Refinement Workshop, </container-title> <date> 1992. </date>
<author> M. Kitsuregawa, H. Tanaka, and T. Moto-oka. </author> <title> Application of hash to data base machine and its architecture. </title> <journal> New Generation Computing, </journal> <volume> 1(1), </volume> <date> 1983. </date>
<author> Alexander Vrchoticky. </author> <title> Modula/R language definition. </title> <tech> Technical Report TU Wien rr-02-92, version 2.0, </tech> <institution> Dept. for Real-Time Systems, Technical University of Vienna, </institution> <date> May 1993. </date>
<author> Marc Shapiro and Susan Horwitz. </author> <title> Fast and accurate flow-insensitive points-to analysis. </title> <container-title> In Proceedings of the 24th Annual ACM Symposium on Principles of Programming Languages, </container-title> <date> January 1997. </date>
<author> W. Landi and B. G. Ryder. </author> <title> Aliasing with and without pointers: A problem taxonomy. </title> <institution> Center for Computer Aids for Industrial Productivity </institution> <tech> Technical Report CAIP-TR-125, </tech> <institution> Rutgers University, </institution> <date> September 1990. </date>
<author> W. H. Enright. </author> <title> Improving the efficiency of matrix operations in the numerical solution of stiff ordinary differential equations. </title> <journal> ACM Trans. Math. Softw., </journal> <volume> 4(2), </volume> <pages> 127-136, </pages> <date> June 1978. </date>
<author> Gmytrasiewicz, P. J., Durfee, E. H., & Wehe, D. K. </author> <date> (1991a). </date> <title> A decision theoretic approach to coordinating multiagent interaction. </title> <container-title> In Proceedings of the Twelfth International Joint Conference on Artificial Intelligence, </container-title> <pages> pp. 62-68 </pages> <location> Sydney, Australia. </location>
<author> A. Bookstein and S. T. Klein, </author> <title> Detecting content-bearing words by serial clustering, </title> <container-title> Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, </container-title> <pages> pp. 319327, </pages> <date> 1995. </date>
<author> U. Dayal, H. Garcia-Molina, M. Hsu, B. Kao, and M.- C. Shan. </author> <title> Third generation TP monitors: A database challenge. </title> <container-title> In ACM SIGMOD Conference on Management of Data, </container-title> <pages> pages 393-397, </pages> <location> Washington, D. C., </location> <date> May 1993. </date>
<author> C. Qiao and R. Melhem, </author> <title> "Reducing Communication Latency with Path Multiplexing in Optically Interconnected Multiprocessor Systems", </title> <container-title> Proc. of HPCA-1, </container-title> <date> 1995. </date>
END_REFERENCES

We could re-use those for our RegEx tests.

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    Status

    Free to take

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions