BERT score: maximum at self-comparison, symmetry, invariance to additional items #2728

GPPassos · 2024-09-09T20:12:13Z

🐛 Bug

I would be expecting the following properties of BERTscore:

given a single list of sentences, and comparing all pairs as preds and targets, BERTscore should be maximum when the same sentence is given as pred[i] and target[i].
for the F1 score, the score should be the same inverting the pred and the target.
with idf=False, extending the list of pred and the list of target should not affect the previous input.

There are counterexamples for all of the properties above.

To Reproduce

Steps to reproduce the behavior, run the test suite with the following tests added to test_bertscore.py.

Proposed test suite

@skip_on_connection_issues()
@pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")
@pytest.mark.parametrize(
    ["idf", "batch_size"],
    [(False, 1),
     (False, 9),
     (True, 1),
     (True, 9)],
)
def test_bertscore_most_similar(idf: bool, batch_size: int):
    """Tests that BERTScore actually gives the highest score to self-similarity."""
    short = "hello there"
    long = "master kenobi"
    longer = "general kenobi"
    
    sentences = [short, long, longer]
    preds, targets = list(zip(*list(product(sentences,
                                            sentences))))
    score = bert_score(preds, targets, idf=idf, lang="en",
                       rescale_with_baseline=False, batch_size=batch_size)
    for i in range(len(preds)):
        max_pred = i%(len(sentences))*(1 + len(sentences))
        max_target = int(i/(len(sentences)))*(1 + len(sentences))
        assert score["f1"][i] <= score["f1"][max_pred], \
            f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_pred], targets[max_pred]}\n{i=}{max_pred=}"
        assert score["f1"][i] <= score["f1"][max_target], \
            f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_target], targets[max_target]}\n{i=}{max_target=}"



@skip_on_connection_issues()
@pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")
@pytest.mark.parametrize(
    ["idf", "batch_size"],
    [(False, 1),
     (False, 9),
     (True, 1),
     (True, 9)],
)
def test_bertscore_symmetry(idf: bool, batch_size: int):
    """Tests that BERTscore F1 score is symmetric between reference and prediction.
    As F1 is symmetric, it should also be symmetric."""

    short = "hello there"
    long = "master kenobi"
    longer = "general kenobi"

    sentences = [short, long, longer]
    preds, targets = list(zip(*list(product(sentences,
                                            sentences))))
    score = bert_score(preds, targets, idf=idf, lang="en",
                       rescale_with_baseline=False, batch_size=batch_size)
    for i in range(len(preds)):
        for j in range(len(targets)):
            if preds[i] == targets[j] and preds[j] == targets[i]:
                assert score['f1'][i] == pytest.approx(score['f1'][j]), \
                    f"f1 score for {(preds[i], targets[i])} is not the same as {(preds[j], targets[j])}."
    pass

        
@skip_on_connection_issues()
@pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")
@pytest.mark.parametrize(
    ["idf", "batch_size"],
    [(False, 1),
     (False, 3)]
)
def test_bertscore_additional_sentence(idf: bool, batch_size: int):
    """Tests that BERTscore keeps the same scores for previous inputs
    by adding additional elements to the input lists. This should be the case for idf=False."""

    short = "hello there"
    long = "master kenobi"
    longer = "general kenobi"

    preds = [long,long]
    targets = [long,short]

    score = bert_score(preds, targets, idf=idf, lang="en",
                       rescale_with_baseline=False, batch_size=batch_size)

    longlong = score["f1"][0]
    longshort = score["f1"][1]
    # First index should be the self-comparison - sorting by length should not shuffle this
    assert longlong > longshort
    
    preds = preds + [short, longer]
    targets = targets + [longer, long]

    score = bert_score(preds, targets, idf=idf, lang="en",
                       rescale_with_baseline=False, batch_size=batch_size)

    # First two indices should be exactly as in the previous call to metric
    assert score["f1"][0] == pytest.approx(longlong)
    assert score["f1"][1] == pytest.approx(longshort)
    # Indices 1 and 2 should also be smaller than self-comparison.
    assert score["f1"][0] > score["f1"][1]
    assert score["f1"][0] > score["f1"][2]

Test results

unittests/text/test_bertscore.py::test_bertscore_most_similar[False-1] FAILED                                                        [ 10%]
unittests/text/test_bertscore.py::test_bertscore_most_similar[False-9] FAILED                                                        [ 20%]
unittests/text/test_bertscore.py::test_bertscore_most_similar[True-1] FAILED                                                         [ 30%]
unittests/text/test_bertscore.py::test_bertscore_most_similar[True-9] FAILED                                                         [ 40%]
unittests/text/test_bertscore.py::test_bertscore_symmetry[False-1] FAILED                                                            [ 50%]
unittests/text/test_bertscore.py::test_bertscore_symmetry[False-9] FAILED                                                            [ 60%]
unittests/text/test_bertscore.py::test_bertscore_symmetry[True-1] FAILED                                                             [ 70%]
unittests/text/test_bertscore.py::test_bertscore_symmetry[True-9] FAILED                                                             [ 80%]
unittests/text/test_bertscore.py::test_bertscore_additional_sentence[False-1] FAILED                                                 [ 90%]
unittests/text/test_bertscore.py::test_bertscore_additional_sentence[False-3] FAILED                                                 [100%]

================================================================= FAILURES =================================================================
___________________________________________________ test_bertscore_most_similar[False-1] ___________________________________________________

idf = False, batch_size = 1

    @skip_on_connection_issues()
    @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")
    @pytest.mark.parametrize(
        ["idf", "batch_size"],
        [(False, 1),
         (False, 9),
         (True, 1),
         (True, 9)],
    )
    def test_bertscore_most_similar(idf: bool, batch_size: int):
        """Tests that BERTScore actually gives the highest score to self-similarity."""
        short = "hello there"
        long = "master kenobi"
        longer = "general kenobi"
    
        sentences = [short, long, longer]
        preds, targets = list(zip(*list(product(sentences,
                                                sentences))))
        score = bert_score(preds, targets, idf=idf, lang="en",
                           rescale_with_baseline=False, batch_size=batch_size)
        for i in range(len(preds)):
            max_pred = i%(len(sentences))*(1 + len(sentences))
            max_target = int(i/(len(sentences)))*(1 + len(sentences))
            assert score["f1"][i] <= score["f1"][max_pred], \
                f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_pred], targets[max_pred]}\n{i=}{max_pred=}"
>           assert score["f1"][i] <= score["f1"][max_target], \
                f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_target], targets[max_target]}\n{i=}{max_target=}"
E           AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi')
E             i=5max_target=4
E           assert tensor(0.9961) <= tensor(0.9664)

unittests/text/test_bertscore.py:220: AssertionError
___________________________________________________ test_bertscore_most_similar[False-9] ___________________________________________________

idf = False, batch_size = 9

    @skip_on_connection_issues()
    @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")
    @pytest.mark.parametrize(
        ["idf", "batch_size"],
        [(False, 1),
         (False, 9),
         (True, 1),
         (True, 9)],
    )
    def test_bertscore_most_similar(idf: bool, batch_size: int):
        """Tests that BERTScore actually gives the highest score to self-similarity."""
        short = "hello there"
        long = "master kenobi"
        longer = "general kenobi"
    
        sentences = [short, long, longer]
        preds, targets = list(zip(*list(product(sentences,
                                                sentences))))
        score = bert_score(preds, targets, idf=idf, lang="en",
                           rescale_with_baseline=False, batch_size=batch_size)
        for i in range(len(preds)):
            max_pred = i%(len(sentences))*(1 + len(sentences))
            max_target = int(i/(len(sentences)))*(1 + len(sentences))
            assert score["f1"][i] <= score["f1"][max_pred], \
                f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_pred], targets[max_pred]}\n{i=}{max_pred=}"
>           assert score["f1"][i] <= score["f1"][max_target], \
                f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_target], targets[max_target]}\n{i=}{max_target=}"
E           AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi')
E             i=5max_target=4
E           assert tensor(0.9961) <= tensor(0.9664)

unittests/text/test_bertscore.py:220: AssertionError
___________________________________________________ test_bertscore_most_similar[True-1] ____________________________________________________

idf = True, batch_size = 1

    @skip_on_connection_issues()
    @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")
    @pytest.mark.parametrize(
        ["idf", "batch_size"],
        [(False, 1),
         (False, 9),
         (True, 1),
         (True, 9)],
    )
    def test_bertscore_most_similar(idf: bool, batch_size: int):
        """Tests that BERTScore actually gives the highest score to self-similarity."""
        short = "hello there"
        long = "master kenobi"
        longer = "general kenobi"
    
        sentences = [short, long, longer]
        preds, targets = list(zip(*list(product(sentences,
                                                sentences))))
        score = bert_score(preds, targets, idf=idf, lang="en",
                           rescale_with_baseline=False, batch_size=batch_size)
        for i in range(len(preds)):
            max_pred = i%(len(sentences))*(1 + len(sentences))
            max_target = int(i/(len(sentences)))*(1 + len(sentences))
            assert score["f1"][i] <= score["f1"][max_pred], \
                f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_pred], targets[max_pred]}\n{i=}{max_pred=}"
>           assert score["f1"][i] <= score["f1"][max_target], \
                f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_target], targets[max_target]}\n{i=}{max_target=}"
E           AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi')
E             i=5max_target=4
E           assert tensor(0.9942) <= tensor(0.9674)

unittests/text/test_bertscore.py:220: AssertionError
___________________________________________________ test_bertscore_most_similar[True-9] ____________________________________________________

idf = True, batch_size = 9

    @skip_on_connection_issues()
    @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")
    @pytest.mark.parametrize(
        ["idf", "batch_size"],
        [(False, 1),
         (False, 9),
         (True, 1),
         (True, 9)],
    )
    def test_bertscore_most_similar(idf: bool, batch_size: int):
        """Tests that BERTScore actually gives the highest score to self-similarity."""
        short = "hello there"
        long = "master kenobi"
        longer = "general kenobi"
    
        sentences = [short, long, longer]
        preds, targets = list(zip(*list(product(sentences,
                                                sentences))))
        score = bert_score(preds, targets, idf=idf, lang="en",
                           rescale_with_baseline=False, batch_size=batch_size)
        for i in range(len(preds)):
            max_pred = i%(len(sentences))*(1 + len(sentences))
            max_target = int(i/(len(sentences)))*(1 + len(sentences))
            assert score["f1"][i] <= score["f1"][max_pred], \
                f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_pred], targets[max_pred]}\n{i=}{max_pred=}"
>           assert score["f1"][i] <= score["f1"][max_target], \
                f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_target], targets[max_target]}\n{i=}{max_target=}"
E           AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi')
E             i=5max_target=4
E           assert tensor(0.9942) <= tensor(0.9674)

unittests/text/test_bertscore.py:220: AssertionError
_____________________________________________________ test_bertscore_symmetry[False-1] _____________________________________________________

idf = False, batch_size = 1

    @skip_on_connection_issues()
    @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")
    @pytest.mark.parametrize(
        ["idf", "batch_size"],
        [(False, 1),
         (False, 9),
         (True, 1),
         (True, 9)],
    )
    def test_bertscore_symmetry(idf: bool, batch_size: int):
        """Tests that BERTscore F1 score is symmetric between reference and prediction.
        As F1 is symmetric, it should also be symmetric."""
    
        short = "hello there"
        long = "master kenobi"
        longer = "general kenobi"
    
        sentences = [short, long, longer]
        preds, targets = list(zip(*list(product(sentences,
                                                sentences))))
        score = bert_score(preds, targets, idf=idf, lang="en",
                           rescale_with_baseline=False, batch_size=batch_size)
        for i in range(len(preds)):
            for j in range(len(targets)):
                if preds[i] == targets[j] and preds[j] == targets[i]:
>                   assert score['f1'][i] == pytest.approx(score['f1'][j]), \
                        f"f1 score for {(preds[i], targets[i])} is not the same as {(preds[j], targets[j])}."
E                   AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there').
E                   assert tensor(0.9663) == approx(0.9960...482 ± 1.0e-06)
E                     
E                     comparison failed. Mismatched elements: 1 / 1:
E                     Max absolute difference: 0.02979564666748047
E                     Max relative difference: 0.03083609460462107
E                     Index | Obtained   | Expected                    
E                     ()    | 0.96625876 | 0.9960544109344482 ± 1.0e-06

unittests/text/test_bertscore.py:250: AssertionError
_____________________________________________________ test_bertscore_symmetry[False-9] _____________________________________________________

idf = False, batch_size = 9

    @skip_on_connection_issues()
    @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")
    @pytest.mark.parametrize(
        ["idf", "batch_size"],
        [(False, 1),
         (False, 9),
         (True, 1),
         (True, 9)],
    )
    def test_bertscore_symmetry(idf: bool, batch_size: int):
        """Tests that BERTscore F1 score is symmetric between reference and prediction.
        As F1 is symmetric, it should also be symmetric."""
    
        short = "hello there"
        long = "master kenobi"
        longer = "general kenobi"
    
        sentences = [short, long, longer]
        preds, targets = list(zip(*list(product(sentences,
                                                sentences))))
        score = bert_score(preds, targets, idf=idf, lang="en",
                           rescale_with_baseline=False, batch_size=batch_size)
        for i in range(len(preds)):
            for j in range(len(targets)):
                if preds[i] == targets[j] and preds[j] == targets[i]:
>                   assert score['f1'][i] == pytest.approx(score['f1'][j]), \
                        f"f1 score for {(preds[i], targets[i])} is not the same as {(preds[j], targets[j])}."
E                   AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there').
E                   assert tensor(0.9663) == approx(0.9960...482 ± 1.0e-06)
E                     
E                     comparison failed. Mismatched elements: 1 / 1:
E                     Max absolute difference: 0.02979564666748047
E                     Max relative difference: 0.03083609460462107
E                     Index | Obtained   | Expected                    
E                     ()    | 0.96625876 | 0.9960544109344482 ± 1.0e-06

unittests/text/test_bertscore.py:250: AssertionError
_____________________________________________________ test_bertscore_symmetry[True-1] ______________________________________________________

idf = True, batch_size = 1

    @skip_on_connection_issues()
    @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")
    @pytest.mark.parametrize(
        ["idf", "batch_size"],
        [(False, 1),
         (False, 9),
         (True, 1),
         (True, 9)],
    )
    def test_bertscore_symmetry(idf: bool, batch_size: int):
        """Tests that BERTscore F1 score is symmetric between reference and prediction.
        As F1 is symmetric, it should also be symmetric."""
    
        short = "hello there"
        long = "master kenobi"
        longer = "general kenobi"
    
        sentences = [short, long, longer]
        preds, targets = list(zip(*list(product(sentences,
                                                sentences))))
        score = bert_score(preds, targets, idf=idf, lang="en",
                           rescale_with_baseline=False, batch_size=batch_size)
        for i in range(len(preds)):
            for j in range(len(targets)):
                if preds[i] == targets[j] and preds[j] == targets[i]:
>                   assert score['f1'][i] == pytest.approx(score['f1'][j]), \
                        f"f1 score for {(preds[i], targets[i])} is not the same as {(preds[j], targets[j])}."
E                   AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there').
E                   assert tensor(0.9672) == approx(0.9942...835 ± 9.9e-07)
E                     
E                     comparison failed. Mismatched elements: 1 / 1:
E                     Max absolute difference: 0.027047932147979736
E                     Max relative difference: 0.02796625947389248
E                     Index | Obtained | Expected                   
E                     ()    | 0.967163 | 0.994210958480835 ± 9.9e-07

unittests/text/test_bertscore.py:250: AssertionError
_____________________________________________________ test_bertscore_symmetry[True-9] ______________________________________________________

idf = True, batch_size = 9

    @skip_on_connection_issues()
    @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")
    @pytest.mark.parametrize(
        ["idf", "batch_size"],
        [(False, 1),
         (False, 9),
         (True, 1),
         (True, 9)],
    )
    def test_bertscore_symmetry(idf: bool, batch_size: int):
        """Tests that BERTscore F1 score is symmetric between reference and prediction.
        As F1 is symmetric, it should also be symmetric."""
    
        short = "hello there"
        long = "master kenobi"
        longer = "general kenobi"
    
        sentences = [short, long, longer]
        preds, targets = list(zip(*list(product(sentences,
                                                sentences))))
        score = bert_score(preds, targets, idf=idf, lang="en",
                           rescale_with_baseline=False, batch_size=batch_size)
        for i in range(len(preds)):
            for j in range(len(targets)):
                if preds[i] == targets[j] and preds[j] == targets[i]:
>                   assert score['f1'][i] == pytest.approx(score['f1'][j]), \
                        f"f1 score for {(preds[i], targets[i])} is not the same as {(preds[j], targets[j])}."
E                   AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there').
E                   assert tensor(0.9672) == approx(0.9942...835 ± 9.9e-07)
E                     
E                     comparison failed. Mismatched elements: 1 / 1:
E                     Max absolute difference: 0.027047932147979736
E                     Max relative difference: 0.02796625947389248
E                     Index | Obtained | Expected                   
E                     ()    | 0.967163 | 0.994210958480835 ± 9.9e-07

unittests/text/test_bertscore.py:250: AssertionError
_______________________________________________ test_bertscore_additional_sentence[False-1] ________________________________________________

idf = False, batch_size = 1

    @skip_on_connection_issues()
    @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")
    @pytest.mark.parametrize(
        ["idf", "batch_size"],
        [(False, 1),
         (False, 3)]
    )
    def test_bertscore_additional_sentence(idf: bool, batch_size: int):
        """Tests that BERTscore keeps the same scores for previous inputs
        by adding additional elements to the input lists. This should be the case for idf=False."""
    
        short = "hello there"
        long = "master kenobi"
        longer = "general kenobi"
    
        preds = [long,long]
        targets = [long,short]
    
        score = bert_score(preds, targets, idf=idf, lang="en",
                           rescale_with_baseline=False, batch_size=batch_size)
    
        longlong = score["f1"][0]
        longshort = score["f1"][1]
        # First index should be the self-comparison - sorting by length should not shuffle this
        assert longlong > longshort
    
        preds = preds + [short, longer]
        targets = targets + [longer, long]
    
        score = bert_score(preds, targets, idf=idf, lang="en",
                           rescale_with_baseline=False, batch_size=batch_size)
    
        # First two indices should be exactly as in the previous call to metric
        assert score["f1"][0] == pytest.approx(longlong)
>       assert score["f1"][1] == pytest.approx(longshort)
E       assert tensor(1.0000) == approx(0.9663...901 ± 9.7e-07)
E         
E         comparison failed. Mismatched elements: 1 / 1:
E         Max absolute difference: 0.03361696004867554
E         Max relative difference: 0.0336169660598575
E         Index | Obtained  | Expected                    
E         ()    | 0.9999998 | 0.9663828611373901 ± 9.7e-07

unittests/text/test_bertscore.py:289: AssertionError
_______________________________________________ test_bertscore_additional_sentence[False-3] ________________________________________________

idf = False, batch_size = 3

    @skip_on_connection_issues()
    @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4")
    @pytest.mark.parametrize(
        ["idf", "batch_size"],
        [(False, 1),
         (False, 3)]
    )
    def test_bertscore_additional_sentence(idf: bool, batch_size: int):
        """Tests that BERTscore keeps the same scores for previous inputs
        by adding additional elements to the input lists. This should be the case for idf=False."""
    
        short = "hello there"
        long = "master kenobi"
        longer = "general kenobi"
    
        preds = [long,long]
        targets = [long,short]
    
        score = bert_score(preds, targets, idf=idf, lang="en",
                           rescale_with_baseline=False, batch_size=batch_size)
    
        longlong = score["f1"][0]
        longshort = score["f1"][1]
        # First index should be the self-comparison - sorting by length should not shuffle this
        assert longlong > longshort
    
        preds = preds + [short, longer]
        targets = targets + [longer, long]
    
        score = bert_score(preds, targets, idf=idf, lang="en",
                           rescale_with_baseline=False, batch_size=batch_size)
    
        # First two indices should be exactly as in the previous call to metric
        assert score["f1"][0] == pytest.approx(longlong)
>       assert score["f1"][1] == pytest.approx(longshort)
E       assert tensor(1.0000) == approx(0.9663...006 ± 9.7e-07)
E         
E         comparison failed. Mismatched elements: 1 / 1:
E         Max absolute difference: 0.03361707925796509
E         Max relative difference: 0.033617085269168366
E         Index | Obtained  | Expected                    
E         ()    | 0.9999998 | 0.9663827419281006 ± 9.7e-07

unittests/text/test_bertscore.py:289: AssertionError
========================================================= short test summary info ==========================================================
FAILED unittests/text/test_bertscore.py::test_bertscore_most_similar[False-1] - AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi')
FAILED unittests/text/test_bertscore.py::test_bertscore_most_similar[False-9] - AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi')
FAILED unittests/text/test_bertscore.py::test_bertscore_most_similar[True-1] - AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi')
FAILED unittests/text/test_bertscore.py::test_bertscore_most_similar[True-9] - AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi')
FAILED unittests/text/test_bertscore.py::test_bertscore_symmetry[False-1] - AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there').
FAILED unittests/text/test_bertscore.py::test_bertscore_symmetry[False-9] - AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there').
FAILED unittests/text/test_bertscore.py::test_bertscore_symmetry[True-1] - AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there').
FAILED unittests/text/test_bertscore.py::test_bertscore_symmetry[True-9] - AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there').
FAILED unittests/text/test_bertscore.py::test_bertscore_additional_sentence[False-1] - assert tensor(1.0000) == approx(0.9663...901 ± 9.7e-07)
FAILED unittests/text/test_bertscore.py::test_bertscore_additional_sentence[False-3] - assert tensor(1.0000) == approx(0.9663...006 ± 9.7e-07)
============================================ 10 failed, 33571 deselected, 71 warnings in 14.56s ============================================

Expected behavior

All tests above should pass.

Environment

TorchMetrics version (if build from source, add commit SHA): 2cd6f6a
Python & PyTorch Version (e.g., 1.0): 3.12.4 & 2.4.1+cu121
Any other relevant information such as OS (e.g., Linux): Linux

Additional context

Maybe this is somehow related to tokenisation or the encoding, but I have not confirmed that. Against this hypothesis is the fact that this still happens for batch_size=1.

Seems related to PR #2347 . Perhaps the sorting is still incorrectly done?

I have also checked that some of those fail on the original implementation mentioned of BERT score. I have considered whether those properties maybe are simply not expected to hold, but I have found nothing in either the paper nor in the documentation suggesting that, when idf=False and there is no baseline correction.

I am happy to submit a PR with the above tests, which currently all fail.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-09-09T20:12:40Z

Hi! thanks for your contribution!, great first issue!

SkafteNicki · 2024-09-11T06:55:13Z

cc: @stancld opinions on this?

Borda · 2025-03-21T09:00:13Z

@rasbt what would be your opinion on this LLM case? 🤔

rasbt · 2025-03-21T14:27:47Z

As far as I know, it should be symmetrical for the F1 score. So things like

AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there').

                   f"f1 score for {(preds[i], targets[i])} is not the same as {(preds[j], targets[j])}."
E                   AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there').
E                   assert tensor(0.9672) == approx(0.9942...835 ± 9.9e-07)
E                     
E                     comparison failed. Mismatched elements: 1 / 1:
E                     Max absolute difference: 0.027047932147979736
E                     Max relative difference: 0.02796625947389248
E                     Index | Obtained | Expected                   
E                     ()    | 0.967163 | 0.994210958480835 ± 9.9e-07

seems unexpected.

It looks like there is a problem with the matching perhaps?

GPPassos added bug / fix Something isn't working help wanted Extra attention is needed labels Sep 9, 2024

Borda assigned stancld Sep 11, 2024

Borda added the v1.4.x label Sep 11, 2024

GPPassos mentioned this issue Sep 13, 2024

Fix test_bertscore_sorting bug + validate idf arg #2727

Merged

4 tasks

Borda unassigned stancld Mar 21, 2025

Borda added the good first issue Good for newcomers label Mar 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BERT score: maximum at self-comparison, symmetry, invariance to additional items #2728

BERT score: maximum at self-comparison, symmetry, invariance to additional items #2728

GPPassos commented Sep 9, 2024 •

edited

Loading

github-actions bot commented Sep 9, 2024

SkafteNicki commented Sep 11, 2024

Borda commented Mar 21, 2025

rasbt commented Mar 21, 2025

BERT score: maximum at self-comparison, symmetry, invariance to additional items #2728

BERT score: maximum at self-comparison, symmetry, invariance to additional items #2728

Comments

GPPassos commented Sep 9, 2024 • edited Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

github-actions bot commented Sep 9, 2024

SkafteNicki commented Sep 11, 2024

Borda commented Mar 21, 2025

rasbt commented Mar 21, 2025

GPPassos commented Sep 9, 2024 •

edited

Loading