Skip to content

BUG: Automatic transformation from int to bool with read_excel #62382

@ldouteau

Description

@ldouteau

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from pathlib import Path

import pandas as pd

# Save this into a Excel sheet, adjust path if needed
#    1   True      c
#    a      b   True
# True      1      1
filex = Path(__file__).parent / "intbool.xlsx"

# Read with Excel, getting:
#    0     1     2
# 0  1  True     c
# 1  a     b  True
# 2  1  True  True
dfx = pd.read_excel(filex, header=None, dtype=str)
print(dfx)

Issue Description

Weird interaction between Python & Excel. When both integers and booleans of the same value are present on the same column (1/True or 0/False), read_excel will cast all these to the value that appears first in each series.

  • Behaviour tested with Openpyxl & Calamine.
  • I expected that setting dtype=str would fix the issue, but it had no effect
  • read_csv doesn't have this issue. The data would be read as strings
  • writing Excel file from dataframe of strings will format excel content as string, thus preventing the issue to happen

I tracked the issue down to sanitize_objects coded in Cython. I believe the issue is from this piece of code:

  • Iterations over the content of the column. memo stores the values already known.
  • When current value val is in memo, reuse it.
  • As 1 == True is true, the first value of 1/True found in the series is used for all the upcoming 1/True. Similar behaviour with 0/False

elif val in memo:
values[i] = memo[val]

Expected Behavior

I may not have the full picture, but i guess

  • current behaviour is fine when there are no strings in the series
  • when strings are found in the series, read everything as strings ? Not sure of this one
  • setting dtype=str in read_excel should read the series as containing only strs, thus preventing this conversion

Installed Versions

INSTALLED VERSIONS

commit : 4665c10
python : 3.12.2
python-bits : 64
OS : Windows
OS-release : 11
Version : 10.0.22631
machine : AMD64
processor : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_United Kingdom.1252

pandas : 2.3.2
numpy : 1.26.2
pytz : 2023.3.post1
dateutil : 2.8.2
pip : 24.0
Cython : None
sphinx : None
IPython : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : None
lxml.etree : None
matplotlib : 3.9.1
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : None
pyreadstat : None
pytest : 8.0.0
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.14.0
sqlalchemy : None
tables : None
tabulate : None
xarray : 2024.6.0
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugDuplicate ReportDuplicate issue or pull requestIO Excelread_excel, to_excelNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions