Wrong inference type depending on how I read the file from S3 #45106

jpdonasolo · 2024-12-24T17:45:52Z

Describe the bug, including details regarding any error messages, version, and platform.

If I download the file from s3 to my machine, I can read it using pandas:

>>> df = pd.read_parquet(my_file)
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77255 entries, 0 to 77254
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   doc_id                77255 non-null  Int64  
 1   price_id              77255 non-null  Int64  
 2   ean                   77255 non-null  object 
 3   preco_liquido         77255 non-null  float64
 4   rede                  77255 non-null  object 
 5   cnpj                  77255 non-null  object 
** 6   tipo                  77255 non-null  object **
 7   codestado             77255 non-null  Int64  
 8   codcidade             77255 non-null  Int64  
 9   descricao             77255 non-null  object 
 10  desconto              77205 non-null  float64
 11  endereco_logradouro   77255 non-null  object 
 12  endereco_numero       77255 non-null  object 
 13  endereco_complemento  77255 non-null  object 
 14  cep                   77255 non-null  object 
 15  bairro                77255 non-null  object 
 16  latitude              77255 non-null  float64
 17  longitude             77255 non-null  float64
 18  nome_fantasia         77255 non-null  object 
 19  telefone              77255 non-null  object 
 20  segmento              77255 non-null  object 
 21  fonte_coleta          77255 non-null  object 
 22  pmc                   0 non-null      float64
 23  composicao            77213 non-null  object 
dtypes: Int64(4), float64(5), object(15)
memory usage: 14.4+ MB

However, if I try to read it directly from s3, I get a type error:

df = pd.read_parquet(f"s3://{bucket}/{my_file}", storage_options=dict(profile=aws_profile))

I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pandas/io/parquet.py", line 667, in read_parquet
    return impl.read(
           ^^^^^^^^^^
  File "/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pandas/io/parquet.py", line 274, in read
    pa_table = self.api.parquet.read_table(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1793, in read_table
    dataset = ParquetDataset(
              ^^^^^^^^^^^^^^^
  File "/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1371, in __init__
    self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 794, in dataset
    return _filesystem_dataset(source, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 486, in _filesystem_dataset
    return factory.finish(schema)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset.pyx", line 3126, in pyarrow._dataset.DatasetFactory.finish
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Unable to merge: Field tipo has incompatible types: string vs dictionary<values=string, indices=int32, ordered=0>

EDIT
Versions:
Pandas: 2.2.3
Pyarrow: 18.1.0
OS: Ubuntu 24.04.1 LTS

Component(s)

Python

jpdonasolo added the Type: bug label Dec 24, 2024

github-actions bot added the Component: Python label Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong inference type depending on how I read the file from S3 #45106

Wrong inference type depending on how I read the file from S3 #45106

jpdonasolo commented Dec 24, 2024 •

edited

Loading

Wrong inference type depending on how I read the file from S3 #45106

Wrong inference type depending on how I read the file from S3 #45106

Comments

jpdonasolo commented Dec 24, 2024 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

jpdonasolo commented Dec 24, 2024 •

edited

Loading