Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong inference type depending on how I read the file from S3 #45106

Open
jpdonasolo opened this issue Dec 24, 2024 · 0 comments
Open

Wrong inference type depending on how I read the file from S3 #45106

jpdonasolo opened this issue Dec 24, 2024 · 0 comments

Comments

@jpdonasolo
Copy link

jpdonasolo commented Dec 24, 2024

Describe the bug, including details regarding any error messages, version, and platform.

If I download the file from s3 to my machine, I can read it using pandas:

>>> df = pd.read_parquet(my_file)
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77255 entries, 0 to 77254
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   doc_id                77255 non-null  Int64  
 1   price_id              77255 non-null  Int64  
 2   ean                   77255 non-null  object 
 3   preco_liquido         77255 non-null  float64
 4   rede                  77255 non-null  object 
 5   cnpj                  77255 non-null  object 
** 6   tipo                  77255 non-null  object **
 7   codestado             77255 non-null  Int64  
 8   codcidade             77255 non-null  Int64  
 9   descricao             77255 non-null  object 
 10  desconto              77205 non-null  float64
 11  endereco_logradouro   77255 non-null  object 
 12  endereco_numero       77255 non-null  object 
 13  endereco_complemento  77255 non-null  object 
 14  cep                   77255 non-null  object 
 15  bairro                77255 non-null  object 
 16  latitude              77255 non-null  float64
 17  longitude             77255 non-null  float64
 18  nome_fantasia         77255 non-null  object 
 19  telefone              77255 non-null  object 
 20  segmento              77255 non-null  object 
 21  fonte_coleta          77255 non-null  object 
 22  pmc                   0 non-null      float64
 23  composicao            77213 non-null  object 
dtypes: Int64(4), float64(5), object(15)
memory usage: 14.4+ MB

However, if I try to read it directly from s3, I get a type error:

df = pd.read_parquet(f"s3://{bucket}/{my_file}", storage_options=dict(profile=aws_profile))

I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pandas/io/parquet.py", line 667, in read_parquet
    return impl.read(
           ^^^^^^^^^^
  File "/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pandas/io/parquet.py", line 274, in read
    pa_table = self.api.parquet.read_table(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1793, in read_table
    dataset = ParquetDataset(
              ^^^^^^^^^^^^^^^
  File "/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1371, in __init__
    self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 794, in dataset
    return _filesystem_dataset(source, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 486, in _filesystem_dataset
    return factory.finish(schema)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset.pyx", line 3126, in pyarrow._dataset.DatasetFactory.finish
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Unable to merge: Field tipo has incompatible types: string vs dictionary<values=string, indices=int32, ordered=0>

EDIT
Versions:
Pandas: 2.2.3
Pyarrow: 18.1.0
OS: Ubuntu 24.04.1 LTS

Component(s)

Python

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant