We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If I download the file from s3 to my machine, I can read it using pandas:
>>> df = pd.read_parquet(my_file) >>> df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 77255 entries, 0 to 77254 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 doc_id 77255 non-null Int64 1 price_id 77255 non-null Int64 2 ean 77255 non-null object 3 preco_liquido 77255 non-null float64 4 rede 77255 non-null object 5 cnpj 77255 non-null object ** 6 tipo 77255 non-null object ** 7 codestado 77255 non-null Int64 8 codcidade 77255 non-null Int64 9 descricao 77255 non-null object 10 desconto 77205 non-null float64 11 endereco_logradouro 77255 non-null object 12 endereco_numero 77255 non-null object 13 endereco_complemento 77255 non-null object 14 cep 77255 non-null object 15 bairro 77255 non-null object 16 latitude 77255 non-null float64 17 longitude 77255 non-null float64 18 nome_fantasia 77255 non-null object 19 telefone 77255 non-null object 20 segmento 77255 non-null object 21 fonte_coleta 77255 non-null object 22 pmc 0 non-null float64 23 composicao 77213 non-null object dtypes: Int64(4), float64(5), object(15) memory usage: 14.4+ MB
However, if I try to read it directly from s3, I get a type error:
df = pd.read_parquet(f"s3://{bucket}/{my_file}", storage_options=dict(profile=aws_profile))
I get the following error:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pandas/io/parquet.py", line 667, in read_parquet return impl.read( ^^^^^^^^^^ File "/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pandas/io/parquet.py", line 274, in read pa_table = self.api.parquet.read_table( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1793, in read_table dataset = ParquetDataset( ^^^^^^^^^^^^^^^ File "/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1371, in __init__ self._dataset = ds.dataset(path_or_paths, filesystem=filesystem, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 794, in dataset return _filesystem_dataset(source, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/joao/Documents/ETL-concorrentes/.venv/lib/python3.12/site-packages/pyarrow/dataset.py", line 486, in _filesystem_dataset return factory.finish(schema) ^^^^^^^^^^^^^^^^^^^^^^ File "pyarrow/_dataset.pyx", line 3126, in pyarrow._dataset.DatasetFactory.finish File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status pyarrow.lib.ArrowTypeError: Unable to merge: Field tipo has incompatible types: string vs dictionary<values=string, indices=int32, ordered=0>
EDIT Versions: Pandas: 2.2.3 Pyarrow: 18.1.0 OS: Ubuntu 24.04.1 LTS
Python
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Describe the bug, including details regarding any error messages, version, and platform.
If I download the file from s3 to my machine, I can read it using pandas:
However, if I try to read it directly from s3, I get a type error:
I get the following error:
EDIT
Versions:
Pandas: 2.2.3
Pyarrow: 18.1.0
OS: Ubuntu 24.04.1 LTS
Component(s)
Python
The text was updated successfully, but these errors were encountered: