Export blocks and transactions:
> python export_blocks_and_transactions.py --start-block 0 --end-block 500000 \
--ipc-path ~/Library/Ethereum/geth.ipc --blocks-output blocks.csv --transactions-output transactions.csvExport ERC20 transfers:
> python export_erc20_transfers.py --start-block 0 --end-block 500000 --ipc-path ~/Library/Ethereum/geth.ipc \
--output erc20_transfers.csvExport ERC20 transfers, filtered by the list of tokens:
> python export_erc20_transfers.py --start-block 0 --end-block 500000 --ipc-path ~/Library/Ethereum/geth.ipc \
--output erc20_transfers.csv --tokens 0x86fa049857e0209aa7d9e616f7eb3b3b78ecfdb0 0x06012c8cf97bead5deae237070f9587f8e7a266dRead this article https://medium.com/@medvedev1088/exporting-and-analyzing-ethereum-blockchain-f5353414a94e
blocks.csv
| Column | Type |
|---|---|
| block_number | bigint |
| block_hash | hex_string |
| block_parent_hash | hex_string |
| block_nonce | hex_string |
| block_sha3_uncles | hex_string |
| block_logs_bloom | hex_string |
| block_transactions_root | hex_string |
| block_state_root | hex_string |
| block_miner | address |
| block_difficulty | bigint |
| block_total_difficulty | bigint |
| block_size | bigint |
| block_extra_data | hex_string |
| block_gas_limit | bigint |
| block_gas_used | bigint |
| block_timestamp | bigint |
| block_transaction_count | bigint |
transactions.csv
| Column | Type |
|---|---|
| tx_hash | hex_string |
| tx_nonce | bigint |
| tx_block_hash | hex_string |
| tx_block_number | bigint |
| tx_index | bigint |
| tx_from | address |
| tx_to | address |
| tx_value | bigint |
| tx_gas | bigint |
| tx_gas_price | bigint |
| tx_input | hex_string |
erc20_transfers.csv
| Column | Type |
|---|---|
| erc20_token | address |
| erc20_from | address |
| erc20_to | address |
| erc20_value | bigint |
| erc20_tx_hash | hex_string |
| erc20_block_number | bigint |
Note: for the address type all hex characters are lower-cased.
If you want to export just a few thousand blocks and don't want to sync your own node refer to https://github.com/medvedev1088/ethereum-scraper.
-
Install python 3.6 https://conda.io/miniconda.html
-
Install geth https://github.com/ethereum/go-ethereum/wiki/Installing-Geth
-
Start geth. Make sure it downloaded the blocks that you need by executing
eth.synchingin the JS console. You can export blocks belowcurrentBlock, there is no need to wait until the full sync as the state is not needed. -
Clone Ethereum ETL and install the dependencies:
> git clone https://github.com/medvedev1088/ethereum-etl.git > cd ethereum-etl > pip install -r requirements.txt
-
Export all:
> ./export_all.sh -h Usage: ./export_all.sh -s <start_block> -e <end_block> -b <batch_size> -i <ipc_path> [-o <output_dir>] > ./export_all.sh -s 0 -e 5499999 -b 100000 -i ~/Library/Ethereum/geth.ipc -o output
The result will be in the
outputsubdirectory, partitioned in Hive style:output/start_block=00000000/end_block=00099999/blocks_00000000_00099999.csv output/start_block=00100000/end_block=00199999/blocks_00100000_00199999.csv ... output/start_block=00000000/end_block=00099999/transactions_00000000_00099999.csv ... output/start_block=00000000/end_block=00099999/erc20_transfers_00000000_00099999.csv ...
Should work with geth and parity, on Linux, Mac, Windows. Tested with Python 3.6, geth 1.8.7, Ubuntu 16.04.4
Read this article https://medium.com/@medvedev1088/how-to-export-the-entire-ethereum-blockchain-to-csv-in-2-hours-for-10-69fef511e9a2
Additional steps:
-
Install Visual C++ Build Tools https://landinghub.visualstudio.com/visual-cpp-build-tools
-
Install Git Bash with Git for Windows https://git-scm.com/download/win
-
Run in Git Bash:
> ./export_all.sh -s 0 -e 999999 -b 100000 -i '\\.\pipe\geth.ipc' -o output
- Export blocks and transactions:
> python export_blocks_and_transactions.py --start-block 0 --end-block 500000 \
--ipc-path ~/Library/Ethereum/geth.ipc --blocks-output blocks.csv --transactions-output transactions.csvOmit --blocks-output or --transactions-output options if you don't want to export blocks/transactions.
You can tune --batch-size, --max-workers, --ipc-timeout for performance.
Call python export_blocks_and_transactions.py -h for more details.
- Export ERC20 transfers:
> python export_erc20_transfers.py --start-block 0 --end-block 500000 \
--ipc-path ~/Library/Ethereum/geth.ipc --batch-size 100 --output erc20_transfers.csvInclude --tokens <token1> <token2> to filter only certain tokens, e.g.
> python export_erc20_transfers.py --start-block 0 --end-block 500000 --ipc-path ~/Library/Ethereum/geth.ipc \
--output erc20_transfers.csv --tokens 0x86fa049857e0209aa7d9e616f7eb3b3b78ecfdb0 0x06012c8cf97bead5deae237070f9587f8e7a266dYou can tune --batch-size, --max-workers, --ipc-timeout for performance.
Call python export_erc20_transfers.py -h for more details.
> pytest -vvUpload the files to S3:
> cd output
> aws s3 sync . s3://<your_bucket>/athena/lab1/blocks --region ap-southeast-1 --exclude "*" --include "*blocks_*.csv"
> aws s3 sync . s3://<your_bucket>/athena/lab1/transactions --region ap-southeast-1 --exclude "*" --include "*transactions_*.csv"
> aws s3 sync . s3://<your_bucket>/athena/lab1/erc20_transfers --region ap-southeast-1 --exclude "*" --include "*erc20_transfers_*.csv"Create database:
CREATE DATABASE lab1;CREATE EXTERNAL TABLE IF NOT EXISTS blocks (
block_number BIGINT,
block_hash STRING,
block_parent_hash STRING,
block_nonce STRING,
block_sha3_uncles STRING,
block_logs_bloom STRING,
block_transactions_root STRING,
block_state_root STRING,
block_miner STRING,
block_difficulty DECIMAL(38,0),
block_total_difficulty DECIMAL(38,0),
block_size BIGINT,
block_extra_data STRING,
block_gas_limit BIGINT,
block_gas_used BIGINT,
block_timestamp BIGINT,
block_transaction_count BIGINT
)
PARTITIONED BY (start_block BIGINT, end_block BIGINT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'escape.delim' = '\\'
)
STORED AS TEXTFILE
LOCATION 's3://<your_bucket>/athena/lab1/blocks'
TBLPROPERTIES (
'skip.header.line.count' = '1'
);CREATE EXTERNAL TABLE IF NOT EXISTS transactions (
tx_hash STRING,
tx_nonce BIGINT,
tx_block_hash STRING,
tx_block_number BIGINT,
tx_index BIGINT,
tx_from STRING,
tx_to STRING,
tx_value DECIMAL(38,0),
tx_gas BIGINT,
tx_gas_price BIGINT,
tx_input STRING
)
PARTITIONED BY (start_block BIGINT, end_block BIGINT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'escape.delim' = '\\'
)
STORED AS TEXTFILE
LOCATION 's3://<your_bucket>/athena/lab1/transactions'
TBLPROPERTIES (
'skip.header.line.count' = '1'
);CREATE EXTERNAL TABLE IF NOT EXISTS erc20_transfers (
erc20_token STRING,
erc20_from STRING,
erc20_to STRING,
erc20_value DECIMAL(38,0),
erc20_tx_hash STRING,
erc20_block_number BIGINT
)
PARTITIONED BY (start_block BIGINT, end_block BIGINT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'escape.delim' = '\\'
)
STORED AS TEXTFILE
LOCATION 's3://<your_bucket>/athena/lab1/erc20_transfers'
TBLPROPERTIES (
'skip.header.line.count' = '1'
);Add partitions:
MSCK REPAIR TABLE blocks;
MSCK REPAIR TABLE transactions;
MSCK REPAIR TABLE erc20_transfers;Note that DECIMAL type is limited to 38 digits in Hive https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-decimal so values greater than 38 decimals will be null.
- Unit tests
- Rewrite export_all.sh in python
- Add HTTPProvider
- Error handling and logging
I'm currently working on a SaaS solution for analysts and developers:
- Web console for running SQLs based on Redash http://demo.redash.io/
- Built on top of AWS, cost efficient
- Can provide access to raw CSV data if needed
- Support for internal transactions in the future
- Support for API access in the future
- Support for Bitcoin and other blockchains in the future
Contact me if you would like to contribute [email protected]
