Skip to content

Python scripts for ETL (extract, transform and load) jobs for Ethereum blocks, transactions, ERC20 transfers, receipts, logs, internal transactions (message calls)

License

Notifications You must be signed in to change notification settings

sandro-magenio/ethereum-etl

 
 

Repository files navigation

Ethereum ETL

Join the chat at https://gitter.im/ethereum-eth Build Status

Export blocks and transactions:

> python export_blocks_and_transactions.py --start-block 0 --end-block 500000 \
--ipc-path ~/Library/Ethereum/geth.ipc --blocks-output blocks.csv --transactions-output transactions.csv

Export ERC20 transfers:

> python export_erc20_transfers.py --start-block 0 --end-block 500000 \
--ipc-path ~/Library/Ethereum/geth.ipc --output erc20_transfers.csv

Export receipts and logs (Reference):

> python export_receipts_and_logs.py --tx-hashes tx_hashes.csv \
--ipc-path ~/Library/Ethereum/geth.ipc --receipts-output receipts.csv --logs-output logs.csv

Read this article https://medium.com/@medvedev1088/exporting-and-analyzing-ethereum-blockchain-f5353414a94e

Table of Contents

Schema

blocks.csv

Column Type Description
block_number bigint The block number
block_hash hex_string Hash of the block
block_parent_hash hex_string Hash of the parent block
block_nonce hex_string Hash of the generated proof-of-work
block_sha3_uncles hex_string SHA3 of the uncles data in the block
block_logs_bloom hex_string The bloom filter for the logs of the block. null when its pending block
block_transactions_root hex_string The root of the transaction trie of the block
block_state_root hex_string The root of the final state trie of the block
block_miner address The address of the beneficiary to whom the mining rewards were given
block_difficulty numeric Integer of the difficulty for this block
block_total_difficulty numeric Integer of the total difficulty of the chain until this block
block_size bigint The size of this block in bytes
block_extra_data hex_string The extra data field of this block
block_gas_limit bigint The maximum gas allowed in this block
block_gas_used bigint The total used gas by all transactions in this block
block_timestamp bigint The unix timestamp for when the block was collated
block_transaction_count bigint The number of transactions in the block

transactions.csv

Column Type Description
tx_hash hex_string Hash of the transaction
tx_nonce bigint The number of transactions made by the sender prior to this one
tx_block_hash hex_string Hash of the block where this transaction was in. null when its pending
tx_block_number bigint Block number where this transaction was in. null when its pending
tx_index bigint Integer of the transactions index position in the block. null when its pending
tx_from address Address of the sender
tx_to address Address of the receiver. null when its a contract creation transaction
tx_value numeric Value transferred in Wei
tx_gas bigint Gas provided by the sender
tx_gas_price bigint Gas price provided by the sender in Wei
tx_input hex_string The data send along with the transaction

erc20_transfers.csv

Column Type Description
erc20_token address ERC20 token address
erc20_from address Address of the sender
erc20_to address Address of the receiver
erc20_value numeric Value transferred
erc20_tx_hash hex_string Transaction hash
erc20_log_index bigint Log index in the transaction receipt
erc20_block_number bigint The block number

receipts.csv

Column Type Description
receipt_transaction_hash hex_string Hash of the transaction
receipt_transaction_index bigint Integer of the transactions index position in the block
receipt_block_hash hex_string Hash of the block where this transaction was in
receipt_block_number bigint Block number where this transaction was in
receipt_cumulative_gas_used bigint The total amount of gas used when this transaction was executed in the block
receipt_gas_used bigint The amount of gas used by this specific transaction alone
receipt_contract_address address The contract address created, if the transaction was a contract creation, otherwise null
receipt_root hex_string 32 bytes of post-transaction stateroot (pre Byzantium)
receipt_status bigint Either 1 (success) or 0 (failure)

logs.csv

Column Type Description
log_index bigint Integer of the log index position in the block. null when its pending log
log_transaction_hash hex_string Hash of the transactions this log was created from. null when its pending log
log_transaction_index bigint Integer of the transactions index position log was created from
log_block_hash hex_string Hash of the block where this log was in. null when its pending
log_block_number bigint The block number where this log was in
log_address address Address from which this log originated
log_data hex_string Contains one or more 32 Bytes non-indexed arguments of the log
log_topics string Pipe-separated (| character) string of indexed log arguments (0 to 4 32-byte hex strings). (In solidity: The first topic is the hash of the signature of the event (e.g. Deposit(address,bytes32,uint256)), except you declared the event with the anonymous specifier.)

contracts.csv

Column Type Description
contract_address address Address of the contract
contract_bytecode hex_string Bytecode of the contract

erc20_tokens.csv

Column Type Description
erc20_token_address address The address of the ERC20 token
erc20_token_symbol string The symbol of the ERC20 token
erc20_token_name string The name of the ERC20 token
erc20_token_decimals bigint The number of decimals the token uses - e.g. 8, means to divide the token amount by 100000000 to get its user representation
erc20_token_total_supply numeric The total token supply

Note: erc20_token_symbol, erc20_token_name, erc20_token_decimals, erc20_token_total_supply columns in erc20_tokens.csv can have empty values in case the contract doesn't implement the corresponding methods or implements it incorrectly (e.g. wrong return type).

Note: for the address type all hex characters are lower-cased.

Exporting the Blockchain

  1. Install python 3.5+ https://www.python.org/downloads/

  2. Install geth https://github.com/ethereum/go-ethereum/wiki/Installing-Geth

  3. Start geth. Make sure it downloaded the blocks that you need by executing eth.syncing in the JS console. You can export blocks below currentBlock, there is no need to wait until the full sync as the state is not needed (unless you also need contracts bytecode and token details). You can export blocks below currentBlock, there is no need to wait until the full sync as the state is not needed.

  4. Clone Ethereum ETL and install the dependencies:

    > git clone https://github.com/medvedev1088/ethereum-etl.git
    > cd ethereum-etl
    > pip install -r requirements.txt
  5. Export all:

    > ./export_all.sh -h
    Usage: ./export_all.sh -s <start_block> -e <end_block> -b <batch_size> -i <ipc_path> [-o <output_dir>]
    > ./export_all.sh -s 0 -e 5499999 -b 100000 -i ~/Library/Ethereum/geth.ipc -o output

    The result will be in the output subdirectory, partitioned in Hive style:

    output/blocks/start_block=00000000/end_block=00099999/blocks_00000000_00099999.csv
    output/blocks/start_block=00100000/end_block=00199999/blocks_00100000_00199999.csv
    ...
    output/transactions/start_block=00000000/end_block=00099999/transactions_00000000_00099999.csv
    ...
    output/erc20_transfers/start_block=00000000/end_block=00099999/erc20_transfers_00000000_00099999.csv
    ...

Should work with geth and parity, on Linux, Mac, Windows. Tested with Python 3.6, geth 1.8.7, Ubuntu 16.04.4

If you see weird behavior, e.g. wrong number of rows in the CSV files or corrupted files, check this issue: blockchain-etl#28

Export in 2 Hours

You can use AWS Auto Scaling and Data Pipeline to reduce the exporting time to a few hours. Read this article for details https://medium.com/@medvedev1088/how-to-export-the-entire-ethereum-blockchain-to-csv-in-2-hours-for-10-69fef511e9a2

Running in Windows

Additional steps:

  1. Install Visual C++ Build Tools https://landinghub.visualstudio.com/visual-cpp-build-tools

  2. Install Git Bash with Git for Windows https://git-scm.com/download/win

  3. Run in Git Bash:

    >  ./export_all.sh -s 0 -e 999999 -b 100000 -i '\\.\pipe\geth.ipc' -o output

Command Reference

All the commands accept -h parameter for help, e.g.:

> python export_blocks_and_transactions.py -h

usage: export_blocks_and_transactions.py [-h] [-s START_BLOCK] -e END_BLOCK
                                         [-b BATCH_SIZE] --ipc-path IPC_PATH
                                         [--ipc-timeout IPC_TIMEOUT]
                                         [-w MAX_WORKERS]
                                         [--blocks-output BLOCKS_OUTPUT]
                                         [--transactions-output TRANSACTIONS_OUTPUT]

Export blocks and transactions.

optional arguments:
  -h, --help            show this help message and exit
  -s START_BLOCK, --start-block START_BLOCK
                        Start block
  -e END_BLOCK, --end-block END_BLOCK
                        End block
  -b BATCH_SIZE, --batch-size BATCH_SIZE
                        The number of blocks to export at a time.
  --ipc-path IPC_PATH   The full path to the ipc file.
  --ipc-timeout IPC_TIMEOUT
                        The timeout in seconds for ipc calls.
  -w MAX_WORKERS, --max-workers MAX_WORKERS
                        The maximum number of workers.
  --blocks-output BLOCKS_OUTPUT
                        The output file for blocks. If not provided blocks
                        will not be exported. Use "-" for stdout
  --transactions-output TRANSACTIONS_OUTPUT
                        The output file for transactions. If not provided
                        transactions will not be exported. Use "-" for stdout
export_blocks_and_transactions.py
> python export_blocks_and_transactions.py --start-block 0 --end-block 500000 \
--ipc-path ~/Library/Ethereum/geth.ipc --blocks-output blocks.csv --transactions-output transactions.csv

Omit --blocks-output or --transactions-output options if you want to export only transactions/blocks.

You can tune --batch-size, --max-workers, --ipc-timeout for performance.

export_erc20_transfers.py
> python export_erc20_transfers.py --start-block 0 --end-block 500000 \
--ipc-path ~/Library/Ethereum/geth.ipc --batch-size 100 --output erc20_transfers.csv

Include --tokens <token1> <token2> to filter only certain tokens, e.g.

> python export_erc20_transfers.py --start-block 0 --end-block 500000 --ipc-path ~/Library/Ethereum/geth.ipc \
--output erc20_transfers.csv --tokens 0x86fa049857e0209aa7d9e616f7eb3b3b78ecfdb0 0x06012c8cf97bead5deae237070f9587f8e7a266d

You can tune --batch-size, --max-workers, --ipc-timeout for performance.

export_receipts_and_logs.py

First extract transaction hashes from transactions.csv (Exported with export_blocks_and_transactions.py):

> python extract_csv_column.py --input transactions.csv --column tx_hash --output tx_hashes.csv

Then export receipts and logs:

> python export_receipts_and_logs.py --tx-hashes tx_hashes.csv \
--ipc-path ~/Library/Ethereum/geth.ipc --receipts-output receipts.csv --logs-output logs.csv

Omit --receipts-output or --logs-output options if you want to export only logs/receipts.

You can tune --batch-size, --max-workers, --ipc-timeout for performance.

Upvote this feature request openethereum/parity-ethereum#9075, it will make receipts and logs export much faster.

export_contracts.py

First extract contract addresses from receipts.csv (Exported with export_receipts_and_logs.py):

> python extract_csv_column.py --input receipts.csv --column receipt_contract_address --output contract_addresses.csv

Then export contracts:

> python export_contracts.py --contract-addresses contract_addresses.csv \
--ipc-path ~/Library/Ethereum/geth.ipc --output contracts.csv

You can tune --batch-size, --max-workers, --ipc-timeout for performance.

export_erc20_tokens.py

First extract token addresses from erc20_transfers.csv (Exported with export_erc20_transfers.py):

> python extract_csv_column.py -i erc20_transfers.csv -c erc20_token -o - | sort | uniq > erc20_token_addresses.csv

Then export ERC20 tokens:

> python export_erc20_tokens.py --token-addresses erc20_token_addresses.csv \
--ipc-path ~/Library/Ethereum/geth.ipc --output erc20_tokens.csv

You can tune --max-workers, --ipc-timeout for performance.

Note that there will be duplicate tokens across different partitions, which need to be deduplicated (see Querying in Google BigQuery section).

Upvote this pull request to make erc20_tokens export faster ethereum/web3.py#944 (comment)

get_block_range_for_date.py
> python get_block_range_for_date.py --provider-uri=https://mainnet.infura.io/ --date 2018-01-01
4832686,4838611

Running Tests

> pytest -vv

Querying in Amazon Athena

  • Upload the files to S3:
> cd output
> aws s3 sync . s3://<your_bucket>/ethereumetl/export --region ap-southeast-1
CREATE DATABASE ethereumetl;

Tables for Parquet Files

Read this article on how to convert CSVs to Parquet https://medium.com/@medvedev1088/converting-ethereum-etl-files-to-parquet-399e048ddd30

Note that DECIMAL type is limited to 38 digits in Hive https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-decimal so values greater than 38 decimals will be null.

Querying in Google BigQuery

To upload CSVs to BigQuery:

> cd output
> gsutil -m rsync -r . gs://<your_bucket>/ethereumetl/export
> cd ethereum-etl
> bq --location=US load --replace --source_format=CSV --skip_leading_rows=1 ethereum.blocks gs://<your_bucket>/ethereumetl/export/blocks/*.csv ./schemas/gcp/blocks.json
> bq --location=US load --replace --source_format=CSV --skip_leading_rows=1 ethereum.transactions gs://<your_bucket>/ethereumetl/export/transactions/*.csv ./schemas/gcp/transactions.json
> bq --location=US load --replace --source_format=CSV --skip_leading_rows=1 --max_bad_records=5000 ethereum.erc20_transfers gs://<your_bucket>/ethereumetl/export/erc20_transfers/*.csv ./schemas/gcp/erc20_transfers.json
> bq --location=US load --replace --source_format=CSV --skip_leading_rows=1 ethereum.receipts gs://<your_bucket>/ethereumetl/export/receipts/*.csv ./schemas/gcp/receipts.json
> bq --location=US load --replace --source_format=CSV --skip_leading_rows=1 ethereum.logs gs://<your_bucket>/ethereumetl/export/logs/*.csv ./schemas/gcp/logs.json
> bq --location=US load --replace --source_format=CSV --skip_leading_rows=1 ethereum.contracts gs://<your_bucket>/ethereumetl/export/contracts/*.csv ./schemas/gcp/contracts.json
> bq --location=US load --replace --source_format=CSV --skip_leading_rows=1 --allow_quoted_newlines ethereum.erc20_tokens_duplicates gs://<your_bucket>/ethereumetl/export/erc20_tokens/*.csv ./schemas/gcp/erc20_tokens.json

Note that --max_bad_records is needed for erc20_transfers to avoid 'Error while reading data, error message: Could not parse '68032337690423899710659284523950357745' as numeric for field erc20_value (position 3) starting at location 52895 numeric overflow' for ERC721 transfers.

Join transactions and receipts:

> bq mk --table --description "Exported using https://github.com/medvedev1088/ethereum-etl" --time_partitioning_field block_timestamp_partition ethereum.transactions_join_receipts ./schemas/gcp/transactions_join_receipts.json
> bq --location=US query --replace --destination_table ethereum.transactions_join_receipts --use_legacy_sql=false "$(cat ./schemas/gcp/transactions_join_receipts.sql | tr '\n' ' ')"

Deduplicate erc20_tokens:

> bq mk --table --description "Exported using https://github.com/medvedev1088/ethereum-etl" ethereum.erc20_tokens ./schemas/gcp/erc20_tokens.json
> bq --location=US query --replace --destination_table ethereum.erc20_tokens --use_legacy_sql=false "$(cat ./schemas/gcp/erc20_tokens_deduplicate.sql | tr '\n' ' ')"

Public Dataset

You can query the data that I exported in the public BigQuery dataset https://medium.com/@medvedev1088/ethereum-blockchain-on-google-bigquery-283fb300f579

TODOs

  1. Unit tests
  2. Rewrite export_all.sh in python
  3. Add HTTPProvider
  4. Error handling and logging

SQL for Blockchain

I'm currently working on a SaaS solution for analysts and developers. The MVP will have the following:

  • Built on top of AWS, cost efficient
  • Can provide access to raw CSV data if needed
  • Support for internal transactions in the future
  • Support for Bitcoin and other blockchains in the future
  • ERC20 token metrics in the future

Contact me if you would like to contribute [email protected]

About

Python scripts for ETL (extract, transform and load) jobs for Ethereum blocks, transactions, ERC20 transfers, receipts, logs, internal transactions (message calls)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.4%
  • Shell 2.6%