science workflows, teams often need access to a shared dataset that stays perfectly synchronized and cannot be modified, e.g., in distributed machine learning environments where multiple teams rely on the exact same feature set.
In this article, I’ll walk through a simple, fee-free method for cryptographically hashing a dataset of any size and storing its hash immutably on the Ethereum blockchain, creating a permanent and verifiable record of the dataset’s integrity.
This method can also simply be extended to model weights, specific transformations which need to be applied in a consistent way, source code, or other data which needs to be immutable and verifiable.
🤔Why Integrity Matters
If you’re at least somewhat familiar with data science as a practice, you’re already aware of the importance of data integrity. Even small changes or errors in the input data can collapse a project.
Modern machine learning models are extremely sensitive to their training data. Missing normalization steps, a modified CSV file, shuffled rows, corrupted features, or mismatches between training and validation datasets can produce dramatically different results.
Integrity failures are difficult to detect and often derailing.
Models may still appear to function normally or train, but metrics can degrade slowly, drift accrues, or experiments become impossible to reproduce. Integrity is doubly important when the team is distributed, possibly across different organizations, and need to work on different versions of the same problem.
🔐Using a Cryptographic Hash as a “Source of Truth”
A cryptographic hash gives us a simple and very useful mechanism for verifying data integrity.
A brief primer on cryptographic hashes
A hash function takes any amount of input data (bytes) and deterministically produces a fixed length output known as a hash or digest. Cryptographic hashes are foundational in computer science, as you are most likely already aware.
The key is determinism:
Same data in → same hash out
Even a single byte changed in the input data produces a completely different hash.
Because of this property, hashes act as unique fingerprints for data and are extremely useful for verifying integrity. There are many flavors of hash functions, and some are more useful for this task as I’ll describe.
How does this apply to datasets?
Because of the hash function’s determinism, once applied to a dataset, we can quickly and reliably test whether the dataset is identical to what we are expecting.
This is exceptionally valuable with large datasets which are used by multiple teams, multiple companies, moving from one version to the next. Team 1 at Research Group Alpha creates features 1-10, Team 2 at Research Group Zeta creates features 10-100, System X consumes version Y, etc.
We no longer need to question the data, simply compute the hash function over the dataset and compare it to the hash computed at a reference point. If it matches, OK. If not, something has changed.
Hashing is extremely efficient. Running a hash function over a 10MB or 10TB dataset quickly gives us a small, fixed size string that can be shared, stored or published.
🧐 Why Use Ethereum as an Immutable Store?
This is the real useful piece of this article.
Ethereum, again, as you are already aware is a blockchain. This gives us:
- Immutability: a transaction can never be changed
- Distributed availability: always accessible without central authority
- Permanent: once written, it is permanently accessible
But, Ethereum is for transactions? Don’t we need to write a complicated smart contract for this specialized purpose?
You could indeed. But, we don’t need to.
The clever bit is utilizing this uncommonly used input data field in an Ethereum transaction, sometimes referred to as “calldata.”
But, Ethereum transactions cost real money (gas, fees, etc)?
Also true. On Ethereum, you are charged “gas” for each byte in the input data. On the mainnet, with a price of $2,000 per ETH, this might cost us between $0.04 – $0.10 per hash. This does not include the gas required for an actual transfer to be included by a block validator, which can be hefty depending on the network’s current load.
Let’s make this more clever. 🦊
By offloading everything to the “testnet”, which every blockchain commonly has, we can make this entirely free.
Sepolia (the ETH testnet) is rarely used unless you are a developer of smart contracts. Sepolia ETH is free and publicly available from faucets.
This means we can create an infinite amount of transactions, on the publicly accessible testnet (called Sepolia for Ethereum), for free!
As long as our input data is reasonably sized, Sepolia provides a way to utilize the blockchain for infinite data storage, with mostly the same properties as the mainnet*
* Sepolia blockchains aren’t permanent, but are mostly trustable for multiple years. If you need absolute permanence, you’ll need to pay for it using the mainnet.
Remember, we’re not storing the actual data on-chain. Just the fingerprint.
⚙️The process
First, we need to a way to reliably create transactions on Ethereum.
Despite seeming complex, this is actually extremely simple. We don’t need any additional software or wallet tech. A wallet is nothing more than a key, paired with a secret used to sign it.
To create an Ethereum transaction, we create a python object with the required keys and format, encode it with our key, and broadcast it to the network. A validator then picks up our transaction from the “mempool” and includes it in a block.
As long as we include all the required fields, and it checks out, it’s now a permanent part of the blockchain within ~12 seconds.
Step 1: Create the key and secret with web3.py with a few lines of code
from eth_account import Account
account = Account.create()
print("Address:", account.address)
print("Private Key:", account.key.hex())
Step 2: Get some ETH on Sepolia. Plug in your address here and wait 12 seconds. Thanks Google!
Step 3: Hash the dataset
As I mentioned, there are some hashes that are better for this process. We could use an SHA256 hash, but Blake2b is actually better for throughput. Really, any hashing function will work.
Use this function to hash the data.
import hashlib
from pathlib import Path
def hash_dataset(dataset, algorithm="blake2b", chunk_size=1024 * 1024):
h = hashlib.new(algorithm)
def update(obj):
if isinstance(obj, (str, Path)) and Path(obj).exists():
with open(obj, "rb") as f:
while chunk := f.read(chunk_size):
h.update(chunk)
elif isinstance(obj, bytes):
h.update(obj)
elif isinstance(obj, str):
h.update(obj.encode("utf-8"))
elif isinstance(obj, dict):
for k in sorted(obj.keys()):
update(k)
update(obj[k])
elif isinstance(obj, (list, tuple)):
for item in obj:
update(item)
elif isinstance(obj, set):
try:
for item in sorted(obj):
update(item)
except TypeError:
for item in sorted(obj, key=str):
update(item)
elif hasattr(obj, "__iter__"):
for item in obj:
update(item)
else:
h.update(repr(obj).encode("utf-8"))
update(dataset)
return h.hexdigest()
digest = hash_dataset("hugedataset.parquet", algorithm="blake2b")
Step 4: Write, sign and publish a transaction with the hash of our dataset.
Using the web3.py library, we can structure our transaction as a python dict, and then publish it to the network.
We need a provider to broadcast our transaction (we don’t have a node). Here we use Infura, but there are others, like Alchemy
Just note that we add a zero bit “0x” to the hash calculated on our dataset. We need to remove it when we validate our hash.
from web3 import Web3
w3 = Web3(Web3.HTTPProvider("https://sepolia.infura.io/v3/YOUR_KEY"))
dataset_hash = "0x" + digest
account = w3.eth.account.from_key("YOUR_PRIVATE_KEY")
tx = {
"to": account.address, # self-send (no contract required)
"value": 0, # no ETH transfer
"gas": 50_000,
"maxFeePerGas": w3.to_wei("20", "gwei"),
"maxPriorityFeePerGas": w3.to_wei("2", "gwei"),
"nonce": w3.eth.get_transaction_count(account.address),
"chainId": 11155111, # Sepolia testnet
"data": dataset_hash
}
Sign it and send it off. Here, we wait till the transaction is finalized.
signed_tx = account.sign_transaction(tx)
tx_hash = w3.eth.send_raw_transaction(signed_tx.rawTransaction)
print("Broadcast tx hash:", tx_hash.hex())
# Wait for mining / inclusion in a block
tx_receipt = w3.eth.wait_for_transaction_receipt(tx_hash)
print("Transaction mined in block:", tx_receipt["blockNumber"])
print("Status:", tx_receipt["status"])
Be sure to retain the transaction id.
Step 5: Create a metadata record to store alongside our dataset
Here, we create a simple piece of metadata, which can be stored in a database (DynamoDB, MongoDB) or along side our data object directly (S3, Google Cloud Storage).
The metadata could look something like so:
{
"dataset_id": "feature_set_v42",
"dataset_uri": "s3://ml-bucket/features/v42.parquet",
"dataset_hash": "0x9f3c...ab21",
"tx_hash": "0x7c1a...e91d",
"timestamp_unix": 1730000000,
"hash_algorithm": "blake2b",
"creator": "0xabc123...",
"notes": "normalized features"
}
Step 6: Whenever reading the dataset, validate the hash matches the original hash stored alongside our data
The final step of the process combines three actions:
- Fetch the Ethereum transaction
- Extract the dataset hash from calldata
- Compare it to a locally recomputed hash
from web3 import Web3
w3 = Web3(Web3.HTTPProvider("https://sepolia.infura.io/v3/YOUR_KEY"))
def verify_dataset(dataset_path, tx_hash):
tx = w3.eth.get_transaction(tx_hash)
raw_input = tx["input"]
onchain_hash = raw_input.hex() if hasattr(raw_input, 'hex') else str(raw_input).lower()
computed_hash = "0x" + hash_dataset(dataset_path).lower()
if computed_hash != onchain_hash:
raise ValueError(f"Integrity FAILED: Local {computed_hash} != On-chain {onchain_hash}")
print("Integrity check PASSED. Dataset matches the blockchain record.")
return True
Thats it!
An important note, this doesn’t prevent anyone from rewriting our metadata object. However, there are many ways to prevent modification of a small piece of metadata internally, like audit databases or S3 Object Lock.
Wrapping up
Ultimately, utilizing a cryptographic hash to verify dataset integrity is a lightweight approach to a heavy problem.
Some natural extensions to this include using this method to verify model weights, or even hashing pieces of source code to ensure preprocessing is relevant.
Whether you are collaborating across distributed, open-source teams, building reproducible research, or simply creating an audit trail for compliance, the blockchain is a nice, impartial notary for your data. You don’t need to trust the infrastructure; you just need to trust the math.