ATproto PDS indexer
[!IMPORTANT] Any use of this code and the data obtained with its help must adhere to Bluesky's Terms of Service and Community Guidelines.
In particular, you are not allowed to distribute any of the data without an explicit permission of the user that piece of data belongs to.
We also do not condone any use of the data obtained with this code for the purposes of:
- training ML/AI models w/o explicit consent of the users who own the data
- aiding any kind of harassment campaigns against anyone
This is a bunch of code that can download all of Bluesky into a giant table in PostgreSQL.
The structure of that table is roughly (repo, collection, rkey) -> JSON
, and
it is a good idea to partition it by collection.
NOTE: all of this is valid as of December 2024, when Bluesky has ~24M accounts, ~4.7B records total, and average daily peak of ~1000 commits/s.
With a SATA SSD dedicated to ScyllaDB it can handle about 6000 commits/s from firehose. The actual number you'll get might be lower, if your CPU is not fast enough.
Once a day get a list of all repos from all known PDSs and adds any that are missing to the database.
Connects to firehose of each PDS and stores all received records in the database.
If CONSUMER_RELAYS
is specified, it will also add any new PDSs to the database
that have records sent through a relay.
Goes over all repos that might have missing data, gets a full checkout from the PDS and adds all missing records to the database.
/${did}
request returns a DID document.example.env
to .env
and edit it to your liking.
POSTGRES_PASSWORD
can be anything, it will be used on the first start of
postgres
container to initialize the database.docker-compose.override.yml.example
to
docker-compose.override.yml
to change some parts of docker-compose.yml
without actually editing it (and introducing possibility of merge conflicts
later on).make init-db
CONSUMER_RELAYS
in
docker-compose.override.yml
make up
make status
- will show container status and resource usagemake psql
- starts up SQL shell inside the postgres
containermake logs
- streams container logs into your terminalmake sqltop
- will show you currently running queriesmake sqldu
- will show disk space usage for each table and indexRecord indexer exposes a simple HTTP handler that allows to do this:
curl -s 'http://localhost:11003/pool/resize?size=10'
With partitioning by collection you can have separate indexes for each record
type. Also, doing any kind of heavy processing on a particular record type will
be also faster, because all of these records will be in a separate table and
PostgreSQL will just read them sequentially, instead of checking collection
column for each row.
You can do the partitioning at any point, but the more data you already have in the database, the longer will it take.
Before doing this you need to run lister
at least once in order to create the
tables (make init-db
does this for you as well).
postgres
.psql
.migrations
dir for any additional
migrations you might be interested in.Go source code for Bluesky's atproto services.
A simplified JSON event stream for AT Proto
Simple golang firehose for Bluesky.
Bluesky Bot library in Go
A minimal implementation of a BlueSky Feed Generator in Go
A handful of Go-based tools for poking around with BlueSky using the AT Protocol
Your Brand Here!
50K+ engaged viewers every month
Limited spots available!
📧 Contact us via email🦋 Contact us on Bluesky