The IndieWeb is a people-focused alternative to the "corporate" web. Participants use their own personal web sites to post, reply, share, organize events and RSVP, and interact in online social networking in ways that have otherwise been limited to centralized silos like Facebook and Twitter.
The Indie Map dataset includes:
- Social network of the 2300 most active IndieWeb sites, including all connections between sites and number of links in each direction, broken down by type.
- 5.8M web pages, including raw HTML, parsed microformats2, and extracted links with metadata.
- 380GB of HTML and HTTP requests in WARC format.
The complete dataset of 5.8M HTML pages is available in a publicly accessible Google BigQuery dataset.
The raw sites and pages data can be downloaded as JSON files, one per site, and also as raw WARC files. They're hosted on Google Cloud Storage: https://console.cloud.google.com/storage/browser/indie-map/
More details in the full documentation.