A web scraper for chess coaches.

Go to file

Joshua Potter 63764a22c4 Transition to a CSV; Postgres can handle that better.		2023-12-04 15:08:17 -07:00
.githooks	Use more robust pre-commit.	2023-12-03 14:31:36 -07:00
app	Transition to a CSV; Postgres can handle that better.	2023-12-04 15:08:17 -07:00
sql	Transition to a CSV; Postgres can handle that better.	2023-12-04 15:08:17 -07:00
.envrc	Initial commit.	2023-11-27 13:09:40 -07:00
.gitignore	Rewrite export as NDJSON and include script to load result into postgres. (#3 )	2023-12-01 10:30:44 -07:00
README.md	Transition to a CSV; Postgres can handle that better.	2023-12-04 15:08:17 -07:00
default.nix	Initial commit.	2023-11-27 13:09:40 -07:00
flake.lock	Initial commit.	2023-11-27 13:09:40 -07:00
flake.nix	Apply pyls-isort.	2023-12-01 16:37:05 -07:00
poetry.lock	Use lxml to speed up parsing.	2023-12-01 07:12:40 -07:00
pyproject.toml	Use lxml to speed up parsing.	2023-12-01 07:12:40 -07:00

README.md

coach-scraper

Caution! Be careful running this script.

We intentionally delay each batch of requests. Make sure any adjustments to this script appropriately rate-limit.

Overview

This is a simple web scraper for coaches listed on:

The program searches for coach usernames as well as specific information about each of them (their profile, recent activity, and stats). The result will be found in a newly created data directory with the following structure:

data
└── <site>
│   ├── coaches
│   │   ├── <username>
│   │   │   ├── <username>.html
│   │   │   ├── export.json
│   │   │   └── ...
│   │   ├── ...
└── pages
    ├── <n>.txt
    ├── ...

Quickstart

If you have nix available, run:

$ nix run . -- --user-agent <your-email> -s <site> [-s <site> ...]

If not, ensure you have poetry on your machine and instead run the following:

$ poetry run python3 -m app -u <your-email> -s <site> [-s <site> ...]

After running (this may take several hours), a new CSV will be generated at data/export.csv containing all scraped content from the specified <site>s.

Database

Included in the development shell of this flake is a Postgres client (version 15.5). Generate an empty Postgres cluster at /db by running

$ pg_ctl -D db init

To start the database, run the following:

$ pg_ctl -D db -l db/logfile -o --unix_socket_directories=@scraper start

In the above command, @scraper refers to an abstract socket name. Rename to whatever is appropriate for your use case. To then connect to this database instance, run:

$ psql -h @scraper

To later shut the database down, run:

$ pg_ctl -D db stop

Loading Data

To load all exported coach data into a local Postgres instance, use the provided sql/*.sql files. First initialize the export schema/table:

$ psql -h @scraper -f sql/init.sql

Next, dump exported data into the newly created table:

$ psql -h @scraper -f sql/export.sql -v export="'$PWD/data/export.csv'"

Re-running the sql/export.sql script will create a backup of the coach_scraper.export table. It will then upsert the scraped data. You can view all backups from the psql console like so:

postgres=# \dt coach_scraper.export*

E2E

With the above section on loading files, we now have the individual components necessary to scrape coach data from our chess website and dump the results into the database in one fell swoop. Assuming our database is open with a socket connection available at @scraper:

$ nix run . -- --user-agent <your-email> -s chesscom -s lichess
$ psql -h @scraper -f sql/init.sql -f sql/export.sql -v export="'$PWD/data/export.csv'"

Development

nix is used for development. The included flakes.nix file automatically loads in Python (version 3.11.6) with packaging and dependency management handled by poetry (version 1.7.0). direnv can be used to a launch a dev shell upon entering this directory (refer to .envrc). Otherwise run via:

$ nix develop

Language Server

The python-lsp-server (version v1.9.0) is included in this flake, along with the python-lsp-black and pyls-isort plugins. Additionally, pylsp is expected to be configured to use:

Refer to your editor for configuration details.

Formatting

Formatting depends on the black (version 23.9.1) tool. A pre-commit hook is included in .githooks that can be used to format all *.py files prior to commit. Install via:

$ git config --local core.hooksPath .githooks/

If running direnv, this hook is installed automatically when entering the directory.