2023-11-30 22:36:44 +00:00
# coach-scraper
2023-11-27 20:09:40 +00:00
2023-11-27 21:13:46 +00:00
**Caution! Be careful running this script.**
2023-11-30 22:36:44 +00:00
We intentionally delay each batch of requests. Make sure any adjustments to this
script appropriately rate-limit.
2023-11-27 21:13:46 +00:00
## Overview
2023-11-30 22:36:44 +00:00
This is a simple web scraper for coaches listed on:
* [chess.com ](https://www.chess.com/coaches )
* [lichess.org ](https://www.lichess.org/coach )
The program searches for coach usernames as well as specific information about
2023-12-05 18:43:13 +00:00
each of them (their profile, recent activity, and stats). Data is streamed into
a Postgres instance. Downloaded content is found in a newly created `data`
directory with the following structure:
2023-11-27 21:13:46 +00:00
```
data
2023-11-30 22:36:44 +00:00
└── < site >
│ ├── coaches
│ │ ├── < username >
│ │ │ ├── < username > .html
│ │ │ └── ...
│ │ ├── ...
2023-11-27 21:13:46 +00:00
└── pages
├── < n > .txt
├── ...
```
2023-12-04 22:08:17 +00:00
## Quickstart
2023-11-28 12:28:21 +00:00
2023-12-01 17:30:44 +00:00
Included in the development shell of this flake is a [Postgres ](https://www.postgresql.org/ )
client (version 15.5). Generate an empty Postgres cluster at `/db` by running
```bash
$ pg_ctl -D db init
```
To start the database, run the following:
```bash
$ pg_ctl -D db -l db/logfile -o --unix_socket_directories=@scraper start
```
In the above command, `@scraper` refers to an [abstract socket name ](https://www.postgresql.org/docs/15/runtime-config-connection.html#GUC-UNIX-SOCKET-DIRECTORIES ).
Rename to whatever is appropriate for your use case. To then connect to this
database instance, run:
```bash
$ psql -h @scraper
```
To later shut the database down, run:
```bash
$ pg_ctl -D db stop
```
2023-12-05 18:43:13 +00:00
Initialize the table that scraped content will be streamed into:
2023-12-01 17:30:44 +00:00
```bash
2023-12-04 15:14:33 +00:00
$ psql -h @scraper -f sql/init.sql
2023-12-01 17:30:44 +00:00
```
2023-12-05 18:43:13 +00:00
If you have nix available, you can now run the scraper via
2023-12-01 17:30:44 +00:00
```bash
2023-12-05 18:43:13 +00:00
$ nix run . -- ...
2023-12-04 22:08:17 +00:00
```
2023-12-05 18:43:13 +00:00
Otherwise, ensure you have [poetry ](https://python-poetry.org/ ) on your machine
and instead run the following:
2023-12-01 17:30:44 +00:00
```bash
2023-12-05 18:43:13 +00:00
$ poetry run python3 -m app ...
2023-11-28 12:28:21 +00:00
```
2023-11-27 21:13:46 +00:00
## Development
2023-11-28 12:28:21 +00:00
[nix ](https://nixos.org/ ) is used for development. The included `flakes.nix`
file automatically loads in Python (version 3.11.6) with packaging and
dependency management handled by poetry (version 1.7.0). [direnv ](https://direnv.net/ )
can be used to a launch a dev shell upon entering this directory (refer to
`.envrc` ). Otherwise run via:
2023-11-27 20:09:40 +00:00
```bash
2023-12-01 17:30:44 +00:00
$ nix develop
2023-11-27 20:09:40 +00:00
```
2023-11-27 21:13:46 +00:00
### Language Server
2023-11-27 20:09:40 +00:00
The [python-lsp-server ](https://github.com/python-lsp/python-lsp-server )
(version v1.9.0) is included in this flake, along with the [python-lsp-black ](https://github.com/python-lsp/python-lsp-black )
2023-12-01 23:36:53 +00:00
and [pyls-isort ](https://github.com/paradoxxxzero/pyls-isort ) plugins.
Additionally, `pylsp` is expected to be configured to use:
* [McCabe ](https://github.com/PyCQA/mccabe ),
* [pycodestyle ](https://pycodestyle.pycqa.org/en/latest/ ), and
* [pyflakes ](https://github.com/PyCQA/pyflakes ).
Refer to your editor for configuration details.
2023-11-27 20:09:40 +00:00
2023-11-27 21:13:46 +00:00
### Formatting
2023-11-27 20:09:40 +00:00
Formatting depends on the [black ](https://black.readthedocs.io/en/stable/index.html )
(version 23.9.1) tool. A `pre-commit` hook is included in `.githooks` that can
be used to format all `*.py` files prior to commit. Install via:
```bash
2023-12-01 17:30:44 +00:00
$ git config --local core.hooksPath .githooks/
2023-11-27 20:09:40 +00:00
```
If running [direnv ](https://direnv.net/ ), this hook is installed automatically
when entering the directory.