Joshua Potter 63764a22c4 | ||
---|---|---|
.githooks | ||
app | ||
sql | ||
.envrc | ||
.gitignore | ||
README.md | ||
default.nix | ||
flake.lock | ||
flake.nix | ||
poetry.lock | ||
pyproject.toml |
README.md
coach-scraper
Caution! Be careful running this script.
We intentionally delay each batch of requests. Make sure any adjustments to this script appropriately rate-limit.
Overview
This is a simple web scraper for coaches listed on:
The program searches for coach usernames as well as specific information about
each of them (their profile, recent activity, and stats). The result will be
found in a newly created data
directory with the following structure:
data
└── <site>
│ ├── coaches
│ │ ├── <username>
│ │ │ ├── <username>.html
│ │ │ ├── export.json
│ │ │ └── ...
│ │ ├── ...
└── pages
├── <n>.txt
├── ...
Quickstart
If you have nix available, run:
$ nix run . -- --user-agent <your-email> -s <site> [-s <site> ...]
If not, ensure you have poetry on your machine and instead run the following:
$ poetry run python3 -m app -u <your-email> -s <site> [-s <site> ...]
After running (this may take several hours), a new CSV will be generated at
data/export.csv
containing all scraped content from the specified <site>
s.
Database
Included in the development shell of this flake is a Postgres
client (version 15.5). Generate an empty Postgres cluster at /db
by running
$ pg_ctl -D db init
To start the database, run the following:
$ pg_ctl -D db -l db/logfile -o --unix_socket_directories=@scraper start
In the above command, @scraper
refers to an abstract socket name.
Rename to whatever is appropriate for your use case. To then connect to this
database instance, run:
$ psql -h @scraper
To later shut the database down, run:
$ pg_ctl -D db stop
Loading Data
To load all exported coach data into a local Postgres instance, use the provided
sql/*.sql
files. First initialize the export schema/table:
$ psql -h @scraper -f sql/init.sql
Next, dump exported data into the newly created table:
$ psql -h @scraper -f sql/export.sql -v export="'$PWD/data/export.csv'"
Re-running the sql/export.sql
script will create a backup of the
coach_scraper.export
table. It will then upsert the scraped data. You can view
all backups from the psql
console like so:
postgres=# \dt coach_scraper.export*
E2E
With the above section on loading files, we now have the individual components
necessary to scrape coach data from our chess website and dump the results into
the database in one fell swoop. Assuming our database is open with a socket
connection available at @scraper
:
$ nix run . -- --user-agent <your-email> -s chesscom -s lichess
$ psql -h @scraper -f sql/init.sql -f sql/export.sql -v export="'$PWD/data/export.csv'"
Development
nix is used for development. The included flakes.nix
file automatically loads in Python (version 3.11.6) with packaging and
dependency management handled by poetry (version 1.7.0). direnv
can be used to a launch a dev shell upon entering this directory (refer to
.envrc
). Otherwise run via:
$ nix develop
Language Server
The python-lsp-server
(version v1.9.0) is included in this flake, along with the python-lsp-black
and pyls-isort plugins.
Additionally, pylsp
is expected to be configured to use:
- McCabe,
- pycodestyle, and
- pyflakes.
Refer to your editor for configuration details.
Formatting
Formatting depends on the black
(version 23.9.1) tool. A pre-commit
hook is included in .githooks
that can
be used to format all *.py
files prior to commit. Install via:
$ git config --local core.hooksPath .githooks/
If running direnv, this hook is installed automatically when entering the directory.