Section alignment image suggestions#
Recommend images for Wikipedia article sections based on equivalent section titles across Wikipedia language editions.
Get ready#
You need access to a Wikimedia Foundation’s analytics client, AKA a stat box. Then:
me@my_box:~$ ssh stat1008.eqiad.wmnet # Or pick another one
me@stat1008:~$ export http_proxy=http://webproxy.eqiad.wmnet:8080
me@stat1008:~$ export https_proxy=http://webproxy.eqiad.wmnet:8080
me@stat1008:~$ git clone https://gitlab.wikimedia.org/repos/structured-data/section-image-recs.git sir
me@stat1008:~$ conda-analytics-clone MY_ENV
me@stat1008:~$ source conda-analytics-activate MY_ENV
(MY_ENV) me@stat1008:~$ conda install -c conda-forge pandas=1.5.3
(MY_ENV) me@stat1008:~$ pip install mwparserfromhell==0.6.4
(MY_ENV) me@stat1008:~$ cd sir
Get set: extract available images#
(MY_ENV) me@stat1008:~/sir$ python imagerec/article_images.py --wikitext-snapshot YYYY-MM --item-page-link-snapshot YYYY-MM-DD --output article_images
–help me#
(MY_ENV) me@stat1008:~/sir$ python imagerec/article_images.py --help
usage: article_images.py [-h] --wikitext-snapshot YYYY-MM --item-page-link-snapshot
YYYY-MM-DD --output /hdfs_path/to/parquet
[--wp-codes-file /path/to/file.json | --wp-codes [wp-code ...]]
Gather images available in Wikipedia articles from wikitext
options:
-h, --help show this help message and exit
--wikitext-snapshot YYYY-MM
"wmf.mediawiki_wikitext_current" Hive monthly snapshot
--item-page-link-snapshot YYYY-MM-DD
"wmf.wikidata_item_page_link" Hive weekly snapshot
--output /hdfs_path/to/parquet
HDFS path to output parquet
--wp-codes-file /path/to/file.json
path to JSON file with a list of Wikipedia language codes to
process. Default: all Wikipedias, see "data/wikipedia.json"
--wp-codes [wp-code ...]
space-separated Wikipedia language codes to process. Example:
ar en zh-yue
Go: generate suggestions#
(MY_ENV) me@stat1008:~/sir$ python imagerec/recommendation.py --section-images article_images --section-alignments /user/mnz/secmap_results/aligned_sections_subset/aligned_sections_subset_9.0_2022-02.parquet --max-target-images 0 --output suggestions
Get –help!#
(MY_ENV) me@stat1008:~/sir$ python imagerec/recommendation --help
usage: recommendation.py [-h] --section-images /hdfs_path/to/parquet
--section-alignments /hdfs_path/to/parquet
--max-target-images N --output /hdfs_path/to/parquet
[--wp-codes-file /path/to/file.json | --wp-codes [wp-code ...]]
[-t /hdfs_path/to/parquet] [--keep-lists-and-tables]
Generate section-level image suggestions based on section alignments
options:
-h, --help show this help message and exit
--section-images /hdfs_path/to/parquet
HDFS path to parquet of section images, as output by
"article_images.py"
--section-alignments /hdfs_path/to/parquet
HDFS path to parquet of section alignments
--max-target-images N
Maximum number of images that a section beingrecommended
images should contain. Use 0 if youwant to generate
recommendations only for unillustratedsections.
--output /hdfs_path/to/parquet
HDFS path to output parquet
--wp-codes-file /path/to/file.json
path to JSON file with a list of Wikipedia language codes to
process. Default: all Wikipedias, see "data/wikipedia.json"
--wp-codes [wp-code ...]
space-separated Wikipedia language codes to process. Example:
ar en zh-yue
-t /hdfs_path/to/parquet, --table-filter /hdfs_path/to/parquet
HDFS path to parquet with a dataframe to exclude, as output by
"https://gitlab.wikimedia.org/repos/structured-data/section-
topics/-/blob/main/scripts/detect_html_tables.py". The
dataframe must include dict_keys(['wiki_db', 'page_id',
'section_title']) columns. Default: ar, bn, cs, es, id, pt, ru
sections with tables, see "20230301_target_wikis_tables" in
the current user home
--keep-lists-and-tables
don't skip sections with at least one standard wikitext list
or table
Trigger an Airflow test run#
Follow this walkthrough to simulate a production execution of the pipeline on your stat box. Inspired by this snippet.
Build your artifact#
Pick a branch you want to test from the drop-down menu
Click on the pipeline status button, it should be a green tick
Click on the play button next to
publish_conda_env
, wait until doneOn the left sidebar, go to Packages and registries > Package Registry
Click on the first item in the list, then copy the Asset URL. It should be something like
https://gitlab.wikimedia.org/repos/structured-data/section-image-recs/-/package_files/1322/download
Get your artifact ready#
me@stat1008:~$ mkdir artifacts
me@stat1008:~$ cd artifacts
me@stat1008:~$ wget -O MY_ARTIFACT MY_COPIED_ASSET_URL
me@stat1008:~$ hdfs dfs -mkdir artifacts
me@stat1008:~$ hdfs dfs -copyFromLocal MY_ARTIFACT artifacts
me@stat1008:~$ hdfs dfs -chmod -R o+rx artifacts
Spin up an Airflow instance#
On your stat box:
me@stat1008:~$ git clone https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags.git
me@stat1008:~$ cd airflow-dags
me@stat1008:~$ rm -r /tmp/MY_AIRFLOW_HOME # If you've previously run the next command
me@stat1008:~$ sudo -u analytics-privatedata ./run_dev_instance.sh -m /tmp/MY_AIRFLOW_HOME -p MY_PORT platform_eng
On your local box:
me@my_box:~$ ssh -t -N stat1008.eqiad.wmnet -L MY_PORT:stat1008.eqiad.wmnet:MY_PORT
Trigger the DAG run#
Go to
http://localhost:MY_PORT/
on your browserOn the top bar, go to Admin > Variables
Click on the middle button (Edit record) next to the
platform_eng/dags/section_alignment_image_suggestions_dag.py
KeyUpdate
{ "conda_env" : "hdfs://analytics-hadoop/user/ME/artifacts/MY_ARTIFACT" }
Add any other relevant DAG properties
Click on the Save button
On the top bar, go to DAGs and click on the
section_alignment_image_suggestions
slider. This should trigger an automatic DAG runClick on
section_alignment_image_suggestions
You’re all set!
Release#
On the left sidebar, go to CI/CD > Pipelines
Click on the play button, select
trigger_release
If the job went fine, you’ll find a new artifact in the Package Registry
We follow Data Engineering’s
workflow_utils:
- the main
branch is on a .dev
release - releases are made by
removing the .dev
suffix and committing a tag
Deploy#
On the left sidebar, go to CI/CD > Pipelines
Click on the play button and select
bump_on_airflow_dags
. This will create a merge request at airflow-dagsDouble-check it and merge
Deploy the DAGs:
me@my_box:~$ ssh deployment.eqiad.wmnet
me@deploy1002:~$ cd /srv/deployment/airflow-dags/platform_eng/
me@deploy1002:~$ git pull
me@deploy1002:~$ scap deploy
See the docs for more details.