Extracting tables from Wikipedia using bookmarklet
Continuing the theme of bookmarklets from my previous post, I decided to create another bookmarklet to extract Wikipedia tables into CSV files. A similar tool can be found here.
Credits to this Stack Overflow answer for the code to download JavaScript arrays as CSV files. Link to this project’s Github repository here.
Why this bookmarklet?
In some of my Data Science projects, I often require data from Wikipedia pages. Instead of manually copying, pasting and formatting the tables one by one, I thought it will be great to have a bookmarklet that helps to download the data that I need into a CSV file, that can be read in by libraries such as pandas
. It’s also a great way to find out how to use a bookmarklet to download content as a file.
Features
- Automatic extraction of simple tables in Wikipedia into CSV
- Retains structure of tables
- CSV files are named according to caption of the tables. If caption is not available, they will be named after the table’s enclosing header
How to get the bookmarklet?
- Toggle your browser’s bookmark bar (CTRL + Shift + B on Windows/Chrome).
- Drag the following link onto your bookmarks bar: wiki-table
- Optional: Rename the new bookmark as whatever you want.
Using the bookmarklet
Example 1
- As an example, suppose I want to extract the tables for this Wikipedia page: Singaporean Mahjong scoring rules. The first table that we see is the following:
- Subsequent tables are as follow:
There are 3 tables in total. Click on the bookmarklet, and it will automatically extract the 3 tables above as 3 separate CSV files. Remember to allow your browser to download multiple files at the same time.
These 3 tables happen to have no caption, and they fall under the same header (“Scoring points”). I decided to name the downloaded files as
scoring_points.csv
, and the downloaded files will automatically be renamed asscoring_points.csv
,scoring_points (1).csv
andscoring_points (2).csv
, due to duplicate file names.This is how one of the CSV files look like, with 2 columns and 6 rows:
Example 2
- As another example, suppose I want to extract from: List of songs recorded by Lady Gaga. Here’s an example table:
A caption exists for this table, which is enclosed in the red box above. To make the filename of the CSV file more readable and intuitive, I decided to name the file as the caption of the table. This table is saved as
name_of_song_featured_performers_writers_originating_album_and_year_released.csv
.Note that there is a small table above, with the caption
Key
. In this context, I regarded this small table as non-essential. I used thefontSize
property of the caption to automatically ignore this table.
Gotchas/To-do
For this section, see the original table from this Wikipedia page…
… And its resulting CSV file:
Multi-line content
- In the event where the content of a cell is separated by new lines, I decided to replace these spaces with a single pipe, in the event where the new lines are essential. See the result enclosed in light blue box. Downstream data processing tasks can recover the spaces again via methods such as
split
.
Multi-row/multi-column tables
- This bookmarklet is unable to extract properly more sophisticated tables. For example, the table above splits the 3rd column into 2 separate rows, and this messes the resulting CSV file. The CSV file interprets the table as having 4 columns (see row 1 of result), and interprets the content in the multi-column as a separate row (see content enclosed in red box).
Conclusion
This bookmarklet is convenient to extract simple tables into CSV files, with some elements of automatic naming, so that one does not have to manually rename the CSV files into interpretable names. For more complex tables, more work will need to be done.
Let me know in the Disqus section below if you’ve used this bookmarklet, and what you think of it!