The Parsing of Wikipedia

Wikipedia is an incredible wealth of knowledge and in the right context, access to its deep archives can be a neat way to enrich the lives of your users. Take my app for example. I'm making an astronomy app and I want to provide some context hints when you click on a star. That seemingly simple feature spawned a robust processing pipeline that I'm exceedingly proud of. Let's dive in to the technical details of how you can do this using the beautiful datasets that they publically provide.

What does wikipedia provide?

Well, everything! I'm not even joking. Their pages are intense repositories of structured and unstructured information. English prose mixed with tabulated data carefully aggregated over a lifetime. All you need to do is access the right page and parse out the information.

In my case, I was looking just at astronomy articles. Their astronomy datasets conform to a fairly rigorous, well-structured set of information. All their stars and nebulas and galaxies are marked-up with excellent summaries and further refined through "infoboxes" that contain things like coordinates, apparent magnitude, alternative names, constellations, and more.

It's all right here. https://en.wikipedia.org/wiki/Wikipedia:Database_download.

For the small space commitment of ~25GB of compressed data paired with ~300MB of indexes, you have everything you need to scrape and parse wikipedia locally. The real trick is finding the right page.

That compressed "multistream" bzip file contain lots of smaller bzip files embedded into it. In order to access a given article, you need to know the starting and ending byte offset for the given "block" your article lives in. Each block is about 100 articles long and finding the boundariesof a given block is the main purpose of the index file.

So the name of the game is finding which block you care about. And that's where my solution begins.

The design

Finding the Blocks

Ultimately, I wanted to create a subset of wikipedia containing only candidate pages that might have astronomy-related contents. In this way, I could quickly iterate on the matching algorithm I created to pair a wikipedia page with a star in my database.

To do this I first created an on-disk Typesense database with the following schema:

 await client.collections().create({
    name: "wiki",
    fields: [
        { name: "wikipediaId", type: "int32" },
        { name: "title", type: "string", infix: true },
        { name: "start", type: "int32" },
        { name: "end", type: "int32" },
        { name: "paragraph", type: "string", infix: true },
    ],
});

By ingesting a subset of wikipedia and storing the block offsets, page titles, and summary information into a full-text database, looking up any given article became exceedingly trivial. Unfortunately, populating this beast is fairly un-sexy. You just have to crack open every single wikipedia page, identify the interesting ones, and then index them.

On a modern computer this only took a few hours, so it's not that bad in the grand scheme of things.

How to identify relevant articles for the problem domain?

In my use case, I wanted to index only astronomy-related articles. That was rather easy because of the metadata associated with them. The Wikipedia structured "infoboxes" contain coordinates for stars and celestial objects and some other tertiary data.

While populating the Typsense database, I just had to use wtf_wikipedia to perform a quick check of the metadata looking for these highly niche infoboxes. If present, the article would get indexed.

Easy peasy.

This does mean I'm relying on all articles to contain a specific infobox and that it is correctly formatted. (Unspriring hint: they weren't all formatted correctly!) But through trial and error I found a few different ways the properties could be misspelled. For example, sometimes an apostraphe would be a unicode apostraphe, or sometimes there'd be two spaces instead of one.

Identifying common ways the parser could fail was certainly work, but not technically complicated.

After I indexed everything into the full-text database, I could now look up where information lives but it does not get me matching stars to articles. So the next phase of the design is to create a processing pipeline capable of extracting and transforming domain-specific information from candidate articles for a deeper analysis.

To do that, I used the lovely (albeit unmaintained) library simply called: wtf wikipedia.

The Processing Pipeline

Making sense of the wikipedia article within the context of a problem domain was rather fun. I patterned my solution off of a typical processing pipeline design. You can register steps into the pipeline, each step has access to a shared context and the goal is for each step to enrich the context object with information.

import wtf from "wtf_wikipedia";

export class PipelineStep {
    async execute(reference, context) {
        throw "Not implemented";
    }
}

class Pipeline {
    constructor() {
        this.steps = [];
    }

    registerStep(transformer) {
        this.steps.push(transformer);
        return this;
    }

    async execute(wikiText) {
        const ref = wtf(wikiText);
        const entry = {};

        entry["wiki"] = {
            title: ref.title(),
        };

        for (const step of this.steps) {
            await step.execute(ref, entry);
        }

        return entry;
    }

    build() {
        return this;
    }
}

I created one step to extract the relevant infobox data, and another step to transform the J2000 coordinate into numerical angles, and then a final step to search my star database for matches at that exact celestial location.

In this way I could empiracally match a given article to a given star by using their actual location in the night sky as a key.

The Final Test

After creating the disparate pieces to solve the ultimate question of "which article matches which star", it was time to merge all these concepts together.

All it took was a small script that would generate a number of queries for each star in my database. Queries like %name%star% or %name%constellation%. And then I would use the Typesense full-text search to identify candidate articles that might be relevant to the celestial object, run it through the pipeline, and if I got a hit: the matched object would get a reference in my database pointing to the wikipedia article it relates to.

This let's me put a wikipedia link in my app (and maybe a summary, if I want to update the pipeline a little), instantly adding rich context and direct access to further information for the inquisitive mind.

If you had asked me 15 years ago to parse Wikipedia, I'd have been enthused by the idea but I'm not sure the pieces were so easily accessible to make it happen. Plus I probably didn't have enough CD's to store it all!

By combining just a few technologies, we can harness the power of Wikipedia in an afternoon.