Populating Wikipedia: New tool integrating Australian Census data

Revision as of 03:20, 29 July 2022 by MaiaCWilliams (talk | contribs) (Link fix)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Wikimedia Australia develop a new module automating the updating of population data
, Maia Williams.


There are approximately 15,000 Australian place articles in English Wikipedia. On the right of these pages, population figures appear in an Infobox. Come Census data release day, this data is instantly out of date. Contributors are diligent and quick in updating these, but updating new populations manually is a stupendous task.

In late 2021 a working group of Wikimedians formed to tackle this challenge by exploring options for partly automating the process of updating population data, ahead of the release of the Australian Census data in 2022.[1] They met with Australian Bureau of Statistics representatives, formed a project brief and decided to contract an experienced data specialist to oversee the project. That new person was me, Maia Williams.

The core project team included Wikimedia Australia president Alex Lum, committee member and Lua mentor Sam Wilson, executive officer Caddie Brain and Wikidata expert Toby Hudson, as well as many other Australian editors who contributed to ongoing discussions surrounding the project. Meeting regularly, the group refined the project scope, explored different technical approaches and defined method constraints.

The Lua module which is called PopulationFromWikidata, was first developed in Beta Wikipedia then tested as a real Wikipedia sandbox version. It was then connected to the Infobox Australian place template on 29 June 2022.

We’ll be discussing anything related to the module at the talk page here.

What the module does

The PopulationFromWikidata module draws on the latest Census data which was uploaded to Wikidata, Wikimedia’s open, collaborative data knowledgebase. The module is invoked from the Infobox Australian place template and brings the best available population value(s) and associated reference information from the article’s linked Wikidata item.

The module works with some assumptions about the content of the Wikipedia and Wikidata items, and sets some minimum requirements for Wikidata population claims. The ideal claim configuration is detailed here.

Figure 1: Diagram of the PopulationFromWikidata module workflow

The full details of the module functionality is outlined in the documentation here. The key steps in the module are as follows and shown in Figure 1 (right).

Step 1: To ensure that the module only outputs valid and well referenced population values, it first checks which population claims in the Wikidata item meet four basic requirements: that they have a date, geographic area type, determination method and some reference data. It ignores population claims that don’t meet these criteria.

Step 2: The module then separates the shortlisted population claims into those which have ‘applies to part’ values correctly mapping to the Infobox Australian place ‘type’ from those that don’t. The mappings of these place types to Australian Bureau of Statistics’ geographic areas are as per Table 1 (for now[2]).

Table 1: Mapping of Infobox Australian place ‘type’ to ABS geographic areas[2][3]
Infobox type ABS geographic area
City Urban Centres and Localities (UCL)[4]
Suburb Suburbs and Localities (SAL)
Town Urban Centres and Localities (UCL) (or SAL or Indigenous Locations (ILOC))[4][5]
LGA Local Government Areas
Region Local Government Areas (LGA) (for now)

Step 3: To ensure that the module only presents the most recent data to the Wikipedia article Infobox, it looks for the population figure with the latest date for each ‘applies to part’ value from the shortlisted population claims (regardless of their classification from Step 2). This is necessary because there will always be population values from different years in the Wikidata item (eg 2011, 2016, 2021 and more).

Step 3A (Output Scenario 1)

Step 3A (Output Scenario 1): In this scenario there are valid population claims with ‘applies to part’ values that match the Infobox ‘type’. The most recent of these is passed to the Infobox. The output is a single population figure (with a reference).

Step 3B with type = town (Output Scenario 2).png

Step 3B with type = town (Output Scenario 2): This scenario occurs when there is no Urban Centre and Locality (UCL) population for the town in Wikidata and the module presents the most recent Suburb and Locality (SAL) and/or Indigenous Location (ILOC) population figures to the Infobox instead. These are the second-most useful (and used)[5] geographic areas that often represent towns (after the preferred UCL). The output is one or two population figures (with references).

Step 3B with type - town (Output Scenario 3).png

Step 3B with type town (Output Scenario 3): This scenario arises when the module doesn’t find a valid population claim with ‘applies to part’ equal to the Infobox ‘type’. In this case it presents a list of the most recent population figures (and references) per each available ‘applies to part’ value. This type of output will not occur often[6].

How to see the module outputs

Currently the module will only give a population figure to the Infobox if one has not been manually added via the Infobox Australian place template ‘pop’ field. This means if you want to see the module in action for a particular place article, you should follow these steps:

  1. Pick a Wikipedia place article and check that the linked Wikidata item has a valid population claim (most now do, but some values will be old because not all 2021 Census data has been released yet[1]).
  2. If the Wikidata item looks good, then edit the Infobox Australian place template part of the article. Remove the ‘pop’ value and replace with a comment like: “<!--Leave blank to draw the latest automatically from Wikidata-->”. Remove the 'pop_year' and 'pop_footnotes' fields. Check if the old 'pop_footnotes' reference had been used elsewhere in the article.
  3. Check the output in the article Infobox. If the output is not as expected then edit the Wikidata item or if it’s really broken, get in touch here.

Here's an example of an article with Infobox using the module, and the diff of the edit made.

Background

Many other initiatives underpinned the development of this module. Much of the 2021 and 2016 Australian Census data has been bulk uploaded to Australian place Wikidata items by Toby Hudson and Alex Lum. This module only makes sense because of their previous hard work in connecting articles and items to each other and to ABS IDs then developing bulk upload methods using QuickStatements. The plan is for Census data to continue to be uploaded in bulk using their established methods and including (at a minimum) the values listed here.

There are no plans to develop bulk uploads of between-census population estimates, although if manually entered with required values they are considered valid population values by the module.

Many other Wikipedians have developed templates for systematically adding in-text population figures to articles or methods for extracting values from Wikidata. Their work will become part of next steps toward keeping other Infobox values (eg. coordinates) and in-text population values up-to-date.

Tracking

Figure 2: Graph showing the number of place articles using the PopulationFromWikidata module

Sam Wilson is tracking some statistics around place articles here. As at 19 July 2022 there are 849 articles using population values from Wikidata (see the list of those articles here). Figure 2 (right) is a graph showing the change in that count since the module was implemented on 29 June 2022.

We will watch the uptake of the module as it is refined and as more data is uploaded to Wikidata.

Where to from here

There is more work to be done on the PopulationFromWikidata module to consider how it interacts with other Infobox Australian place fields and more unusual uses of Infoboxes.

There is also a lot of work still to be done under the scope of keeping population figures up-to-date in place articles. For example, a big challenge is how best to use Wikidata to keep article in-text population figures up-to-date, how to retain historic population values as the latest are introduced and how to tackle associated complications with reference lists.

Some noted bugs and challenges to tackle (at all scales) are here. Please feel free to contribute to the list!

Thank you!

A huge thank you to the mentor team who helped me (Maia) work on this project: Toby Hudson, Sam Wilson, Alex Lum and Caddie Brain. I've learnt so much and I'm completely hooked!

The PopulationFromWikidata module was developed thanks to support of the Wikimedia Foundation through its Simple APG annual funding. Wikimedia Australia thanks Maia Williams for her work on this project as well as the working group. Wikimedia Australia offers an annual grants program for projects like this, so if you have ideas for additional projects you can read more and apply here.

Footnotes

  1. 1.0 1.1 The first release of the 2021 Australian Census data was on 28 June 2022 (including SAL, ILOC and LGA population counts). The second release will be in October 2022 (including UCL population counts). See the ABS release dates here.
  2. 2.0 2.1 These place mappings were derived using SPARQL queries extracting all Australian place articles, checking Infobox type values and the most common ABS geographic area specified in the linked Wikidata item. The mapping was also informed by experienced Wikipedians and census data users. The mappings remain negotiable and are easy to adjust in the module if need be.
  3. Australian Bureau of Statistics (2021). Australian Statistical Geography Standard (ASGS) Edition 3. ABS Standards. Retrieved 19 July 2022.
    Figure 3: Map showing the difference between UCL, SUA and GCCSA ABS geographic areas for a capital city (where Infobox type = ‘city’)
  4. 4.0 4.1 The UCL value was settled on as the most appropriate geographic area to represent cities (and towns) because it represents where people actually live now as opposed to zoned metropolitan boundaries of where people will live in the future (see Figure 3). UCL boundaries are defined using density thresholds. The other common choices for cities are the Greater Capital City Statistical Areas (GCCSA) and Significant Urban Areas (SUA). The following ABS definitions are useful:
    1. “Urban Centres and Localities (UCLs) are aggregations of SA1s which meet population density criteria or contain other urban infrastructure.”(See here).
    2. “GCCSA boundaries represent labour markets and the functional area of Australian capital cities respectively. They are designed with an emphasis on stability over time to support the time series of statistical releases...”(See here).
    3. “Significant Urban Areas (SUAs) represent individual Urban Centres or clusters of related Urban Centres with a core urban population over 10,000 people.”(See here).
  5. 5.0 5.1 Not all towns have Urban Centre and Locality (UCL) geographic areas defined for them. This is particularly true for some small Aboriginal and Torres Strait Islander communities - in which case Indigenous Locations (ILOC) or Suburbs and Localities (SAL) are the next best options. The main difference between UCL and ILOC population values (in this small community context) is that ILOC counts only include Aboriginal and Torres Strait Islander people. Figure 4 shows the difference between the ILOC and SAL areas for a community without a UCL area.
    Figure 4: Map showing the difference between ILOC and SAL ABS geographic areas for a small community (where Infobox type = ‘town’)


    As at July 2022 ILOC population values have not been bulk uploaded to Wikidata so most small communities have SAL population values.


    For towns which have a UCL area but the population value has not been entered into Wikidata the Suburb and Locality (SAL) area can be used as a substitute, although it generally covers a spatial area that differs from the perceived town area. Figure 5 shows the difference between the UCL, SAL and LGA boundaries for a large town.
    Figure 5: Map showing the difference between UCL, SAL and LGA ABS geographic areas for a large town (where Infobox type = ‘town’)
  6. This Output Scenario 3 will be uncommon but may occur for cities until October 2022 when the 2021 UCL population counts are released and uploaded to Wikidata. Until then other types of geographies and/or old data values may show in the Infobox when using this module for city populations. This scenario will also occur for ill-defined places like regions.
Discuss this page