Preprocessing data for harmonizing 1985-2005 occupation classifications

Alan Engel
Last update: 30 November 2021

This image of the top rows from the year 2000 dataset show the columns of interest and tasks that need to be performed to prepare this data for use in harmonizing the occupation classifications and import the data into R.

Top rows of 2000 data file
Top rows of 2000 data file

The codes in columns A, C, E, G and I appear to be specific to e-Stat's query language and API. (An example of using R to query e-Stat is at https://qiita.com/nozma/items/778843c757e42a4a63a4. Using the API will be left to a future task.)

The first subtask is to create a column that can be used as a basis for constructing the harmonized occupation classification and to merge this data with it. This column can be created from the census year (column J) and the occupation numbers given in column H. The cat03_code numbers in column G are an alternative to the occupation numbers, but aren't used here because they appear to be specific to this dataset.

First, correct a "feature" of this file. The occupation numbers of the first 9 occupations are in wide characters. These must be corrected to ASCII numerals. The easiest way to do this is to use a plain text editor such as Notepad. So far, I have only observed this feature in this download from the 2000 census, but researchers should be aware that it may exist elsewhere. (The CSV file in the repository is the corrected version.)

The remainder of this task is done in Microsoft Excel.

The CSV file 2000_totals_minor_occupations.csv is imported into Excel, a new worksheet created with its first three columns labeled 'OccCode', 'Gender' and 'value'.

Import CSV into Excel and create new worksheet
Import CSV into Excel and create new worksheet

Under OccCode, make an occupation code by concatenating the census year, zero padding, and the occupation number.

=CONCATENATE(LEFT('2000_totals_minor_occupations'!J2,4),"00",LEFT('2000_totals_minor_occupations'!H2,1))

Use the conditional IF to select 'MALE' or 'FEMALE' under Gender. Simply copy the value.

=IF('2000_totals_minor_occupations'!F2="男","MALE","FEMALE")

These cells can be replicated down the length of the dataset. The function for computing OccCode must be modified for 2-digit and 3-digit occupation numbers.

=CONCATENATE(LEFT('2000_totals_minor_occupations'!J11,4),"0",LEFT('2000_totals_minor_occupations'!H11,2))
and
=CONCATENATE(LEFT('2000_totals_minor_occupations'!J101,4),LEFT('2000_totals_minor_occupations'!H101,3))
Extracted dataset
Extracted dataset

If the formulas are converted to values, this XSLX file can be used directly in R.