This image of the top rows from the year 2000 dataset show the columns of interest and tasks that need to be performed to prepare this data for use in harmonizing the occupation classifications and import the data into R.
- B gives the name of the statistic, which is number of employees.
- D gives the status of employment, which in this case is the total for all statuses.
- F gives the gender.
- H lists the minor occupations including their numbers as used in the 2000 population census.
- J gives the census year.
- L gives the value of this statistic.
The columns of interest are as follows:
The codes in columns A, C, E, G and I appear to be specific to e-Stat's query language and API. (An example of using R to query e-Stat is at https://qiita.com/nozma/items/778843c757e42a4a63a4. Using the API will be left to a future task.)
The first subtask is to create a column that can be used as a basis for constructing the harmonized occupation classification and to merge this data with it. This column can be created from the census year (column J) and the occupation numbers given in column H. The cat03_code numbers in column G are an alternative to the occupation numbers, but aren't used here because they appear to be specific to this dataset.
First, correct a "feature" of this file. The occupation numbers of the first 9 occupations are in wide characters. These must be corrected to ASCII numerals. The easiest way to do this is to use a plain text editor such as Notepad. So far, I have only observed this feature in this download from the 2000 census, but researchers should be aware that it may exist elsewhere. (The CSV file in the repository is the corrected version.)
The remainder of this task is done in Microsoft Excel.
The CSV file 2000_totals_minor_occupations.csv is imported into Excel, a new worksheet created with its first three columns labeled 'OccCode', 'Gender' and 'value'.
Under OccCode, make an occupation code by concatenating the census year, zero padding, and the occupation number.
=CONCATENATE(LEFT('2000_totals_minor_occupations'!J2,4),"00",LEFT('2000_totals_minor_occupations'!H2,1))
Use the conditional IF to select 'MALE' or 'FEMALE' under Gender. Simply copy the value.
=IF('2000_totals_minor_occupations'!F2="男","MALE","FEMALE")
These cells can be replicated down the length of the dataset. The function for computing OccCode must be modified for 2-digit and 3-digit occupation numbers.
=CONCATENATE(LEFT('2000_totals_minor_occupations'!J11,4),"0",LEFT('2000_totals_minor_occupations'!H11,2))
and
=CONCATENATE(LEFT('2000_totals_minor_occupations'!J101,4),LEFT('2000_totals_minor_occupations'!H101,3))
If the formulas are converted to values, this XSLX file can be used directly in R.