Data Cleaning: From KOSIS Raw Data to Tidy Format

Introduction

This vignette demonstrates how to clean and transform raw Korean census data from the Korean Statistical Information Service (KOSIS) API into a tidy, analysis-ready format. While there are existing R packages for accessing Korean statistical data (such as the kosis package for general KOSIS data access), there are currently no readily available software solutions specifically designed for retrieving and processing Korean census data in a tidy, analysis-friendly format for demographic and socioeconomic research.

The Challenge

Korean census and administrative data presents several challenges for researchers:

Complex API URLs: KOSIS APIs require detailed parameter specifications with cryptic codes
Multiple data sources: Different datasets (population, tax, mortality) come from different API endpoints
Inconsistent formatting: Raw data often comes in wide format with Korean-language column names
Administrative code mapping: Different datasets use different administrative code systems
Data integration: Combining multiple datasets requires careful matching of administrative units

This vignette shows how the tidycensuskr package addresses these challenges by providing pre-processed, tidy datasets and demonstrating the underlying data cleaning workflow.

Raw Data Sources

The data cleaning process involves retrieving data from multiple KOSIS API endpoints:

You need to know following information to retrieve data from KOSIS API:

API Key: You need to register for an API key from KOSIS.
API Parameters: Each API endpoint requires specific parameters such as itmId, objL1, objL2, etc. These parameters define the data you want to retrieve.
Output Format: The API supports various output formats, but we will use JSON for easier processing in R.

API call parameters

Parameter	Type	Description	Required?
`apiKey`	String	Your KOSIS API key	Yes
`orgId`	String	Organization identifier	Yes
`tblId`	String	Table identifier	Yes
`objL1`	String	First class code	Yes
`objL2`-`objL8`	String	Second to eighth class code	Optional
`itmId`	String	Item identifier(s)	Yes
`prdSe`	String	Update cycle	Yes
`format`	String	Output format (e.g., JSON)	Yes

Data variables

To note, an output can include the following variables:

Field name (all caps)	Description	Data type	Remarks
ORG_ID	Organization code	VARCHAR2(40)	Yes
TBL_ID	Table identifier	VARCHAR2(40)
TBL_NM	Table name	VARCHAR2(300)
C1 - C8	Class identifier (1-8)	VARCHAR2(40)	2-8 may be omitted if nonexisting
C1_OBJ_NM - C8_OBJ_NM	Class code (1-8)	VARCHAR2(3000)
C1_OBJ_NM_ENG - C8_OBJ_NM_ENG	Class code in English (1-8)	VARCHAR2(3000)
C1_NM - C8_NM	Class name (1-8)	VARCHAR2(3000)
C1_NM_ENG - C8_NM_ENG	Class name in English (1-8)	VARCHAR2(3000)
ITM_ID	Item identifier	VARCHAR2(40)
ITM_NM	Item name	VARCHAR2(3000)
ITM_NM_ENG	Item name in English	VARCHAR2(3000)
UNIT_ID	Unit identifier	VARCHAR2(40)
UNIT_NM	Unit name	VARCHAR2(1000)
UNIT_NM_ENG	Unit name in English	VARCHAR2(1000)
PRD_SE	Data update cycle	VARCHAR2(20)
PRD_DE	Data period	VARCHAR2(8)
DT	Data value	VARCHAR2(100)
LST_CHN_DE	Date of change	VARCHAR2(8)

It is worth noting that all fields are character, and numeric values are stored as strings. This means that you will need to convert them to numeric types after loading the data.

API Key

To use the KOSIS API, you need to register for an API key. You can obtain one by visiting the KOSIS API registration page and following the instructions.

The webpage is only available in Korean and may not be available outside of Korea.

Example URL

The URL below is an example of how to retrieve general tax data from KOSIS. You will need to replace 인증키없음 with your actual API key. Dissecting this URL can help you understand how to construct your own API calls.

url_tax_general <- 
  "https://kosis.kr/openapi/Param/statisticsParameterData.do?method=getList&apiKey=인증키없음&itmId=T001+&objL1=A0201+A0202+A0203+A0204+A0205+A0206+A0207+A0208+A0209+A0210+A0211+A0212+A0213+A0214+A0215+A0216+A0217+A0218+A0219+A0220+A0221+A0222+A0223+A0224+A0225+A0301+A0302+A0303+A0304+A0305+A0306+A0307+A0308+A0309+A0310+A0401+A0402+A0403+A0404+A0405+A0406+A0407+A0408+A0409+A0410+A0411+A0412+A0413+A0414+A0415+A0416+A0417+A0418+A0419+A0420+A0421+A0422+A0423+A0424+A0425+A0426+A0427+A0428+A0429+A0430+A0431+A0501+A0502+A0503+A0504+A0505+A0506+A0507+A0508+A0509+A0510+A0511+A0512+A0513+A0514+A0515+A0516+A0517+A0518+A0601+A0602+A0603+A0604+A0605+A0701+A0702+A0703+A0704+A0705+A0706+A0707+A0708+A0709+A0710+A0711+A0801+A0802+A0803+A0804+A0805+A0806+A0807+A0808+A0809+A0810+A0811+A0812+A0813+A0814+A0815+A09+A1001+A1002+A1003+A1004+A1005+A1101+A1102+A1103+A1104+A1105+A1106+A1107+A1108+A1109+A1110+A1111+A1112+A1113+A1114+A1201+A1202+A1203+A1204+A1205+A1206+A1207+A1208+A1209+A1210+A1211+A1212+A1213+A1214+A1215+A1216+A1217+A1218+A1219+A1220+A1221+A1222+A1309+A1301+A1302+A1303+A1304+A1305+A1306+A1307+A1308+A1401+A1402+A1403+A1404+A1405+A1406+A1407+A1408+A1409+A1410+A1411+A1412+A1413+A1414+A1415+A1416+A1417+A1418+A1419+A1420+A1421+A1422+A1423+A1501+A1502+A1503+A1504+A1505+A1506+A1507+A1508+A1509+A1510+A1511+A1512+A1513+A1514+A1515+A1516+A1601+A1602+A1603+A1604+A1605+A1701+A1702+A1703+A1704+A1705+A1706+A1707+A1708+A1709+A1710+A1711+A1712+A1713+A1714+A1715+A1716+A1717+A1718+A1802+A1801+&objL2=15133SGH0M+&objL3=&objL4=&objL5=&objL6=&objL7=&objL8=&format=json&jsonVD=Y&prdSe=Y&newEstPrdCnt=1&outputFields=TBL_ID+TBL_NM+OBJ_ID+OBJ_NM+OBJ_NM_ENG+NM+NM_ENG+ITM_ID+ITM_NM+ITM_NM_ENG+UNIT_NM+UNIT_NM_ENG+&orgId=133&tblId=DT_133N_A3212"

Parameter	Value
`apiKey`	“인증키없음” (meaning “no authentication”) as it is for demonstration.
`orgId`	“133”
`tblId`	“DT_133N_A3212”
`objL1`	“A0201”, “A0202”, … , “A1801”
`objL2`	“15133SGH0M”
`itmId`	“T001”
`prdSe`	“Y”
`newEstPrdCnt`	“1”
`outputFields`	“TBL_ID”, “TBL_NM”, … , “UNIT_NM_ENG”

Adding more context, multiple values for objL1 and itmId can be specified by separating them with a plus sign (+). For example, itmId=T001+T002+T003 would retrieve data for multiple items.

Try interpreting the following URL to understand how to construct your own API calls:

url_pop <-
  "https://kosis.kr/openapi/Param/statisticsParameterData.do?method=getList&apiKey=인증키없음&itmId=T00+T60+&objL1=11010+11020+11030+11040+11050+11060+11070+11080+11090+11100+11110+11120+11130+11140+11150+11160+11170+11180+11190+11200+11210+11220+11230+11240+11250+21010+21020+21030+21040+21050+21060+21070+21080+21090+21100+21110+21120+21130+21140+21150+21510+22010+22020+22030+22040+22050+22060+22070+22510+22520+23010+23020+23030+23040+23050+23060+23070+23080+23090+23510+23520+24010+24020+24030+24040+24050+25010+25020+25030+25040+25050+26010+26020+26030+26040+26510+29010+31010+31011+31012+31013+31014+31020+31021+31022+31023+31030+31040+31041+31042+31050+31051+31052+31053+31060+31070+31080+31090+31091+31092+31100+31101+31103+31104+31110+31120+31130+31140+31150+31160+31170+31180+31190+31191+31192+31193+31200+31210+31220+31230+31240+31250+31260+31270+31280+31550+31570+31580+32010+32020+32030+32040+32050+32060+32070+32510+32520+32530+32540+32550+32560+32570+32580+32590+32600+32610+33020+33030+33040+33041+33042+33043+33044+33520+33530+33540+33550+33560+33570+33580+33590+34010+34011+34012+34020+34030+34040+34050+34060+34070+34080+34510+34530+34540+34550+34560+34570+34580+35010+35011+35012+35020+35030+35040+35050+35060+35510+35520+35530+35540+35550+35560+35570+35580+36010+36020+36030+36040+36060+36510+36520+36530+36550+36560+36570+36580+36590+36600+36610+36620+36630+36640+36650+36660+36670+36680+37010+37011+37012+37020+37030+37040+37050+37060+37070+37080+37090+37100+37510+37520+37530+37540+37550+37560+37570+37580+37590+37600+37610+37620+37630+38030+38050+38060+38070+38080+38090+38100+38110+38111+38112+38113+38114+38115+38510+38520+38530+38540+38550+38560+38570+38580+38590+38600+39010+39020+&objL2=ALL&objL3=000+&objL4=&objL5=&objL6=&objL7=&objL8=&format=json&jsonVD=Y&prdSe=Y&newEstPrdCnt=1&outputFields=TBL_ID+TBL_NM+OBJ_ID+OBJ_NM+OBJ_NM_ENG+NM+NM_ENG+ITM_ID+ITM_NM+ITM_NM_ENG+UNIT_NM+UNIT_NM_ENG+&orgId=101&tblId=DT_1IN1509"

Data Retrieval

Setting up API Access

library(tidycensuskr)
library(dplyr)
library(tidyr)
library(kosis)

# Set KOSIS API key (stored in a secure file)
kosiskey <- readLines("~/.kosiskey")[1]
tidycensuskr::set_kosis_key("~/.kosiskey")

Downloading Raw Data

# Download raw datasets from KOSIS API
df_tax <- kosis::getStatDataFromURL(url_tax_general)

# Load administrative district lookup table
sgg_lookup <-
  read.csv(
    system.file(
      file.path("extdata", "lookup_district_code.csv"),
      package = "tidycensuskr"
    ),
    fileEncoding = "EUC-KR"
  )

Data Cleaning Workflow

1. Administrative Code Mapping

The first challenge is creating a consistent mapping between different administrative code systems used across datasets:

# Create administrative code mapping for provinces (sido)
sidocd_range <- tibble::tribble(
  ~sido_kr, ~sido_cd, ~sido_txcd,
  "서울특별시", "11", "02",
  "부산광역시", "21", "15", 
  "대구광역시", "22", "13",
  "인천광역시", "23", "03",
  "광주광역시", "24", "10",
  "대전광역시", "25", "06",
  "울산광역시", "26", "16",
  "세종특별자치시", "29", "09",
  "경기도", "31", "04",
  "강원특별자치도", "32", "05",
  "충청북도", "33", "07",
  "충청남도", "34", "08",
  "전라북도", "35", "11",
  "전라남도", "36", "12",
  "경상북도", "37", "14",
  "경상남도", "38", "17",
  "제주특별자치도", "39", "18"
)

2. Tax Data Processing

Transform raw tax data into a standardized format:

df_tax_compact <- df_tax |>
  dplyr::transmute(
    adm2_code = C1,           # Administrative code
    value = DT   # Tax value in million KRW
  ) |>
  dplyr::inner_join(
    sgg_lookup[, c("sgg_tax_global", "sido_en", "sigungu_1_en", "adm2_code")],
    multiple = "first"
  )

3. Population Data Processing

Clean and reshape population data with gender disaggregation:

df_pop2 <- df_pop |>
  dplyr::mutate(
    sex = plyr::mapvalues(C2, c(0, 1, 2), c("total", "male", "female")),
    type = plyr::mapvalues(ITM_ID, c("T00", "T60"), c("population_total", "population_nonrelative"))
  ) |>
  dplyr::select(C1, C1_NM, sex, type, DT) |>
  tidyr::pivot_wider(
    names_from = c(type, sex),
    values_from = DT
  ) |>
  dplyr::rename(
    sigungu_cd = C1,
    sigungu_kr = C1_NM
  ) |>
  dplyr::mutate(
    adm2_code = as.integer(adm2_code)
  ) |>
  dplyr::inner_join(
    sgg_lookup[, c("adm2_code", "sido_en", "sigungu_1_en")],
    by = "adm2_code",
    multiple = "first"
  )

Transforming to Tidy Format

The key feature of tidycensuskr is converting wide-format, multiple-dataset structure into a single, long-format tidy dataset:

1. Convert Each Dataset to Long Format

# Tax data to long format
df_tax_long <- df_tax_compact |>
  dplyr::select(2:5) |>
  tidyr::pivot_longer(
    cols = "value"
  ) |>
  dplyr::mutate(
    type = "tax",
    class1 = "global",
    class2 = "total",
    unit = "million KRW"
  ) |>
  dplyr::select(-name)

# Population data to long format
df_pop_long <- df_pop2 |>
  dplyr::select(-2) |>
  tidyr::pivot_longer(
    cols = 2:7
  ) |>
  tidyr::separate(col = "name", into = c("type", "class1", "class2"), sep = "_")

2. Combine into Single Tidy Dataset

# Bind all datasets into one comprehensive long-format dataset
censuskor <- dplyr::bind_rows(
    df_tax_long,
    df_pop_long
  ) |>
  dplyr::rename(
    adm1 = sido_en,
    adm2 = sigungu_1_en
  ) |>
  dplyr::mutate(
    year = 2020
  ) |>
  dplyr::select(
    year, adm1, adm2, adm2_other, adm2_code,
    type, class1, class2, unit, value
  )

The Result: A Tidy Dataset

The final censuskor dataset has a consistent structure:

year: Census year (2020)
adm1: Province-level administrative name (in English)
adm2: District-level administrative name (in English)
adm2_code: Numeric administrative code
type: Data type (“population”, “tax”, “mortality”)
class1: First-level classification (e.g., “global”, “income”, “All causes”)
class2: Second-level classification (e.g., “total”, “male”, “female”)
unit: Unit of measurement
value: Numeric value

Example Usage

library(tidycensuskr)
library(dplyr)

# Load the cleaned dataset
data(censuskor)

# View the structure
head(censuskor)

# Filter for Seoul population data
seoul_pop <- censuskor |>
  filter(adm1 == "Seoul", type == "population", class1 == "population") |>
  select(adm2, class2, value) |>
  pivot_wider(names_from = class2, values_from = value)

head(seoul_pop)

These data can be joined with sf objects bundled in the package to make maps and perform spatial analysis

Notes on `adm2_code` changes

Districts (si-gun-gu) are the second-level administrative units in South Korea. Each district is assigned a unique code called adm2_code. However, these codes can change over time due to administrative boundary changes, mergers, or reclassifications.

Notable changes were made in all rural districts (gun) in 2022; the third digit of adm2_code got addition of 2. For example, Ulleung-gun changed from 37430 to 37630. This change may affect longitudinal analyses that span multiple years and pose challenges in updating censuskor dataset by developers. Some datasets in KOSIS are retrospectively updated to reflect the new codes, while most KOSIS datasets retain the old codes. Therefore, developers should be advised to check the adm2_code for each dataset and year with this code:

# imported data
some_census_table

# check adm2_code
any(substr(some_census_table$adm2_code, 3, 3) %in% c("5", "6"))

If TRUE, it indicates that the dataset uses the new adm2_code system (post-2022) regardless of the raw data year. If FALSE, it uses the old system (pre-2022).

Note: This vignette shows the data cleaning process for developers. The actual censuskor dataset in the package is the result of this workflow and is ready for immediate use.