This vignette demonstrates how to clean and transform raw Korean
census data from the Korean Statistical Information Service (KOSIS) API
into a tidy, analysis-ready format. While there are existing R packages
for accessing Korean statistical data (such as the kosis
package for general KOSIS data access), there are currently no
readily available software solutions specifically designed for
retrieving and processing Korean census data in a tidy,
analysis-friendly format for demographic and socioeconomic
research.
Korean census and administrative data presents several challenges for researchers:
This vignette shows how the tidycensuskr package
addresses these challenges by providing pre-processed, tidy datasets and
demonstrating the underlying data cleaning workflow.
The data cleaning process involves retrieving data from multiple KOSIS API endpoints:
You need to know following information to retrieve data from KOSIS API:
itmId, objL1,
objL2, etc. These parameters define the data you want to
retrieve.| Parameter | Type | Description | Required? |
|---|---|---|---|
apiKey |
String | Your KOSIS API key | Yes |
orgId |
String | Organization identifier | Yes |
tblId |
String | Table identifier | Yes |
objL1 |
String | First class code | Yes |
objL2-objL8 |
String | Second to eighth class code | Optional |
itmId |
String | Item identifier(s) | Yes |
prdSe |
String | Update cycle | Yes |
format |
String | Output format (e.g., JSON) | Yes |
To note, an output can include the following variables:
| Field name (all caps) | Description | Data type | Remarks |
|---|---|---|---|
| ORG_ID | Organization code | VARCHAR2(40) | Yes |
| TBL_ID | Table identifier | VARCHAR2(40) | |
| TBL_NM | Table name | VARCHAR2(300) | |
| C1 - C8 | Class identifier (1-8) | VARCHAR2(40) | 2-8 may be omitted if nonexisting |
| C1_OBJ_NM - C8_OBJ_NM | Class code (1-8) | VARCHAR2(3000) | |
| C1_OBJ_NM_ENG - C8_OBJ_NM_ENG | Class code in English (1-8) | VARCHAR2(3000) | |
| C1_NM - C8_NM | Class name (1-8) | VARCHAR2(3000) | |
| C1_NM_ENG - C8_NM_ENG | Class name in English (1-8) | VARCHAR2(3000) | |
| ITM_ID | Item identifier | VARCHAR2(40) | |
| ITM_NM | Item name | VARCHAR2(3000) | |
| ITM_NM_ENG | Item name in English | VARCHAR2(3000) | |
| UNIT_ID | Unit identifier | VARCHAR2(40) | |
| UNIT_NM | Unit name | VARCHAR2(1000) | |
| UNIT_NM_ENG | Unit name in English | VARCHAR2(1000) | |
| PRD_SE | Data update cycle | VARCHAR2(20) | |
| PRD_DE | Data period | VARCHAR2(8) | |
| DT | Data value | VARCHAR2(100) | |
| LST_CHN_DE | Date of change | VARCHAR2(8) |
It is worth noting that all fields are character, and numeric values are stored as strings. This means that you will need to convert them to numeric types after loading the data.
To use the KOSIS API, you need to register for an API key. You can obtain one by visiting the KOSIS API registration page and following the instructions.
The webpage is only available in Korean and may not be available outside of Korea.
The URL below is an example of how to retrieve general tax data from
KOSIS. You will need to replace 인증키없음 with your actual
API key. Dissecting this URL can help you understand how to construct
your own API calls.
url_tax_general <-
"https://kosis.kr/openapi/Param/statisticsParameterData.do?method=getList&apiKey=인증키없음&itmId=T001+&objL1=A0201+A0202+A0203+A0204+A0205+A0206+A0207+A0208+A0209+A0210+A0211+A0212+A0213+A0214+A0215+A0216+A0217+A0218+A0219+A0220+A0221+A0222+A0223+A0224+A0225+A0301+A0302+A0303+A0304+A0305+A0306+A0307+A0308+A0309+A0310+A0401+A0402+A0403+A0404+A0405+A0406+A0407+A0408+A0409+A0410+A0411+A0412+A0413+A0414+A0415+A0416+A0417+A0418+A0419+A0420+A0421+A0422+A0423+A0424+A0425+A0426+A0427+A0428+A0429+A0430+A0431+A0501+A0502+A0503+A0504+A0505+A0506+A0507+A0508+A0509+A0510+A0511+A0512+A0513+A0514+A0515+A0516+A0517+A0518+A0601+A0602+A0603+A0604+A0605+A0701+A0702+A0703+A0704+A0705+A0706+A0707+A0708+A0709+A0710+A0711+A0801+A0802+A0803+A0804+A0805+A0806+A0807+A0808+A0809+A0810+A0811+A0812+A0813+A0814+A0815+A09+A1001+A1002+A1003+A1004+A1005+A1101+A1102+A1103+A1104+A1105+A1106+A1107+A1108+A1109+A1110+A1111+A1112+A1113+A1114+A1201+A1202+A1203+A1204+A1205+A1206+A1207+A1208+A1209+A1210+A1211+A1212+A1213+A1214+A1215+A1216+A1217+A1218+A1219+A1220+A1221+A1222+A1309+A1301+A1302+A1303+A1304+A1305+A1306+A1307+A1308+A1401+A1402+A1403+A1404+A1405+A1406+A1407+A1408+A1409+A1410+A1411+A1412+A1413+A1414+A1415+A1416+A1417+A1418+A1419+A1420+A1421+A1422+A1423+A1501+A1502+A1503+A1504+A1505+A1506+A1507+A1508+A1509+A1510+A1511+A1512+A1513+A1514+A1515+A1516+A1601+A1602+A1603+A1604+A1605+A1701+A1702+A1703+A1704+A1705+A1706+A1707+A1708+A1709+A1710+A1711+A1712+A1713+A1714+A1715+A1716+A1717+A1718+A1802+A1801+&objL2=15133SGH0M+&objL3=&objL4=&objL5=&objL6=&objL7=&objL8=&format=json&jsonVD=Y&prdSe=Y&newEstPrdCnt=1&outputFields=TBL_ID+TBL_NM+OBJ_ID+OBJ_NM+OBJ_NM_ENG+NM+NM_ENG+ITM_ID+ITM_NM+ITM_NM_ENG+UNIT_NM+UNIT_NM_ENG+&orgId=133&tblId=DT_133N_A3212"| Parameter | Value |
|---|---|
apiKey |
“인증키없음” (meaning “no authentication”) as it is for demonstration. |
orgId |
“133” |
tblId |
“DT_133N_A3212” |
objL1 |
“A0201”, “A0202”, … , “A1801” |
objL2 |
“15133SGH0M” |
itmId |
“T001” |
prdSe |
“Y” |
newEstPrdCnt |
“1” |
outputFields |
“TBL_ID”, “TBL_NM”, … , “UNIT_NM_ENG” |
Adding more context, multiple values for objL1 and
itmId can be specified by separating them with a plus sign
(+). For example, itmId=T001+T002+T003 would
retrieve data for multiple items.
Try interpreting the following URL to understand how to construct your own API calls:
url_pop <-
"https://kosis.kr/openapi/Param/statisticsParameterData.do?method=getList&apiKey=인증키없음&itmId=T00+T60+&objL1=11010+11020+11030+11040+11050+11060+11070+11080+11090+11100+11110+11120+11130+11140+11150+11160+11170+11180+11190+11200+11210+11220+11230+11240+11250+21010+21020+21030+21040+21050+21060+21070+21080+21090+21100+21110+21120+21130+21140+21150+21510+22010+22020+22030+22040+22050+22060+22070+22510+22520+23010+23020+23030+23040+23050+23060+23070+23080+23090+23510+23520+24010+24020+24030+24040+24050+25010+25020+25030+25040+25050+26010+26020+26030+26040+26510+29010+31010+31011+31012+31013+31014+31020+31021+31022+31023+31030+31040+31041+31042+31050+31051+31052+31053+31060+31070+31080+31090+31091+31092+31100+31101+31103+31104+31110+31120+31130+31140+31150+31160+31170+31180+31190+31191+31192+31193+31200+31210+31220+31230+31240+31250+31260+31270+31280+31550+31570+31580+32010+32020+32030+32040+32050+32060+32070+32510+32520+32530+32540+32550+32560+32570+32580+32590+32600+32610+33020+33030+33040+33041+33042+33043+33044+33520+33530+33540+33550+33560+33570+33580+33590+34010+34011+34012+34020+34030+34040+34050+34060+34070+34080+34510+34530+34540+34550+34560+34570+34580+35010+35011+35012+35020+35030+35040+35050+35060+35510+35520+35530+35540+35550+35560+35570+35580+36010+36020+36030+36040+36060+36510+36520+36530+36550+36560+36570+36580+36590+36600+36610+36620+36630+36640+36650+36660+36670+36680+37010+37011+37012+37020+37030+37040+37050+37060+37070+37080+37090+37100+37510+37520+37530+37540+37550+37560+37570+37580+37590+37600+37610+37620+37630+38030+38050+38060+38070+38080+38090+38100+38110+38111+38112+38113+38114+38115+38510+38520+38530+38540+38550+38560+38570+38580+38590+38600+39010+39020+&objL2=ALL&objL3=000+&objL4=&objL5=&objL6=&objL7=&objL8=&format=json&jsonVD=Y&prdSe=Y&newEstPrdCnt=1&outputFields=TBL_ID+TBL_NM+OBJ_ID+OBJ_NM+OBJ_NM_ENG+NM+NM_ENG+ITM_ID+ITM_NM+ITM_NM_ENG+UNIT_NM+UNIT_NM_ENG+&orgId=101&tblId=DT_1IN1509"The first challenge is creating a consistent mapping between different administrative code systems used across datasets:
# Create administrative code mapping for provinces (sido)
sidocd_range <- tibble::tribble(
~sido_kr, ~sido_cd, ~sido_txcd,
"서울특별시", "11", "02",
"부산광역시", "21", "15",
"대구광역시", "22", "13",
"인천광역시", "23", "03",
"광주광역시", "24", "10",
"대전광역시", "25", "06",
"울산광역시", "26", "16",
"세종특별자치시", "29", "09",
"경기도", "31", "04",
"강원특별자치도", "32", "05",
"충청북도", "33", "07",
"충청남도", "34", "08",
"전라북도", "35", "11",
"전라남도", "36", "12",
"경상북도", "37", "14",
"경상남도", "38", "17",
"제주특별자치도", "39", "18"
)Transform raw tax data into a standardized format:
Clean and reshape population data with gender disaggregation:
df_pop2 <- df_pop |>
dplyr::mutate(
sex = plyr::mapvalues(C2, c(0, 1, 2), c("total", "male", "female")),
type = plyr::mapvalues(ITM_ID, c("T00", "T60"), c("population_total", "population_nonrelative"))
) |>
dplyr::select(C1, C1_NM, sex, type, DT) |>
tidyr::pivot_wider(
names_from = c(type, sex),
values_from = DT
) |>
dplyr::rename(
sigungu_cd = C1,
sigungu_kr = C1_NM
) |>
dplyr::mutate(
adm2_code = as.integer(adm2_code)
) |>
dplyr::inner_join(
sgg_lookup[, c("adm2_code", "sido_en", "sigungu_1_en")],
by = "adm2_code",
multiple = "first"
)The key feature of tidycensuskr is converting
wide-format, multiple-dataset structure into a single, long-format tidy
dataset:
# Tax data to long format
df_tax_long <- df_tax_compact |>
dplyr::select(2:5) |>
tidyr::pivot_longer(
cols = "value"
) |>
dplyr::mutate(
type = "tax",
class1 = "global",
class2 = "total",
unit = "million KRW"
) |>
dplyr::select(-name)
# Population data to long format
df_pop_long <- df_pop2 |>
dplyr::select(-2) |>
tidyr::pivot_longer(
cols = 2:7
) |>
tidyr::separate(col = "name", into = c("type", "class1", "class2"), sep = "_")# Bind all datasets into one comprehensive long-format dataset
censuskor <- dplyr::bind_rows(
df_tax_long,
df_pop_long
) |>
dplyr::rename(
adm1 = sido_en,
adm2 = sigungu_1_en
) |>
dplyr::mutate(
year = 2020
) |>
dplyr::select(
year, adm1, adm2, adm2_other, adm2_code,
type, class1, class2, unit, value
)The final censuskor dataset has a consistent
structure:
library(tidycensuskr)
library(dplyr)
# Load the cleaned dataset
data(censuskor)
# View the structure
head(censuskor)
# Filter for Seoul population data
seoul_pop <- censuskor |>
filter(adm1 == "Seoul", type == "population", class1 == "population") |>
select(adm2, class2, value) |>
pivot_wider(names_from = class2, values_from = value)
head(seoul_pop)These data can be joined with sf objects bundled in the
package to make maps and perform spatial analysis
adm2_code changesDistricts (si-gun-gu) are the second-level administrative
units in South Korea. Each district is assigned a unique code called
adm2_code. However, these codes can change over time due to
administrative boundary changes, mergers, or reclassifications.
Notable changes were made in all rural districts (gun) in
2022; the third digit of adm2_code got addition of 2. For
example, Ulleung-gun changed from 37430 to
37630. This change may affect longitudinal analyses that
span multiple years and pose challenges in updating
censuskor dataset by developers. Some datasets in KOSIS are
retrospectively updated to reflect the new codes, while
most KOSIS datasets retain the old codes. Therefore, developers should
be advised to check the adm2_code for each dataset and year
with this code:
# imported data
some_census_table
# check adm2_code
any(substr(some_census_table$adm2_code, 3, 3) %in% c("5", "6"))If TRUE, it indicates that the dataset uses the
new adm2_code system (post-2022)
regardless of the raw data year. If FALSE, it uses the
old system (pre-2022).
Note: This vignette shows the data cleaning process
for developers. The actual censuskor dataset in the package
is the result of this workflow and is ready for immediate use.