Navigation
Questions about Data Use
1. How do I cite CFPS data?
Please indicate the source of data wherever CFPS data are used. References should take the form below:“The data are from China Family Panel Studies (CFPS), funded by 985 Program of Peking University and carried out by the Institute of Social Science Survey of Peking University ”.
2. Does CFPS only provide sequential code of counties? I need to use variables above at county level and hope to get the national standard code of CFPS counties.
For confidentiality purposes, CFPS will not provide address information below province levels and only a recoded county code and community code would be released. However, we have prepared a county level file with variables on GDP, GDP per capita, population, employment, average educational level, proportion of working-age population, proportion of elderly population, sex distribution of population aged from 10 to 19, and proportion of non-agricultural population. The county level restricted dataset can be applied from here.
3. Why are there many missing values in variables concerning the specific amount of values? What is the“unfolding” approach mentioned in the questionnaire?
When respondents were not willing to provide the specific amount of values, we proceeded questions with the unfolding approach. For instance, when asking respondents about their personal income, there would be a string("2500/5000/7500/12000/18000/27000/40000/60000/90000/140000/210000/320000/480000") in the question. In interviews, we started from the midpoint (40000 in the example above). For amounts less than the midpoint, we asked them, “Is your total income below XX?“ (XX stands for amounts to the left of the middle point.) For amounts greater than the midpoint, we asked them,“Is your total income higher than XX?”(XX stands for amounts greater than the midpoint).
4. How can I match the individual information when doing longitudinal analysis?
Pid, which is the key identifier of individuals in CFPS, is unique and permanent. This variable could be used to match individual information when analyzing the data from different CFPS datasets.
5. Why are there many unmatched cases after merging the 2010 and 2012 child datasets?
There are three reasons: 1)he/she is a new member in 2012 that has not beena sample in 2010; 2) the member above 13 years old in 2010 has been included in the adult database of 2012; 3) the attrition rate of sample of children is about 15% in 2012.
6. How can I connect the information between children and parents?
If only the basic information of children and parents is needed (e.g. age, marital status, education, household registration), you can search such information from the family roster data set. . For the children and parents who have individual questionnaires, find the parents’ or children’s pid from the family roster dataset and then connect the individual questionnaire by pid to obtain additional information.
7. How can I determine the head of the household in the family questionnaire of 2010?
If only the basic information of children and parents is needed (e.g. age, marital status, education, household registration), you can search such information from the family roster data set; . For the children and parents who have individual questionnaires, find the parents’ or children’s pid from the family roster dataset and then connect the individual questionnaire by pid to obtain additional information.
8. What does the “adjusted” mean in the family data sets of 2010, such as “adjusted expense” or “adjusted family income”?
Adjustment mainly refers to conversion of self-consumption in rural households into family income. See the 2010 Technical Report CFPS-14: China family panel studies of 2010 rural household income adjustment measures (in Chinese) for detail.
9. Explain the inconsistency of the variable of “urban” and household registration status (hukou) (wa4)?
Variable “urban” is defined by the record from the National Bureau of Statistics of China. Wa4 is the hukou status reported by respondents themselves. The inconsistency may be caused by the change of urban status over the years or a mismatch between official record and individual report.
10. Is the self-employed business income included in the personal income of the adult data set in 2010 and 2012?
The question related to personal income in 2010, which is “How much is your approximate total personal income (all sources of income) last year?” could be regarded as including self-employed business income.
The personal income of 2012 is the total sum of wages, bonuses, benefits, pensions and scholarships.Personal income from farming and home business,has not been asked in individual questionnaires, but only in family economy questionnaire,causing a large proportion of samples not having positive personal income.
11. Why are some cases having missing values of fswt_res (individual weight, which is National resampling samples divided by integration sample) in 2010?
fswt_res is positive only for a subsample of CFPS, identified by subsample=1. The weight variables could be used in two ways: 1. Use the whole sample and the corresponding national weight (fswt_nat); 2. Use the resampling sample, with the resample weight (fswt_res). Users may refer to a relevant PPT slides (in Chinese) downloadable from here.
12. How do I find the school information of those who attended colleges (variable QC306)?
Such information is available as a recoded value, available from the variable “collegetype” in CFPS 2010 adult dataset.
13. Have the 2010 chronic disease data been released?
We currently released the 2010 adult chronic disease information in coded form. The related variable names are QP501ACODE and QP501BCODE; you can look up the Codebook for detail.
14. How can I match the 2010 and 2012 community ID with two different digit formats?
The first 5 digits of community ID in 2012 are corresponding to those in 2010, which could be used to match the community data.
15. How could I understand questions with multiple choice formats? Could you explain with examples?
For example, the question qa7 “Which of the following organizations did you join?” of adult questionnaire in 2010 contains 14 choices (multiple choices) from qa7_s_1 to qa7_s_14. qa7_s_1 is the first organization the user selected, qa7_s_2 is the second organization the user selected and so on.
16. Is there a variable of proxy respondent in the database of children?
waproxy is the variable of proxy respondent in the database of children in 2010, this variable represents the order number of proxy respondent among the family members.
kz1_b_1 and kz1_b_3 are the variables of proxy respondent in the database of children in 2012, they also represent the order number of proxy respondent among the family members.
kz1_b_1 and kz1_b_3 are the variables of proxy respondent in the database of children in 2012, they also represent the order number of proxy respondent among the family members.
17. What does total household expenditure include?
Total household expenditure in CFPS can be categorized into the following four main types:
(1) Consumption Expenditure: Primarily covers daily household expenses such as food, clothing, housing, transportation, and utilities. Aligned with the National Bureau of Statistics' household expenditure categories, it is further divided into eight sub-items: food; clothing; housing; household equipment and supplies; transportation and communication; healthcare; education, culture, and entertainment; and other consumption expenditures. Household consumption expenditure is the most significant component, accounting for over 85% of total household expenditure.
(2) Transfer Expenditure: Mainly refers to financial support provided by the household to non-co-resident relatives and friends, as well as social donations.
(3) Welfare Expenditure: Primarily includes expenses for purchasing commercial insurance, medical insurance, and contributions to various pension insurance schemes.
(4) House Construction and Purchase Expenditure: Covers expenses related to building or purchasing a house (including mortgage payments).
Total household expenditure is the sum of the above four categories. If a household did not incur an expense in a certain category, it is recorded as 0. In the 2014 questionnaire, CFPS specifically asked about the total household expenditure over the past 12 months or the range of total expenditure. When generating the comprehensive variable for total household expenditure, it is based on the sum of itemized expenditures. Only when the sum of itemized expenditures is less than 100 or missing (if all expenditure items are recorded as Not Applicable/Refusal/Don't Know), the total expenditure reported by the respondent is used for imputation.
(1) Consumption Expenditure: Primarily covers daily household expenses such as food, clothing, housing, transportation, and utilities. Aligned with the National Bureau of Statistics' household expenditure categories, it is further divided into eight sub-items: food; clothing; housing; household equipment and supplies; transportation and communication; healthcare; education, culture, and entertainment; and other consumption expenditures. Household consumption expenditure is the most significant component, accounting for over 85% of total household expenditure.
(2) Transfer Expenditure: Mainly refers to financial support provided by the household to non-co-resident relatives and friends, as well as social donations.
(3) Welfare Expenditure: Primarily includes expenses for purchasing commercial insurance, medical insurance, and contributions to various pension insurance schemes.
(4) House Construction and Purchase Expenditure: Covers expenses related to building or purchasing a house (including mortgage payments).
Total household expenditure is the sum of the above four categories. If a household did not incur an expense in a certain category, it is recorded as 0. In the 2014 questionnaire, CFPS specifically asked about the total household expenditure over the past 12 months or the range of total expenditure. When generating the comprehensive variable for total household expenditure, it is based on the sum of itemized expenditures. Only when the sum of itemized expenditures is less than 100 or missing (if all expenditure items are recorded as Not Applicable/Refusal/Don't Know), the total expenditure reported by the respondent is used for imputation.
18. How does CFPS collect household expenditure data?
Similar to the survey on household income, CFPS collects household expenditure data based on the recall and responses of the respondent to the household economic questionnaire. Respondents for the household expenditure module are typically of two types: those familiar with the household's finances, or those responsible for food shopping for the household. Respondents should be family members.
Considering the varying frequency of different types of expenditures, CFPS uses three recall periods for household expenditure: the past week, the past month, and the past 12 months. More frequently consumed items use shorter recall periods (e.g., food), while less frequent items use longer recall periods (e.g., healthcare). Since the comprehensive variable for total household expenditure is based on a 12-month period, expenditures asked for the past week are converted to a 52-week equivalent (≈365/7), and those asked for the past month are converted to a 12-month equivalent.
Due to the switching between recall periods, some respondents might mistakenly report expenses for the past month as being for the past 12 months, or vice versa. To address this, CFPS implements soft checks for value ranges in the questionnaire. If a respondent's answer deviates significantly from the normal range, the computer prompts the interviewer to confirm the amount with the respondent. Furthermore, when constructing comprehensive variables for itemized expenditures, we use income quantiles to identify expenditure levels that severely deviate from the income level and adjust these outliers. For example, if a household's reported itemized expenditure for the past month exceeds 12 times the average expenditure for that item among households in the same income quantile, we infer that the household likely reported the expenditure based on a 12-month recall period. In such cases, we divide the reported expenditure by 12 as an adjustment.
Considering the varying frequency of different types of expenditures, CFPS uses three recall periods for household expenditure: the past week, the past month, and the past 12 months. More frequently consumed items use shorter recall periods (e.g., food), while less frequent items use longer recall periods (e.g., healthcare). Since the comprehensive variable for total household expenditure is based on a 12-month period, expenditures asked for the past week are converted to a 52-week equivalent (≈365/7), and those asked for the past month are converted to a 12-month equivalent.
Due to the switching between recall periods, some respondents might mistakenly report expenses for the past month as being for the past 12 months, or vice versa. To address this, CFPS implements soft checks for value ranges in the questionnaire. If a respondent's answer deviates significantly from the normal range, the computer prompts the interviewer to confirm the amount with the respondent. Furthermore, when constructing comprehensive variables for itemized expenditures, we use income quantiles to identify expenditure levels that severely deviate from the income level and adjust these outliers. For example, if a household's reported itemized expenditure for the past month exceeds 12 times the average expenditure for that item among households in the same income quantile, we infer that the household likely reported the expenditure based on a 12-month recall period. In such cases, we divide the reported expenditure by 12 as an adjustment.
19. I recently encountered an issue using the CFPS composite income variable. I need to use household disposable income in my analysis, but the composite variable provided by CFPS is household net income. According to the definition in the China Statistical Yearbook, disposable income includes wage income, net operational income, net property income, and net transfer income. Comparing this with the CFPS user manual, the difference seems to lie in the definition of transfer income. My question is, how can I calculate household disposable income based on CFPS household net income? Is it simply subtracting transfer expenditures?
Because the questionnaire design of CFPS differs from that of the National Bureau of Statistics, we believe it is difficult to derive a "disposable income" measure that is entirely consistent in scope. Besides the net transfer income you mentioned, CFPS's property income also differs from the "property income" in disposable income; our composite variable calculation does not include "income from assets such as bank deposits and securities." The CFPS questionnaires and calculation methods are available on the project website. We recommend that you construct the most comparable indicator possible based on your research needs.
20. I found that the 2012 total household consumption data has 2,068 missing values, while other years only have around 600. Could you check if there is an error? Why are there so many missing values? Furthermore, the description for this variable in the 2012 data is "Total Household Expenditure," whereas in other years it is "Total Household Expenditure Last Year." As I need this data for validation, could you please look into this?
The difference in the proportion of missing values is related to the calculation method of this composite variable across different years. Starting in 2014, the design of the expenditure section in the questionnaire differed from previous years. In 2012, only itemized expenditures were collected, and total expenditure had to be calculated by summation, which is how the total expenditure in our dataset for that year is derived. Starting in 2014, the questionnaire includes not only itemized expenditures but also a separate question on total expenditure. When itemized expenditures have missing values, the comprehensive variable uses the total expenditure response, leading to differences in the proportion of missing values across years. If you need more consistent calculations, you can generate your own version of total expenditure based on the questionnaire data, ensuring the algorithm remains consistent across years.
21. I noticed that in the 2018 data, there is a variable named fincperadj_p, which, according to the description, reflects the quantile of household per capita income. I read the relevant user manual, but it does not specify the scope for this quantile. For example, does "above 25%" refer to the top 25% nationally, provincially, or within the community? Furthermore, I want the income quantile for each sample at the community level. Apart from applying for restricted data, is there any other way to obtain this?
It refers to the national quantile. Except for the five major provinces where CFPS is self-representative, it is not representative in other provinces or at the community level. If needed, please handle this cautiously according to your specific research needs.