Home > FAQ > Data Use

Questions about Data Use

1. How do I cite CFPS data?

Please indicate the source of data wherever CFPS data are used. References should take the form below:
“The data are from China Family Panel Studies (CFPS), funded by 985 Program of Peking University and carried out by the Institute of Social Science Survey of Peking University ”.

 

2. Does CFPS only provide sequential code of counties? I need to use variables above at county level and hope to get the national standard code of CFPS counties.

For confidentiality purposes, CFPS will not provide address information below province levels and only a recoded county code and community code would be released. However, we have prepared a county level file with variables on GDP, GDP per capita, population, employment, average educational level, proportion of working-age population, proportion of elderly population, sex distribution of population aged from 10 to 19, and proportion of non-agricultural population. The county level restricted dataset can be applied from here.

 

3. Why are there many missing values in variables concerning the specific amount of values? What is the“unfolding” approach mentioned in the questionnaire?

When respondents were not willing to provide the specific amount of values, we proceeded questions with the unfolding approach. For instance, when asking respondents about their personal income, there would be a string("2500/5000/7500/12000/18000/27000/40000/60000/90000/140000/210000/320000/480000") in the question. In interviews, we started from the midpoint (40000 in the example above). For amounts less than the midpoint, we asked them, “Is your total income below XX?“ (XX stands for amounts to the left of the middle point.) For amounts greater than the midpoint, we asked them,“Is your total income higher than XX?”(XX stands for amounts greater than the midpoint).

 

4. How can I match the individual information when doing longitudinal analysis?

Pid, which is the key identifier of individuals in CFPS, is unique and permanent. This variable could be used to match individual information when analyzing the data from different CFPS datasets.

 

5. Why are there many unmatched cases after merging the 2010 and 2012 child datasets?

There are three reasons: 1)he/she is a new member in 2012 that has not beena sample in 2010; 2) the member above 13 years old in 2010 has been included in the adult database of 2012; 3) the attrition rate of sample of children is about 15% in 2012.

 

6. How can I connect the information between children and parents?

If only the basic information of children and parents is needed (e.g. age, marital status, education, household registration), you can search such information from the family roster data set. . For the children and parents who have individual questionnaires, find the parents’ or children’s pid from the family roster dataset and then connect the individual questionnaire by pid to obtain additional information.

 

7. How can I determine the head of the household in the family questionnaire of 2010?

If only the basic information of children and parents is needed (e.g. age, marital status, education, household registration), you can search such information from the family roster data set; . For the children and parents who have individual questionnaires, find the parents’ or children’s pid from the family roster dataset and then connect the individual questionnaire by pid to obtain additional information.

 

8. What does the “adjusted” mean in the family data sets of 2010, such as “adjusted expense” or “adjusted family income”?

Adjustment mainly refers to conversion of self-consumption in rural households into family income. See the 2010 Technical Report CFPS-14: China family panel studies of 2010 rural household income adjustment measures (in Chinese) for detail.

 

9. Explain the inconsistency of  the variable of “urban” and household registration status (hukou) (wa4)?

Variable “urban” is defined by the record from the National Bureau of Statistics of China. Wa4 is the hukou status reported by respondents themselves. The inconsistency may be caused by the change of urban status over the years or a mismatch between official record and individual report.

 

10. Is the self-employed business income included in the personal income of the adult data set in 2010 and 2012?

The question related to personal income in 2010, which is “How much is your approximate total personal income (all sources of income) last year?” could be regarded as including self-employed business income.

The personal income of 2012 is the total sum of wages, bonuses, benefits, pensions and scholarships.Personal income from farming and home business,has not been asked in individual questionnaires, but only in family economy questionnaire,causing a large proportion of samples not having positive personal income.

 

11. Why are some cases having missing values of fswt_res (individual weight, which is National resampling samples divided by integration sample) in 2010?

fswt_res is positive only for a subsample of CFPS, identified by subsample=1. The weight variables could be used in two ways: 1. Use the whole sample and the corresponding national weight (fswt_nat); 2. Use the resampling sample, with the resample weight (fswt_res). Users may refer to a relevant PPT slides (in Chinese) downloadable from here.

 

12. How do I find the school information of those who attended colleges (variable QC306)?

Such information is available as a recoded value, available from the variable “collegetype” in CFPS 2010 adult dataset.

 

13. Have the 2010 chronic disease data been released?

We currently released the 2010 adult chronic disease information in coded form. The related variable names are QP501ACODE and QP501BCODE; you can look up the Codebook for detail.

 

14. How can I match the 2010 and 2012 community ID with two different digit formats?

The first 5 digits of community ID in 2012 are corresponding to those in 2010, which could be used to match the community data.

 

15. How could I understand questions with multiple choice formats? Could you explain with examples?

For example, the question qa7 “Which of the following organizations did you join?” of adult questionnaire in 2010 contains 14 choices (multiple choices) from qa7_s_1 to qa7_s_14. qa7_s_1 is the first organization the user selected, qa7_s_2 is the second organization the user selected and so on.

 

16. Is there a variable of proxy respondent in the database of children?

waproxy is the variable of proxy respondent in the database of children in 2010, this variable represents the order number of proxy respondent among the family members.
kz1_b_1 and kz1_b_3 are the variables of proxy respondent in the database of children in 2012, they also represent the order number of proxy respondent among the family members.