统计检验箱线图

R
rstatix
统计检验
可视化
Author

Rui

Published

November 28, 2023

统计检验箱线图

数据说明

将统计检验与箱线图结合起来,结果输出形式以图代替表格,可以避免统计检验表格的杂乱,同时更加直观、清晰

数据使用 2017 年和 2019 年的 CHFS 数据,已经按照家庭编号匹配为平衡面板数据。被解释变量为家庭总消费,核心解释变量为家庭总负债,其他控制变量包括家庭总收入、家庭总资产、家庭拥有住房数量、地区、户主性别、户主婚姻状况、户主受教育程度、户主年龄、户主健康状况、户主工作性质等。

数值型的经济变量已进行加一对数化处理。

Code
library(tidyverse)
library(ggplot2)

data <- read.csv("data\\data_panel.csv")
glimpse(data)
## Rows: 10,202
## Columns: 46
## $ hhid                      <int> 2013000001, 2013000001, 2013000042, 20130000…
## $ year                      <int> 2017, 2019, 2017, 2019, 2017, 2019, 2017, 20…
## $ prov                      <chr> "北京", "北京", "北京", "北京", "北京", "北…
## $ gender                    <int> 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0,…
## $ education                 <int> 3, 3, 2, 2, 2, 2, 3, 3, 2, 3, 4, 4, 3, 3, 4,…
## $ marriage                  <int> 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1,…
## $ health                    <int> 2, 2, 3, 2, 2, 2, 1, 2, 2, 3, 3, 4, 2, 3, 4,…
## $ total_income              <dbl> 136634.78, 139585.71, 80668.00, 88999.62, 20…
## $ total_consump             <dbl> 91600.00, 108264.83, 28656.00, 133716.73, 12…
## $ total_asset               <dbl> 9421595, 9830520, 409693, 1062253, 8503600, …
## $ total_debt                <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,…
## $ house_asset               <dbl> 8000000.0, 8566460.0, 300000.0, 737667.4, 80…
## $ house_debt                <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,…
## $ shengcun_con              <dbl> 71950.00, 83227.92, 27336.00, 124689.58, 853…
## $ fazhan_con                <dbl> 19650.000, 25036.907, 1320.000, 9027.145, 42…
## $ relationship              <int> 2, 2, 1, 1, 3, 4, 1, 3, 2, 2, 2, 1, 1, 1, 1,…
## $ workplace_type            <int> 0, 0, 5, 5, 0, 0, 0, 0, 1, 5, 0, 0, 0, 0, 5,…
## $ work_type                 <int> 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,…
## $ family_num                <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ house_n                   <int> 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 2,…
## $ house_dummy               <int> 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1,…
## $ house_price               <dbl> 8000000.0, 8566460.0, 300000.0, 737667.4, 80…
## $ house_loan                <dbl> 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e…
## $ down_payment              <dbl> 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e+00, 0e…
## $ mortgage_per_mon          <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.…
## $ loan_arrears              <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,…
## $ age                       <int> 72, 74, 48, 50, 68, 70, 56, 58, 59, 61, 70, …
## $ loan_confirmation         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ other_asset               <dbl> 1421595.0, 1264060.0, 109693.0, 324585.6, 50…
## $ other_debt                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ logtotal_income           <dbl> 11.82507, 11.84644, 11.29811, 11.39640, 12.2…
## $ logtotal_consump          <dbl> 11.42520, 11.59234, 10.26315, 11.80349, 11.7…
## $ logtotal_asset            <dbl> 16.05852, 16.10100, 12.92317, 13.87590, 15.9…
## $ logtotal_debt             <dbl> 0.00000, 0.00000, 0.00000, 0.00000, 0.00000,…
## $ loghouse_debt             <dbl> 0.00000, 0.00000, 0.00000, 0.00000, 0.00000,…
## $ logother_debt             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ loghouse_asset            <dbl> 15.89495, 15.96337, 12.61154, 13.51125, 15.8…
## $ logother_asset            <dbl> 14.16729, 14.04984, 11.60545, 12.69031, 13.1…
## $ loghouse_price            <dbl> 15.89495, 15.96337, 12.61154, 13.51125, 15.8…
## $ logshengcun_con           <dbl> 11.183741, 11.329350, 10.215996, 11.733591, …
## $ logfazhan_con             <dbl> 9.885884, 10.128146, 7.186144, 9.108102, 10.…
## $ logloan_arrears           <dbl> 0.00000, 0.00000, 0.00000, 0.00000, 0.00000,…
## $ age2                      <int> 5184, 5476, 2304, 2500, 4624, 4900, 3136, 33…
## $ prov_mean_house_price     <dbl> 27497.00, 32190.85, 27497.00, 32190.85, 2749…
## $ log_prov_mean_house_price <dbl> 10.22183, 10.37944, 10.22183, 10.37944, 10.2…
## $ debt_dummy                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,…

描述性统计分析

Code
data <- data %>% select(-family_num) # 删去全为 0 的列

summary(data)
##       hhid                year          prov               gender      
##  Min.   :2.013e+09   Min.   :2017   Length:10202       Min.   :0.0000  
##  1st Qu.:2.013e+09   1st Qu.:2017   Class :character   1st Qu.:0.0000  
##  Median :2.015e+09   Median :2018   Mode  :character   Median :0.0000  
##  Mean   :2.015e+09   Mean   :2018                      Mean   :0.2051  
##  3rd Qu.:2.017e+09   3rd Qu.:2019                      3rd Qu.:0.0000  
##  Max.   :2.017e+09   Max.   :2019                      Max.   :1.0000  
##                                                                        
##    education        marriage         health      total_income     
##  Min.   :0.000   Min.   :0.000   Min.   :0.00   Min.   :       0  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.:2.00   1st Qu.:   38742  
##  Median :2.000   Median :1.000   Median :2.00   Median :   74392  
##  Mean   :2.589   Mean   :0.883   Mean   :2.45   Mean   :  108935  
##  3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.:3.00   3rd Qu.:  126057  
##  Max.   :6.000   Max.   :1.000   Max.   :4.00   Max.   :11538468  
##  NA's   :1       NA's   :1                                        
##  total_consump      total_asset         total_debt        house_asset      
##  Min.   :   1376   Min.   :     962   Min.   :       0   Min.   :       1  
##  1st Qu.:  35859   1st Qu.:  359730   1st Qu.:       0   1st Qu.:  219190  
##  Median :  57900   Median :  831176   Median :       0   Median :  550000  
##  Mean   :  79172   Mean   : 1684518   Mean   :   64588   Mean   : 1274703  
##  3rd Qu.:  92307   3rd Qu.: 2024338   3rd Qu.:   10000   3rd Qu.: 1478834  
##  Max.   :2218525   Max.   :39770836   Max.   :10470118   Max.   :23500000  
##                                                                            
##    house_debt        shengcun_con       fazhan_con       relationship     
##  Min.   :       0   Min.   :   1285   Min.   :      0   Min.   :   1.000  
##  1st Qu.:       0   1st Qu.:  23056   1st Qu.:   8011   1st Qu.:   1.000  
##  Median :       0   Median :  36741   Median :  18367   Median :   1.000  
##  Mean   :   46620   Mean   :  46346   Mean   :  33846   Mean   :   6.037  
##  3rd Qu.:       0   3rd Qu.:  55111   3rd Qu.:  37978   3rd Qu.:   2.000  
##  Max.   :10470118   Max.   :1616206   Max.   :1126600   Max.   :7777.000  
##                                       NA's   :2493                        
##  workplace_type    work_type        house_n       house_dummy    
##  Min.   :0.000   Min.   :0.000   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.000   Median :1.000   Median :1.000   Median :0.0000  
##  Mean   :1.802   Mean   :1.111   Mean   :1.274   Mean   :0.2351  
##  3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:0.0000  
##  Max.   :7.000   Max.   :3.000   Max.   :5.000   Max.   :1.0000  
##                                                                  
##   house_price         house_loan       down_payment      mortgage_per_mon 
##  Min.   :       0   Min.   :      0   Min.   :       0   Min.   :    0.0  
##  1st Qu.:  209402   1st Qu.:      0   1st Qu.:       0   1st Qu.:    0.0  
##  Median :  523506   Median :      0   Median :       0   Median :    0.0  
##  Mean   : 1251276   Mean   :  31316   Mean   :   25195   Mean   :  316.5  
##  3rd Qu.: 1427743   3rd Qu.:      0   3rd Qu.:       0   3rd Qu.:    0.0  
##  Max.   :21000000   Max.   :3807316   Max.   :13000000   Max.   :33314.0  
##                                                                           
##   loan_arrears          age        loan_confirmation  other_asset      
##  Min.   :      0   Min.   :22.00   Min.   :1.000     Min.   :       0  
##  1st Qu.:      0   1st Qu.:47.00   1st Qu.:1.000     1st Qu.:   59115  
##  Median :      0   Median :56.00   Median :1.000     Median :  184062  
##  Mean   :  33735   Mean   :56.52   Mean   :1.122     Mean   :  409815  
##  3rd Qu.:      0   3rd Qu.:66.00   3rd Qu.:1.000     3rd Qu.:  452397  
##  Max.   :3331401   Max.   :97.00   Max.   :2.000     Max.   :35770836  
##                                                                        
##    other_debt      logtotal_income logtotal_consump logtotal_asset 
##  Min.   :      0   Min.   : 0.00   Min.   : 7.228   Min.   : 6.87  
##  1st Qu.:      0   1st Qu.:10.56   1st Qu.:10.487   1st Qu.:12.79  
##  Median :      0   Median :11.22   Median :10.966   Median :13.63  
##  Mean   :  17967   Mean   :11.01   Mean   :10.962   Mean   :13.59  
##  3rd Qu.:      0   3rd Qu.:11.74   3rd Qu.:11.433   3rd Qu.:14.52  
##  Max.   :5000000   Max.   :16.26   Max.   :14.612   Max.   :17.50  
##                                                                    
##  logtotal_debt    loghouse_debt    logother_debt    loghouse_asset   
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.6688  
##  1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.:12.2977  
##  Median : 0.000   Median : 0.000   Median : 0.000   Median :13.2177  
##  Mean   : 3.257   Mean   : 2.031   Mean   : 1.727   Mean   :13.1339  
##  3rd Qu.: 9.210   3rd Qu.: 0.000   3rd Qu.: 0.000   3rd Qu.:14.2068  
##  Max.   :16.164   Max.   :16.164   Max.   :15.425   Max.   :16.9725  
##                                                                      
##  logother_asset  loghouse_price  logshengcun_con  logfazhan_con   
##  Min.   : 0.00   Min.   : 0.00   Min.   : 7.159   Min.   : 0.000  
##  1st Qu.:10.99   1st Qu.:12.25   1st Qu.:10.046   1st Qu.: 8.989  
##  Median :12.12   Median :13.17   Median :10.512   Median : 9.818  
##  Mean   :11.94   Mean   :13.11   Mean   :10.475   Mean   : 9.732  
##  3rd Qu.:13.02   3rd Qu.:14.17   3rd Qu.:10.917   3rd Qu.:10.545  
##  Max.   :17.39   Max.   :16.86   Max.   :14.296   Max.   :13.935  
##                                                   NA's   :2493    
##  logloan_arrears       age2      prov_mean_house_price
##  Min.   : 0.000   Min.   : 484   Min.   : 4241        
##  1st Qu.: 0.000   1st Qu.:2209   1st Qu.: 5514        
##  Median : 0.000   Median :3136   Median : 6710        
##  Mean   : 1.293   Mean   :3371   Mean   : 9871        
##  3rd Qu.: 0.000   3rd Qu.:4356   3rd Qu.:11097        
##  Max.   :15.019   Max.   :9409   Max.   :32191        
##                                                       
##  log_prov_mean_house_price   debt_dummy    
##  Min.   : 8.353            Min.   :0.0000  
##  1st Qu.: 8.615            1st Qu.:0.0000  
##  Median : 8.811            Median :0.0000  
##  Mean   : 9.022            Mean   :0.2905  
##  3rd Qu.: 9.314            3rd Qu.:1.0000  
##  Max.   :10.379            Max.   :1.0000  
## 

因为是两年面板数据,也可以将家庭总消费分年份汇总:

Code
data %>%
  select(year, total_debt, total_consump, logtotal_consump) %>%
  mutate(debt_confirmation = if_else(abs(total_debt) < 1, 0, 1)) %>%
  mutate(debt_confirmation = factor(debt_confirmation)) %>%
  group_by(year) %>%
  summarise(count = n(), 
            Min = min(total_consump),
            Q25 = quantile(total_consump, 0.25),
            Mean = mean(total_consump),
            Median = median(total_consump),
            Q75 = quantile(total_consump, 0.75),
            Max = max(total_consump),
            SD = sd(total_consump),
            VAR = var(total_consump),
            IQR = IQR(total_consump)) %>%
  knitr::kable(align = "c")
year count Min Q25 Mean Median Q75 Max SD VAR IQR
2017 5101 1376.000 33524.00 71672.81 53836.00 85120.0 1441060 74751.21 5587743314 51596.00
2019 5101 2379.572 38954.55 86671.91 63058.66 100275.2 2218525 93349.04 8714043484 61320.62

也可以按照是否负债来汇总:

Code
data %>%
  select(year, total_debt, total_consump, logtotal_consump) %>%
  mutate(debt_confirmation = if_else(abs(total_debt) < 1, 0, 1)) %>%
  mutate(debt_confirmation = factor(debt_confirmation)) %>%
  group_by(year, debt_confirmation) %>%
  summarise(count = n(), 
            Min = min(total_consump),
            Q25 = quantile(total_consump, 0.25),
            Mean = mean(total_consump),
            Median = median(total_consump),
            Q75 = quantile(total_consump, 0.75),
            Max = max(total_consump),
            SD = sd(total_consump),
            VAR = var(total_consump),
            IQR = IQR(total_consump)) %>%
  knitr::kable(align = "c")
year debt_confirmation count Min Q25 Mean Median Q75 Max SD VAR IQR
2017 0 3536 1376.000 31847.00 66577.51 51142.00 79837.50 1441060 67877.93 4607413904 47990.50
2017 1 1565 3620.000 37442.00 83185.25 61090.00 97080.00 1144430 87268.37 7615768220 59638.00
2019 0 3697 2379.572 36013.40 75658.82 57595.17 90328.56 1301480 72622.14 5273974884 54315.16
2019 1 1404 2969.706 50218.25 115671.49 80309.13 133274.84 2218525 128924.16 16621438977 83056.59

按照不同年份汇总有负债与无负债家庭的个数:

Code
data %>%
  select(year, total_debt, total_consump, logtotal_consump) %>%
  mutate(debt_confirmation = if_else(abs(total_debt) < 1, 0, 1)) %>%
  mutate(debt_confirmation = factor(debt_confirmation)) %>%
  group_by(year, debt_confirmation) %>%
  summarise(count = n()) %>%
  knitr::kable(align = "c")
year debt_confirmation count
2017 0 3536
2017 1 1565
2019 0 3697
2019 1 1404

不难发现大部分家庭并没有负债

Code
data %>%
  select(year, total_debt, total_consump, logtotal_consump) %>%
  mutate(debt_confirmation = if_else(abs(total_debt) < 1, 0, 1)) %>%
  mutate(debt_confirmation = factor(debt_confirmation)) %>%
  ggplot(aes(x = logtotal_consump, fill = debt_confirmation)) +
  geom_histogram(alpha = 0.5, position = 'identity') +
  labs(x = "家庭总消费(加一对数化)", y = "家庭个数", fill = "是否负债") +
  theme_minimal() +
  facet_grid(.~year)

正态性检验

图像法

CDF 图:

Code
library(ggpubr)

ggdensity(data$logtotal_consump, 
          main = "密度图",
          xlab = "家庭总消费(对数化)")

QQ 图:

Code
qqnorm(data$logtotal_consump, main = "家庭总消费(对数化)QQ图")
qqline(data$logtotal_consump)

由密度图和 QQ 图可以看出,数据中间部分比较符合正态分布,而在双侧尾部距正态分布有一定偏移。

正态性检验方法

  • Shapiro-Wilks 检验:shapiro.test,R语言中要求样本量要在3和5000之间

  • Kolmogorov-Smirnov 检验(K-S检验):

Code
ks <- ks.test(x = data$logtotal_consump, y = "pnorm", 
              mean = mean(data$logtotal_consump), 
              sd = sqrt(var(data$logtotal_consump)))
ks
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  data$logtotal_consump
## D = 0.024665, p-value = 8.13e-06
## alternative hypothesis: two-sided

结果显示 p 值小于 0,所以应该拒绝样本分布与理论分布一致的原假设,即样本分布不是正态分布。

Danger

在做单样本 K-S 检验或者正态检验时,有时会有错误提示 “Kolmogorov - Smirnov 检验里不应该有连结”,这是因为 K-S 检验只对连续 CDF 有效,而连续 CDF 中出现相同值的概率为 0,也就是说数据中倘出现相同值,则连续分布的假设不成立,因此 R 会报错。这也提醒我们,在做正态性检验之前,要先对数据进行描述性分析,对数据整体要先有个大致的认识,这也才后续才能选择正确的检验方法。

  • Anderson–Darling test:Anderson–Darling 检验是一种用来检验给定的样本是否来自于某个确定的概率分布的统计检验方法。在R语言中,我们可以从 nortest 包中的 ad.test() 进行检验
Code
library(nortest)

ad <- ad.test(data$logtotal_consump)
ad
## 
##  Anderson-Darling normality test
## 
## data:  data$logtotal_consump
## A = 13.482, p-value < 2.2e-16
  • Lilliefor test:Lilliefor test 是基于 Kolmogorov–Smirnov test 的一种正态性检验,其检验没有确定来自哪一个具体的正态分布。lillie.test() 也在 nortest 包中。
Code
lilliefor <- lillie.test(data$logtotal_consump)
lilliefor
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  data$logtotal_consump
## D = 0.024665, p-value = 2.649e-15
检验名称 p值 结果
Shapiro-Wilks test \ \
K-S test 8.1296653^{-6} 非正态
Anderson–Darling test 3.7^{-24} 非正态
Lilliefor test 2.6485138^{-15} 非正态

以上三种检验的结果均显示样本分布非正态。

非参数检验

Code
library(patchwork) # 拼图
library(ggpubr)
library(qqplotr) # QQ图
library(ggsignif) # 显著性标记
library(rstatix) # 更适合管道操作的统计检验包

不同户主性别的消费水平统计检验

以上检验了总体样本的正态性,还应该检验分样本后各个样本的正态性。

先画出分布图和 Q-Q 图

Code
# 户主性别
data_long <- data %>%
  select(c(gender, logtotal_consump)) %>%
  mutate(gender = if_else(gender == 0, "男性", "女性")) 

# 核密度图
Q1 <- data_long %>% 
  ggplot(aes(y = logtotal_consump, fill = gender)) +
  geom_density(alpha = 0.4,        # 填充颜色透明度
               size = 1,           # 线条粗细
               linetype = 1) +     # 线条类型1是实线,2是虚线
  ylim(c(7, 16)) + 
  #scale_fill_brewer(palette = "Set1") + # 调色
  facet_grid(.~gender) +
  labs(x = "freq", y = "logtotal_consump") +
  theme_bw()

# Q-Q 图
Q2 <- ggplot(data_long, aes(sample = logtotal_consump, fill = gender)) +
  stat_qq_band(#bandType = "ts", 
               alpha = 0.4) +
  stat_qq_point(size = 1) +
  stat_qq_line(linewidth = 1) +
  #scale_fill_brewer(palette = "Set1") + # 调色
  facet_wrap(vars(gender), scales = "free_x") + # x轴刻度不固定
  labs(x = "Theoretical", y = "Sample") +
  theme_bw()

# 拼图
Q1/Q2

K-S 检验与结果可视化:

Code
for (g in unique(data_long$gender)) {
  data_long %>%
    dplyr::filter(gender == g) %>%
    dplyr::select(logtotal_consump) %>%
    pull() %>%
    ks.test(., "pnorm", mean = mean(.), sd = sqrt(.)) %>%
    print()
}
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  .
## D = 0.31515, p-value < 2.2e-16
## alternative hypothesis: two-sided
## 
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  .
## D = 0.3065, p-value < 2.2e-16
## alternative hypothesis: two-sided
Code
ggplot(data_long, aes(sample = logtotal_consump)) +
  geom_qq_band(bandType = "ks", mapping = aes(fill = "KS"), alpha = 0.3) +
  #geom_qq_band(bandType = "ts", mapping = aes(fill = "TS"), alpha = 0.3) +
  #geom_qq_band(bandType = "pointwise", mapping = aes(fill = "Normal"), alpha = 0.3) +
  #geom_qq_band(bandType = "boot", mapping = aes(fill = "Bootstrap"), alpha = 0.3) +
  stat_qq_line() +
  stat_qq_point() +
  facet_wrap(vars(gender), scales = "free_x") +
  labs(x = "Theoretical", y = "Sample") +
  theme_bw()

K-S检验结果显示,两组样本均不服从正态分布,所以不太适合使用 t 检验。

由于户主性别是一个二元分类变量(男女两种水平),所以直接使用一般的 wilcox 检验。

Code
p1 <- data_long %>%
  ggplot(aes(gender, logtotal_consump)) +
  geom_boxplot(aes(fill = gender)) +
  labs(x = "", y = "家庭总消费(加一对数化)", fill = "户主性别") +
  ylim(7, 15.8) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))

  
p1 <- p1 + 
  geom_signif(
  # 指定你要比较的组,以list形式传入:
    comparisons = list(c("男性", "女性")),
    map_signif_level = T, # 展示P值或显著性
    textsize = 3, # label大小
    test = "wilcox.test", # wilcox检验
    step_increase = 0.1 # 位置变化
  )
p1

由于空间有限,以下几次的 K-S 检验的结果暂不显示,一律按照非正态分布的情况处理。

不同婚姻状况的消费水平统计检验

由于婚姻状况是一个二元分类变量(已婚、未婚两种水平),所以直接使用一般的 wilcox 检验。

Code
# 婚姻状况
data_long <- data %>%
  select(c(marriage, logtotal_consump)) %>%
  drop_na() %>%
  mutate(marriage = if_else(marriage == 0, "未婚", "已婚")) 

p2 <- data_long %>%
  ggplot(aes(marriage, logtotal_consump)) +
  geom_boxplot(aes(fill = marriage)) +
  labs(x = "", y = "家庭总消费(加一对数化)", fill = "婚姻状况") +
  ylim(7, 15.8) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))


p2 <- p2 + 
  geom_signif(
    # 指定你要比较的组,以list形式传入:
    comparisons = list(c("未婚", "已婚")),
    map_signif_level = T, # 展示P值或显著性
    textsize = 3, # label大小
    test = "wilcox.test", # 检验类型
    step_increase = 0.1 # 位置变化
  )
p2

不同受教育程度的消费水平统计检验

由于受教育程度是一个多分类变量(男女两种水平),所以需要对指定的水平进行两两的 wilcox 检验。

Code
data_long <- data %>%
  select(c(education, logtotal_consump)) %>%
  drop_na() %>%
  mutate(education = as.numeric(education)) %>% # education原本是factor
  mutate(edu = case_when(
    education >= 4 ~ "本科及以上",
    education >= 2 & education < 4 ~ "初中高中",
    education < 2 ~ "小学及以下")) %>%
  select(-education)

p3 <- data_long %>%
  ggplot(aes(edu, logtotal_consump)) +
  geom_boxplot(aes(fill = edu)) +
  labs(x = "", y = "家庭总消费(加一对数化)", fill = "受教育程度") +
  ylim(7, 17.8) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))

p3 <- p3 + 
  geom_signif(
    # 指定你要比较的组,以list形式传入:
    comparisons = list(c("本科及以上", "初中高中"), 
                       c("本科及以上", "小学及以下"),
                       c("初中高中", "小学及以下")),
    map_signif_level = T, # 展示P值或显著性
    textsize = 3, # label大小
    test = "wilcox.test", # 检验类型
    step_increase = 0.14 # 位置变化
  )
p3

不同健康程度的消费水平统计检验

Code
# 健康程度
data_long <- data %>%
  select(c(health, logtotal_consump)) %>%
  drop_na() %>%
  #mutate(health = as.numeric(health)) %>% 
  mutate(hea = case_when(
    health == 0 ~ "非常好",
    health == 1 ~ "好",
    health == 2 ~ "一般",
    health == 3 ~ "不好",
    health == 4 ~ "非常不好")) %>%
  mutate(hea = factor(hea, levels = c("非常不好", "不好", "一般", "好", "非常好"))) %>%
  select(-health)

p4 <- data_long %>%
  ggplot(aes(hea, logtotal_consump)) +
  geom_boxplot(aes(fill = hea)) +
  labs(x = "", y = "家庭总消费(加一对数化)", fill = "健康状况") +
  ylim(7, 17.8) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))

p4 <- p4 + 
  geom_signif(
    # 指定你要比较的组,以list形式传入:
    comparisons = list(#c("非常不好", "不好"), # 不显著
                       c("非常不好", "一般"),
                       c("非常不好", "好"),
                       c("非常不好", "非常好")),
    map_signif_level = T, # 展示P值或显著性
    textsize = 3, # label大小
    test = t.test, # 检验类型
    step_increase = 0.14 # 位置变化
  )
p4

将以上四幅图拼接在一起:

Code
(p1|p2)/(p3|p4)

不同工作性质的消费水平统计检验

Code
# 工作性质
data_long <- data %>%
  select(c(work_type, logtotal_consump)) %>%
  drop_na() %>%
  mutate(work = case_when(
    work_type == 0 ~ "无工作",
    work_type == 1 ~ "正规劳动合同雇员",
    work_type == 2 ~ "非正规劳动合同雇员",
    work_type == 3 ~ "其他")) %>%
  mutate(work = factor(work, levels = c("无工作", "正规劳动合同雇员", "非正规劳动合同雇员", "其他"))) %>%
  select(-work_type)

p5 <- data_long %>%
  ggplot(aes(work, logtotal_consump)) +
  geom_boxplot(aes(fill = work)) +
  labs(x = "", y = "家庭总消费(加一对数化)", fill = "工作性质") +
  ylim(7, 18) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) +
  ggtitle("")

p5 <- p5 + 
  geom_signif(
    # 指定你要比较的组,以list形式传入:
    comparisons = list(#c("非常不好", "不好"), # 不显著
      c("正规劳动合同雇员", "无工作"),
      c("正规劳动合同雇员", "非正规劳动合同雇员"),
      c("正规劳动合同雇员", "其他")),
    map_signif_level = T, # 展示P值或显著性
    textsize = 3, # label大小
    test = t.test, # 检验类型
    step_increase = 0.14 # 位置变化
  )
p5

不同单位类型的消费水平统计检验

Code
data_long <- data %>%
  select(c(workplace_type, logtotal_consump)) %>%
  drop_na() %>%
  mutate(workplace = case_when(
    workplace_type == 0 ~ "无工作单位",
    workplace_type == 1 ~ "机关团体、事业单位",
    workplace_type == 2 ~ "国企",
    workplace_type == 4 ~ "个体工商户",
    workplace_type == 5 ~ "私企",
    workplace_type == 3 | workplace_type == 6 | workplace_type == 7 ~ "其他")) %>%
  mutate(workplace = factor(workplace, levels = c("无工作单位", "机关团体、事业单位", "国企", "个体工商户", "私企", "其他"))) %>%
  select(-workplace_type)

p6 <- data_long %>%
  ggplot(aes(workplace, logtotal_consump)) +
  geom_boxplot(aes(fill = workplace)) +
  labs(x = "", y = "家庭总消费(加一对数化)", fill = "工作单位类型") +
  ylim(7, 20) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) +
  ggtitle("")

p6 <- p6 + 
  geom_signif(
    # 指定你要比较的组,以list形式传入:
    comparisons = list(#c("非常不好", "不好"), # 不显著
                        c("无工作单位", "机关团体、事业单位"),
                        c("无工作单位", "国企"),
                        c("无工作单位", "个体工商户"),
                        c("无工作单位", "私企"),
                        c("无工作单位", "其他")),
    map_signif_level = T, # 展示P值或显著性
    textsize = 3, # label大小
    test = t.test, # 检验类型
    step_increase = 0.14 # 位置变化
  )
p6

将以上两幅图拼在一起:

Code
p5|p6

是否负债家庭的消费水平统计检验(分年份)

除了使用 stats 包中的 wilcox.test 进行统计检验以外。还可以使用 rstatix 包中的 wilcox_test,更加适合管道操作。

准备统计检验需要的数据:

Code
# 是否负债
data_long <- data %>%
  select(year, total_debt, total_consump, logtotal_consump) %>%
  mutate(debt_confirmation = if_else(abs(total_debt) < 1, 0, 1)) %>%
  mutate(debt_confirmation = factor(debt_confirmation)) %>%
  select(c(debt_confirmation, year, logtotal_consump)) %>%
  rename("features" = "year", "values" = "logtotal_consump") %>%
  mutate(features = as.character(features))

使用 wilcox 检验:wilcox_test 函数

Code
stat.test <- data_long %>%
  group_by(features) %>%
  wilcox_test(
    values ~ debt_confirmation,
    paired = FALSE
  )

用管道还可以(更建议)这么写:

Code
stat.test <- data_long %>%
  group_by(features) %>%
  wilcox_test(data =., values ~ debt_confirmation) %>%
  add_significance(
    "p", # 根据统计检验结果中的p列添加显著性标记
    cutpoints = c(0, 0.01, 0.05, 0.1, 1), 
    symbols = c("***", "**", "*", "ns") # 星号的数量可以自己改
  ) 
Tip
  • 若要使用配对样本 t 检验,可用 t_test 替换 wilcox_test
  • 默认参数显著性最高的是4颗星星,不太符合国内研究习惯。在 add_significance 函数中可以通过设置参数 cutsymbols 来自行调整星号的设置

在检验结果中添加显著性标记的 x、y 坐标,方便绘图:

Code
# 添加显著性标记
stat.test <- stat.test %>% add_xy_position(x = "features")

绘制箱线图:

Code
# 绘制分组箱线图
p0 <- data_long %>%  
  ggplot(aes(features, values)) +
  geom_boxplot(aes(fill = debt_confirmation)) +
  ylim(7, 16) +
  labs(x = "年份", y = "家庭总消费(加一对数化)", fill = "是否负债") +
  #ggsci::scale_fill_d3()+
  theme_bw()

# 按照检验结果中的坐标添加显著性标记
p0 <- p0 + geom_signif(
  annotations = stat.test$p.signif, # 根据p.signif这一列绘制显著性标记
  y_position = stat.test$y.position,
  xmin = stat.test$xmin,
  xmax = stat.test$xmax
)

p0