当前位置: 首页 > news >正文

【统计方法】LASSO筛变量

比较原始做LASSO包是library(glmnet)

若目标是纯 LASSO 分析,alpha 必须设为 ​​1

​​标准化数据​​:LASSO 对特征的尺度敏感,需对数据进行标准化(均值为0,方差为1)。

cv.glmnet​获得的lambda.min 或者 lambda.1se 传递给
glmnet::glmnet(lambda = ???)

# 加载数据(以 mtcars 为例)
data(mtcars)
x <- as.matrix(mtcars[, -1])  # 特征矩阵(mpg 是响应变量)
y <- mtcars$mpg# 交叉验证选择最优 lambda(自动 LASSO)
cv_fit <- cv.glmnet(x, y, alpha = 1)
best_lambda <- cv_fit$lambda.min# 用最优 lambda 训练最终模型
final_model <- glmnet(x, y, alpha = 1, lambda = best_lambda)# 查看筛选的变量
selected_vars <- rownames(coef(final_model))[coef(final_model)[, 1] != 0]
print(selected_vars)

手动标准化特征矩阵
x_scaled <- scale(x)

分类变量区别-测试


library(glmnet)data(iris)
str(iris$Species)
df=iris
design_matrix <- model.matrix(~ Species, data = df)
x<-as.matrix(data.frame(Sepal.Width=df$Sepal.Width, Petal.Length=df$Petal.Length,Petal.Width=df$Petal.Width,design_matrix))fit1 <- cv.glmnet(x = x,y = df$Sepal.Length)
fit1
plot(fit1)iris$Species_num <- as.numeric(iris$Species)
x2 <- as.matrix(iris[, c(2, 3, 4, 5)])
fit2 <- cv.glmnet(x = x, y = iris$Sepal.Length)
fit2
plot(fit2)

食管癌的

# -----01-Lasso----
set.seed(123)
train_index <- caret::createDataPartition(1:nrow(df), p = 0.7, list = T)[["Resample1"]]
test_index= setdiff(1:nrow(df), train_index)library(glmnet)
df <- read.csv("tab.csv")
library(glmnet)
# 先进行参数查找
cv.glmnet()# 
names(df)
df[,4:15]<-lapply(df[,4:15],as.factor)paste(names(df[,4:15]),collapse = "+")
design_matrix <- model.matrix(~ Smoking_status+Alcohol_consumption+Tea_consumption+Sex+Ethnic.group+Residence+Education+Marital.status+History_of_diabetes+Family_history_of_cancer+Occupation+Physical_Activity, data = df)
df[,16:48] <- scale(df[,16:48])
summary(df$AAvsEPA);sd(df$AAvsEPA)
x <- as.matrix(data.frame(df[,16:48],design_matrix))fit1 <- cv.glmnet(x = x[train_index,],y = df[train_index,]$Group,alpha=1, nfolds = 5,type.measure = "mse",family="binomial")
plot(fit1)
fit1
mean(fit1$cvm)
best_lambda <- fit1$lambda.1se
coeficients <- coef(fit1, s = best_lambda)
selected_vars <- rownames(coeficients)[coeficients[, 1] != 0]
print("Selected variables in test prediction:")
print(selected_vars)lasso_pred <- predict(fit1, s = best_lambda, newx = x[test_index,])mse <- mean((lasso_pred - df[test_index,]$Group)^2)
cat("Test MSE:", mse, "")fit<- glmnet(x, df$Group, family =  "cox", maxit = 1000)plot(fit)final_model <- glmnet(x[train_index,], df[train_index,]$Group,  # 重新运行 glmnet(使用相同的 lambda 值)lambda = fit1$lambda,alpha = 1)
plot(final_model,label = T)
plot(final_model, xvar = "lambda", label = TRUE)
plot(final_model, xvar = "dev", label = TRUE)

Feature selection
We found 44 potential features, including demographics and clinical and laboratory variables (Table 1). We performed feature selection using the least absolute shrinkage and selection operator (LASSO), which is among the most widely used feature selection techniques. LASSO constructs a penalty function that compresses some of the regression coefcients, i.e., it forces the sum of the absolute values of the coefcients to be less than some fxed value while setting some regression coefcients at zero, thus obtaining a more refned model. LASSO retains the advantage of subset shrinkage as a biased estimator that deals with data with complex covariance. This algorithm uses LassoCV, a fvefold cross-validation approach, to automatically eliminate factors with zero coefcients (Python version: sklearn 0.22.1)

2.2.2. Feature Selection.
Feature selection was performed by using least absolute shrinkage and selection operator
(LASSO) regression. The LASSO regression model improves the prediction performance by adjusting the hyperparameter λ to compress the regression coefficients to zero and selecting the feature set that performs best in DN prediction. To determine the best λ value, λ was selected by minimum mean error using 10-fold cross-validation.

Detailed steps were as follows: (1) Screening characteristic factors: First, R software (glmnet4.1.2) was used to conduct the least absolute shrinkage and selection operator (LASSO) regression analysis and adjusting the variable screening and complexity. Then, LASSO regression analysis results were used to conduct multifactor
logistic regression analysis with SPSS, and finally, we obtained the characteristic factors of p < 0.05. (2) Data division: Pyskthon (0.22.1) random number method was used to randomly divide the gout patients into training set and test set according to the ratio of 7:3, of which 491 were in the training set and 211 were in the testing set. (3) Classified multi-model comprehensive analysis: eXtreme Gradient Boosting (XGBoost)


http://www.mrgr.cn/news/97084.html

相关文章:

  • Apache httpclient okhttp(2)
  • CExercise_05_1函数_2海伦公式求三角形面积
  • 大模型学习四:‌DeepSeek Janus-Pro 多模态理解和生成模型 本地部署与调用指南
  • Leetcode 437 -- dfs | 前缀和
  • centos8上实现lvs集群负载均衡dr模式
  • swift-oc和swift block和代理
  • Dive into Deep Learning - 2.4. Calculus (微积分)
  • 如何实现浏览器中的报表打印
  • yolov12检测 聚类轨迹运动速度
  • 【小沐杂货铺】基于Three.JS绘制太阳系Solar System(GIS 、WebGL、vue、react)
  • Vanna:用检索增强生成(RAG)技术革新自然语言转SQL
  • #SVA语法滴水穿石# (002)关于 |-> + ##[min:max] 的联合理解
  • JAVA线程安全
  • orangepi zero烧录及SSH联网
  • c++项目 网络聊天服务器 实现
  • Neo4j操作数据库(Cypher语法)
  • Java 大视界 -- 基于 Java 的大数据机器学习模型在图像识别中的迁移学习与模型优化(173)
  • Linux线程同步与互斥:【线程互斥】【线程同步】【线程池】
  • leetcode117 填充每个节点的下一个右侧节点指针2
  • hackmyvm-Principle