Elastic net is a proper tool when we have lots of correlated features in regression.
In this scenario, if you apply the vanilla LASSO the resulted coefficients are not stable. Instability here means that if you decrease the , the regularization parameter, by for each in the path of the LASSO the value of non-zero coefficients can change substantially in any direction (increase/decrease). So depending on the chosen the value of regression coefficient (that determines the importance of the feature ) can be drastically different and we don’t have any idea of the relative significance of features.
In a non-degenerate scenario of applying the LASSO, you would hope that after the initial selection of a feature in the LASSO path, its coefficient escapes shrinkage gradually when decreases and reaches the OLS value.
Such an eventual increase/decrease in selected coefficients does not happen in the correlated features case because those features, arbitrarily divide the importance weight, i.e., coefficients, among themselves while it should be almost uniformly distributed among the correlated dimensions.
The uniformness of coefficient over the correlated features can be induced using penalty (why?). Therefore, elastic net suggests a convex combination of both and penalties:
Interestingly, the resulted norm ball has sparse facets like -norm ball while each facet is curved, encouraging equal values for the selected features on that facet.
Below is the elastic-net ball with (left panel) in , compared to the
-ball (right panel). The curved contours encourage strongly correlated variables to share coefficients. Image is from Statistical Learning with Sparsity The Lasso and Generalizations.