This release adds support for scikit-learn 1.0, which includes support for feature names. If you pass a pandas dataframe to fit
, the estimator will set a feature_names_in_
attribute containing the feature names. When a dataframe is passed to predict
, it is checked that the column names are consistent with those passed to fit
. The example below illustrates this feature.
For a full list of changes in scikit-survival 0.17.0, please see the release notes.
Installation
Pre-built conda packages are available for Linux, macOS, and Windows via
conda install -c sebp scikit-survival
Alternatively, scikit-survival can be installed from source following these instructions.
Feature Names Support
Prior to scikit-survival 0.17, you could pass a pandas dataframe to estimators’ fit
and predict
methods, but the estimator was oblivious to the feature names accessible via the dataframe’s columns
attribute. With scikit-survival 0.17, and thanks to scikit-learn 1.0, feature names will be considered when a dataframe is passed.
Let’s illustrate feature names support using the Veteran’s Lung Cancer dataset.
from sksurv.datasets import load_veterans_lung_cancer
X, y = load_veterans_lung_cancer()
X.head(3)
Age_in_years | Celltype | Karnofsky_score | Months_from_Diagnosis | Prior_therapy | Treatment | |
---|---|---|---|---|---|---|
0 | 69.0 | squamous | 60.0 | 7.0 | no | standard |
1 | 64.0 | squamous | 70.0 | 5.0 | yes | standard |
2 | 38.0 | squamous | 60.0 | 3.0 | no | standard |
The original data has 6 features, three of which contain strings, which we encode as numeric using OneHotEncoder.
from sksurv.preprocessing import OneHotEncoder
transform = OneHotEncoder()
Xt = transform.fit_transform(X)
Transforms now have a get_feature_names_out()
method, which will return the name of features after the transformation.
transform.get_feature_names_out()
array(['Age_in_years', 'Celltype=large', 'Celltype=smallcell', 'Celltype=squamous', 'Karnofsky_score', 'Months_from_Diagnosis', 'Prior_therapy=yes', 'Treatment=test'], dtype=object)
The transformed data returned by OneHotEncoder
is again a dataframe, which can be used to fit Cox’s proportional hazards model.
from sksurv.linear_model import CoxPHSurvivalAnalysis
model = CoxPHSurvivalAnalysis().fit(Xt, y)
Since we passed a dataframe, the feature_names_in_
attribute will contain the names of the dataframe used when calling fit
.
model.feature_names_in_
array(['Age_in_years', 'Celltype=large', 'Celltype=smallcell', 'Celltype=squamous', 'Karnofsky_score', 'Months_from_Diagnosis', 'Prior_therapy=yes', 'Treatment=test'], dtype=object)
This is used during prediction to check that the data matches the format of the training data. For instance, when passing a raw numpy array instead of a dataframe, a warning will be issued.
pred = model.predict(Xt.values)
UserWarning: X does not have valid feature names, but CoxPHSurvivalAnalysis was fitted with feature names
Moreover, it will also check that the order of columns matches.
X_reordered = pd.concat(
(Xt.drop("Age_in_years", axis=1), Xt.loc[:, "Age_in_years"]),
axis=1
)
pred = model.predict(X_reordered)
FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised. Feature names must be in the same order as they were in fit.
For more details on feature names support, have a look at the scikit-learn release highlights.
from Planet Python
via read more
No comments:
Post a Comment