Daily Python: Sebastian Pölsterl: scikit-survival 0.17 released

This release adds support for scikit-learn 1.0, which includes support for feature names. If you pass a pandas dataframe to fit, the estimator will set a feature_names_in_ attribute containing the feature names. When a dataframe is passed to predict, it is checked that the column names are consistent with those passed to fit. The example below illustrates this feature.

For a full list of changes in scikit-survival 0.17.0, please see the release notes.

Installation

Pre-built conda packages are available for Linux, macOS, and Windows via

 conda install -c sebp scikit-survival

Alternatively, scikit-survival can be installed from source following these instructions.

Feature Names Support

Prior to scikit-survival 0.17, you could pass a pandas dataframe to estimators’ fit and predict methods, but the estimator was oblivious to the feature names accessible via the dataframe’s columns attribute. With scikit-survival 0.17, and thanks to scikit-learn 1.0, feature names will be considered when a dataframe is passed.

Let’s illustrate feature names support using the Veteran’s Lung Cancer dataset.

from sksurv.datasets import load_veterans_lung_cancer
X, y = load_veterans_lung_cancer()
X.head(3)

	Age_in_years	Celltype	Karnofsky_score	Months_from_Diagnosis	Prior_therapy	Treatment
0	69.0	squamous	60.0	7.0	no	standard
1	64.0	squamous	70.0	5.0	yes	standard
2	38.0	squamous	60.0	3.0	no	standard

The original data has 6 features, three of which contain strings, which we encode as numeric using OneHotEncoder.

from sksurv.preprocessing import OneHotEncoder
transform = OneHotEncoder()
Xt = transform.fit_transform(X)

Transforms now have a get_feature_names_out() method, which will return the name of features after the transformation.

transform.get_feature_names_out()

array(['Age_in_years', 'Celltype=large', 'Celltype=smallcell',
'Celltype=squamous', 'Karnofsky_score', 'Months_from_Diagnosis',
'Prior_therapy=yes', 'Treatment=test'], dtype=object)

The transformed data returned by OneHotEncoder is again a dataframe, which can be used to fit Cox’s proportional hazards model.

from sksurv.linear_model import CoxPHSurvivalAnalysis
model = CoxPHSurvivalAnalysis().fit(Xt, y)

Since we passed a dataframe, the feature_names_in_ attribute will contain the names of the dataframe used when calling fit.

model.feature_names_in_

array(['Age_in_years', 'Celltype=large', 'Celltype=smallcell',
'Celltype=squamous', 'Karnofsky_score', 'Months_from_Diagnosis',
'Prior_therapy=yes', 'Treatment=test'], dtype=object)

This is used during prediction to check that the data matches the format of the training data. For instance, when passing a raw numpy array instead of a dataframe, a warning will be issued.

pred = model.predict(Xt.values)

UserWarning: X does not have valid feature names, but CoxPHSurvivalAnalysis was fitted with feature names

Moreover, it will also check that the order of columns matches.

X_reordered = pd.concat(
(Xt.drop("Age_in_years", axis=1), Xt.loc[:, "Age_in_years"]),
axis=1
)
pred = model.predict(X_reordered)

FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

For more details on feature names support, have a look at the scikit-learn release highlights.

from Planet Python
via read more

Daily Python

Sunday, January 9, 2022

Sebastian Pölsterl: scikit-survival 0.17 released

Installation

Feature Names Support

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

Search This Blog