Sunday, January 9, 2022

Sebastian Pölsterl: scikit-survival 0.17 released

This release adds support for scikit-learn 1.0, which includes support for feature names. If you pass a pandas dataframe to fit, the estimator will set a feature_names_in_ attribute containing the feature names. When a dataframe is passed to predict, it is checked that the column names are consistent with those passed to fit. The example below illustrates this feature.

For a full list of changes in scikit-survival 0.17.0, please see the release notes.

Installation

Pre-built conda packages are available for Linux, macOS, and Windows via

 conda install -c sebp scikit-survival

Alternatively, scikit-survival can be installed from source following these instructions.

Feature Names Support

Prior to scikit-survival 0.17, you could pass a pandas dataframe to estimators’ fit and predict methods, but the estimator was oblivious to the feature names accessible via the dataframe’s columns attribute. With scikit-survival 0.17, and thanks to scikit-learn 1.0, feature names will be considered when a dataframe is passed.

Let’s illustrate feature names support using the Veteran’s Lung Cancer dataset.

from sksurv.datasets import load_veterans_lung_cancer
X, y = load_veterans_lung_cancer()
X.head(3)
Age_in_years Celltype Karnofsky_score Months_from_Diagnosis Prior_therapy Treatment
0 69.0 squamous 60.0 7.0 no standard
1 64.0 squamous 70.0 5.0 yes standard
2 38.0 squamous 60.0 3.0 no standard

The original data has 6 features, three of which contain strings, which we encode as numeric using OneHotEncoder.

from sksurv.preprocessing import OneHotEncoder
transform = OneHotEncoder()
Xt = transform.fit_transform(X)

Transforms now have a get_feature_names_out() method, which will return the name of features after the transformation.

transform.get_feature_names_out()
array(['Age_in_years', 'Celltype=large', 'Celltype=smallcell',
'Celltype=squamous', 'Karnofsky_score', 'Months_from_Diagnosis',
'Prior_therapy=yes', 'Treatment=test'], dtype=object)

The transformed data returned by OneHotEncoder is again a dataframe, which can be used to fit Cox’s proportional hazards model.

from sksurv.linear_model import CoxPHSurvivalAnalysis
model = CoxPHSurvivalAnalysis().fit(Xt, y)

Since we passed a dataframe, the feature_names_in_ attribute will contain the names of the dataframe used when calling fit.

model.feature_names_in_
array(['Age_in_years', 'Celltype=large', 'Celltype=smallcell',
'Celltype=squamous', 'Karnofsky_score', 'Months_from_Diagnosis',
'Prior_therapy=yes', 'Treatment=test'], dtype=object)

This is used during prediction to check that the data matches the format of the training data. For instance, when passing a raw numpy array instead of a dataframe, a warning will be issued.

pred = model.predict(Xt.values)
UserWarning: X does not have valid feature names, but CoxPHSurvivalAnalysis was fitted with feature names

Moreover, it will also check that the order of columns matches.

X_reordered = pd.concat(
(Xt.drop("Age_in_years", axis=1), Xt.loc[:, "Age_in_years"]),
axis=1
)
pred = model.predict(X_reordered)
FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names must be in the same order as they were in fit.

For more details on feature names support, have a look at the scikit-learn release highlights.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...