This analysis focuses on Wellness Profiles in Scotland (the data is [here]) and an analysis of the data is [here]
- Male life expectancy
- Female life expectancy
- Deaths all ages
- All mortality among 15-44 year olds
- Early deaths from CHD (<75)
- Early deaths from cancer (<75)
- Estimated smoking attributable deaths
- Smoking prevalence (adults 16+)
- Alcohol-related hospital stays
- Deaths from alcohol conditions
- Drug-related hospital stays
- Active travel to work
- New cancer registrations
- Patients hospitalised with (COPD)
- Patients hospitalised with coronary heart disease
- Patients hospitalised with asthma
- Patients with emergency hospitalisations
- Patients (65+) with multiple emergency hospitalisations
- Road traffic accident casualties
- Population prescribed drugs for anxiety/depression/psychosis
- Patients with a psychiatric hospitalisation
- Deaths from suicide
- Adults incapacity benefit/severe disability allow/employment allow
- People aged 65+ with high care needs cared at home
- Children looked after by local authority
- Single adult dwellings
- Average tariff score of all pupils on S4 roll
- Primary school attendance,Secondary school attendance
We will use a machine learning method to take a sample of the data and learn with it. Then we will make predictions and see if we can predict the age of the patient from the data. Let's first read in the data:
import numpy as np import pandas as pd import sys from sklearn.cross_validation import train_test_split from sklearn.ensemble import RandomForestRegressor ver=pd.read_csv("well.csv")
Now let's generate the training data (taking 50% of the data) to train the system. The rows will be taken randomly from the dataset, and we will use "Patients hospitalised with (COPD)","Patients hospitalised with asthma","Drug-related hospital stays","Alcohol-related hospital stays","Male life expectancy",and "All mortality among 15-44 year olds" to train the machine:
train, test, y_train, y_test = train_test_split(ver[["Patients hospitalised with (COPD)","Patients hospitalised with asthma","Drug-related hospital stays","Alcohol-related hospital stays","Male life expectancy","All mortality among 15-44 year olds"]],ver["Deaths from alcohol conditions"],test_size=0.5, random_state=1)
Now we will fit a model using the random forest method:
model= RandomForestRegressor(n_estimators=100,min_samples_leaf=10) model.fit(train,y_train)
We should now have created our model. Let's make prediction on our data:
predictions =model.predict(test)
This will give us values for deaths from alchohol. Let's process these and define if the magnitude of the error is less than 5 we will see that as a success:
print ('%22s %s %s %s %s' % ("Area","Pred","Actual","Diff","Success")) for x in range(0, len(predictions)): error = abs(int(predictions[x])-ver["Deaths from alcohol conditions"][x]) if (error<=5): str = "Success" success=success+1 else: str="Failed!" failure = failure+1 print('%22s %4d %4d %4d %s' % (ver["Area"][x],int(predictions[x]),ver["Deaths from alcohol conditions"][x],error,str) ) print ('Success: %3d Fail: %3d' % (success,failure))
If we run the model here are the results:
Area Pred Actual Diff Success Aberdeen City 19 21 2 Success Aberdeenshire 16 9 6 Failed! Angus 18 16 2 Success Argyll and Bute 27 22 4 Success Clackmannanshire 19 19 0 Success Dumfries and Galloway 13 13 0 Success Dundee City 20 30 10 Failed! East Ayrshire 17 23 6 Failed! East Dunbartonshire 16 14 1 Success East Renfrewshire 15 14 0 Success Edinburgh City 30 21 9 Failed! Falkirk 19 18 0 Success Fife 16 19 3 Success Glasgow City 18 40 22 Failed! Highland 29 22 6 Failed! Success: 9 Fail: 6
We actually predicted 4 of the 15 correctly! We can estimate the success if we guess an age of 14 deaths each time, that gives us 11 values. So the random choice is around 43%, where we achieved 60% from a limited set of parameters ... and that's machine learning!
Demo
Here is the demo: