Problem

We have data of medical insurance of patients. We will use the independent data to create a machine learning model which will estimate the Insurance charges. The medical charge is a numeric value so this problem is a regression problem.

Charge is dependent variable and these are independent variable:

Load Library and data

library(dplyr)
library(caret)

insurance <- readRDS("insurance.rds")

Exploratory Data Analysis (EDA)

It is already done in data analysis section.

knitr::kable(head(insurance))
age sex bmi children smoker region charges
19 female 27.900 0 yes southwest 16884.924
39 male 33.770 1 no southeast 1725.552
28 male 33.000 3 no southeast 4449.462
33 male 22.705 0 no northwest 21984.471
32 male 28.880 0 no northwest 3866.855
31 female 25.740 0 no southeast 3756.622
str(insurance)
## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 39 28 33 32 31 46 37 37 60 ...
##  $ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
##  $ charges : num  16885 1726 4449 21984 3867 ...

Algorithm

As this problem is regression problem we will use Multiple Linear Regression Algorithm to make Medical insurance Predictive Model.

Simple Linear Regression

simple linear regression is a simple method for predicting the quantitative value and study relationships between two continuous variables suppose X and Y. Mathematically, simple linear regression can be written as:

\[Y=a+b∗X+e\]

Where \(Y\) is dependent variable, \(X\) is independent variable, \(a\) is the intercept , \(b\) is the slope of \(X\) and \(e\) is the error term in equation.

Linear regression method’s main task is to find the best-fitting straight line through the Y and X points

Multiple Linear Regression

Multiple linear regression attempts to model the relationship between two or more explanatory variables and a response variable by fitting a linear equation to observed data.

Multiple Linear regression uses multiple predictors. The equation for multiple linear regression looks like:

\[Y = \beta0 + \beta1x1+ \beta2x2+ ...+e\]

where:

\(Y\) is Response or dependent variable \(\beta0\) is intercept \(x1\) and \(x2\) are predictors or independent variable \(\beta1\) and \(\beta2\) are coefficeints for the \(x1\) and \(x2\) respectively and \(e\) is error term in equation.

Spliting the data