FINAL PROJECT 1: LINEAR REGRESSION

LukmanPras

FINAL PROJECT 1: LINEAR REGRESSION

LUKMAN PRASETYO NUGROHO (PYTN-KS09-004)
NATHANIA GUNAWAN (PYTN-KS09-006)

A. Perkenalan

Dataset yang kami gunakan yaitu Uber and Lyft Dataset Boston, MA yang berisi histori data perjalanan Uber dan Lyft di Boston. Analisis kami bertujuan untuk memprediksi dan membandingkan, Bagaimana cab type, name, surge multiplier, dan distance mempengaruhi harga Uber dan Lyft.

Dataset yang kami gunakan berasal dari terdiri dari https://www.kaggle.com/datasets/brllrb/uber-and-lyft-dataset-boston-ma 57 kolom dan 693.071 data , yakni:

id, timestamp, hour, day, month, datetime, timezone, source, destination, cab_type, product_id, name, price, distance, surge_multiplier, latitude, longitude, temperature, apparentTemperature, short_summary, long_summary, precipIntensity, precipProbability, humidity, windSpeed, windGust, windGustTime, visibility, temperatureHigh, temperatureHighTime, temperatureLow, temperatureLowTime, apparentTemperatureHigh, apparentTemperatureHighTime, apparentTemperatureLow, apparentTemperatureLowTime, icon, dewPoint, pressure, windBearing, cloudCover, uvIndex, visibility.1, ozone, sunriseTime, sunsetTime, moonPhase, precipIntensityMax, uvIndexTime, temperatureMin, temperatureMinTime, temperatureMax, temperatureMaxTime, apparentTemperatureMin, apparentTemperatureMinTime, apparentTemperatureMax, dan apparentTemperatureMaxTime

B.Import Package

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from statsmodels.stats.outliers_influence import variance_inflation_factor
import seaborn as sns
%matplotlib inline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import sys
from statsmodels.stats.diagnostic import normal_ad

if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")

C. Data Loading

Read Dataframe

from google.colab import drive
drive.mount('/content/drive/')
Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
%cd drive/MyDrive/dataset/uber-and-lyft-dataset-boston-ma
/content/drive/MyDrive/dataset/uber-and-lyft-dataset-boston-ma
df = pd.read_csv('rideshare_kaggle.csv')
df
id timestamp hour day month datetime timezone source destination cab_type ... precipIntensityMax uvIndexTime temperatureMin temperatureMinTime temperatureMax temperatureMaxTime apparentTemperatureMin apparentTemperatureMinTime apparentTemperatureMax apparentTemperatureMaxTime
0 424553bb-7174-41ea-aeb4-fe06d4f4b9d7 1.544953e+09 9 16 12 2018-12-16 09:30:07 America/New_York Haymarket Square North Station Lyft ... 0.1276 1544979600 39.89 1545012000 43.68 1544968800 33.73 1545012000 38.07 1544958000
1 4bd23055-6827-41c6-b23b-3c491f24e74d 1.543284e+09 2 27 11 2018-11-27 02:00:23 America/New_York Haymarket Square North Station Lyft ... 0.1300 1543251600 40.49 1543233600 47.30 1543251600 36.20 1543291200 43.92 1543251600
2 981a3613-77af-4620-a42a-0c0866077d1e 1.543367e+09 1 28 11 2018-11-28 01:00:22 America/New_York Haymarket Square North Station Lyft ... 0.1064 1543338000 35.36 1543377600 47.55 1543320000 31.04 1543377600 44.12 1543320000
3 c2d88af2-d278-4bfd-a8d0-29ca77cc5512 1.543554e+09 4 30 11 2018-11-30 04:53:02 America/New_York Haymarket Square North Station Lyft ... 0.0000 1543507200 34.67 1543550400 45.03 1543510800 30.30 1543550400 38.53 1543510800
4 e0126e1f-8ca9-4f2e-82b3-50505a09db9a 1.543463e+09 3 29 11 2018-11-29 03:49:20 America/New_York Haymarket Square North Station Lyft ... 0.0001 1543420800 33.10 1543402800 42.18 1543420800 29.11 1543392000 35.75 1543420800
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
693066 616d3611-1820-450a-9845-a9ff304a4842 1.543708e+09 23 1 12 2018-12-01 23:53:05 America/New_York West End North End Uber ... 0.0000 1543683600 31.42 1543658400 44.76 1543690800 27.77 1543658400 44.09 1543690800
693067 633a3fc3-1f86-4b9e-9d48-2b7132112341 1.543708e+09 23 1 12 2018-12-01 23:53:05 America/New_York West End North End Uber ... 0.0000 1543683600 31.42 1543658400 44.76 1543690800 27.77 1543658400 44.09 1543690800
693068 64d451d0-639f-47a4-9b7c-6fd92fbd264f 1.543708e+09 23 1 12 2018-12-01 23:53:05 America/New_York West End North End Uber ... 0.0000 1543683600 31.42 1543658400 44.76 1543690800 27.77 1543658400 44.09 1543690800
693069 727e5f07-a96b-4ad1-a2c7-9abc3ad55b4e 1.543708e+09 23 1 12 2018-12-01 23:53:05 America/New_York West End North End Uber ... 0.0000 1543683600 31.42 1543658400 44.76 1543690800 27.77 1543658400 44.09 1543690800
693070 e7fdc087-fe86-40a5-a3c3-3b2a8badcbda 1.543708e+09 23 1 12 2018-12-01 23:53:05 America/New_York West End North End Uber ... 0.0000 1543683600 31.42 1543658400 44.76 1543690800 27.77 1543658400 44.09 1543690800

693071 rows × 57 columns

General Info of Dataframe

#Mengetahui jumlah kolom, serta tipe data
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 693071 entries, 0 to 693070
Data columns (total 57 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           693071 non-null  object 
 1   timestamp                    693071 non-null  float64
 2   hour                         693071 non-null  int64  
 3   day                          693071 non-null  int64  
 4   month                        693071 non-null  int64  
 5   datetime                     693071 non-null  object 
 6   timezone                     693071 non-null  object 
 7   source                       693071 non-null  object 
 8   destination                  693071 non-null  object 
 9   cab_type                     693071 non-null  object 
 10  product_id                   693071 non-null  object 
 11  name                         693071 non-null  object 
 12  price                        637976 non-null  float64
 13  distance                     693071 non-null  float64
 14  surge_multiplier             693071 non-null  float64
 15  latitude                     693071 non-null  float64
 16  longitude                    693071 non-null  float64
 17  temperature                  693071 non-null  float64
 18  apparentTemperature          693071 non-null  float64
 19  short_summary                693071 non-null  object 
 20  long_summary                 693071 non-null  object 
 21  precipIntensity              693071 non-null  float64
 22  precipProbability            693071 non-null  float64
 23  humidity                     693071 non-null  float64
 24  windSpeed                    693071 non-null  float64
 25  windGust                     693071 non-null  float64
 26  windGustTime                 693071 non-null  int64  
 27  visibility                   693071 non-null  float64
 28  temperatureHigh              693071 non-null  float64
 29  temperatureHighTime          693071 non-null  int64  
 30  temperatureLow               693071 non-null  float64
 31  temperatureLowTime           693071 non-null  int64  
 32  apparentTemperatureHigh      693071 non-null  float64
 33  apparentTemperatureHighTime  693071 non-null  int64  
 34  apparentTemperatureLow       693071 non-null  float64
 35  apparentTemperatureLowTime   693071 non-null  int64  
 36  icon                         693071 non-null  object 
 37  dewPoint                     693071 non-null  float64
 38  pressure                     693071 non-null  float64
 39  windBearing                  693071 non-null  int64  
 40  cloudCover                   693071 non-null  float64
 41  uvIndex                      693071 non-null  int64  
 42  visibility.1                 693071 non-null  float64
 43  ozone                        693071 non-null  float64
 44  sunriseTime                  693071 non-null  int64  
 45  sunsetTime                   693071 non-null  int64  
 46  moonPhase                    693071 non-null  float64
 47  precipIntensityMax           693071 non-null  float64
 48  uvIndexTime                  693071 non-null  int64  
 49  temperatureMin               693071 non-null  float64
 50  temperatureMinTime           693071 non-null  int64  
 51  temperatureMax               693071 non-null  float64
 52  temperatureMaxTime           693071 non-null  int64  
 53  apparentTemperatureMin       693071 non-null  float64
 54  apparentTemperatureMinTime   693071 non-null  int64  
 55  apparentTemperatureMax       693071 non-null  float64
 56  apparentTemperatureMaxTime   693071 non-null  int64  
dtypes: float64(29), int64(17), object(11)
memory usage: 301.4+ MB
df.dtypes.value_counts()
float64    29
int64      17
object     11
dtype: int64

General information of the data:

  1. 29 data float64
  2. 17 data int64
  3. 13 data object
#Mengecek kolom yang bertipe numerik
df.describe()
timestamp hour day month price distance surge_multiplier latitude longitude temperature ... precipIntensityMax uvIndexTime temperatureMin temperatureMinTime temperatureMax temperatureMaxTime apparentTemperatureMin apparentTemperatureMinTime apparentTemperatureMax apparentTemperatureMaxTime
count 6.930710e+05 693071.000000 693071.000000 693071.000000 637976.000000 693071.000000 693071.000000 693071.000000 693071.000000 693071.000000 ... 693071.000000 6.930710e+05 693071.000000 6.930710e+05 693071.000000 6.930710e+05 693071.000000 6.930710e+05 693071.000000 6.930710e+05
mean 1.544046e+09 11.619137 17.794365 11.586684 16.545125 2.189430 1.013870 42.338172 -71.066151 39.584388 ... 0.037374 1.544044e+09 33.457774 1.544042e+09 45.261313 1.544047e+09 29.731002 1.544048e+09 41.997343 1.544048e+09
std 6.891925e+05 6.948114 9.982286 0.492429 9.324359 1.138937 0.091641 0.047840 0.020302 6.726084 ... 0.055214 6.912028e+05 6.467224 6.901954e+05 5.645046 6.901353e+05 7.110494 6.871862e+05 6.936841 6.910777e+05
min 1.543204e+09 0.000000 1.000000 11.000000 2.500000 0.020000 1.000000 42.214800 -71.105400 18.910000 ... 0.000000 1.543162e+09 15.630000 1.543122e+09 33.510000 1.543154e+09 11.810000 1.543136e+09 28.950000 1.543187e+09
25% 1.543444e+09 6.000000 13.000000 11.000000 9.000000 1.280000 1.000000 42.350300 -71.081000 36.450000 ... 0.000000 1.543421e+09 30.170000 1.543399e+09 42.570000 1.543439e+09 27.760000 1.543399e+09 36.570000 1.543439e+09
50% 1.543737e+09 12.000000 17.000000 12.000000 13.500000 2.160000 1.000000 42.351900 -71.063100 40.490000 ... 0.000400 1.543770e+09 34.240000 1.543727e+09 44.680000 1.543788e+09 30.130000 1.543745e+09 40.950000 1.543788e+09
75% 1.544828e+09 18.000000 28.000000 12.000000 22.500000 2.920000 1.000000 42.364700 -71.054200 43.580000 ... 0.091600 1.544807e+09 38.880000 1.544789e+09 46.910000 1.544814e+09 35.710000 1.544789e+09 44.120000 1.544818e+09
max 1.545161e+09 23.000000 30.000000 12.000000 97.500000 7.860000 3.000000 42.366100 -71.033000 57.220000 ... 0.145900 1.545152e+09 43.100000 1.545192e+09 57.870000 1.545109e+09 40.050000 1.545134e+09 57.200000 1.545109e+09

8 rows × 46 columns

Checking Duplicate Data

df.duplicated().sum()
0

Tidak terdapat data yang duplikat

Filtering Uber and Lyft Data

df_uber = df[(df['cab_type']=='Uber')]
df_lyft = df[(df['cab_type']=='Lyft')]

D. Data Cleaning

Checking Null Values, Filling Missing Data

df.isna().sum()
id                                 0
timestamp                          0
hour                               0
day                                0
month                              0
datetime                           0
timezone                           0
source                             0
destination                        0
cab_type                           0
product_id                         0
name                               0
price                          55095
distance                           0
surge_multiplier                   0
latitude                           0
longitude                          0
temperature                        0
apparentTemperature                0
short_summary                      0
long_summary                       0
precipIntensity                    0
precipProbability                  0
humidity                           0
windSpeed                          0
windGust                           0
windGustTime                       0
visibility                         0
temperatureHigh                    0
temperatureHighTime                0
temperatureLow                     0
temperatureLowTime                 0
apparentTemperatureHigh            0
apparentTemperatureHighTime        0
apparentTemperatureLow             0
apparentTemperatureLowTime         0
icon                               0
dewPoint                           0
pressure                           0
windBearing                        0
cloudCover                         0
uvIndex                            0
visibility.1                       0
ozone                              0
sunriseTime                        0
sunsetTime                         0
moonPhase                          0
precipIntensityMax                 0
uvIndexTime                        0
temperatureMin                     0
temperatureMinTime                 0
temperatureMax                     0
temperatureMaxTime                 0
apparentTemperatureMin             0
apparentTemperatureMinTime         0
apparentTemperatureMax             0
apparentTemperatureMaxTime         0
dtype: int64

Terdapat 55095 data missing value pada kolom price

#Filling missing data in "price" column by it's median
df['price'].fillna(df['price'].median(), inplace=True)

Missing values tidak di drop karena terdapat pada kolom price (pada objective kita ingin memprediksi bagaimana variabel lain mempengaruhi price). Jadi, untuk memperoleh hasil yang lebih akurat, missing values tidak di drop tetapi diisi dengan menggunakan median.

E. Explorasi Data

Rata-rata dan Standar Deviasi Harga Uber vs Lyft

  1. Menghitung rata-rata dan standar deviasi dari Uber dan Lyft
mean_uber = df_uber['price'].mean()
stdev_uber = df_uber['price'].std()

mean_lyft = df_lyft['price'].mean()
stdev_lyft = df_lyft['price'].std()
  1. Membuat Data Visualization dari data di atas
data = {"Uber":[df_uber['price'].mean(), df_lyft['price'].mean()],
        "Lyft":[df_lyft['price'].mean(), df_lyft['price'].std(),]
        };

index = ["Mean", "Standard Deviation"];     

dataFrame = pd.DataFrame(data=data, index=index);

dataFrame.plot.bar(rot=0,title="Perbandingan Rata-rata dan Standar Deviasi dari Harga Uber vs Lyft", color=['crimson','steelblue'],figsize=(10,5));
plt.gca().legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show(block=True);

Rata-rata harga Uber lebih rendah dibandingkan Lyft, sedangkan standar deviasi Uber jauh lebih tinggi daripada Lyft

Rata-rata Harga Uber vs Lyft Per 3 Jam

  1. Menentukan nilai minimum dan maximum dari kolom "hour" pada data
df.hour.max()
23
df.hour.min()
0
  1. Mengelompokan jam menjadi 8 kelompok (per 3 jam)
df.loc[(0<=df['hour']) & (df['hour']<=3), 'hourGrouped'] = '00-03' 
df.loc[(4<=df['hour']) & (df['hour']<=6), 'hourGrouped'] = '04-06'
df.loc[(7<=df['hour']) & (df['hour']<=9), 'hourGrouped'] = '07-09'
df.loc[(10<=df['hour']) & (df['hour']<=12), 'hourGrouped'] = '10-12'
df.loc[(13<=df['hour']) & (df['hour']<=15), 'hourGrouped'] = '13-15'
df.loc[(16<=df['hour']) & (df['hour']<=18), 'hourGrouped'] = '16-18'
df.loc[(19<=df['hour']) & (df['hour']<=21), 'hourGrouped'] = '19-21'
df.loc[(22<=df['hour']) & (df['hour']<=24), 'hourGrouped'] = '22-24'
  1. Mengumpulkan data untuk pengguna Uber
dv11 = df[(df['hourGrouped']=='00-03') & (df['cab_type']=='Uber')]
dv12 = df[(df['hourGrouped']=='04-06') & (df['cab_type']=='Uber')]
dv13 = df[(df['hourGrouped']=='07-09') & (df['cab_type']=='Uber')]
dv14 = df[(df['hourGrouped']=='10-12') & (df['cab_type']=='Uber')]
dv15 = df[(df['hourGrouped']=='13-15') & (df['cab_type']=='Uber')]
dv16 = df[(df['hourGrouped']=='16-18') & (df['cab_type']=='Uber')]
dv17 = df[(df['hourGrouped']=='19-21') & (df['cab_type']=='Uber')]
dv18 = df[(df['hourGrouped']=='22-24') & (df['cab_type']=='Uber')]
  1. Mengumpulkan data untuk pengguna Lyft
dv19 = df[(df['hourGrouped']=='00-03') & (df['cab_type']=='Lyft')]
dv110 = df[(df['hourGrouped']=='04-06') & (df['cab_type']=='Lyft')]
dv111 = df[(df['hourGrouped']=='07-09') & (df['cab_type']=='Lyft')]
dv112 = df[(df['hourGrouped']=='10-12') & (df['cab_type']=='Lyft')]
dv113 = df[(df['hourGrouped']=='13-15') & (df['cab_type']=='Lyft')]
dv114 = df[(df['hourGrouped']=='16-18') & (df['cab_type']=='Lyft')]
dv115 = df[(df['hourGrouped']=='19-21') & (df['cab_type']=='Lyft')]
dv116 = df[(df['hourGrouped']=='22-24') & (df['cab_type']=='Lyft')]
  1. Membuat Data Visualization
data = {"Uber mean":[dv11['price'].mean(), dv12['price'].mean(), dv13['price'].mean(), 
                     dv14['price'].mean(), dv15['price'].mean(), dv16['price'].mean(),
                     dv17['price'].mean(),dv18['price'].mean()],
        "Lyft mean":[dv19['price'].mean(), dv110['price'].mean(), dv111['price'].mean(),
                     dv112['price'].mean(), dv113['price'].mean(), dv114['price'].mean(),
                     dv115['price'].mean(), dv116['price'].mean()]
        };

index = ["00-03", "04-06",'07-09','10-12','13-15','16-18','19-21','22-24'];     

dataFrame = pd.DataFrame(data=data, index=index);

dataFrame.plot.bar(rot=0,title="Rata-rata Harga Uber vs Lyft Per 3 Jam", color=['#f48668','#73a580'],figsize=(15,5));
plt.gca().legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show(block=True);

Berdasarkan visualisasi di atas, dapat disimpulkan bahwa rata-rata harga Uber maupun Lyft tidak berbeda jauh per 3 jam. Selain itu, visualisasi di atas juga menunjukkan bahwa rata-rata harga Lyft cenderung lebih tinggi dibandingkan Uber


Rata-rata Harga Uber vs Lyft Per 5 Derajat Temperatur

  1. Menentukan nilai minimum dan maximum dari kolom "temperature" pada data
df.temperature.min()
18.91
df.temperature.max()
57.22
  1. Mengelompokan temperature menjadi 9 kelompok (per 5 derajat)
df.loc[(15<=df['temperature']) & (df['temperature']<20), 'temperatureGrouped'] = '15-20' 
df.loc[(20<=df['temperature']) & (df['temperature']<25), 'temperatureGrouped'] = '20-25'
df.loc[(25<=df['temperature']) & (df['temperature']<30), 'temperatureGrouped'] = '25-30'
df.loc[(30<=df['temperature']) & (df['temperature']<35), 'temperatureGrouped'] = '30-35'
df.loc[(35<=df['temperature']) & (df['temperature']<40), 'temperatureGrouped'] = '35-40'
df.loc[(40<=df['temperature']) & (df['temperature']<45), 'temperatureGrouped'] = '40-45'
df.loc[(45<=df['temperature']) & (df['temperature']<50), 'temperatureGrouped'] = '45-50'
df.loc[(50<=df['temperature']) & (df['temperature']<55), 'temperatureGrouped'] = '50-55'
df.loc[(55<=df['temperature']) & (df['temperature']<60), 'temperatureGrouped'] = '55-60'
  1. Mengumpulkan data untuk pengguna Uber
dv21 = df[(df['temperatureGrouped']=='15-20') & (df['cab_type']=='Uber')]
dv22 = df[(df['temperatureGrouped']=='20-25') & (df['cab_type']=='Uber')]
dv23 = df[(df['temperatureGrouped']=='25-30') & (df['cab_type']=='Uber')]
dv24 = df[(df['temperatureGrouped']=='30-35') & (df['cab_type']=='Uber')]
dv25 = df[(df['temperatureGrouped']=='35-40') & (df['cab_type']=='Uber')]
dv26 = df[(df['temperatureGrouped']=='40-45') & (df['cab_type']=='Uber')]
dv27 = df[(df['temperatureGrouped']=='45-50') & (df['cab_type']=='Uber')]
dv28 = df[(df['temperatureGrouped']=='50-55') & (df['cab_type']=='Uber')]
dv29 = df[(df['temperatureGrouped']=='553. Mengumpulkan data untuk pengguna Uber-60') & (df['cab_type']=='Uber')]
  1. Mengumpulkan data untuk pengguna Lyft
dv210 = df[(df['temperatureGrouped']=='15-20') & (df['cab_type']=='Lyft')]
dv211 = df[(df['temperatureGrouped']=='20-25') & (df['cab_type']=='Lyft')]
dv212 = df[(df['temperatureGrouped']=='25-30') & (df['cab_type']=='Lyft')]
dv213 = df[(df['temperatureGrouped']=='30-35') & (df['cab_type']=='Lyft')]
dv214 = df[(df['temperatureGrouped']=='35-40') & (df['cab_type']=='Lyft')]
dv215 = df[(df['temperatureGrouped']=='40-45') & (df['cab_type']=='Lyft')]
dv216 = df[(df['temperatureGrouped']=='45-50') & (df['cab_type']=='Lyft')]
dv217 = df[(df['temperatureGrouped']=='50-55') & (df['cab_type']=='Lyft')]
dv218 = df[(df['temperatureGrouped']=='55-60') & (df['cab_type']=='Lyft')]
  1. Membuat Data Visualization
data = {"Uber mean":[dv21['price'].mean(), dv22['price'].mean(), dv23['price'].mean(), 
                     dv24['price'].mean(), dv25['price'].mean(), dv26['price'].mean(),
                     dv27['price'].mean(),dv28['price'].mean(),dv29['price'].mean()],
        "Lyft mean":[dv210['price'].mean(), dv211['price'].mean(),dv212['price'].mean(), 
                     dv213['price'].mean(), dv214['price'].mean(),dv215['price'].mean(), 
                     dv216['price'].mean(),dv217['price'].mean(),dv218['price'].mean(),  ]
        };

index = ["15-20", "20-25",'25-30','30-35','35-40','40-45','45-50','50-55','55-60'];     

dataFrame = pd.DataFrame(data=data, index=index);

dataFrame.plot.bar(rot=0,title="Rata-rata Harga Uber vs Lyft Per 5 Derajat Temperatur", color=['darkorange','royalblue'],figsize=(15,5));
plt.gca().legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show(block=True);

Berdasarkan visualisasi di atas, dapat disimpulkan bahwa rata-rata harga Uber maupun Lyft tidak berbeda jauh jika per 5 derajat kenaikan temperatur.


Proporsi Source dan Destination dari Uber vs Lyft

  1. Filter kolom "Source" dan "Destination" dari data Uber dan Lyft
dv41 = df_uber[['source','destination']]
dv42 = df_lyft[['source','destination']]
  1. Memilih color palette untuk Uber dan Lyft
colors1 = sns.color_palette("Spectral",12)
colors2 = sns.color_palette("rainbow",12)
  1. Grouping dv41 dan dv42
dv411 = dv41.groupby(['source'],as_index=True).agg({'source':'count'}, index=False)
dv412 = dv41.groupby(['destination'],as_index=True).agg({'source':'count'}, index=False)
dv413 = dv42.groupby(['source'],as_index=True).agg({'source':'count'}, index=False)
dv414 = dv42.groupby(['destination'],as_index=True).agg({'source':'count'}, index=False)
  1. Membuat Data Visualization
dv411.plot(kind='pie',figsize=(12, 7),autopct='%1.4f%%',startangle=90,shadow=True,subplots=True,colors=colors1,
           textprops={'fontsize': 8},labels=dv411.index,legend=False,wedgeprops={'linewidth': 2.0, 'edgecolor': 'white'})
plt.title('Proporsi Source Uber', loc='center',size ='15')
plt.axis('off')
plt.show()
dv412.plot(kind='pie',figsize=(12, 7),autopct='%1.4f%%',startangle=90,shadow=True,subplots=True,colors=colors1,
           textprops={'fontsize': 8},labels=dv411.index,legend=False,wedgeprops={'linewidth': 2.0, 'edgecolor': 'white'})
plt.title('Proporsi Destination Uber', loc='center',size ='15')
plt.axis('off')
plt.show()
dv413.plot(kind='pie',figsize=(12, 7),autopct='%1.4f%%',startangle=90,shadow=True,subplots=True,colors=colors2,
           textprops={'fontsize': 8},labels=dv411.index,legend=False,wedgeprops={'linewidth': 2.0, 'edgecolor': 'white'})
plt.title('Proporsi Source Lyft', loc='center',size ='15')
plt.axis('off')
plt.show()
dv414.plot(kind='pie',figsize=(12, 7),autopct='%1.4f%%',startangle=90,shadow=True,subplots=True,colors=colors2,
           textprops={'fontsize': 8},labels=dv411.index,legend=False,wedgeprops={'linewidth': 2.0, 'edgecolor': 'white'})
plt.title('Proporsi Destination Lyft', loc='center',size ='15')
plt.axis('off')
plt.show()

Dari visualisasi-visualisasi di atas, dapat kita lihat bahwa penggunaan jasa Uber dan Lyft sudah cukup merata (sekitar 8,2 - 8.5 %) karena proporsi antara daerah source dan destination tidak berbeda signifikan satu dengan yang lainnya.


Distribusi dari Harga Uber vs Lyft

grid1 = sns.FacetGrid(df, col='cab_type', height=5, aspect=1.6)
grid1.map(sns.distplot, 'price',bins=50, color = 'c')
plt.subplots_adjust(top=0.85)
grid1.fig.suptitle('Distribusi dari Harga Uber vs Lyft')
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
Text(0.5, 0.98, 'Distribusi dari Harga Uber vs Lyft')

Selain menggunakan visualisasi di atas, digunakan juga Anderson Darling Test untuk menguji normalitas datanya agar didapatkan hasil yang lebih akurat

# Performing the test on the Uber Price
p_value = normal_ad(df_uber['price'])[1]
print('p-value Uber Price from the test Anderson-Darling test below 0.05 generally means non-normal:', p_value)

# Reporting the normality of the Uber Price
if p_value < 0.05:
    print('Harga Uber tidak normally distributed','\n')
else:
    print('Harga Uber normally distributed','\n')
    
# Performing the test on the Lyft Price    
p_value1 = normal_ad(df_lyft['price'])[1]
print('p-value Lyft Price from the test Anderson-Darling test below 0.05 generally means non-normal:', p_value1)

# Reporting the normality of the Lyft Price
if p_value1 < 0.05:
    print('Harga Lyft tidak normally distributed')
else:
    print('Harga Lyft normally distributed')
p-value Uber Price from the test Anderson-Darling test below 0.05 generally means non-normal: 0.0
Harga Uber tidak normally distributed 

p-value Lyft Price from the test Anderson-Darling test below 0.05 generally means non-normal: 0.0
Harga Lyft tidak normally distributed

Overall Correlations

  1. Drop kolom yang tidak diperlukan
df_cor = df.drop(['id','timestamp','day','month','datetime','timezone',
               'source','destination','cab_type','product_id','name',
               'latitude','longitude','apparentTemperature','precipProbability',
               'windGustTime','temperatureHighTime','temperatureLowTime',
               'apparentTemperatureHighTime','apparentTemperatureHighTime',
               'visibility.1','uvIndexTime','short_summary','long_summary',
               'icon','sunriseTime','sunsetTime','moonPhase',
               'temperatureMinTime','temperatureMaxTime','apparentTemperatureMax',
               'apparentTemperatureMaxTime','apparentTemperatureMinTime'], axis=1)
df_cor
hour price distance surge_multiplier temperature precipIntensity humidity windSpeed windGust visibility ... windBearing cloudCover uvIndex ozone precipIntensityMax temperatureMin temperatureMax apparentTemperatureMin hourGrouped temperatureGrouped
0 9 5.0 0.44 1.0 42.34 0.0000 0.68 8.66 9.17 10.000 ... 57 0.72 0 303.8 0.1276 39.89 43.68 33.73 07-09 40-45
1 2 11.0 0.44 1.0 43.58 0.1299 0.94 11.98 11.98 4.786 ... 90 1.00 0 291.1 0.1300 40.49 47.30 36.20 00-03 40-45
2 1 7.0 0.44 1.0 38.33 0.0000 0.75 7.33 7.33 10.000 ... 240 0.03 0 315.7 0.1064 35.36 47.55 31.04 00-03 35-40
3 4 26.0 0.44 1.0 34.38 0.0000 0.73 5.28 5.28 10.000 ... 310 0.00 0 291.1 0.0000 34.67 45.03 30.30 04-06 30-35
4 3 9.0 0.44 1.0 37.44 0.0000 0.70 9.14 9.14 10.000 ... 303 0.44 0 347.7 0.0001 33.10 42.18 29.11 00-03 35-40
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
693066 23 13.0 1.00 1.0 37.05 0.0000 0.74 2.34 2.87 9.785 ... 133 0.31 0 271.5 0.0000 31.42 44.76 27.77 22-24 35-40
693067 23 9.5 1.00 1.0 37.05 0.0000 0.74 2.34 2.87 9.785 ... 133 0.31 0 271.5 0.0000 31.42 44.76 27.77 22-24 35-40
693068 23 13.5 1.00 1.0 37.05 0.0000 0.74 2.34 2.87 9.785 ... 133 0.31 0 271.5 0.0000 31.42 44.76 27.77 22-24 35-40
693069 23 27.0 1.00 1.0 37.05 0.0000 0.74 2.34 2.87 9.785 ... 133 0.31 0 271.5 0.0000 31.42 44.76 27.77 22-24 35-40
693070 23 10.0 1.00 1.0 37.05 0.0000 0.74 2.34 2.87 9.785 ... 133 0.31 0 271.5 0.0000 31.42 44.76 27.77 22-24 35-40

693071 rows × 27 columns

fig, ax = plt.subplots(figsize=(30,30))
sns.heatmap(df_cor.corr(), annot=True, fmt='.2%',annot_kws={"size": 10},cmap="inferno")
plt.title("Korelasi Antar Variabel", loc='center',size ='15')
plt.show()