專案實戰：kaggle紐約出租車資料分析

作者：由百香果資料發表于體育時間：2018-08-19

問題背景

在紐約到處都是單向的、小的小巷，在任何特定的時間點，行人的數量幾乎無法計算。更不用說路上塞滿了汽車/摩托車/腳踏車了。在這樣的城市生活，兩地之間的交通出行可以坐計程車/Uber等。你不需要對交通或行人感到壓力，但並不意味著你能及時到達目的地。所以你需要讓你的司機儘可能走最短的路程。最短的時間，我們說的是時間。如果a線路比B線路長X公里*，但是比B線路快Y分鐘*，那就選擇B線路。因此需要知道哪條路線是最好的，我們需要能夠預測在走特定路線的時候旅行會持續多長時間。即我們的目標是預測在給定的開始和結束座標下的測試資料集中的每次旅行的持續時間。

紐約出租車資料分析是kaggle競賽的著名賽題之一，也是新手學習資料分析的經典練習案例，專案資料可以從kaggle官網或者github上下載，下面我們開始分析資料。

Python庫準備

Pandas Numpy Seaborn

其中pandas被用來處理資料。Numpy是Python中科學計算的基本包。XGBoost是用於進行最終預測的分類演算法。Seaborn是構建在matplotlib之上的資料視覺化工具。

匯入庫

import pandas as pd

from datetime import datetime

import pandas as pd

from math import radians， cos， sin， asin， sqrt

import seaborn as sns

import matplotlib

import numpy as np

import matplotlib。pyplot as plt

plt。rcParams［‘figure。figsize’］ = ［16， 10］

載入資料

使用pandas載入訓練和測試的資料

train = pd。read_csv（‘input/new-york-city-taxi-with-osrm/train。csv’）

test = pd。read_csv（‘input/new-york-city-taxi-with-osrm/test。csv’）

檢視資料資訊

表單說明：

id - 每次行程的唯一ID

vendor_id - 行程提供者的ID

pickup_datetime - 上車的日期和時間

dropoff_datetime - 停表的日期和時間

passenger_count - 車輛中的乘客數量（駕駛員輸入值）

pickup_longitude - 上車的經度

pickup_latitude - 上車的緯度

dropoff_longitude - 下車經度

dropoff_latitude - 下車的緯度

store_and_fwd_flag - 行程記錄是否為儲存轉發（或是直接傳送）—— Y =儲存和轉發 N =沒有儲存

trip_duration - 行程持續時間（秒）

pd。set_option（‘display。float_format’， lambda x： ‘%。3f’ % x）

train。head（）

id vendor_id pickup_datetime dropoff_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag trip_duration

id2875421 2 2016-03-14 17：24：55 2016-03-14 17：32：30 1 -73。982 40。768 -73。965 40。766 N 455

id2377394 1 2016-06-12 00：43：35 2016-06-12 00：54：38 1 -73。980 40。739 -73。999 40。731 N 663

id3858529 2 2016-01-19 11：35：24 2016-01-19 12：10：48 1 -73。979 40。764 -74。005 40。710 N 2124

id3504673 2 2016-04-06 19：32：31 2016-04-06 19：39：40 1 -74。010 40。720 -74。012 40。707 N 429

id2181028 2 2016-03-26 13：30：55 2016-03-26 13：38：10 1 -73。973 40。793 -73。973 40。783 N 435

pd。set_option（‘display。float_format’， lambda x： ‘%。3f’ % x）

train。describe（）

vendor_id passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude trip_duration

count 1458644。000 1458644。000 1458644。000 1458644。000 1458644。000 1458644。000 1458644。000

mean 1。535 1。665 -73。973 40。751 -73。973 40。752 959。492

std 0。499 1。314 0。071 0。033 0。071 0。036 5237。432

min 1。000 0。000 -121。933 34。360 -121。933 32。181 1。000

25% 1。000 1。000 -73。992 40。737 -73。991 40。736 397。000

50% 2。000 1。000 -73。982 40。754 -73。980 40。755 662。000

75% 2。000 2。000 -73。967 40。768 -73。963 40。770 1075。000

max 2。000 9。000 -61。336 51。881 -61。336 43。921 3526282。000

我們可以看到trip_duration。最少1秒，最多3 526 282秒（約980小時）。不可能有人旅行那麼長時間，這個賬單將是天文數字。而1秒鐘的旅行不會讓你有任何收穫。顯然我們需要處理一些異常值。

http：//

train。info（）

訓練集基本資訊如下：

RangeIndex： 1458644 entries， 0 to 1458643

Data columns （total 11 columns）：

id 1458644 non-null object

vendor_id 1458644 non-null int64

pickup_datetime 1458644 non-null object

dropoff_datetime 1458644 non-null object

passenger_count 1458644 non-null int64

pickup_longitude 1458644 non-null float64

pickup_latitude 1458644 non-null float64

dropoff_longitude 1458644 non-null float64

dropoff_latitude 1458644 non-null float64

store_and_fwd_flag 1458644 non-null object

trip_duration 1458644 non-null int64

dtypes： float64（4）， int64（3）， object（4）

memory usage： 122。4+ MB

旅行時間異常值處理

正如我們在前面提到的，有一些與“trip_duration”變數相關的異常值，特別是980小時的最大旅行時間和至少1秒的旅行時間。我決定將2個標準差以外的資料排除在均值之外。研究排除4個標準差對最終結果的影響可能是值得的。

m = np。mean（train［‘trip_duration’］）

s = np。std（train［‘trip_duration’］）

train = train［train［‘trip_duration’］ <= m + 2*s］

train = train［train［‘trip_duration’］ >= m - 2*s］

經緯度資料清理

將經緯度範圍限定在紐約境內，將位於邊界之外的座標點清除。紐約市的邊界，在座標裡是這樣的：

city_long_border =（-74。03，-73。75）

city_lat_border =（40。63，40。85）

train = train［train［‘pickup_longitude’］ <= -73。75］

train = train［train［‘pickup_longitude’］ >= -74。03］

train = train［train［‘pickup_latitude’］ <= 40。85］

train = train［train［‘pickup_latitude’］ >= 40。63］

train = train［train［‘dropoff_longitude’］ <= -73。75］

train = train［train［‘dropoff_longitude’］ >= -74。03］

train = train［train［‘dropoff_latitude’］ <= 40。85］

train = train［train［‘dropoff_latitude’］ >= 40。63］

日期處理

準備資料的最後一步，我們需要更改日期變數的格式

train［‘pickup_datetime’］ = pd。to_datetime（train。pickup_datetime）

test［‘pickup_datetime’］ = pd。to_datetime（test。pickup_datetime）

train。loc［：， ‘pickup_date’］ = train［‘pickup_datetime’］。dt。date

test。loc［：， ‘pickup_date’］ = test［‘pickup_datetime’］。dt。date

train［‘dropoff_datetime’］ = pd。to_datetime（train。dropoff_datetime）

資料視覺化和分析

一、繪製一個旅行持續時間的簡單直方圖

：將資料放入100個箱子中。改變這一點，以獲得對資料進行整理的感覺。簡單地說，裝箱涉及到取資料的最大值和最小值點，減去它以得到長度，將長度除以得到間隔長度的箱子數，並將資料點分組到這些間隔中。

plt。hist（train［‘trip_duration’］。values， bins=100）

plt。xlabel（‘trip_duration’）

plt。ylabel（‘number of train records’）

plt。show（）

這裡我們使用一些資料轉換來檢視在應用某些轉換（例如日誌轉換）時資料中是否出現了顯著的模式。在這種情況下，對旅行持續時間應用日誌轉換是有意義的。

train［‘log_trip_duration’］ = np。log（train［‘trip_duration’］。values + 1）

plt。hist（train［‘log_trip_duration’］。values， bins=100）

plt。xlabel（‘log（trip_duration）’）

plt。ylabel（‘number of train records’）

plt。show（）

sns。distplot（train［“log_trip_duration”］， bins =100）

人也可能感興趣檢視旅行的數量隨著時間的推移，因為這不僅可以揭示資料和某些趨勢明顯的季節性，但可以指出任何siginficant離群值（如果不是已經清理的資料集）和顯示缺失值（再一次，如果不是已經檢查和清洗的資料集）。為此，我們將簡單地繪製測試資料和訓練資料的timeseries線條圖，不僅檢視可能的趨勢/季節性，而且檢視兩個資料集是否遵循相同的模式形狀。

plt。plot（train。groupby（‘pickup_date’）。count（）［［‘id’］］， ‘o-’， label=‘train’）

plt。plot（test。groupby（‘pickup_date’）。count（）［［‘id’］］， ‘o-’， label=‘test’）

plt。title（‘Trips over Time。’）

plt。legend（loc=0）

plt。ylabel（‘Trips’）

plt。show（）

顯然，測試和訓練資料集遵循一個非常相似的形狀，正如預期的那樣。乍一看，有兩點很突出。大約在1月下旬/ 2月上旬，出行次數大幅下降。大約4個月後，股市明顯出現了略微溫和的下跌。第一個下降可能與季節有關：紐約現在是冬天，所以你預計會少去旅行（當外面快要結冰的時候，誰還想騎車兜風呢？）然而，這種情況似乎不太可能發生，因為這種下降似乎是在一天或幾天內出現的，這些“異常值”都值得注意。

接下來我們看看這兩個供應商在各自的平均旅行時間上有多大的不同:

import warnings

warnings。filterwarnings（“ignore”）

plot_vendor = train。groupby（‘vendor_id’）［‘trip_duration’］。mean（）

plt。subplots（1，1，figsize=（17，10））

plt。ylim（ymin=800）

plt。ylim（ymax=840）

sns。barplot（plot_vendor。index，plot_vendor。values）

plt。title（‘Time per Vendor’）

plt。legend（loc=0）

plt。ylabel（‘Time in Seconds’）

所以看起來這兩個供應商的旅行時間沒有太大區別。人們可能會認為，知道從A到B的哪條路線最快已經不是什麼秘密了，而且這更多是一種交易技巧，而非智慧財產權。但是有些東西看起來不太對，所以我們可以使用另一個特性來觀察平均旅行時間是否有顯著差異

snwflag = train。groupby（‘store_and_fwd_flag’）［‘trip_duration’］。mean（）

plt。subplots（1，1，figsize=（17，10））

plt。ylim（ymin=0）

plt。ylim（ymax=1100）

plt。title（‘Time per store_and_fwd_flag’）

plt。legend（loc=0）

plt。ylabel（‘Time in Seconds’）

sns。barplot（snwflag。index，snwflag。values）

因此，沒有明顯的顯著差異，這可以解釋為任何給定的旅行中車輛上的乘客數量。

二、乘車點位置

我們使用了紐約的城市地圖邊界座標，在前面的核心中提到了建立畫布，其中座標點將被繪製。為了顯示實際座標，使用了一個簡單的散點圖：

city_long_border = （-74。03， -73。75）

city_lat_border = （40。63， 40。85）

fig， ax = plt。subplots（ncols=2， sharex=True， sharey=True）

ax［0］。scatter（train［‘pickup_longitude’］。values［：100000］， train［‘pickup_latitude’］。values［：100000］，

color=‘blue’， s=1， label=‘train’， alpha=0。1）

ax［1］。scatter（test［‘pickup_longitude’］。values［：100000］， test［‘pickup_latitude’］。values［：100000］，

color=‘green’， s=1， label=‘test’， alpha=0。1）

fig。suptitle（‘Train and test area complete overlap。’）

ax［0］。legend（loc=0）

ax［0］。set_ylabel（‘latitude’）

ax［0］。set_xlabel（‘longitude’）

ax［1］。set_xlabel（‘longitude’）

ax［1］。legend（loc=0）

plt。ylim（city_lat_border）

plt。xlim（city_long_border）

plt。show（）

從這兩個圖中我們可以看出，取車地點非常相似，值得注意的是，火車資料集只是有更多的資料點

三、距離和方向

下一部分很有趣。由於Beluga的post，我們可以根據拾取和下落的座標來確定一個特定行程的距離和方向。為此，我做了三個函式：

def haversine_array（lat1， lng1， lat2， lng2）：

lat1， lng1， lat2， lng2 = map（np。radians，（lat1， lng1， lat2， lng2））

AVG_EARTH_RADIUS = 6371 # in km

lat = lat2 - lat1

lng = lng2 - lng1

d = np。sin（lat * 0。5） ** 2 + np。cos（lat1） * np。cos（lat2） * np。sin（lng * 0。5） ** 2

h = 2 * AVG_EARTH_RADIUS * np。arcsin（np。sqrt（d））

return h

def dummy_manhattan_distance（lat1， lng1， lat2， lng2）：

a = haversine_array（lat1， lng1， lat1， lng2）

b = haversine_array（lat1， lng1， lat2， lng1）

return a + b

def bearing_array（lat1， lng1， lat2， lng2）：

AVG_EARTH_RADIUS = 6371 # in km

lng_delta_rad = np。radians（lng2 - lng1）

lat1， lng1， lat2， lng2 = map（np。radians，（lat1， lng1， lat2， lng2））

y = np。sin（lng_delta_rad） * np。cos（lat2）

x = np。cos（lat1） * np。sin（lat2） - np。sin（lat1） * np。cos（lat2） * np。cos（lng_delta_rad）

return np。degrees（np。arctan2（y， x））

將這些函式應用於試驗資料和列車資料，我們可以計算出在給定經度和緯度的球面上兩點之間的大圓距離。然後我們就可以計算出在曼哈頓旅行的總距離。最後，我們（透過一些方便的三角學知識）計算出距離的方向。這些計算作為變數儲存在單獨的資料集中。下一步我決定做的是建立鄰居，比如Soho，或者上東區，從資料中顯示出來。

train。loc［：， ‘distance_haversine’］ = haversine_array（train［‘pickup_latitude’］。values， train［‘pickup_longitude’］。values， train［‘dropoff_latitude’］。values， train［‘dropoff_longitude’］。values）

test。loc［：， ‘distance_haversine’］ = haversine_array（test［‘pickup_latitude’］。values， test［‘pickup_longitude’］。values， test［‘dropoff_latitude’］。values， test［‘dropoff_longitude’］。values）

train。loc［：， ‘distance_dummy_manhattan’］ = dummy_manhattan_distance（train［‘pickup_latitude’］。values， train［‘pickup_longitude’］。values， train［‘dropoff_latitude’］。values， train［‘dropoff_longitude’］。values）

test。loc［：， ‘distance_dummy_manhattan’］ = dummy_manhattan_distance（test［‘pickup_latitude’］。values， test［‘pickup_longitude’］。values， test［‘dropoff_latitude’］。values， test［‘dropoff_longitude’］。values）

train。loc［：， ‘direction’］ = bearing_array（train［‘pickup_latitude’］。values， train［‘pickup_longitude’］。values， train［‘dropoff_latitude’］。values， train［‘dropoff_longitude’］。values）

test。loc［：， ‘direction’］ = bearing_array（test［‘pickup_latitude’］。values， test［‘pickup_longitude’］。values， test［‘dropoff_latitude’］。values， test［‘dropoff_longitude’］。values）

四、建立“社群”

人們可能會認為手邊有張地圖是必要的，但事實並非如此。這將直觀地工作，因為KMeans將資料點聚到自己的社群。這是非常直接的，因為Numpy幫助建立一個垂直堆疊的拾取和下垂座標陣列，並且使用‘ sklearn ’s MiniBatchKMeans模組很容易設定引數來建立叢集。準備資料有三個步驟：建立座標堆疊，配置KMeans叢集引數，建立實際的叢集：

coords = np。vstack（（train［［‘pickup_latitude’， ‘pickup_longitude’］］。values，

train［［‘dropoff_latitude’， ‘dropoff_longitude’］］。values））

In ［21］：

sample_ind = np。random。permutation（len（coords））［：500000］

kmeans = MiniBatchKMeans（n_clusters=100， batch_size=10000）。fit（coords［sample_ind］）

In ［22］：

train。loc［：， ‘pickup_cluster’］ = kmeans。predict（train［［‘pickup_latitude’， ‘pickup_longitude’］］）

train。loc［：， ‘dropoff_cluster’］ = kmeans。predict（train［［‘dropoff_latitude’， ‘dropoff_longitude’］］）

test。loc［：， ‘pickup_cluster’］ = kmeans。predict（test［［‘pickup_latitude’， ‘pickup_longitude’］］）

test。loc［：， ‘dropoff_cluster’］ = kmeans。predict（test［［‘dropoff_latitude’， ‘dropoff_longitude’］］）

fig， ax = plt。subplots（ncols=1， nrows=1）

ax。scatter（train。pickup_longitude。values［：500000］， train。pickup_latitude。values［：500000］， s=10， lw=0，

c=train。pickup_cluster［：500000］。values， cmap=‘autumn’， alpha=0。2）

ax。set_xlim（city_long_border）

ax。set_ylim（city_lat_border）

ax。set_xlabel（‘Longitude’）

ax。set_ylabel（‘Latitude’）

plt。show（）

這顯示了KMeans叢集演算法的一個很好的視覺化表示（我們使用了100個叢集，但是可以自由地使用這個引數來檢視它如何改變結果）。叢集有效地建立了曼哈頓的不同街區，透過不同顏色之間的邊界顯示出來。這在某種程度上應該是直觀的，因為在紐約的不同地區，從a點到B點的旅行是不同的。從本質上，它是不同的。

五、提取資訊

首先分析平均速度，以及它是如何隨時間而變化的，具體來說，我們關注的是一天的時間、星期的一天以及一年中的蛾子是如何影響平均速度的。需要注意的是平均速度是距離和時間的函式所以它不會給建模輸出增加任何東西。因此，在訓練模型之前，我們最終需要刪除它。

train。loc［：， ‘avg_speed_h’］ = 1000 * train［‘distance_haversine’］ / train［‘trip_duration’］

train。loc［：， ‘avg_speed_m’］ = 1000 * train［‘distance_dummy_manhattan’］ / train［‘trip_duration’］

fig， ax = plt。subplots（ncols=3， sharey=True）

ax［0］。plot（train。groupby（‘Hour’）。mean（）［‘avg_speed_h’］， ‘bo-’， lw=2， alpha=0。7）

ax［1］。plot（train。groupby（‘dayofweek’）。mean（）［‘avg_speed_h’］， ‘go-’， lw=2， alpha=0。7）

ax［2］。plot（train。groupby（‘Month’）。mean（）［‘avg_speed_h’］， ‘ro-’， lw=2， alpha=0。7）

ax［0］。set_xlabel（‘Hour of Day’）

ax［1］。set_xlabel（‘Day of Week’）

ax［2］。set_xlabel（‘Month of Year’）

ax［0］。set_ylabel（‘Average Speed’）

fig。suptitle（‘Average Traffic Speed by Date-part’）

plt。show（）

接下來我們使用位置和平均速度資料，並按位置繪製平均速度。

train。loc［：， ‘pickup_lat_bin’］ = np。round（train［‘pickup_latitude’］， 3）

train。loc［：， ‘pickup_long_bin’］ = np。round（train［‘pickup_longitude’］， 3）

# Average speed for regions

gby_cols = ［‘pickup_lat_bin’， ‘pickup_long_bin’］

coord_speed = train。groupby（gby_cols）。mean（）［［‘avg_speed_h’］］。reset_index（）

coord_count = train。groupby（gby_cols）。count（）［［‘id’］］。reset_index（）

coord_stats = pd。merge（coord_speed， coord_count， on=gby_cols）

coord_stats = coord_stats［coord_stats［‘id’］ > 100］

fig， ax = plt。subplots（ncols=1， nrows=1）

ax。scatter（train。pickup_longitude。values［：500000］， train。pickup_latitude。values［：500000］， color=‘black’， s=1， alpha=0。5）

ax。scatter（coord_stats。pickup_long_bin。values， coord_stats。pickup_lat_bin。values， c=coord_stats。avg_speed_h。values，

cmap=‘RdYlGn’， s=20， alpha=0。5， vmin=1， vmax=8）

ax。set_xlim（city_long_border）

ax。set_ylim（city_lat_border）

ax。set_xlabel（‘Longitude’）

ax。set_ylabel（‘Latitude’）

plt。title（‘Average speed’）

plt。show（）

很明顯，在鄰近地區，平均速度肯定會改變。在更大的程度上，市中心是最繁忙的（我們可以預期，因為大城市的大部分活動都集中在市中心），而郊區的平均速度也會很好地提高。

詳細內容請閱讀原文：

https：//www。

kaggle。com/karelrv/nyct

-from-a-to-z-with-xgboost-tutorial/notebook

歡迎轉發或分享公眾號《Python練手專案實戰》，我們將分享更多Python學習實戰練習專案，在實戰中成長，在實戰中進步。

標簽： train pickup values longitude plt

上一篇:男生多大可以刮鬍子（本人現17歲）？

下一篇：足球運動員的“註冊”跟“報名”什麼區別？

專案實戰：kaggle紐約出租車資料分析

猜你喜歡

用 Python 讓圖表動起來，居然這麼簡單

各位大佬看看這張epoch-loss圖，為什麼中間loss會突然上升然後平穩，想收斂在最低點，怎麼做？

在 Matplotlib 中如何隱藏座標軸文字刻度或刻度標籤?

利用Anaconda入門資料分析——機率分佈（六）

加入時間固定效應符號顯著變化說明什麼呢？