= '../data/raw-data' raw_data
Data Collection
When starting a new project, we’ll load all raw data into the data/raw-data/
directory. Define the relative path here.
Data
All of our data came from https://av-info.faa.gov/dd_sublevel.asp?Folder=%5CAID, which provides text files of flight incident records from 1975-2022 in five year increments. We then converted the text files to CSV files using excel.
Lets first import all datasets from our raw data directory
= f'{raw_data}/a1975_79.csv'
dataset1 = f'{raw_data}/a1980_84.csv'
dataset2 = f'{raw_data}/a1985_89.csv'
dataset3 = f'{raw_data}/a1990_94.csv'
dataset4 = f'{raw_data}/a1995_99.csv'
dataset5 = f'{raw_data}/a2000_04.csv'
dataset6 = f'{raw_data}/a2005_09.csv'
dataset7 = f'{raw_data}/a2010_14.csv'
dataset8 = f'{raw_data}/a2015_19.csv'
dataset9 = f'{raw_data}/a2020_25.csv' dataset0
We then convert all daatasets to pandas dataframes
= pd.read_csv(dataset1, header = 0)
df1 = pd.read_csv(dataset2, header = 0)
df2 = pd.read_csv(dataset3, header = 0)
df3 = pd.read_csv(dataset4, header = 0)
df4 = pd.read_csv(dataset5, header = 0)
df5 = pd.read_csv(dataset6, header = 0)
df6 = pd.read_csv(dataset7, header = 0)
df7 = pd.read_csv(dataset8, header = 0)
df8 = pd.read_csv(dataset9, header = 0)
df9 = pd.read_csv(dataset0, header = 0) df0
/tmp/ipykernel_439980/4246310983.py:1: DtypeWarning: Columns (3,43,56,57,60,61,67,68,69,70,71,72,73,74,75,76,77,133) have mixed types. Specify dtype option on import or set low_memory=False.
df1 = pd.read_csv(dataset1, header = 0)
/tmp/ipykernel_439980/4246310983.py:2: DtypeWarning: Columns (2,3,58,74,77,80) have mixed types. Specify dtype option on import or set low_memory=False.
df2 = pd.read_csv(dataset2, header = 0)
/tmp/ipykernel_439980/4246310983.py:3: DtypeWarning: Columns (2,3,74,75,80,81) have mixed types. Specify dtype option on import or set low_memory=False.
df3 = pd.read_csv(dataset3, header = 0)
/tmp/ipykernel_439980/4246310983.py:4: DtypeWarning: Columns (80,81) have mixed types. Specify dtype option on import or set low_memory=False.
df4 = pd.read_csv(dataset4, header = 0)
/tmp/ipykernel_439980/4246310983.py:5: DtypeWarning: Columns (74,80,81,177,178,180,181,182,183,184,185,188) have mixed types. Specify dtype option on import or set low_memory=False.
df5 = pd.read_csv(dataset5, header = 0)
/tmp/ipykernel_439980/4246310983.py:6: DtypeWarning: Columns (2,3,68,74,80) have mixed types. Specify dtype option on import or set low_memory=False.
df6 = pd.read_csv(dataset6, header = 0)
/tmp/ipykernel_439980/4246310983.py:7: DtypeWarning: Columns (2,3,10,43,59,67,69,70,72,76,111) have mixed types. Specify dtype option on import or set low_memory=False.
df7 = pd.read_csv(dataset7, header = 0)
/tmp/ipykernel_439980/4246310983.py:8: DtypeWarning: Columns (5,6,7,8,9,22,29,30,42,58,59,62,63,66,76,81,177,178,180,183) have mixed types. Specify dtype option on import or set low_memory=False.
df8 = pd.read_csv(dataset8, header = 0)
/tmp/ipykernel_439980/4246310983.py:9: DtypeWarning: Columns (5,6,7,8,9,22,29,30,42,58,59,62,63,66,178,182) have mixed types. Specify dtype option on import or set low_memory=False.
df9 = pd.read_csv(dataset9, header = 0)
/tmp/ipykernel_439980/4246310983.py:10: DtypeWarning: Columns (5,6,7,8,9,22,29,30,42,58,59,62,63,66,178,181,182,183) have mixed types. Specify dtype option on import or set low_memory=False.
df0 = pd.read_csv(dataset0, header = 0)
As you can see, this data is not clean, we’ll do that in our next step
First, concat all data into one dataframe to easlit work with all the data.
= pd.concat([df1, df2, df3, df4, df5, df6, df7, df8, df9, df0], ignore_index=True) df
df
c5 | c1 | c2 | c3 | c4 | c6 | c7 | c8 | c9 | c10 | ... | 32 | Unnamed: 180 | Unnamed: 181 | Unnamed: 182 | Unnamed: 183 | Unnamed: 184 | Unnamed: 185 | Unnamed: 186 | Unnamed: 187 | Unnamed: 188 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 19750101000049A | A | 0.4 | 1975 | 1 | 1 | 19750101 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |||
1 | 19750101000129A | A | 0.4 | 1975 | 1 | 1 | 19750101 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |||
2 | 19750101000139A | A | 0.4 | 1975 | 1 | 1 | 19750101 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |||
3 | 19750101000219A | A | 0.4 | 1975 | 1 | 1 | 19750101 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |||
4 | 19750101000229A | A | 0.4 | 1975 | 1 | 1 | 19750101 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
215303 | 20220207001489I | I | 91 | 2022 | 2 | 7 | 20220207 | 1505 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||
215304 | 20220206001469I | I | 91 | 2022 | 2 | 6 | 20220206 | 930 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||
215305 | 20220206001479I | I | 91 | 2022 | 2 | 6 | 20220206 | 1710 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||
215306 | 20220131001459I | I | O | 2022 | 1 | 31 | 20220131 | 832 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||
215307 | 20211118022049A | A | 91 | 2021 | 11 | 18 | 20211118 | 1130 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
215308 rows × 190 columns
Then, save thte concatenated dataframe to our working-data/
directory.
f'../data/Concatenated_Orig_data.csv') df.to_csv(
df.shape
(215308, 190)