Data Collection

When starting a new project, we’ll load all raw data into the data/raw-data/ directory. Define the relative path here.

raw_data = '../data/raw-data'

Data

All of our data came from https://av-info.faa.gov/dd_sublevel.asp?Folder=%5CAID, which provides text files of flight incident records from 1975-2022 in five year increments. We then converted the text files to CSV files using excel.

Lets first import all datasets from our raw data directory

dataset1 = f'{raw_data}/a1975_79.csv'
dataset2 = f'{raw_data}/a1980_84.csv'
dataset3 = f'{raw_data}/a1985_89.csv'
dataset4 = f'{raw_data}/a1990_94.csv'
dataset5 = f'{raw_data}/a1995_99.csv'
dataset6 = f'{raw_data}/a2000_04.csv'
dataset7 = f'{raw_data}/a2005_09.csv'
dataset8 = f'{raw_data}/a2010_14.csv'
dataset9 = f'{raw_data}/a2015_19.csv'
dataset0 = f'{raw_data}/a2020_25.csv'

We then convert all daatasets to pandas dataframes

df1 = pd.read_csv(dataset1, header = 0)
df2 = pd.read_csv(dataset2, header = 0)
df3 = pd.read_csv(dataset3, header = 0)
df4 = pd.read_csv(dataset4, header = 0)
df5 = pd.read_csv(dataset5, header = 0)
df6 = pd.read_csv(dataset6, header = 0)
df7 = pd.read_csv(dataset7, header = 0)
df8 = pd.read_csv(dataset8, header = 0)
df9 = pd.read_csv(dataset9, header = 0)
df0 = pd.read_csv(dataset0, header = 0)
/tmp/ipykernel_439980/4246310983.py:1: DtypeWarning: Columns (3,43,56,57,60,61,67,68,69,70,71,72,73,74,75,76,77,133) have mixed types. Specify dtype option on import or set low_memory=False.
  df1 = pd.read_csv(dataset1, header = 0)
/tmp/ipykernel_439980/4246310983.py:2: DtypeWarning: Columns (2,3,58,74,77,80) have mixed types. Specify dtype option on import or set low_memory=False.
  df2 = pd.read_csv(dataset2, header = 0)
/tmp/ipykernel_439980/4246310983.py:3: DtypeWarning: Columns (2,3,74,75,80,81) have mixed types. Specify dtype option on import or set low_memory=False.
  df3 = pd.read_csv(dataset3, header = 0)
/tmp/ipykernel_439980/4246310983.py:4: DtypeWarning: Columns (80,81) have mixed types. Specify dtype option on import or set low_memory=False.
  df4 = pd.read_csv(dataset4, header = 0)
/tmp/ipykernel_439980/4246310983.py:5: DtypeWarning: Columns (74,80,81,177,178,180,181,182,183,184,185,188) have mixed types. Specify dtype option on import or set low_memory=False.
  df5 = pd.read_csv(dataset5, header = 0)
/tmp/ipykernel_439980/4246310983.py:6: DtypeWarning: Columns (2,3,68,74,80) have mixed types. Specify dtype option on import or set low_memory=False.
  df6 = pd.read_csv(dataset6, header = 0)
/tmp/ipykernel_439980/4246310983.py:7: DtypeWarning: Columns (2,3,10,43,59,67,69,70,72,76,111) have mixed types. Specify dtype option on import or set low_memory=False.
  df7 = pd.read_csv(dataset7, header = 0)
/tmp/ipykernel_439980/4246310983.py:8: DtypeWarning: Columns (5,6,7,8,9,22,29,30,42,58,59,62,63,66,76,81,177,178,180,183) have mixed types. Specify dtype option on import or set low_memory=False.
  df8 = pd.read_csv(dataset8, header = 0)
/tmp/ipykernel_439980/4246310983.py:9: DtypeWarning: Columns (5,6,7,8,9,22,29,30,42,58,59,62,63,66,178,182) have mixed types. Specify dtype option on import or set low_memory=False.
  df9 = pd.read_csv(dataset9, header = 0)
/tmp/ipykernel_439980/4246310983.py:10: DtypeWarning: Columns (5,6,7,8,9,22,29,30,42,58,59,62,63,66,178,181,182,183) have mixed types. Specify dtype option on import or set low_memory=False.
  df0 = pd.read_csv(dataset0, header = 0)

As you can see, this data is not clean, we’ll do that in our next step

First, concat all data into one dataframe to easlit work with all the data.

df = pd.concat([df1, df2, df3, df4, df5, df6, df7, df8, df9, df0], ignore_index=True)
df
c5 c1 c2 c3 c4 c6 c7 c8 c9 c10 ... 32 Unnamed: 180 Unnamed: 181 Unnamed: 182 Unnamed: 183 Unnamed: 184 Unnamed: 185 Unnamed: 186 Unnamed: 187 Unnamed: 188
0 19750101000049A A 0.4 1975 1 1 19750101 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 19750101000129A A 0.4 1975 1 1 19750101 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 19750101000139A A 0.4 1975 1 1 19750101 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 19750101000219A A 0.4 1975 1 1 19750101 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 19750101000229A A 0.4 1975 1 1 19750101 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
215303 20220207001489I I 91 2022 2 7 20220207 1505 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
215304 20220206001469I I 91 2022 2 6 20220206 930 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
215305 20220206001479I I 91 2022 2 6 20220206 1710 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
215306 20220131001459I I O 2022 1 31 20220131 832 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
215307 20211118022049A A 91 2021 11 18 20211118 1130 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

215308 rows × 190 columns

Then, save thte concatenated dataframe to our working-data/ directory.

df.to_csv(f'../data/Concatenated_Orig_data.csv')
df.shape
(215308, 190)