Feature Extraction

At this point our workflow will split as we prepare for different modeling techniques. This notebook will preprocess the data for Neural Net training and inference in 03a_Training_Model.ipynb

First let’s import our cleaned concatenated data from 01_Cleaning_Data.ipynb

df = pd.read_csv("../data/Concatenated_Clean_data.csv")
/tmp/ipykernel_440158/2483919307.py:1: DtypeWarning: Columns (5,6,11,15,16,17,18,19,27,29,30,33,34,35,36,43,45,48,49,50,51,52,58,59,60,61,62,63,69,70,71,72,73,74,75,76,77,78,79,82,83,117,118,122,123,127,131,135,136,142,155,158,167,168,172,173,175,176) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv("../data/Concatenated_Clean_data.csv")

Feature Extraction

In this study, 8 maintenance codes were observed as relevant. We’ll extract those now

maintenance_codes = ['AF', 'DE', 'AI', 'AP', 'AU', 'EQ', 'II', 'ME']
df = df[df['c78'].isin(maintenance_codes)]
df['c78'].value_counts()
II    1951
ME     377
AU     246
AF      92
DE      57
EQ      24
AI      15
AP       1
Name: c78, dtype: int64

Next, we identify and select relevant data and label columns

text_columns = ['c119','c77','c79','c81', 'c85', 'c87', 'c89', 'c91', 'c93', 'c95', 'c97', 'c99', 'c101', 'c103', 'c105', 'c107', 'c109', 'c131', 'c133', 'c135', 'c137', 'c146', 'c148', 'c150', 'c154','c161', 'c163', 'c183', 'c191']
label_columns = ['c78', 'c80', 'c86', 'c5']

columns_to_keep = text_columns + label_columns
df.drop(columns=[col for col in df if col not in columns_to_keep], inplace=True)
df.shape
(2763, 33)

This is our maintenence text csv, we’ll save that now

df.to_csv("../data/cleaned-data/Maintenance_Text_data.csv")

For our NLP classification, we only need two columns: c119 is the text that describes an issue, c78 is the label that classifies the issue

We’ll extract those now

data = pd.DataFrame()
data['text'] = df['c119']
data['label'] = df['c78']
data
text label
535 TAILWHEEL COCKED RIGHT PRIOR TO TKOF. ... AU
864 TOW PLANE BECAME AIRBORNE THEN SETTLED.STUDENT... ME
2195 2ND ILS APCH,ACFT'S G/S INOP.LOM TUNED TO WRON... AU
2476 PLT NOTED SOFT R BRAKE PEDAL DRG TAXI TO TKOF.... AU
2916 TAXI OFF HARD SFC DUE TFC R MAIN GR BROKE THRO... AF
... ... ...
113835 (-23) A/C RELOCATED TO NEW HANGAR TO CHECK SIZ... II
113838 (-23) ON 2/23/08 @ APPROXIMATELY 2130 DURING T... AF
113840 (-23) PILOT TOOK OFF FOR LEESBURG AIRPORT AND ... II
113869 (-23) OWNER FORGOT TO FASTEN THE LOWER LEFT 4 ... II
113902 (-23) THE AIRCRAFT EXPERIENCED SEVERE TURBULAN... ME

2763 rows × 2 columns

Cleaning Dataframe

Even with previous cleaning, lets ensure our dataframe is clean

data.isna().sum()
text     15
label     0
dtype: int64

Remove NaN values

data = data.fillna('Null')
data = data[data['text'] != 'Null']

Check there are no missing values left

data.isna().sum()
text     0
label    0
dtype: int64
data.head(10)
text label
535 TAILWHEEL COCKED RIGHT PRIOR TO TKOF. ... AU
864 TOW PLANE BECAME AIRBORNE THEN SETTLED.STUDENT... ME
2195 2ND ILS APCH,ACFT'S G/S INOP.LOM TUNED TO WRON... AU
2476 PLT NOTED SOFT R BRAKE PEDAL DRG TAXI TO TKOF.... AU
2916 TAXI OFF HARD SFC DUE TFC R MAIN GR BROKE THRO... AF
3151 ACFT BEING TAXIED ON GRASS TAXIWAY NOSE WHEEL ... AF
3332 DEP FOR DEST WITH KNOWN ELEC PROB. DIDNT USE E... AU
3943 MTNS OBSCURED.FLT TO CK VOR REC REPTD INOP PRI... AU
4176 SUFFICIENT OPPORTUNITY EXISTED TO RELEASE WHEN... ME
4442 MAINT NOT PERFORMED DUE PARTS NOT AVAILABLE. T... AU

Remove rows with one occurance

counts = data['label'].value_counts()
data = data[data['label'].isin(counts[counts > 1].index)]

Splitting Data

X, y = data['text'], data['label']

We’ll split data into training (60%), testing (20%), and validating (20%)

ss = StratifiedShuffleSplit(n_splits=10, test_size=0.20, random_state=0)
ss.get_n_splits(X, y)
10
output_dir = '../data/splits'

Create Output Dirs

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

if not os.path.exists(output_dir + '/train'):
    os.makedirs(output_dir + '/train')

if not os.path.exists(output_dir + '/test'):
    os.makedirs(output_dir + '/test')

if not os.path.exists(output_dir + '/val'):
    os.makedirs(output_dir + '/val')

if not os.path.exists(output_dir + '/actual'):
    os.makedirs(output_dir + '/actual')
for i, (train_index, test_index) in enumerate(ss.split(X, y)):
    X_train , X_test = X.iloc[train_index],X.iloc[test_index]
    y_train , y_test = y.iloc[train_index] , y.iloc[test_index]
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=.20, random_state=0)
    
    Encoder = LabelEncoder()
    y_train = Encoder.fit_transform(y_train)
    y_test = Encoder.fit_transform(y_test)
    y_val_encode = Encoder.fit_transform(y_val)
    
    final_train = pd.DataFrame({'text':X_train,'label':y_train})
    final_test = pd.DataFrame({'text':X_test,'label':y_test})
    final_val = pd.DataFrame({'text':X_val,'label':y_val_encode})
    
    final_train.to_csv(f'{output_dir}/train/FAA-{i}.csv', index=False)
    final_test.to_csv(f'{output_dir}/test/FAA-{i}.csv', index=False)
    final_val.to_csv(f'{output_dir}/val/FAA-{i}.csv', index=False)
    y_val.to_csv(f'{output_dir}/actual/FAA-{i}.csv', index=False)