Feature Extraction

At this point our workflow will split as we prepare for different modeling techniques. This notebook will preprocess the data for Neural Net training and inference in 03a_Training_Model.ipynb

First let’s import our cleaned concatenated data from 01_Cleaning_Data.ipynb

df = pd.read_csv("../data/Concatenated_Clean_data.csv")

/tmp/ipykernel_440158/2483919307.py:1: DtypeWarning: Columns (5,6,11,15,16,17,18,19,27,29,30,33,34,35,36,43,45,48,49,50,51,52,58,59,60,61,62,63,69,70,71,72,73,74,75,76,77,78,79,82,83,117,118,122,123,127,131,135,136,142,155,158,167,168,172,173,175,176) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv("../data/Concatenated_Clean_data.csv")

Feature Extraction

In this study, 8 maintenance codes were observed as relevant. We’ll extract those now

maintenance_codes = ['AF', 'DE', 'AI', 'AP', 'AU', 'EQ', 'II', 'ME']

df = df[df['c78'].isin(maintenance_codes)]
df['c78'].value_counts()

II    1951
ME     377
AU     246
AF      92
DE      57
EQ      24
AI      15
AP       1
Name: c78, dtype: int64

Next, we identify and select relevant data and label columns

text_columns = ['c119','c77','c79','c81', 'c85', 'c87', 'c89', 'c91', 'c93', 'c95', 'c97', 'c99', 'c101', 'c103', 'c105', 'c107', 'c109', 'c131', 'c133', 'c135', 'c137', 'c146', 'c148', 'c150', 'c154','c161', 'c163', 'c183', 'c191']
label_columns = ['c78', 'c80', 'c86', 'c5']

columns_to_keep = text_columns + label_columns
df.drop(columns=[col for col in df if col not in columns_to_keep], inplace=True)

df.shape

(2763, 33)

This is our maintenence text csv, we’ll save that now

df.to_csv("../data/cleaned-data/Maintenance_Text_data.csv")

For our NLP classification, we only need two columns: c119 is the text that describes an issue, c78 is the label that classifies the issue

We’ll extract those now

data = pd.DataFrame()
data['text'] = df['c119']
data['label'] = df['c78']
data

	text	label
535	TAILWHEEL COCKED RIGHT PRIOR TO TKOF. ...	AU
864	TOW PLANE BECAME AIRBORNE THEN SETTLED.STUDENT...	ME
2195	2ND ILS APCH,ACFT'S G/S INOP.LOM TUNED TO WRON...	AU
2476	PLT NOTED SOFT R BRAKE PEDAL DRG TAXI TO TKOF....	AU
2916	TAXI OFF HARD SFC DUE TFC R MAIN GR BROKE THRO...	AF
...	...	...
113835	(-23) A/C RELOCATED TO NEW HANGAR TO CHECK SIZ...	II
113838	(-23) ON 2/23/08 @ APPROXIMATELY 2130 DURING T...	AF
113840	(-23) PILOT TOOK OFF FOR LEESBURG AIRPORT AND ...	II
113869	(-23) OWNER FORGOT TO FASTEN THE LOWER LEFT 4 ...	II
113902	(-23) THE AIRCRAFT EXPERIENCED SEVERE TURBULAN...	ME

2763 rows × 2 columns

Cleaning Dataframe

Even with previous cleaning, lets ensure our dataframe is clean

data.isna().sum()

text     15
label     0
dtype: int64

Remove NaN values

data = data.fillna('Null')
data = data[data['text'] != 'Null']

Check there are no missing values left

data.isna().sum()

text     0
label    0
dtype: int64

data.head(10)

	text	label
535	TAILWHEEL COCKED RIGHT PRIOR TO TKOF. ...	AU
864	TOW PLANE BECAME AIRBORNE THEN SETTLED.STUDENT...	ME
2195	2ND ILS APCH,ACFT'S G/S INOP.LOM TUNED TO WRON...	AU
2476	PLT NOTED SOFT R BRAKE PEDAL DRG TAXI TO TKOF....	AU
2916	TAXI OFF HARD SFC DUE TFC R MAIN GR BROKE THRO...	AF
3151	ACFT BEING TAXIED ON GRASS TAXIWAY NOSE WHEEL ...	AF
3332	DEP FOR DEST WITH KNOWN ELEC PROB. DIDNT USE E...	AU
3943	MTNS OBSCURED.FLT TO CK VOR REC REPTD INOP PRI...	AU
4176	SUFFICIENT OPPORTUNITY EXISTED TO RELEASE WHEN...	ME
4442	MAINT NOT PERFORMED DUE PARTS NOT AVAILABLE. T...	AU

Remove rows with one occurance

counts = data['label'].value_counts()
data = data[data['label'].isin(counts[counts > 1].index)]

Splitting Data

X, y = data['text'], data['label']

We’ll split data into training (60%), testing (20%), and validating (20%)

ss = StratifiedShuffleSplit(n_splits=10, test_size=0.20, random_state=0)
ss.get_n_splits(X, y)

output_dir = '../data/splits'

Create Output Dirs

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

if not os.path.exists(output_dir + '/train'):
    os.makedirs(output_dir + '/train')

if not os.path.exists(output_dir + '/test'):
    os.makedirs(output_dir + '/test')

if not os.path.exists(output_dir + '/val'):
    os.makedirs(output_dir + '/val')

if not os.path.exists(output_dir + '/actual'):
    os.makedirs(output_dir + '/actual')

for i, (train_index, test_index) in enumerate(ss.split(X, y)):
    X_train , X_test = X.iloc[train_index],X.iloc[test_index]
    y_train , y_test = y.iloc[train_index] , y.iloc[test_index]
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=.20, random_state=0)
    
    Encoder = LabelEncoder()
    y_train = Encoder.fit_transform(y_train)
    y_test = Encoder.fit_transform(y_test)
    y_val_encode = Encoder.fit_transform(y_val)
    
    final_train = pd.DataFrame({'text':X_train,'label':y_train})
    final_test = pd.DataFrame({'text':X_test,'label':y_test})
    final_val = pd.DataFrame({'text':X_val,'label':y_val_encode})
    
    final_train.to_csv(f'{output_dir}/train/FAA-{i}.csv', index=False)
    final_test.to_csv(f'{output_dir}/test/FAA-{i}.csv', index=False)
    final_val.to_csv(f'{output_dir}/val/FAA-{i}.csv', index=False)
    y_val.to_csv(f'{output_dir}/actual/FAA-{i}.csv', index=False)