Cleaning Data

We will be using the Concatenated_Orig_data.csv, that was created using the 00_Collecting_Data.ipynb

df = pd.read_csv(f"../data/Concatenated_Orig_data.csv",encoding='latin-1', header=0)
/tmp/ipykernel_440135/286995959.py:1: DtypeWarning: Columns (3,4,6,7,8,9,10,11,23,30,31,43,44,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,81,82,104,106,111,112,113,134,178,179,182,183,184,185,186,187,190) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv(f"../data/Concatenated_Orig_data.csv",encoding='latin-1', header=0)

Preprocessing

First, lets examine our data:

df.head()
Unnamed: 0 c5 c1 c2 c3 c4 c6 c7 c8 c9 ... 32 Unnamed: 180 Unnamed: 181 Unnamed: 182 Unnamed: 183 Unnamed: 184 Unnamed: 185 Unnamed: 186 Unnamed: 187 Unnamed: 188
0 0 19750101000049A A 0.4 1975 1 1 19750101 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1 19750101000129A A 0.4 1975 1 1 19750101 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2 19750101000139A A 0.4 1975 1 1 19750101 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 3 19750101000219A A 0.4 1975 1 1 19750101 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 4 19750101000229A A 0.4 1975 1 1 19750101 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 191 columns

Let’s clean the data set by dropping duplicate entries by the identification number and converting blank cells to NaN

df = df.drop_duplicates(subset=['c5'])
df.replace(r'^\s*$', np.nan, regex=True, inplace=True)

Selecting Opcodes

Now, let’s get a better sense of the target column c78. Specifcally, we’d like to know how many instances there are of each label, and limit ourselves to only the most frequent labels to reduce numerosity.

Let’s limit the data to those entries with the value of c78 appearing in the AIDCODE.csv

df2 = pd.read_csv("../data/raw-data/AIDCODE.csv")
list_code = df2['CODE'].unique()
list_code = list_code.tolist()
#drop rows that contain any value in the list
df = df[df.c78.isin(list_code) == True]
df.dropna(subset=['c78'],inplace=True)
df['c78'].value_counts()
GC    14100
LO     9009
HO     8832
GN     6085
AS     5800
      ...  
FC        3
IF        1
FE        1
AP        1
CS        1
Name: c78, Length: 115, dtype: int64
df.shape[0]
113942

Saving

df.to_csv("../data/Concatenated_Clean_data.csv")