Cleaning Data

We will be using the Concatenated_Orig_data.csv, that was created using the 00_Collecting_Data.ipynb

df = pd.read_csv(f"../data/Concatenated_Orig_data.csv",encoding='latin-1', header=0)

/tmp/ipykernel_440135/286995959.py:1: DtypeWarning: Columns (3,4,6,7,8,9,10,11,23,30,31,43,44,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,81,82,104,106,111,112,113,134,178,179,182,183,184,185,186,187,190) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv(f"../data/Concatenated_Orig_data.csv",encoding='latin-1', header=0)

Preprocessing

First, lets examine our data:

df.head()

	Unnamed: 0	c5	c1	c3	c6	c7	c8	c9	...	32	Unnamed: 180	Unnamed: 181	Unnamed: 182	Unnamed: 183	Unnamed: 184	Unnamed: 185	Unnamed: 186	Unnamed: 187	Unnamed: 188
0	0	19750101000049A	A	0.4	1975	1	1	19750101	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	1	19750101000129A	A	0.4	1975	1	1	19750101	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	2	19750101000139A	A	0.4	1975	1	1	19750101	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	3	19750101000219A	A	0.4	1975	1	1	19750101	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	4	19750101000229A	A	0.4	1975	1	1	19750101	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 191 columns

Let’s clean the data set by dropping duplicate entries by the identification number and converting blank cells to NaN

df = df.drop_duplicates(subset=['c5'])
df.replace(r'^\s*$', np.nan, regex=True, inplace=True)

Selecting Opcodes

Now, let’s get a better sense of the target column c78. Specifcally, we’d like to know how many instances there are of each label, and limit ourselves to only the most frequent labels to reduce numerosity.

Let’s limit the data to those entries with the value of c78 appearing in the AIDCODE.csv

df2 = pd.read_csv("../data/raw-data/AIDCODE.csv")
list_code = df2['CODE'].unique()
list_code = list_code.tolist()

#drop rows that contain any value in the list
df = df[df.c78.isin(list_code) == True]

df.dropna(subset=['c78'],inplace=True)

df['c78'].value_counts()

GC    14100
LO     9009
HO     8832
GN     6085
AS     5800
      ...  
FC        3
IF        1
FE        1
AP        1
CS        1
Name: c78, Length: 115, dtype: int64

df.shape[0]

Saving

df.to_csv("../data/Concatenated_Clean_data.csv")