ย
Learning Goals of this Project:
- Learning basic Pandas Dataframe manipulations
- Learning more about Machine Learning (ML) Classification models and how they are used in a Cybersecurity context.
- Learning about basic data pipelines and transformations.
- Learning how to write and use unit tests when developing Python code.
Important Highlights
- You can do this project on your host, you do not need to use the VM.
- Please see theย Setupย page for videos and instructions about project setup.
- Keep the VM around for the final project (Summer 24), Web Security.
- Please watch the provided videos below to see how to setup your environment, we canโt provide broad support here
- There are only 25 submissions allowed! This is because Gradescope is a limited resource. Itโs improper to test your code against Gradescope.
- We have provided a local testing suite, be sure to pass that completely before you submit to Gradescope.
Important Reference Materials:
Project Overview Video
This is a 16 minute video by the project creator, it covers project concepts.https://www.youtube.com/embed/kYoQiAamIpQ?si=HRUz0RA4-8IuDhdn
There are other videos on theย Setupย page that cover installation and other subjects.
BACKGROUND
Many of the Projects in CS6035 are focused on offensive security tasks. These are related toย Red Teamย activities/tasks that many of us may associate with cybersecurity. This project will be focused on defensive security tasks, which are usually consideredย Blue Teamย activities that are done by many corporate teams.
Historically, many defensive security professionals have investigated malicious activity, files, and code. They investigate these to create patterns (often called signatures) that can be used to detect (and prevent) malicious activity, files, and code when that pattern is used again. What this means is that these simple methods only were effective on known threats.
This approach was relatively effective in preventing known malware from infecting systems, but it did nothing to protect against novel attacks. As attackers became more sophisticated, they learned to tweak or simply encode their malicious activity, files, or code to avoid detection from these simple pattern matching detections.
With this background information, it would be nice if a more general solution could give a score to activity, files, and code that pass through corporate systems every day. This solution would inform the security team that while a certain pattern may not exactly fit a signature of known malicious activity, files, or code it appears to be very similar to examples that were seen in the past that were malicious.
Luckily machine learning models can do exactly that if provided with proper training data! Thus, it is no surprise that one of the most powerful tools in the hands of defensive cybersecurity professionals is Machine Learning. Modern detection systems usually use a combination of machine learning models and pattern matching (regular expressions) to detect and prevent malicious activity on networks and devices.
This project will focus on teaching the fundamentals of data analysis and building/testing your own machine learning models in python. Youโll be using the open source libraries Pandas and Scikit-Learn.
Cybersecurity Machine Learning Careers and Trends
- Machine learning in cybersecurity is a growing field. The area was considered among top trends byย McKinseyย in 2022.
- In theย CompTIA State of Cybersecurity 2024ย it says last year there were 660,000 unfilled Cybersecurity positions. Also in the section titledย Product: AI Drives the Cybersecurity Product Set to New Heightsย they note that 56% of respondents use AI and Machine Learning for Cybersecurity.
Additional Information
- ML in Cybersecurity โ Crowdstrike
- AI for Cybersecurity โ IBM
- Future of Cybersecurity and AI โ Deloitte
Table of contents
Task 1 (15 points)
For the first task, letโs get familiar with some pandas basics. pandas is a Python library that deals with Dataframes, which you can think of as a Python class that handles tabular data. In the real world, you would create graphics and other visuals to better understand the dataset you are working with. You would also use plotting tools like PowerBi, Tableau, Data Studio, and Matplotlib. This step is generally known as Exploratory Data Analysis. Since we are using an autograder for this class, we will skip the plotting for this project.
For this task, we have released a local test suite. If you are struggling to understand the expected input and outputs for a function, please set up the test suite and use it to debug your function. Please note that the return lines for the provided skeleton functions are placeholders for the data types that the tests are expecting.
Itโs critical you pass all tests locally before you submit to Gradescope for credit. Do not use Gradescope for debugging.
Theory
In this Task, weโre not yet getting into theory. Itโs more nuts and bolts โ you will learn the basics of pandas. pandas dataframes are something of a glorified list of lists, mixed in with a dictionary. You get a table of values with rows and columns, and you can modify the column names and index values for the rows. There are numerous functions built into pandas to let you manipulate the data in the dataframe.
To be clear, pandas is not part of Python, so when you look up docs, youโll specifically want . Note that we linked to the API docs here, this is the core of the docs youโll be looking at.
You can always get started trying to solve a problem by looking at Stack Overflow posts in Google search results. There youโll find ideas about how to use the pandas library. In the end, however, you should find yourself in the habit of looking directly at the docs for whichever library you are using, pandas in this case.
For those who might need a concrete example to get started, hereโs how you would take a pandas dataframe column and return the average of its values:
import pandas as pd
# create a dataframe from a Python dict
df = pd.DataFrame({"color":["yellow", "green", "purple", "red"], "weight":[124,4.56,384,-2]})
df # shows the dataframe
Note that the column names are [โcolorโ,โweightโ] while the index is [0,1,2,3โฆ] where [โฆ] the brackets denote a list.
Now that we have created a dataframe, we can find the average weight by summing the values under โweightโ and dividing them by the sum:
average = df['weight'].sum() / len(df['weight'])
average # if you put a variable as the last line, the variable is printed
127.63999999999999
Note: In the example above, weโre not paying attention to rounding, you will need to round your answers to the precision asked for in each Task.
Also note, we are using slightly older versions of the pandas, Python and other libraries so be sure to look at the docs for the appropriate library version. Often thereโs a drop-down at the top of docs sites to select the older version.
Refer to the page for details about submitting your work.
Useful Links:
Deliverables:
- Complete the functions in task1.py
- For this task we have released a local test suite please set that up and use it to debug your function.
- Submit task1.py to gradescope
Instructions:
The Task1.py file has function skeletons that you will complete with Python code, mostly using the pandas library. The goal of each of these functions is to give you familiarity with the pandas library and some general Python concepts like classes, which you may not have seen before. See information about the functionโs inputs, outputs, and skeletons below.
Example of Solving Task1 subtasks:
Subtask Description In this function you will take a dataset, a random state, and a number n. You will return n sampled rows from the dataset using the random state to ensure reproducability.
Useful Resource
Inputs
datasetโ a pandas DataFrame.nโ integer, number of rows to return.random_stateโ integer, seed value to get repeatable results
Outputs
- A pandas DataFrame containing
nrandomly selected rows.
Function Skeleton
def get_random_sample(dataset: pd.DataFrame, n: int, random_state: int) -> pd.DataFrame:
return pd.DataFrame()
Expected Result
def get_random_sample(dataset: pd.DataFrame, n: int, random_state: int) -> pd.DataFrame:
return dataset.sample(n=n,random_state=random_state)
Explaination Reviewing the link in the Useful Resource we can see pandas has a DataFrame.sample method that can be used to sample from a pd.DataFrame Class and returns a pd.DataFrame class. The task Description mentions that you will use n and random_state to return the expected pd.DataFrame object looking at the method it also uses the same naming scheme for those variables so we can use the dataset variable (which is a pd.DataFrame), the DataFrame.sample method and the input parameters passed into our skeleton to return the expected result. This result could then be tested in the notebook and compared with an expected result given sample data and in the unittests which would also pass in sample inputs to the function and compare the function outputs to the expected outputs.
Local Test Dataset Information
For this task the local test dataset we are using a very simple example dataset, which contains 10 rows and 6 columns of data related to network traffic.
In this task we will guide you through splitting datasets into train and test sets as well as preprocessing datasets using scikit-learn encoders, scalers and dimensionality reduction techniques.
Table of contents
find_data_type
In this function you will take a dataset and the name of a column in it. You will return the columnโs data type.
Useful Resources
INPUTS
datasetโ a pandas DataFrame that contains some datacolumn_nameโ a Python string (str)
OUTPUTS
np.dtype โ data type of the column
Function Skeleton
def find_data_type(dataset:pd.DataFrame,column_name:str) -> np.dtype:
return np.dtype()
set_index_col
In this function you will take a dataset and a series and set the index of the dataset to be the series
Useful Resources
INPUTS
datasetโ a pandas DataFrame that contains some dataindexโ a pandas series that contains an index for the dataset
OUTPUTS
a pandas DataFrame indexed by the given index series
Function Skeleton
def set_index_col(dataset:pd.DataFrame,index:pd.Series) -> pd.DataFrame:
return pd.DataFrame()
reset_index_col
In this function you will take a dataset with an index already set and reindex the dataset from 0 to n-1, where n is the number of rows in the dataset, dropping the old index
Useful Resources
INPUTS
datasetโ a pandas DataFrame that contains some data
OUTPUTS
a pandas DataFrame indexed from 0 to n-1
Function Skeleton
def reset_index_col(dataset:pd.DataFrame) -> pd.DataFrame:
return pd.DataFrame()
set_col_type
In this function you will be given a DataFrame, column name and column type. You will edit the dataset to take the column name you are given and set it to be the type given in the input variable
Useful Resources
INPUTS
datasetโ a pandas DataFrame that contains some datacolumn_nameโ a string containing the name of a columnnew_col_typeโ a Python type to change the column to
OUTPUTS
a pandas DataFrame with the column in column_name changed to the type in new_col_type
Function Skeleton
# Set astype (string, int, datetime)
def set_col_type(dataset:pd.DataFrame,column_name:str,new_col_type:type) -> pd.DataFrame:
return pd.DataFrame()
make_DF_from_2d_array
In this function you will take data in an array as well as column and row labels and use that information to create a pandas DataFrame
Useful Resources
INPUTS
array_2dโ a 2 dimensional numpy array of valuescolumn_name_listโ a list of strings holding column namesindexโ a pandas series holding the row indexโs
OUTPUTS
a pandas DataFrame with columns set from column_name_list, row index set from index and data set from array_2d
Function Skeleton
# Take Matrix of numbers and make it into a DataFrame with column name and index numbering
def make_DF_from_2d_array(array_2d:np.array,column_name_list:list[str],index:pd.Series) -> pd.DataFrame:
return pd.DataFrame()
sort_DF_by_column
In this function, you are given a dataset and column name. You will return a sorted dataset (sorting rows by the value of the specified column) either in descending or ascending order, depending on the value in the descending variable.
Useful Resources
INPUTS
datasetโ a pandas DataFrame that contains some datacolumn_nameโ a string that contains the column name to sort the data ondescendingโ a boolean value (TrueorFalse) for if the column should be sorted in descending order
OUTPUTS
a pandas DataFrame sorted by the given column name and in descending or ascending order depending on the value of the descending variable
Function Skeleton
# Sort DataFrame by values
def sort_DF_by_column(dataset:pd.DataFrame,column_name:str,descending:bool) -> pd.DataFrame:
return pd.DataFrame()
drop_NA_cols
In this function you are given a DataFrame. You will return a DataFrame with any columns containing NA values dropped (meaning the returned DataFrame will only contain columns from the input DataFrame that do not contain any NA values).
Useful Resources
INPUTS
datasetโ a pandas DataFrame that contains some data
OUTPUTS
a pandas DataFrame with any columns that contain an NA value dropped
Function Skeleton
# Drop NA values in DataFrame Columns
def drop_NA_cols(dataset:pd.DataFrame) -> pd.DataFrame:
return pd.DataFrame()
Example
Input DataFrame:
Output DataFrame:
drop_NA_rows
In this function you are given a DataFrame you will return a DataFrame with any rows containing NA values dropped
Useful Resources
INPUTS
datasetโ a pandas DataFrame that contains some data
OUTPUTS
a pandas DataFrame with any rows that contain an NA value dropped
Function Skeleton
def drop_NA_rows(dataset:pd.DataFrame) -> pd.DataFrame:
return pd.DataFrame()
make_new_column
This function adds a new column to a DataFrame using a provided list of values, where each value corresponds to a row in the dataset. The new column is named according to the specified new_column_name.
Useful Resources
INPUTS
datasetโ A pandas DataFrame containing existing data.new_column_nameโ A string specifying the name of the new column to create.new_column_valueโ A list of values where each element represents the value for the new column in the corresponding row of the DataFrame. The length of this list must match the number of rows in the dataset.
OUTPUTS
A pandas DataFrame with the new column added. The new column, named new_column_name, contains the values from the new_column_value list in the order they are provided. Each rowโs value in the new column corresponds to the element at the same index in the list.
Function Skeleton
def make_new_column(dataset:pd.DataFrame,new_column_name:str,new_column_value:list) -> pd.DataFrame:
return pd.DataFrame()
left_merge_DFs_by_column
In this function you are given 2 datasets and the name of a column with which you will left join them on using the pandas merge method. The left dataset is dataset1 right dataset is dataset2, for example purposes.
Useful Resources
INPUTS
left_datasetโ a pandas DataFrame that contains some dataright_datasetโ a pandas DataFrame that contains some datajoin_col_nameโ a string containing the column name to join the two DataFrames on
OUTPUTS
a pandas DataFrame containing the two datasets left joined together on the given column name
Function Skeleton
def left_merge_DFs_by_column(left_dataset:pd.DataFrame,right_dataset:pd.DataFrame,join_col_name:str) -> pd.DataFrame:
return pd.DataFrame()
simpleClass
This project will require you to work with Python Classes. If you are not familiar with them we suggest learning a bit more about them.
You will take the inputs into the class initialization and set them as instance variables (of the same name) in the Python class.
Useful Resources
INPUTS
lengthโ an integerwidthโ an integerheightโ an integer
OUTPUTS
None, just setup the init method in the class.
Function Skeleton
class simpleClass():
def __init__(self, length:int, width:int, height:int):
pass
find_dataset_statistics
Now that you have learned a bit about pandas DataFrames, we will use them to generate some simple summary statistics for a DataFrame. You will be given the dataset as an input variable, as well as a column name for a column in the dataset that serves as a label column. This label column contains binary values (0 and 1) that you also summarize, and also the variable to predict.
In this context:
- 0 represents a โnegativeโ sample (e.g. if the column is IsAVirus and we think it is false)
- 1 represents a โpositiveโ sample (e.g. if the column is IsAVirus and we think it is true)
This type of binary classification is common in machine learning tasks where we want to be able to predict the field. An example of where this could be useful would be if we were looking at network data, and the label column was IsVirus. We could then analyze the network data of Georgia Tech services and predict if incoming files look like a virus (and if we should alert the security team).
Useful Resources
INPUTS
datasetโ a pandas DataFrame that contains some datalabel_colโ a string containing the name of thelabelcolumn
OUTPUTS
n_records(int) โ the number of rows in the datasetn_columns(int) โ the number of columns in the datasetn_negative(int) โ the number of โnegativeโ samples in the dataset (the argumentlabelcolumn equals 0)n_positive(int) โ the number of โpositiveโ samples in the dataset (the argumentlabelcolumn equals 1)perc_positive(int) โ the percentage (out of 100%) of positive samples in the dataset; truncate anything after the decimal
Hint: Consider using the int function to type cast decimals
Function Skeleton
def find_dataset_statistics(dataset:pd.DataFrame,label_col:str) -> tuple[int,int,int,int,int]:
n_records = #TODO
n_columns = #TODO
n_negative = #TODO
n_positive = #TODO
perc_positive = #TODO
return n_records,n_columns,n_negative,n_positive,perc_positive
Task 2 (25 points)
Now that you have a basic understanding of pandas and the dataset, it is time to dive into some more complex data processing tasks.
Theory
In machine learning a common goal is to train a model on one set of data. Then we validate the model on a similarly structured but different set of data. You could, for example, train the model on data you have collected historically. Then you would validate the model against real-time data as it comes in, seeing how well it predicts the new data coming in.
If youโre looking at a past dataset as we are in these tasks, we need to treat different parts of the data differently to be able to develop and test models. We segregate the data into test and training portions. We train the model on the training data and test the developed model on the test data to see how well it predicts the results.
You should never train your models on test data, only on training data.
Notes
At a high level it is important to hold out a subset of your data when you train a model. You can see what the expected performance is on unseen sample. Thus, you can determine if the resulting model is overfit (performs much better on training data vs test data).
Preprocessing data is essential because most models only take in numerical values. Therefore, categorical features need to be โencodedโ to numerical values so that models can use them. A machine learning model may not be able to make sense of โgreenโ, โblueโ and โred.โ In preprocessing, weโll convert those to integer values 1, 2 and 3, for example. Itโs an interesting question as to what happens when you have training data that has โgreen,โ โredโ and blue,โ but your testing data says โyellow.โ
Numerical scaling can be more or less useful depending on the type of model used, but it is especially important in linear models. Numerical scaling is typically taking positive value and โcompressingโ them into a range between 0 and 1 (inclusive) that retains the relationships among the original data.
These preprocessing techniques will provide you with options to augment your dataset and improve model performance.
Useful Links:
Deliverables:
- Complete the functions and methods in task2.py
- For this task we have released a local test suite please set that up and use it to debug your function.
- Submit task2.py to Gradescope when you pass all local tests. Refer to the page for details.
Instructions:
The Task2.py File has function skeletons that you will complete with python code (mostly using the pandas and scikit-learn libraries). The Goal of each of these functions is to give you familiarity with the applied concepts of Splitting and Preprocessing Data. See information about the Functionโs Inputs, Outputs and Skeletons below
Local Test Dataset Information
For this task the local test dataset we are using a very simple example dataset, which contains 10 rows and 6 columns of data related to network traffic.
In this task we will guide you through splitting datasets into train and test sets as well as preprocessing datasets using scikit-learn encoders, scalers and dimensionality reduction techniques.
Table of contents
tts
In this function, you will take:
- a dataset
- the name of its label column
- a percentage of the data to put into the test set
- whether you should stratify on the label column
- a random state to set the scikit-learn function
You will return features and labels for the training and test sets.
At a high level, you can separate the task into two subtasks. The first is splitting your dataset into both features and labels (by columns), and the second is splitting your dataset into training and test sets (by rows). You should use the scikit-learn train_test_split function but will have to write wrapper code around it based on the input values we give you.
Useful Resources
INPUTS
datasetโ a pandas DataFrame that contains some datalabel_colโ a string containing the name of the column that contains thelabelvalues (what our model wants to predict)test_sizeโ a float containing the decimal value of the percentage of the number of rows that the test set should be out of the datasetshould_stratifyโ a boolean (TrueorFalse) value indicating if the resulting train/test split should be stratified or notrandom_stateโ an integer value to set the randomness of the function (useful for repeatability especially when autograding)
OUTPUTS
train_featuresโ a pandas DataFrame that contains the train rows and the feature columnstest_featuresโ a pandas DataFrame that contains the test rows and the feature columnstrain_labelsโ a pandas Series that contains the train rows and the label columntest_labelsโ a pandas Series that contains the test rows and the label column
Function Skeleton
def tts( dataset: pd.DataFrame,
label_col: str,
test_size: float,
should_stratify: bool,
random_state: int) -> tuple[pd.DataFrame,pd.DataFrame,pd.Series,pd.Series]:
# TODO
return train_features,test_features,train_labels,test_labels
PreprocessDataset
The PreprocessDataset Class contains a code skeleton with nine methods for you to implement. Most methods will be split into two parts: one that will be run on the training dataset and one that will be run on the test dataset. In Data Science/Machine Learning, this is done to avoid something called .
For this assignment, we donโt expect you to understand the nuances of the concept, but we will have you follow principles that will minimize the chances of it occurring. You will accomplish this by splitting data into training and test datasets and processing those datasets in slightly different ways.
Generally, for everything you do in this project, and if you do any ML or Data Science work in the future, you should train/fit on the training data first, then predict/transform on the training and test data. That holds up for basic preprocessing steps like task 2 and for complex models like you will see in tasks 3 and 4.
For the purposes of this project, you should never train or fit on the test data (and more generally in any ML project) because your test data is expected to give you an understanding of how your model/predictions will perform on unseen data. If you fit even a preprocessing step to your test data, then you are either giving the model information about the test set it wouldnโt have about unseen data (if you combine train and test and fit to both), or you are providing a different preprocessing than the model is expecting (if you fit a different preprocessor to the test data), and your model would not be expected to perform well.
Note: You should train/fit using the train dataset; then, once you have a fit encoder/scaler/pca/model instance, you can transform/predict on the training and test data.
You will also notice that we are only preprocessing the Features and not the Labels. There are a few cases where preprocessing steps on labels may be helpful in modeling, but they are definitely more advanced and out of the scope of this introduction. Generally, you will not need to do any preprocessing to your labels beyond potentially encoding a string value (i.e., โMalwareโ or โBenignโ) into an integer value (0 or 1), which is called .
PreprocessDataset:__init__
Similar to the Task1 simpleClass subtask you previously completed you will initialize the class by adding instance variables (add all the inputs to the class).
Useful Resources
INPUTS
one_hot_encode_colsโ a list of column names (strings) that should be one hot encoded by the one hot encode methodsmin_max_scale_colsโ a list of column names (strings) that should be min/max scaled by the min/max scaling methodsn_componentsโ an int that contains the number of components that should be used in Principal Component Analysisfeature_engineering_functionsโ a dictionary that contains feature name and function to create that feature as a key value pair (example shown below)
Example of feature_engineering_functions:
def double_height(dataframe:pd.DataFrame):
return dataframe["height"] * 2
def half_height(dataframe:pd.DataFrame):
return dataframe["height"] / 2
feature_engineering_functions = {"double_height":double_height,"half_height":half_height}
Donโt worry about copying it we also have examples in the local test cases this is just provided as an illustration of what to expect in your function.
OUTPUTS
None, just assign all the input parameters to class variables.
Also per the instructions below, youโll return here and create another instance variable: a scikit-learn OneHotEncoder with any Parameters you may need later.
Function Skeleton
def __init__(self,
one_hot_encode_cols:list[str],
min_max_scale_cols:list[str],
n_components:int,
feature_engineering_functions:dict
):
# TODO: Add any instance variables you may need to make your functions work
return
PreprocessDataset:one_hot_encode_columns_train and one_hot_encode_columns_test
One Hot Encoding is the process of taking a column and returning a binary vector representing the various values within it. There is a separate function for the training and test datasets since they should be handled separately to avoid data leakage (see the 3rd link in Useful Resources for a little more info on how to handle them).
Pseudocode
one_hot_encode_columns_train()
- In the
PreprocessDataset__init__()method initialize an instance variable containing a scikit-learnOneHotEncoderwith any parameters you may need. - Split
train_featuresinto two DataFrames: one with only the columns you want to one hot encode (usingone_hot_encode_cols) and another with all the other columns. - Fit the
OneHotEncoderusing the DataFrame you split fromtrain_featureswith the columns you want to encode. - Transform the DataFrame you split from
train_featureswith the columns you want to encode using the fittedOneHotEncoder. - Create a DataFrame from the 2D array of data that the output from step 4 gave you, with column names in the form of
columnName_categoryName(there should be an attribute inOneHotEncoderthat can help you with this) and the same index thattrain_featureshad. - Concatenate the DataFrame you made in step 5 with the DataFrame of other columns from step 2.
one_hot_encode_columns_test()
- Split
test_featuresinto two DataFrames: one with only the columns you want to one hot encode (usingone_hot_encode_cols) and another with all the other columns. - Transform the DataFrame you split from
test_featureswith the columns you want to encode using theOneHotEncoderyou fit inone_hot_encode_columns_train() - Create a DataFrame from the 2D array of data that the output from step 2 gave you, with column names in the form of
columnName_categoryName(there should be an attribute inOneHotEncoderthat can help you with this) and the same index thattest_featureshad. - Concatenate the DataFrame you made in step 3 with the DataFrame of other columns from step 1.
Example Walkthrough (from Local Testing suite):
INPUTS:
one_hot_encode_cols
["src_ip","protocol"]
Train Features
Test Features
TRAIN DATAFRAMES AT EACH STEP:
2.
DataFrame with columns to encode:
DataFrame with other columns:
4.
One Hot Encoded 2d array:
5.
One Hot Encoded DataFrame with Index and Column Names
6.
Final DataFrame with passthrough/other columns joined back
TEST DATAFRAMES AT EACH STEP:
1.
DataFrame with columns to encode:
DataFrame with other columns:
2.
One Hot Encoded 2d array:
3.
One Hot Encoded DataFrame with Index and Column Names
4.
Final DataFrame with passthrough columns joined back
Note: For the local tests and autograder use the column naming scheme of joining the previous column name and the column value with an underscore (similar to above where Type -> Type_Fruit and Type_Vegetable)
Note 2: Since you should only be fitting your encoder on the training data, if there are values in your test set that are different than those in the training set, you will denote that with 0s. In the example above, letโs say we have a row in the test set with pizza, which is neither a fruit nor vegetable for the Type_Fruit and Type_Vegetable. It should result in a 0 for both columns. If you donโt handle these properly, you may get errors like Test Failed: Found unknown categories.
Note 3: You may be tempted to use the pandas function get_dummies to solve this task, but its a trap. It seems easier, but you will have to do a lot more work to make it handle a train/test split. So, we suggest you use scikit-learnโs OneHotEncoder.
Useful Resources
INPUTS
- Use the needed instance variables you set in the
__init__method train_featuresโ a dataset split by a function similar to tts which should be used in the training/fitting stepstest_featuresโ a dataset split by a function similar to tts which should be used in the test steps
OUTPUTS
a pandas DataFrame with the columns listed in one_hot_encode_cols one hot encoded and all other columns in the DataFrame unchanged
Function Skeleton
def one_hot_encode_columns_train(self,train_features:pd.DataFrame) -> pd.DataFrame:
one_hot_encoded_dataset = pd.DataFrame()
return one_hot_encoded_dataset
def one_hot_encode_columns_test(self,test_features:pd.DataFrame) -> pd.DataFrame:
one_hot_encoded_dataset = pd.DataFrame()
return one_hot_encoded_dataset
PreprocessDataset:min_max_scaled_columns_train and min_max_scaled_columns_test
Min/Max Scaling is a process to transform numerical features to a specific range, typically [0, 1], to ensure that input values are comparable (similar to how you may have heard of โnormalizingโ data) and is a crucial preprocessing step for many machine learning algos. In particular this standardization is essential for algorithms like linear regression, logistic regression, k-means, and neural networks, which can be sensitive to the scale of input features, whereas some algos like decision trees are less impacted.
By applying Min/Max Scaling, we prevent feature dominance, to ideally improve performance and accuracy of these algorithms and improve training convergence. Itโs a recommended step to ensure your models are trained on consistent and standardized data.
For the provided assignment you should use the scikit-learn MinMaxScaler function (linked in the resources below) rather than attempting to implement your own scaling function.
The rough implementation of the scikit-learn function is provided below for educational purposes.
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
Note: There are separate functions for the training and test datasets to help avoid data leakage between the test/train datasets. Please refer to the 3rd link in Useful Resources for more information on how to handle this โ namely that we should still scale the test data based on our โknowledgeโ of the train dataset.
Example Dataframe:
Example Min Max Scaled Dataframe (rounded to 4 decimal places):
Note: For the Autograder use the same column name as the original column (ex: Price -> Price)
Useful Resources
INPUTS
- Use the needed instance variables you set in the
__init__method train_featuresโ a dataset split by a function similar to tts which should be used in the training/fitting stepstest_featuresโ a dataset split by a function similar to tts which should be used in the test steps
OUTPUTS
a pandas DataFrame with the columns listed in min_max_scale_cols min/max scaled and all other columns in the DataFrame unchanged
Function Skeleton
def min_max_scaled_columns_train(self,train_features:pd.DataFrame) -> pd.DataFrame:
min_max_scaled_dataset = pd.DataFrame()
return min_max_scaled_dataset
def min_max_scaled_columns_test(self,test_features:pd.DataFrame) -> pd.DataFrame:
min_max_scaled_dataset = pd.DataFrame()
return min_max_scaled_dataset
PreprocessDataset:pca_train and pca_test
Principal Component Analysis is a dimensionality reduction technique (column reduction). It aims to take the variance in your input columns and map the columns into N columns that contain as much of the variance as it can. This technique can be useful if you are trying to train a model faster and has some more advanced uses, especially when training models on data which has many columns but few rows. There is a separate function for the training and test datasets because they should be handled separately to avoid data leakage (see the 3rd link in Useful Resources for a little more info on how to handle them).
Note 1: For the local tests and autograder, use the column naming scheme of column names: component_1, component_2 .. component_n for the n_components passed into the __init__ method.
Note 2: For your PCA outputs to match the local tests and autograder, make sure you set the seed using a random state of 0 when you initialize the PCA function.
Note 3: Since PCA does not work with NA values, make sure you drop any columns that have NA values before running PCA.
Useful Resources
INPUTS
- Use the needed instance variables you set in the
__init__method train_featuresโ a dataset split by a function similar to tts which should be used in the training/fitting stepstest_featuresโ a dataset split by a function similar to tts which should be used in the test steps
OUTPUTS
a pandas DataFrame with the generated pca values and using column names: component_1, component_2 .. component_n
Function Skeleton
def pca_train(self,train_features:pd.DataFrame) -> pd.DataFrame:
# TODO: Read the function description in https://github.gatech.edu/pages/cs6035-tools/cs6035-tools.github.io/Projects/Machine_Learning/Task2.html and implement the function as described
pca_dataset = pd.DataFrame()
return pca_dataset
def pca_test(self,test_features:pd.DataFrame) -> pd.DataFrame:
# TODO: Read the function description in https://github.gatech.edu/pages/cs6035-tools/cs6035-tools.github.io/Projects/Machine_Learning/Task2.html and implement the function as described
pca_dataset = pd.DataFrame()
return pca_dataset
PreprocessDataset:feature_engineering_train, feature_engineering_test
Feature Engineering is a process of using domain knowledge (physics, geometry, sports statistics, business metrics, etc.) to create new features (columns) out of the existing data. This could mean creating an area feature when given the length and width of a triangle or extracting the major and minor version number from a software version or more complex logic depending on the scenario.
In cybersecurity in particular, feature engineering is crucial for using domain expertโs (e.g. a security analyst) experience to identify anomalous behavior that might signify a security breach. This could involve creating features that represent deviations from established baselines, such as unusual file access patterns, unexpected network connections, or sudden spikes in CPU usage. These anomaly-based features can help distinguish malicious activity from normal system operations, but the system does not know what data patterns mean anomalous off-hand โ that is where you as the domain expert can help by creating features.
These methods utilize a dictionary, feature_engineering_functions, passed to the class constructor (__init__). This dictionary defines how to generate new features:
- Keys: Strings representing new column names.
- Values: Functions that:
- Take a DataFrame as input.
- Return a Pandas Series (the new columnโs values).
Example of whatcouldbe passed as thefeature_engineering_functionsdictionary to__init__:
import pandas as pd
def double_height(dataframe: pd.DataFrame) -> pd.Series:
return dataframe["height"] * 2
def half_height(dataframe: pd.DataFrame) -> pd.Series:
return dataframe["height"] / 2
example_feature_engineering_functions = {
"double_height": double_height, # Note that functions in python can be passed around and used just like data!
"half_height": half_height
}
# and the class may be been created like this...
# preprocessor = PreprocessDataset(..., feature_engineering_functions=example_feature_engineering_functions, ...)
In particular for this method, you will be taking in a dictionary with a column name and a function that takes in a DataFrame and returns a column. Youโll be using that to create a new column with the name in the dictionary key. Therefore if you were given the above functions, you would create two new columns named โdouble_heightโ and โhalf_heightโ in your Dataframe.
Useful Resources
INPUTS
- Use the needed instance variables you set in the
__init__method train_featuresโ a dataset split by a function similar to tts which should be used in the training/fitting stepstest_featuresโ a dataset split by a function similar to tts which should be used in the test steps
OUTPUTS
a pandas dataframe with the features described in feature_engineering_train and feature_engineering_test added as new columns and all other columns in the dataframe unchanged
Function Skeleton
def feature_engineering_train(self,train_features:pd.DataFrame) -> pd.DataFrame:
feature_engineered_dataset = pd.DataFrame()
return feature_engineered_dataset
def feature_engineering_test(self,test_features:pd.DataFrame) -> pd.DataFrame:
feature_engineered_dataset = pd.DataFrame()
return feature_engineered_dataset
PreprocessDataset:preprocess_train, preprocess_test
Now, we will put three of the above methods together into a preprocess function. This function will take in a dataset and perform encoding, scaling, and feature engineering using the above methods and their respective columns. You should not perform PCA for this function.
Useful Resources
See resources for one hot encoding, min/max scaling and feature engineering above
INPUTS
- Use the needed instance variables you set in the
__init__method train_featuresโ a dataset split by a function similar to tts which should be used in the training/fitting stepstest_featuresโ a dataset split by a function similar to tts which should be used in the test steps
OUTPUTS
a pandas dataframe for both test and train features with the columns in one_hot_encode_cols encoded, the columns in min_max_scale_cols scaled and the columns described in feature_engineering_functions engineered. You do not need to use PCA here.
Function Skeleton
def preprocess_train(self,train_features:pd.DataFrame) -> pd.DataFrame:
preprocessed_dataset = pd.DataFrame()
return train_features
def preprocess_test(self,test_features:pd.DataFrame) -> pd.DataFrame:
preprocessed_dataset = pd.DataFrame()
return test_features
Task 3 (15 points)
In Task 2 you learned how to split a dataset into training and testing components. Now itโs time to learn about using a K-means model. We will run a basic model on the data to cluster files (rows) with similar attributes together. We will use an unsupervised model.
Theory
An has no label column. By contrast, in (which youโll see in Task 4) the data has features and targets/labels. These labels are effectively an answer key to the data in the feature columns. You donโt have this answer key in unsupervised learning, instead youโre working on data without labels. Youโll need to use algorithms that learn directly from the data without the benefit of labels.
We start with because it is simple to understand the algorithm. For the mathematical people, you can look at the underlying data structure, . Based on squared , K-means creates clusters of similar datapoints. Each cluster has a centroid. The idea is that for each sample, itโs associated/clustered with the centroid that is the โclosest.โ
Closest is an interesting concept in higher dimensions. You can think of each feature in a dataset as a dimension in the data. If itโs 2D or 3D, we can visualize it easily. Concepts of distance are easy to visualize in 2D and 3D, and they extend similarly to higher dimensions.
If you read the Wikipedia articles for K-means youโll see a discussion of the use of โsquared Euclidean distancesโ in K-means. This is compared with simple Euclidean distances in the Weber problem, and better approaches resulting from k-medians and k-mediods is discussed.
Please use scikit-learn to create the model and Yellowbrick to determine the optimal value of k for the dataset.
So far, we have functions to split the data and preprocess it. Now, we will run a basic model on the data to cluster files (rows) with similar attributes together. We will use an unsupervised model (model with no label column), K-means. Again, use scikit-learn to create the model and Yellowbrick to determine the optimal value of k for the dataset.
Refer to the page for details about submitting your work.
Useful Links:
Deliverables:
- Complete the KmeansClustering class in task3.py.
- For this task we have released a local test suite please set that up and use it to debug your function.
- Submit task3.py to Gradescope when you pass all local tests. Refer to the page for details.
Local Test Dataset Information
For this task the local test dataset we are using is the NATICUSdroid dataset, which contains 86 columns of data related to android permissions used by benign and malicious Android applications released between 2010 and 2019. For more information such as the introductory paper and the Citations/Acknowledgements you can view the site in the UCI ML repository. In this specific case clustering can be a useful tool to group apps that request similar permissions together. The team that created this dataset hypothesized that malicious apps would exhibit distinct patterns in the types of permissions they request compared to benign apps. This difference in permission request patterns could potentially be used to distinguish between malicious and benign applications.
KmeansClustering
The KmeansClustering Class contains a code skeleton with 4 methods for you to implement.
Note: You should train/fit using the train dataset then once you have a Yellowbrick/K-means model instance you can transform/predict on the training and test data.
KmeansClustering:__init__
Similar to Task 1, you will initialize the class by adding instance variables as needed.
Useful Resources
INPUTS
These inputs are all expected to be passed to the sklearn.cluster.KMeans class so review that documentation for more details on what the parameters are used for.
random_stateโ an integer that should be used to set the scikit-learn randomness so the model results will be repeatable which is required for the tests and autograderinitโ a string denoting what method to use to initialize kmeans cluster centroids so the model results will be repeatable which is required for the tests and autogradern_initโ an integer used to set the number of times the k-means algorithm is run with different centroid seeds so the model results will be repeatable which is required for the tests and autogradermax_iterโ an integer used to set the maximum number of iterations of the k-means algorithm for a single run so the model results will be repeatable which is required for the tests and autograderalgorithmโ a string denoting what kmeans algorithm to use so the model results will be repeatable which is required for the tests and autogradertolโ a float used to set the threshold for relative tolerance for the difference in the cluster centers of two consecutive iterations to declare convergence so the model results will be repeatable which is required for the tests and autograder
OUTPUTS
None
Function Skeleton
def __init__(self,
random_state: int,
init: str,
n_init: int,
max_iter: int,
algorithm: str,
tol: float
):
# TODO: Add any state variables you may need to make your functions work
pass
KmeansClustering:kmeans_get_n_clusters
Kmeans Clustering is a process of grouping together similar rows together and assigning them to a cluster. For this method you will use the training data to fit an optimal K-means cluster on the data.
To help you get started we have provided a list of subtasks to complete for this task:
- Initialize a scikit-learn K-means model with the parameters passed to the
__init__method- init
- n_init
- max_iter
- random_state
- algorithm
- tol
- Try to find the best โkโ to use for the KMeans Clustering.
- Initialize a Yellowbrick KElbowVisualizer with the K-means model.
- Use that visualizer to search for the optimal value of k [between 2 (inclusive) and 10, (exclusive) in mathmatical expression that would be
[2,10)].
- If you are stuck on this step Review the YellowBrick Docs
- Train the KElbowVisualizer on the training data and determine the optimal k value.
Useful Resources
INPUTS
- Use the needed instance variables you set in the
__init__method train_featuresโ a dataset split by a function similar to tts which should be used in the training/fitting steps
OUTPUTS
The optimal value of K for the input training features
Function Skeleton
def kmeans_get_n_clusters(self, train_features:pd.DataFrame) -> int:
k = int
return k
KmeansClustering:kmeans_train
Kmeans Clustering is a process of grouping together similar rows together and assigning them to a cluster. For this method you will use the training data to fit an optimal K-means cluster on the data.
- Train a K-means model with the proper initialization and using either the value of k passed as an input or if no value is passed determine the optimal value of k (using the function you previously wrote kmeans_get_n_clusters)
- Return the cluster ids for each row of the training set as a list.
Useful Resources
INPUTS
- Use the needed instance variables you set in the
__init__method train_featuresโ a dataset split by a function similar to tts which should be used in the training/fitting stepsk(optional) โ an int value of the number of kmeans clusters orNone
OUTPUTS
a list of cluster ids that the K-means model has assigned for each row in the train dataset
Function Skeleton
def kmeans_train(self, train_features:pd.DataFrame, k:int=None) -> list:
cluster_ids = list()
return cluster_ids
KmeansClustering:kmeans_test
K-means clustering is a process of grouping together similar rows together and assigning them to a cluster. For this method you will use the trained model from the previous task to fit an optimal K-means cluster on the test data.
To help you get started, we have provided a list of subtasks to complete for this task:
- Use the model you trained in the
kmeans_trainmethod to generate cluster ids for each row of the test dataset.- If you are stuck here review the scikit-learn docs for KMeans and make sure you are using a method that will return cluster ids for the test data you input.
- Return the cluster ids for each row of the test set as a list.
Useful Resources
INPUTS
- Use the needed instance variables you set in the
__init__method test_featuresโ a dataset split by a function similar to tts which should be used in the test steps
OUTPUTS
a list of cluster ids that the K-means model has assigned for each row in the test dataset
Function Skeleton
def kmeans_test(self, test_features:pd.DataFrame) -> list:
cluster_ids = list()
return cluster_ids
ย
ย
ย
Task 4 (25 points)
Now letโs try a few supervised classification models, we have chosen a few commonly used models for you to use here, but there are many options. In the real world, specific algorithms may fit a specific dataset better than other algorithms.
You wonโt be doing any hyperparameter tuning yet, so you can better focus on writing the basic code. You will:
- Train a model using the training set.
- Predict on the training/test sets.
- Calculate performance metrics.
- Return a
ModelMetricsobject and trained scikit-learn model from each model function.
Important Note on Feature Selection: You should ONLY use RFE (Recursive Feature Elimination) for determining feature importance of your Logistic Regression model. Do NOT use RFE for any tree-based models (Decision Tree, Random Forest, or Gradient Boosting). For tree-based models, use their built-in feature importance values instead.
Useful Links:
Deliverables:
- Complete the functions and methods in task4.py
- For this task we have released a local test suite please set that up and use it to debug your function.
- Submit task4.py to Gradescope when you pass all local tests. Refer to the page for details.
Local Test Dataset Information
For this task the local test dataset we are using is the NATICUSdroid dataset, which contains 86 columns of data related to android permissions used by benign and malicious Android applications released between 2010 and 2019. For more information such as the introductory paper and the Citations/Acknowledgements you can view the site in the UCI ML repository. If you look at the for the paper that the dataset creators wrote from their research, they trained a variety of different models including Random Forest, Logistic Regression and XGBoost and calculated a variety of metrics related to training and detection performance. In this task we will guide you through training ML models and calculating performance metrics to compare the predictive abilities of different models.
Instructions:
The Task4.py File has function skeletons that you will complete with Python code (mostly using the pandas and scikit-learn libraries).
The goal of each of these functions is to give you familiarity with the applied concepts of training a model, using it to score records and calculating performance metrics for it. See information about the function inputs, outputs and skeletons below.
Table of contents
ModelMetrics
- In order to simplify the autograding we have created a class that will hold the metrics and feature importances for a model you trained.
- You should not modify this class but are expected to use it in your return statements.
- This means you put your training and test metrics dictionaries and feature importance DataFrames inside a
ModelMetricsobject for the autograder to handle. This is for each of the Logistic Regression, Gradient Boosting and Random Forest models you will create. - You do not need to return a feature importance DataFrame in the
ModelMetricsvalue for the naive model you will create, just return None in that position of the return statement, as the given code does.
calculate_naive_metrics
A Naive model is a very simple model/prediction that can help to frame how well a more sophisticated model is doing. At best, such a model has random competence at predicting things. At worst, itโs wrong all the time.
Since a naive model is incredibly basic (often a constant or randomly selected result), we can expect that any more sophisticated model that we train should outperform it. If the naive Model beats our trained model, it can mean that additional data (rows or columns) is needed in the dataset to improve our model. It can also mean that the dataset doesnโt have a strong enough signal for the target we want to predict.
In this function, youโll implement a simple model that always predicts a constant (function-provided) number, regardless of the input values. Specifically, youโll use a given constant integer, provided as the parameter naive_assumption, as the modelโs prediction. This means the model will always output this constant value, without considering the actual data. Afterward, you will calculate four metricsโaccuracy, recall, precision, and F1-scoreโfor both the training and test datasets.
[1] Refer to the resources below.
Useful Resources
INPUTS
train_featuresโ a dataset split by a function similar to the tts function you created in task2test_featuresโ a dataset split by a function similar to the tts function you created in task2train_targetsโ a dataset split by a function similar to the tts function you created in task2test_targetsโ a dataset split by a function similar to the tts function you created in task2naive_assumptionโ an integer that should be used as the result from the naive model you will create
OUTPUTS
A completed ModelMetrics object with a training and test metrics dictionary with each one of the metrics rounded to 4 decimal places
Function Skeleton
def calculate_naive_metrics(train_features:pd.DataFrame, test_features:pd.DataFrame, train_targets:pd.Series, test_targets:pd.Series, naive_assumption:int) -> ModelMetrics:
train_metrics = {
"accuracy" : 0,
"recall" : 0,
"precision" : 0,
"fscore" : 0
}
test_metrics = {
"accuracy" : 0,
"recall" : 0,
"precision" : 0,
"fscore" : 0
}
naive_metrics = ModelMetrics("Naive",train_metrics,test_metrics,None)
return naive_metrics
calculate_logistic_regression_metrics
Important Note on Feature Selection: RFE (Recursive Feature Elimination) should ONLY be used for Logistic Regression. Do NOT use RFE for any tree-based models (Decision Tree, Random Forest, or Gradient Boosting) โ use their built-in feature importance values instead.
A logistic regression model is a simple and more explainable statistical model that can be used to estimate the probability of an event (). At a high level, a logistic regression model uses data in the training set to estimate a columnโs weight in a linear approximation function. Conceptually this is similar to estimating m for each column in the line formula you probably know well from geometry: y = m*x + b. If you are interested in learning more, behind how this works. For this project, we are more focused on showing you how to apply these models, so you can simply use a scikit-learn Logistic Regression model in your code.
For this task use scikit-learnโs LogisticRegression class and complete the following subtasks:
- Train a Logistic Regression model (initialized using the kwargs passed into the function)
- Predict scores for training and test datasets and calculate the 7 metrics listed below for the training and test datasets using predictions from the fit model. (All rounded to 4 decimal places)
accuracyrecallprecisionfscorefalse positive rate (fpr)false negative rate (fnr)Area Under the Curve of Receiver Operating Characteristics Curve (roc_auc)
- Use RFE to select the top
n_feat_importancefeatures - Train a Logistic Regression model using these selected features (initialized using the kwargs passed into the function)
- Create a Feature Importance DataFrame from the model trained on the top
n_feat_importancefeatures:- Use the top N features (where N is configured via
n_feat_importanceinput variable) and sort by absolute value of the coefficient from biggest to smallest. - Make sure you use the same feature and importance column names as set in ModelMetrics in feat_name_col [
Feature] and imp_col [Importance]. - Round the importances to 4 decimal places (do this step after you have sorted by Importance)
- Reset the index to 0->(N-1). You can do this the same way you did in task1.
- Use the top N features (where N is configured via
NOTE: Make sure you use the predicted probabilities for roc auc
Useful Resources
INPUTS
The first 4 are similar to the tts function you created in Task 2:
train_featuresโ a Pandas Dataframe with training featurestest_featuresโ a Pandas Dataframe with test featurestrain_targetsโ a Pandas Dataframe with training targetstest_targetsโ a Pandas Dataframe with test targetsn_feat_importanceโ an int with how many features to return in the feature importance dataframelogreg_kwargsโ a dictionary with keyword arguments that can be passed directly to the scikit-learn Logistic Regression class
OUTPUTS
- A completed
ModelMetricsobject with a training and test metrics dictionary with each one of the metrics rounded to 4 decimal places - A scikit-learn Logistic Regression model object fit on the training set
Function Skeleton
def calculate_logistic_regression_metrics(train_features:pd.DataFrame, test_features:pd.DataFrame, train_targets:pd.Series, test_targets:pd.Series, n_feat_importance: int, logreg_kwargs) -> tuple[ModelMetrics,LogisticRegression]:
model = LogisticRegression()
train_metrics = {
"accuracy" : 0,
"recall" : 0,
"precision" : 0,
"fscore" : 0,
"fpr" : 0,
"fnr" : 0,
"roc_auc" : 0
}
test_metrics = {
"accuracy" : 0,
"recall" : 0,
"precision" : 0,
"fscore" : 0,
"fpr" : 0,
"fnr" : 0,
"roc_auc" : 0
}
log_reg_importance = pd.DataFrame()
log_reg_metrics = ModelMetrics("Logistic Regression",train_metrics,test_metrics,log_reg_importance)
return log_reg_metrics,model
Example of Feature Importance DataFrame (with n_feat_importance=10)
calculate_decision_tree_metrics
Important Note on Feature Selection: RFE (Recursive Feature Elimination) should ONLY be used for Logistic Regression. Do NOT use RFE for any tree-based models (Decision Tree, Random Forest, or Gradient Boosting). For tree-based models, use their built-in feature importance values instead.
A Decision Tree (DT) is a supervised learning algorithm used for both classification and regression tasks. It works by recursively splitting the data into subsets based on the feature that results in the best separation of classes, typically measured using Gini impurity or entropy. Decision trees are interpretable, as the learned model can be visualized as a flowchart-like structure.
If you are interested in learning more, behind how decision trees work.
For this project, we are more focused on showing you how to apply these models, so you can simply use a scikit-learn DecisionTreeClassifier in your code.
For this task, use scikit-learnโs DecisionTreeClassifier class and complete the following subtasks:
- Train a DT model (initialized using the kwargs passed into the function).
- Predict scores for training and test datasets and calculate the 7 metrics listed below for the training and test datasets using predictions from the fit model. (All rounded to 4 decimal places)
accuracyrecallprecisionfscorefalse positive rate (fpr)false negative rate (fnr)Area Under the Curve of Receiver Operating Characteristics Curve (roc_auc)
- Create a Feature Importance DataFrame from the trained model:
- Do Not Use RFE for feature selection
- Use the top N (where N is configured via
n_feat_importanceinput variable) features selected by the built in method (sorted from biggest to smallest) - Make sure you use the same feature and importance column names as ModelMetrics shows in feat_name_col [
Feature] and imp_col [Importance] - Round the importances to 4 decimal places (do this step after you have sorted by Importance)
- Reset the index to 0->(N-1) you can do this the same way you did in task1
NOTE: Make sure you use the predicted probabilities for roc_auc.
Useful Resources
INPUTS
The first 4 are similar to the tts function you created in Task 2:
train_featuresโ a Pandas DataFrame with training featurestest_featuresโ a Pandas DataFrame with test featurestrain_targetsโ a Pandas DataFrame with training targetstest_targetsโ a Pandas DataFrame with test targetsn_feat_importanceโ an int with how many features to return in the feature importance dataframedt_kwargsโ a dictionary with keyword arguments that can be passed directly to the scikit-learnDecisionTreeClassifierclass
OUTPUTS
- A completed
ModelMetricsobject with a training and test metrics dictionary with each one of the metrics rounded to 4 decimal places - A scikit-learn
DecisionTreeClassifiermodel object fit on the training set
Function Skeleton
def calculate_decision_tree_metrics(train_features: pd.DataFrame, test_features: pd.DataFrame, train_targets: pd.Series, test_targets: pd.Series, n_feat_importance: int, dt_kwargs) -> tuple[ModelMetrics, DecisionTreeClassifier]:
model = DecisionTreeClassifier(**dt_kwargs)
train_metrics = {
"accuracy": 0,
"recall": 0,
"precision": 0,
"fscore": 0,
"fpr": 0,
"fnr": 0,
"roc_auc": 0
}
test_metrics = {
"accuracy": 0,
"recall": 0,
"precision": 0,
"fscore": 0,
"fpr": 0,
"fnr": 0,
"roc_auc": 0
}
dt_importance = pd.DataFrame()
dt_metrics = ModelMetrics("Decision Tree", train_metrics, test_metrics, dt_importance)
return dt_metrics, model
Example of Feature Importance DataFrame (with n_feat_importance=5)
calculate_gradient_boosting_metrics
Important Note on Feature Selection: RFE (Recursive Feature Elimination) should ONLY be used for Logistic Regression. Do NOT use RFE for any tree-based models (Decision Tree, Random Forest, or Gradient Boosting). For tree-based models, use their built-in feature importance values instead.
A Gradient Boosted model is more complex than the Naive and Logistic Regression models and similar in structure to the Decision Tree model you trained. A Gradient Boosted model expands on the tree-based model by using its additional trees to predict the errors from the previous tree. For this project, we are more focused on showing you how to apply these models, so you can simply use the scikit-learn Gradient Boosted Model in your code.
For this task use scikit-learnโs Gradient Boosting Classifier class and complete the following subtasks:
- Train a Gradient Boosted model (initialized using the kwargs passed into the function)
- Predict scores for training and test datasets and calculate the 7 metrics listed below for the training and test datasets using predictions from the fit model. (All rounded to 4 decimal places)
accuracyrecallprecisionfscorefalse positive rate (fpr)false negative rate (fnr)Area Under the Curve of Receiver Operating Characteristics Curve (roc_auc)
- Create a Feature Importance DataFrame from the trained model:
- Do Not Use RFE for feature selection
- Use the top N (where N is configured via
n_feat_importanceinput variable) features selected by the built in method (sorted from biggest to smallest) - Make sure you use the same feature and importance column names as ModelMetrics shows in feat_name_col [
Feature] and imp_col [Importance] - round the importances to 4 decimal places (do this step after you have sorted by Importance)
- Reset the index to 0->(N-1) you can do this the same way you did in task1
NOTE: Make sure you use the predicted probabilities for roc auc
Refer to the page for details about submitting your work.
Useful Resources
INPUTS
train_featuresโ a dataset split by a function similar to the tts function you created in task2test_featuresโ a dataset split by a function similar to the tts function you created in task2train_targetsโ a dataset split by a function similar to the tts function you created in task2test_targetsโ a dataset split by a function similar to the tts function you created in task2n_feat_importanceโ an int with how many features to return in the feature importance dataframegb_kwargsโ a dictionary with keyword arguments that can be passed directly to the scikit-learn GradientBoostingClassifier class
OUTPUTS
- A completed
ModelMetricsobject with a training and test metrics dictionary with each one of the metrics rounded to 4 decimal places - An scikit-learn Gradient Boosted model object fit on the training set
Function Skeleton
def calculate_gradient_boosting_metrics(train_features:pd.DataFrame, test_features:pd.DataFrame, train_targets:pd.Series, test_targets:pd.Series, n_feat_importance: int, gb_kwargs) -> tuple[ModelMetrics,GradientBoostingClassifier]:
model = GradientBoostingClassifier()
train_metrics = {
"accuracy" : 0,
"recall" : 0,
"precision" : 0,
"fscore" : 0,
"fpr" : 0,
"fnr" : 0,
"roc_auc" : 0
}
test_metrics = {
"accuracy" : 0,
"recall" : 0,
"precision" : 0,
"fscore" : 0,
"fpr" : 0,
"fnr" : 0,
"roc_auc" : 0
}
gb_importance = pd.DataFrame()
gb_metrics = ModelMetrics("Gradient Boosting",train_metrics,test_metrics,gb_importance)
return gb_metrics,model
Example of Feature Importance DataFrame (with n_feat_importance=10)
ย
Task 5: Model Training and Evaluation (20 points)
Now that you have written functions for different steps of the model-building process, you will put it all together. You will write code that trains a model with hyperparameters you determine (you should do any tuning locally or in a notebook, i.e., donโt tune your model in gradescope since the autograder will likely timeout).
- Refer to the page for details about submitting your work.
Important: Conduct hyperparameter tuning locally or in a separate notebook. Avoid tuning within Gradescope to prevent autograder timeouts.
Develop your own local tests to ensure your code functions correctly before submitting to Gradescope. Do not share these tests with other students.
train_model_return_scores_clamp (ClaMP Dataset)
Instructions (5 points):
This function focuses on training a model using the ClaMP dataset and evaluating its performance on a test set.
- Input:
train_df: A Pandas DataFrame containing the ClaMP training data. This includes the โlabelโ column, which serves as your target variable (0 for benign, 1 for malicious).test_df: A Pandas DataFrame containing the ClaMP test data. The โlabelโ column is intentionally omitted from this set.
- Model Training:
- Train a machine learning model using the
train_dfdataset. - You may use any techniques covered in this project.
- Set a random seed for reproducibility.
- Perform hyperparameter tuning (if needed) to optimize your modelโs performance
- Tip: putting comments on the ranges you select for hyperparameters will help the graders understand how you chose it
- Document your hyperparameter tuning strategy (if any) in the
document_hyperparameter_tuning_clampfunction. if you used grid search or some other automated search then running that function should generate the same parameters you used for your submission
- Train a machine learning model using the
- Prediction:
- Use your trained model to predict the probability of malware for each row in the
test_df. - Output these probabilities as values between 0 and 1. A value closer to 0 indicates a lower likelihood of malware, while a value closer to 1 indicates a higher likelihood.
- Use your trained model to predict the probability of malware for each row in the
- Output:
- Return a Pandas DataFrame with two columns:
index: The index from the inputtest_df.prob_label_1: The predicted malware probabilities.
- Return a Pandas DataFrame with two columns:
- Evaluation:
- The autograder will evaluate your predictions using the ROC AUC score.
- You must achieve a ROC AUC score of 0.9 or higher on the test set to receive full credit.
Local Test Case
For this task we do not provide a local test case for you. One of the Learning Objectives of this project is getting familiar using and creating python unittests. You have used tests we created for you in tasks 1-4 and now will get to practice creating your own.
- Load the training data you were provided in the Task5 folder from the CSV file.
- Split that file into a train df and a test df (a common split % to use is 80% data for the train set and 20% for the test set). Review Task 2 and the scikit-learn docs for useful parameters to ensure a balanced split.
- Make a copy of the test df with the label column removed to mimic the environment your model will see in the AG.
- Run your train_model_return_scores function on those datasets to get your predictions
- Run Output Checks:
- Make sure the number of rows in your predictions matches the expected number of rows (compare with rows of your test set)
- Check if the returned DataFrame has exactly the columns [โindexโ, โprob_label_1โ]. Hint: DataFrame.columns returns the columns of a pandas DataFrame.
- Calculate the ROC AUC score between your prob_label_1 and the true class labels (compare with test set) you can use the thresholds described in the evaluation section above to consider your model โpassingโ this test.
Sample Submission (ClaMP):
Function Skeleton (ClaMP):
import pandas as pd
def document_hyperparameter_tuning_clamp(train_df,test_df):
"""
Please document the hyperparameter tuning process you used to tune your machine learning model for Task 5.
You should copy and paste the hyperparameter process you conducted here.
Place all parameters tuned and values in the hyperparameters dictionary.
If we run your code and your hyperparameter function, it must generate the same hyperparameters you used for your function.
You do not need to return anything specific, just document how you did your tuning.
We will not run it in the autograder, it is just an additional check on our end.
"""
hyperparameters = {}
return hyperparameters
def train_model_return_scores_clamp(train_df, test_df) -> pd.DataFrame:
"""
Trains a model on the ClaMP training data and returns predicted probabilities
for the test data.
Args:
train_df (pd.DataFrame): ClaMP training data with 'label' column.
test_df (pd.DataFrame): ClaMP test data without 'label' column.
Returns:
pd.DataFrame: DataFrame with 'index' and 'prob_label_1' columns.
"""
# TODO: Implement the model training and prediction logic as described above.
test_scores = pd.DataFrame() # Replace with your implementation
return test_scores
ClaMP Dataset
- The ClaMP (Classification of Malware with PE Headers) dataset is used for malware classification.
- It is based on the header fields of Portable Executable (PE) files.
- Learn more about PE files:
- ClaMP Dataset GitHub Repository:
- This project uses the
ClaMP_Raw-5184.csvfile (55 features).train_model_return_scores_unsw(UNSW-NB15 Dataset)
Instructions (10 points):
This function focuses on training a model using the UNSW-NB15 dataset and evaluating its performance on a test set. It will likely require exploring/understanding the dataset, data preprocessing, model selection, and hyperparameter tuning to acheive full credit.
- Input:
train_df: A Pandas DataFrame containing the UNSW-NB15 training data (including the โlabelโ column).test_df: A Pandas DataFrame containing the UNSW-NB15 test data (without the โlabelโ column).
- Model Training:
- Train a machine learning model using the
train_df. - You can use any techniques from this project.
- Set a random seed for reproducibility.
- Document your hyperparameter tuning strategy in the
document_hyperparameter_tuning_clampfunction. if you used grid search or some other automated search then running that function should generate the same parameters you used for your submission
- Train a machine learning model using the
- Prediction:
- Predict the probability of label=1 for each row in
test_df. - Output probabilities between 0 and 1, where values closer to 1 indicate a higher likelihood of being label=1.
- Predict the probability of label=1 for each row in
- Output:
- Return a Pandas DataFrame with two columns:
index: The index from the inputtest_df.prob_label_1: The predicted probabilities of label=1.
- Return a Pandas DataFrame with two columns:
- Evaluation:
- The autograder will evaluate your predictions using the ROC AUC score.
- Full Credit (10 points) will be given for 0.76 and above, 5 points for .75 and above and 2.5 points for .55 and above
- Parameter tuning will likely be necessary to achieve higher scores.
Local Test Case
For this task we do not provide a local test case for you. One of the Learning Objectives of this project is getting familiar using and creating python unittests. You have used tests we created for you in tasks 1-4 and now will get to practice creating your own.
- Load the training data you were provided in the Task5 folder from the CSV file.
- Split that file into a train df and a test df (a common split % to use is 80% data for the train set and 20% for the test set). Review Task 2 and the scikit-learn docs for useful parameters to ensure a balanced split.
- Make a copy of the test df with the label column removed to mimic the environment your model will see in the AG.
- Run your train_model_return_scores function on those datasets to get your predictions
- Run Output Checks:
- Make sure the number of rows in your predictions matches the expected number of rows (compare with rows of your test set)
- Check if the returned DataFrame has exactly the columns [โindexโ, โprob_label_1โ]. Hint: DataFrame.columns returns the columns of a pandas DataFrame.
- Calculate the ROC AUC score between your prob_label_1 and the true class labels (compare with test set) you can use the thresholds described in the evaluation section above to consider your model โpassingโ this test.
Sample Submission (UNSW-NB15):
Function Skeleton (UNSW-NB15):
import pandas as pd
def document_hyperparameter_tuning_unsw(train_df,test_df):
"""
Please document the hyperparameter tuning process you used to tune your machine learning model for Task 5.
You should copy and paste the hyperparameter process you conducted here.
Place all parameters tuned and values in the hyperparameters dictionary.
If we run your code and your hyperparameter function, it must generate the same hyperparameters you used for your function.
You do not need to return anything specific, just document how you did your tuning.
We will not run it in the autograder, it is just an additional check on our end.
"""
hyperparameters = {}
return hyperparameters
def train_model_return_scores_unsw(train_df, test_df) -> pd.DataFrame:
"""
Trains a model on the UNSW-NB15 training data and returns predicted
probabilities for the test data.
Args:
train_df (pd.DataFrame): UNSW-NB15 training data with 'label' column.
test_df (pd.DataFrame): UNSW-NB15 test data without 'label' column.
Returns:
pd.DataFrame: DataFrame with 'index' and 'prob_label_1' columns.
"""
# TODO: Implement the model training and prediction logic as described above.
test_scores = pd.DataFrame() # Replace with your implementation
return test_scores
UNSW-NB15 Dataset
- The UNSW-NB15 dataset was created using the IXIA PerfectStorm tool to simulate real-world network traffic and attack scenarios.
- Dataset Website:
- Note: This project does not use all features or classes from the original UNSW-NB15 dataset.
Additional Resources (optional)
train_model_return_scores_phiusiil (PhiUSIIL Phishing URL Dataset)
Instructions (5 points):
This function focuses on training a model using the PhiUSIIL Phishing URL dataset and evaluating its performance on a test set. It will likely require exploring/understanding the dataset, feature creation, data preprocessing, model selection, and hyperparameter tuning to acheive full credit.
- Input:
train_df: A Pandas DataFrame containing the PhiUSIIL Phishing URL training data (including the โlabelโ column).test_df: A Pandas DataFrame containing the PhiUSIIL Phishing URL test data (without the โlabelโ column).
- Model Training:
- You will need to create your own feature engineering functions to pass this task since you are only given a single text column. Review resources for tips on generating features from a URL.
- Train a machine learning model using the
train_df. - You can use any techniques from this project.
- Set a random seed for reproducibility.
- Document your hyperparameter tuning strategy (if any) in the
document_hyperparameter_tuning_clampfunction. if you used grid search or some other automated search then running that function should generate the same parameters you used for your submission
- Prediction:
- Predict the probability of label=1 for each row in
test_df. - Output probabilities between 0 and 1, where values closer to 1 indicate a higher likelihood of being label=1.
- Predict the probability of label=1 for each row in
- Output:
- Return a Pandas DataFrame with two columns:
index: The index from the inputtest_df.prob_label_1: The predicted probabilities of label=1.
- Return a Pandas DataFrame with two columns:
- Evaluation:
- The autograder will evaluate your predictions using the ROC AUC score.
- Full Credit (5 points) will be given for 0.85 and above, 2.5 points for .8 and above and 1.25 points for .55 and above
- Parameter tuning will likely be necessary to achieve higher scores.
Local Test Case
For this task we do not provide a local test case for you. One of the Learning Objectives of this project is getting familiar using and creating python unittests. You have used tests we created for you in tasks 1-4 and now will get to practice creating your own.
- Load the training data you were provided in the Task5 folder from the CSV file.
- Split that file into a train df and a test df (a common split % to use is 80% data for the train set and 20% for the test set). Review Task 2 and the scikit-learn docs for useful parameters to ensure a balanced split.
- Make a copy of the test df with the label column removed to mimic the environment your model will see in the AG.
- Run your train_model_return_scores function on those datasets to get your predictions
- Run Output Checks:
- Make sure the number of rows in your predictions matches the expected number of rows (compare with rows of your test set)
- Check if the returned DataFrame has exactly the columns [โindexโ, โprob_label_1โ]. Hint: DataFrame.columns returns the columns of a pandas DataFrame.
- Calculate the ROC AUC score between your prob_label_1 and the true class labels (compare with test set) you can use the thresholds described in the evaluation section above to consider your model โpassingโ this test.
Sample Submission (PhiUSIIL):
Function Skeleton (PhiUSIIL):
import pandas as pd
def document_hyperparameter_tuning_phiusiil(train_df,test_df):
"""
Please document the hyperparameter tuning process you used to tune your machine learning model for Task 5.
You should copy and paste the hyperparameter process you conducted here.
Place all parameters tuned and values in the hyperparameters dictionary.
If we run your code and your hyperparameter function, it must generate the same hyperparameters you used for your function.
You do not need to return anything specific, just document how you did your tuning.
We will not run it in the autograder, it is just an additional check on our end.
"""
hyperparameters = {}
return hyperparameters
def train_model_return_scores_phiusiil(train_df, test_df) -> pd.DataFrame:
"""
Trains a model on the PhiUSIIL Phishing URL training data and returns predicted
probabilities for the test data.
Args:
train_df (pd.DataFrame): PhiUSIIL Phishing URL training data with 'label' column.
test_df (pd.DataFrame): PhiUSIIL Phishing URL test data without 'label' column.
Returns:
pd.DataFrame: DataFrame with 'index' and 'prob_label_1' columns.
"""
# TODO: Implement the model training and prediction logic as described above.
test_scores = pd.DataFrame() # Replace with your implementation
return test_scores
PhiUSIIL Phishing URL Dataset
- The PhiUSIIL Phishing URL dataset was created from 235,795 legitimate and phishing urls. For our class we are only giving you the URLs and want you to feature engineer your own predictive features to use in an ML model. This task is open ended meaning you should do your own research to determine what features to generate from a url. You can try to mimic the features the authors created or create your own the only limitation is your features need to be created from the URL alone (without internet access) so more complex strategies that try to scrape the url will not work in the autograder.
- Dataset Website:
- Note: This project does not use all features or classes from the original PhiUSIIL Phishing URL dataset.
Additional Resources (optional)
Deliverables
- Local Testing: While it is not a deliverable that we will check in your Gradescope submission, we strongly encourage you to thoroughly test your code locally using the provided datasets. Create your own test sets by splitting the training data as described above.
- Gradescope Submission: Once you are confident in your solution, submit your
task5.pyfile (containing all functions) to Gradescope. - ย
ย











