Skip to content

KTamas03/CryptoClustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

CryptoClustering

Module 19 Challenge - Unsupervised Machine Learning

In this scenario, I used Python and unsupervised machine learning to predict whether cryptocurrencies are influenced by changes in prices over different time periods. I performed all the work within a Jupyter Notebook.

Repository Folders and Contents:

  • Resources:
    • crypto_market_data.csv
  • Crypto_Clustering.ipynb

Table of Contents

About

Step 1.

First, I imported the data into a pandas dataframe and ran summary statistics. Then, I created a line chart to visualize the data. I observed that the cryptocurrency with the highest price change percentage over one year and 200 days was 'ethlend', followed by 'celsius-degree-token'. 'Theta-token' and 'havven' were also notable standouts. image

Step 2.

Next, I normalized the data using the StandardScaler() module to ensure that all the features in the dataframe had equal weight. Afterward, I created a new scaled dataframe, with cryptocurrency names ('coin_id') as the index.

Step 3.

Afterwards, I determined the optimal value for 'k'. I calculated inertia values for 'k' values ranging from 0 to 10 using the K-Means model. By plotting an elbow curve, I identified that the ideal number of clusters for the scaled dataset was 4: image

Step 4.

With the results of the K-Means test in mind, I proceeded to predict the cluster assignments for each cryptocurrency and created a scatter plot of 'price_change_percentage_24h' versus 'price_change_percentage_7d': image

Step 5.

I then optimized the clusters using Principal Component Analysis (PCA) by setting the number of components to 3. Additionally, I set the random seed to 1 to ensure consistent random initialization of the PCA algorithm, allowing for reproducible results. It was observed that the total explained variance of the three principal components amounted to 88.94% or 0.8894, which is quite promising: image

Step 6.

Subsequently, I generated a new dataframe containing the PCA data and proceeded to determine the optimal value for 'k' using the same K-Means approach. I identified that the ideal number of clusters for the PCA dataset was also 4: image

Step 7.

Once again, relying on the results of the K-Means test, I proceeded to predict the cluster assignments for each cryptocurrency based on the PCA data. I then generated a scatter plot of Principal Component 1 (PC1) versus Principal Component 2 (PC2): image

Step 8.

Finally, I generated a composite plot displaying the elbow curves for both the scaled data and PCA data. Additionally, I created another composite plot featuring scatter plots illustrating the cluster formations based on the scaled data and the PCA data. In summary, it was observed that the optimal number of clusters was consistent across both the Scaled Data and PCA Data, with both indicating the presence of 4 clusters: image

Furthermore, upon visually analyzing the cluster analysis results, it became evident that using fewer features to cluster the data with K-Means led to more distinct clusters that were visually prominent on the scatterplot. PCA, in this regard, eliminates less informative features from the data while retaining the most significant variance present in the original dataset. The second scatterplot, depicting PCA, clearly illustrates the presence of four distinct clusters, with 'ethland' marked in orange and 'celsius-degree-token' in yellow. Incidentally, these two cryptocurrencies exhibited significantly higher percentage price changes at the 200-day and 1-year marks compared to all other cryptocurrencies. Hence, isolating them into their respective clusters makes sense: image

Resource File I Used:

  • crypto_market_data.csv

My Jupyter Notebook Python Script:

  • Crypto_Clustering.ipynb

Tools/Libraries I Imported:

  • pandas library: for data manipulation and analysis
  • hvplot.pandas library: to create plots
  • sklearn.cluster: to create clusters using k-means
  • sklearn.decomposition: for principal component analysis (PCA)
  • sklearn.preprocessing: used for normalising the data

Getting Started

Programs/software I used:

  • Jupyter Notebook: python programming tool, was used for data manipulation and consolidation.

To activate dev environment and open Jupyter Notebook:

  • Open Anaconda Prompt
  • Activate dev environment, type 'conda activate dev'
  • Navigate to the folder where repository is saved on local drive
  • Open Jupyter Notebook, type 'Jupyter Notebook'

Installing

Install scikit-learn library

Install hvPlot

  • After activating the dev environment (see Getting Started), in terminal type 'conda install -c pyviz hvplot'.

Contributing

About

Module 19 Challenge - Unsupervised Machine Learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors