Working with Sagemaker-Part-I

Hey ya everyone!

Before I move ahead, just wanted to wish you all safe stay in your home and wish for this pandemic to be over soon. I miss India I miss my parents and I badly want to get a covid-19 vaccine shot.

Anyway.

I had a chance to go through Machine learning Engineer Nanodegree from Udacity and to be really honest, its awesome. Udacity never fails to make learning addictive for me.

Let’s start with whats and hows.

Amazon Sagemaker is a really cool platform where you can do pretty much everything that is there in a machine learning process. Data processing, exploration, building models, hyperparameter tuning and deployment and updating the models. As such it offers four services:

  • Ground Truth – A large dataset needs to get labeled? No worries, distribute a chunk of your data to people for labeling and then send it to an active learning agent that labels rest of the data on its own.
  • Notebook – Have your Jupiter notebook instances, attach your git repositories to them
  • Training – Select machine learning algorithms, run training jobs and tune up those hyperparameters.
  • Inference – Configure the endpoints and deploy your models

Now, before I begin, I am assuming you do have an account in AWS with sufficient credit (pretty bad assumption huh?)

Ok, so first of all, go to console.amazon.com and in the search bar of AWS services, type in Amazon Sagemaker, select it and then you should be able to see something like:

You can see the tools on the left and on selecting notebook instances, you can pretty much start building things. Click on Create notebook instance, give it some name, based on your requirements select notebook instance type, for more information visit here. In the Permission and Encryption section, if you just want the instance for your own, go to Create a new role -> None (in providing access option). There is also one option to add git repositories for pre-loading the instance with required folders and scripts. Create the instance.

After sometime you should see the status of your instance from pending to in-service. You can then select between Jupyter and Jupyter-lab. Jupyter lab is basically a more organized IDE like Pycharm.

Once you have opened it, you can select kernels for running your scripts, I prefer conda_mxnet_p36 ; mxnet is offered by sagemaker and p36 indicates python version 3.6. You can also make use of terminal that accessible by going to  New -> Terminal (last option, yes I missed it ). Make sure to cd SageMaker/ before you head towards accessing or cloning other repos.

First, it’s always data:

To show you how you can import, store locally and upload to S3 (storage service) in AWS, here is one minimal example:

import os
import numpy as np
import pandas as pd
import sklearn.model_selection
from sklearn.datasets import load_boston
import sagemaker
from sagemaker import get_execution_rolesession = sagemaker.Session()

boston = load_boston()
X_bos_pd = pd.DataFrame(boston.data, columns=boston.feature_names)
Y_bos_pd = pd.DataFrame(boston.target)

X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X_bos_pd, Y_bos_pd, test_size=0.33)

data_dir = '../data/boston'
if not os.path.exists(data_dir):
os.makedirs(data_dir)

X_test.to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)
pd.concat([Y_train, X_train], axis=1).to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

prefix = 'boston-xgboost-LL

test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)
train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)

Sagemaker offers various APIs to make ML process easy. In order to upload data to S3, you first have to save data locally -> using os.makedirs(path_to_directory) and you can check your csv/txt files by navigating in jupyter. Once that is done, a session object is used for uploading the data. You can check your data in the bucket by going to AWS services -> in search bar type ‘S3’ and click on it. You should see your bucket and on clicking it, the data that you intend to use should be visible with your prefix mentioned.

In the upcoming articles, I have planned to complete the training and deployment process in Sagemaker. Cheers!


Leave a Reply

Your email address will not be published. Required fields are marked *