Machine Learning 101 – Top 200 AWS and Google Certified Machine Learning Specialty Questions and Answers Dumps

AWS machine Learning Specialty Exam Prep MLS-C01

AI Dashboard is available on the Web, Apple, Google, and Microsoft, PRO version

What are the Top 200 AWS and Google Certified Machine Learning Specialty Questions and Answers Dumps?

This blog is the best way  is the best way to prepare for your upcoming  AWS Certified Machine Learning Specialty and Google Certified Professional Machine Learning Engineer exam. With over 100 questions and answers, this blog provides quizzes similar  that are very similar to the real exam. It also includes  the option to show and hide answers. Additionally, there are machine learning interview questions and detailed answers, as well as cheat sheets and illustrations. This blog is the best way to make sure you are well-prepared for your AWS Certified Machine Learning Specialty Exam.

2023 AWS Certified Machine Learning Specialty (MLS-C01) Practice Exams
2023 AWS Certified Machine Learning Specialty (MLS-C01) Practice Exams

The typical Google Machine Learning Engineer salary is $147,218. Machine Learning Engineer salaries at Google can range from $110,000 – $152,183.

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.

  • By the end of 2020, 85% of customer interactions will be handled without a human (Call Center, Chatbot, etc…)
  • 61% of marketers say artificial intelligence is the most important aspect of their data strategy.
  • 80% of business and tech leaders say AI already boosts productivity (Robotic Process Automation, Power Automate, etc..)
  • Current AI technology can boost business productivity by up to 40%

AWS Machine Learning Certification Specialty Exam Prep for iOs Android Windows10/11

AWS Certified machine Learning Specialty Exam Prep MLS-C01 - Top 200 AWS and Google Certified Machine Learning Specialty Questions and Answers Dumps
AWS machine Learning Specialty Exam Prep MLS-C01

GCP Professional Machine Learning Engineer for iOs, Android, Windows 10/11

Quizzes, Practice Exams: Framing, Architecting, Designing, Developing ML Problems & Solutions, ML Jobs Interview Q&A

GCP Professional Machine Learning Engineer
GCP Professional Machine Learning Engineer

 

Get 20% off Google Google Workspace (Google Meet) Standard Plan with  the following codes: 96DRHDRA9J7GTN6
Get 20% off Google Workspace (Google Meet)  Business Plan (AMERICAS) with  the following codes:  C37HCAQRVR7JTFK Get 20% off Google Workspace (Google Meet) Business Plan (AMERICAS): M9HNXHX3WC9H7YE (Email us for more codes)

Active Anti-Aging Eye Gel, Reduces Dark Circles, Puffy Eyes, Crow's Feet and Fine Lines & Wrinkles, Packed with Hyaluronic Acid & Age Defying Botanicals

Azure AI Fundamentals AI-900 Exam Prep App for iOS, Android, Windows10/11

Basics and Advanced Machine Learning Quizzes on Azure, Azure Machine Learning Job Interviews Questions and Answer, ML Cheat Sheets

Azure AI Fundamentals AI-900 Exam Prep
Azure AI Fundamentals AI-900 Exam Prep

Machine Learning For Dummies App for iOs, Android, Windows10/11

Use this App to learn about Machine Learning and Elevate your Brain with Machine Learning Quizzes, Cheat Sheets, Ml Jobs Interview Questions and Answers updated daily.

Machine Learning For Dummies
Machine Learning For Dummies

What does a Professional Machine Learning Engineer do?

Professional Machine Learning Engineer designs, builds, and productionizes ML models to solve business challenges using Google Cloud technologies and knowledge of proven ML models and techniques. The ML Engineer collaborates closely with other job roles to ensure long-term success of models. The ML Engineer should be proficient in all aspects of model architecture, data pipeline interaction, and metrics interpretation. The ML Engineer needs familiarity with application development, infrastructure management, data engineering, and security. Through an understanding of training, retraining, deploying, scheduling, monitoring, and improving models, they design and create scalable solutions for optimal performance.


AI Unraveled: Demystifying Frequently Asked Questions on Artificial Intelligence (OpenAI, ChatGPT, Google Bard, Generative AI, Discriminative AI, xAI, LLMs, GPUs, Machine Learning, NLP, Promp Engineering)

The AWS Certified Machine Learning – Specialty certification is intended for individuals who perform a development or data science role. It validates a candidate’s ability to design, implement, deploy, and maintain machine learning (ML) solutions for given business problems.

This blog covers Machine Learning 101, Top 20 AWS Certified Machine Learning Specialty Questions and Answers, Top 20 Google Professional Machine Learning Engineer Sample Questions, Machine Learning Quizzes, Machine Learning Q&A, Top 10 Machine Learning Algorithms, Machine Learning Latest Hot News, Machine Learning Demos (Ex: Tensorflow Demos)

Question1: A machine learning team has several large CSV datasets in Amazon S3. Historically, models built with the Amazon SageMaker Linear Learner algorithm have taken hours to train on similar-sized datasets. The team’s leaders need to accelerate the training process. What can a machine learning specialist do to address this concern?

A) Use Amazon SageMaker Pipe mode.
B) Use Amazon Machine Learning to train the models.
C) Use Amazon Kinesis to stream the data to Amazon SageMaker.
D) Use AWS Glue to transform the CSV dataset to the JSON format.
ANSWER1:

A

Notes/Hint1:


Amazon SageMaker Pipe mode streams the data directly to the container, which improves the performance of training jobs. (Refer to this link for supporting information.) In Pipe mode, your training job streams data directly from Amazon S3. Streaming can provide faster start times for training jobs and better throughput. With Pipe mode, you also reduce the size of the Amazon EBS volumes for your training instances. B would not apply in this scenario. C is a streaming ingestion solution, but is not applicable in this scenario. D transforms the data structure.

Reference1: Amazon SageMaker

Question 2) A local university wants to track cars in a parking lot to determine which students are parking in the lot. The university is wanting to ingest videos of the cars parking in near-real time, use machine learning to identify license plates, and store that data in an AWS data store. Which solution meets these requirements with the LEAST amount of development effort?

A) Use Amazon Kinesis Data Streams to ingest the video in near-real time, use the Kinesis Data Streams consumer integrated with Amazon Rekognition Video to process the license plate information, and then store results in DynamoDB.

B) Use Amazon Kinesis Video Streams to ingest the videos in near-real time, use the Kinesis Video Streams integration with Amazon Rekognition Video to identify the license plate information, and then store the results in DynamoDB.

C) Use Amazon Kinesis Data Streams to ingest videos in near-real time, call Amazon Rekognition to identify license plate information, and then store results in DynamoDB.

Djamgatech: Build the skills that’ll drive your career into six figures: Get Djamgatech.

D) Use Amazon Kinesis Firehose to ingest the video in near-real time and outputs results onto S3. Set up a Lambda function that triggers when a new video is PUT onto S3 to send results to Amazon Rekognition to identify license plate information, and then store results in DynamoDB.

Answer 2)

B

Notes/Hint2)

Kinesis Video Streams is used to stream videos in near-real time. Amazon Rekognition Video uses Amazon Kinesis Video Streams to receive and process a video stream. After the videos have been processed by Rekognition we can output the results in DynamoDB.

Reference: Kinesis Video Streams

Question 3) A term frequency–inverse document frequency (tf–idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences:

1. Please call the number below.
2. Please do not call us. What are the dimensions of the tf–idf matrix?
A) (2, 16)
B) (2, 8)
C) (2, 10)
D) (8, 10)

ANSWER3:

A

Notes/Hint3:

There are 2 sentences, 8 unique unigrams, and 8 unique bigrams, so the result would be (2,16). The phrases are “Please call the number below” and “Please do not call us.” Each word individually (unigram) is “Please,” “call,” ”the,” ”number,” “below,” “do,” “not,” and “us.” The unique bigrams are “Please call,” “call the,” ”the number,” “number below,” “Please do,” “do not,” “not call,” and “call us.” The tf–idf vectorizer is described at this link.

Ace the Microsoft Azure Fundamentals AZ-900 Certification Exam: Pass the Azure Fundamentals Exam with Ease

Reference3:  tf-idf vertorizer

Question 4: A company is setting up a system to manage all of the datasets it stores in Amazon S3. The company would like to automate running transformation jobs on the data and maintaining a catalog of the metadata concerning the datasets. The solution should require the least amount of setup and maintenance. Which solution will allow the company to achieve its goals? 

A) Create an Amazon EMR cluster with Apache Hive installed. Then, create a Hive metastore and a script to run transformation jobs on a schedule.
B) Create an AWS Glue crawler to populate the AWS Glue Data Catalog. Then, author an AWS Glue ETL job, and set up a schedule for data transformation jobs.
C) Create an Amazon EMR cluster with Apache Spark installed. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule. D) Create an AWS Data Pipeline that transforms the data. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule.
 

ANSWER4:

B

Notes/Hint4:

AWS Glue is the correct answer because this option requires the least amount of setup and maintenance since it is serverless, and it does not require management of the infrastructure. Refer to this link for supporting information. A, C, and D are all solutions that can solve the problem, but require more steps for configuration, and require higher operational overhead to run and maintain.
Reference4:  Glue

Question 5) Which service in the Kinesis family allows you to easily load streaming data into data stores and analytics tools?

A) Kinesis Firehose
B) Kinesis Streams
C) Kinesis Data Analytics
D) Kinesis Video Streams
 

ANSWER5:

A

Notes/Hint5:

Kinesis Firehose is perfect for streaming data into AWS and sending it directly to its final destination – places like S3, Redshift, Elastisearch, and Splunk Instances.

Reference 5): Kinesis Firehose

Question 6) A data scientist is working on optimizing a model during the training process by varying multiple parameters. The data scientist observes that, during multiple runs with identical parameters, the loss function converges to different, yet stable, values. What should the data scientist do to improve the training process? 
A) Increase the learning rate. Keep the batch size the same.
B) Reduce the batch size. Decrease the learning rate.
C) Keep the batch size the same. Decrease the learning rate.
D) Do not change the learning rate. Increase the batch size.
 
Answer  6)
B
 

Notes 6)

It is most likely that the loss function is very curvy and has multiple local minima where the training is getting stuck. Decreasing the batch size would help the data scientist stochastically get out of the local minima saddles. Decreasing the learning rate would prevent overshooting the global loss function minimum. Refer to the paper at this link for an explanation.
Reference 6) : Here

Question 7) Your organization has a standalone Javascript (Node.js) application that streams data into AWS using Kinesis Data Streams. You notice that they are using the Kinesis API (AWS SDK) over the Kinesis Producer Library (KPL). What might be the reasoning behind this?
A) The Kinesis API (AWS SDK) provides greater functionality over the Kinesis Producer Library.
B) The Kinesis API (AWS SDK) runs faster in Javascript applications over the Kinesis Producer Library.
C) The Kinesis Producer Library must be installed as a Java application to use with Kinesis Data Streams.
D) The Kinesis Producer Library cannot be integrated with a Javascript application because of its asynchronous architecture.
Answer 7)
C
Notes/Hint7:
The KPL must be installed as a Java application before it can be used with your Kinesis Data Streams. There are ways to process KPL serialized data within AWS Lambda, in Java, Node.js, and Python, but not if these answers mentions Lambda.
Reference 7) KPL
 
 
Question 8) A data scientist is evaluating different binary classification models. A false positive result is 5 times more expensive (from a business perspective) than a false negative result. The models should be evaluated based on the following criteria: 
1) Must have a recall rate of at least 80%
2) Must have a false positive rate of 10% or less
3) Must minimize business costs After creating each binary classification model, the data scientist generates the corresponding confusion matrix. Which confusion matrix represents the model that satisfies the requirements?
A) TN = 91, FP = 9 FN = 22, TP = 78
 B) TN = 99, FP = 1 FN = 21, TP = 79
C) TN = 96, FP = 4 FN = 10, TP = 90
D) TN = 98, FP = 2 FN = 18, TP = 82
 
Answer 8): 
D
 

Notes/Hint 8)


The following calculations are required: TP = True Positive FP = False Positive FN = False Negative TN = True Negative FN = False Negative Recall = TP / (TP + FN) False Positive Rate (FPR) = FP / (FP + TN) Cost = 5 * FP + FN A B C D Recall 78 / (78 + 22) = 0.78 79 / (79 + 21) = 0.79 90 / (90 + 10) = 0.9 82 / (82 + 18) = 0.82 False Positive Rate 9 / (9 + 91) = 0.09 1 / (1 + 99) = 0.01 4 / (4 + 96) = 0.04 2 / (2 + 98) = 0.02 Costs 5 * 9 + 22 = 67 5 * 1 + 21 = 26 5 * 4 + 10 = 30 5 * 2 + 18 = 28 Options C and D have a recall greater than 80% and an FPR less than 10%, but D is the most cost effective. For supporting information, refer to this link.
Reference 8: Here

 
 
Question 9) A data scientist uses logistic regression to build a fraud detection model. While the model accuracy is 99%, 90% of the fraud cases are not detected by the model. What action will definitely help the model detect more than 10% of fraud cases? 
A) Using undersampling to balance the dataset
B) Decreasing the class probability threshold
C) Using regularization to reduce overfitting
D) Using oversampling to balance the dataset
 

Answer  9)

B

 

Notes 9)


Decreasing the class probability threshold makes the model more sensitive and, therefore, marks more cases as the positive class, which is fraud in this case. This will increase the likelihood of fraud detection. However, it comes at the price of lowering precision. This is covered in the Discussion section of the paper at this link
Reference 9: Here

 

 
Question 10) A company is interested in building a fraud detection model. Currently, the data scientist does not have a sufficient amount of information due to the low number of fraud cases. Which method is MOST likely to detect the GREATEST number of valid fraud cases?
A) Oversampling using bootstrapping
B) Undersampling
C) Oversampling using SMOTE
D) Class weight adjustment
 

Answer  10)

C

 
Notes 10)

With datasets that are not fully populated, the Synthetic Minority Over-sampling Technique (SMOTE) adds new information by adding synthetic data points to the minority class. This technique would be the most effective in this scenario. Refer to Section 4.2 at this link for supporting information.
Reference 10) : Here
 
Question 11) A machine learning engineer is preparing a data frame for a supervised learning task with the Amazon SageMaker Linear Learner algorithm. The ML engineer notices the target label classes are highly imbalanced and multiple feature columns contain missing values. The proportion of missing values across the entire data frame is less than 5%. What should the ML engineer do to minimize bias due to missing values? 
 
A) Replace each missing value by the mean or median across non-missing values in same row.
B) Delete observations that contain missing values because these represent less than 5% of the data.
C) Replace each missing value by the mean or median across non-missing values in the same column.
D) For each feature, approximate the missing values using supervised learning based on other features.
 

Answer  11)

D

 

Notes 11)

Use supervised learning to predict missing values based on the values of other features. Different supervised learning approaches might have different performances, but any properly implemented supervised learning approach should provide the same or better approximation than mean or median approximation, as proposed in responses A and C. Supervised learning applied to the imputation of missing values is an active field of research. Refer to this link for an example.
Reference 11): Here

 
Question 12) A company has collected customer comments on its products, rating them as safe or unsafe, using decision trees. The training dataset has the following features: id, date, full review, full review summary, and a binary safe/unsafe tag. During training, any data sample with missing features was dropped. In a few instances, the test set was found to be missing the full review text field. For this use case, which is the most effective course of action to address test data samples with missing features? 
A) Drop the test samples with missing full review text fields, and then run through the test set.
B) Copy the summary text fields and use them to fill in the missing full review text fields, and then run through the test set.
C) Use an algorithm that handles missing data better than decision trees.
D) Generate synthetic data to fill in the fields that are missing data, and then run through the test set.
 
Answer  12)
B

 

 

Notes 12) 

In this case, a full review summary usually contains the most descriptive phrases of the entire review and is a valid stand-in for the missing full review text field. For supporting information, refer to page 1627 at this link, and this link and this link.

Reference 12) Here

 

 
Question 13) An insurance company needs to automate claim compliance reviews because human reviews are expensive and error-prone. The company has a large set of claims and a compliance label for each. Each claim consists of a few sentences in English, many of which contain complex related information. Management would like to use Amazon SageMaker built-in algorithms to design a machine learning supervised model that can be trained to read each claim and predict if the claim is compliant or not. Which approach should be used to extract features from the claims to be used as inputs for the downstream supervised task? 
A) Derive a dictionary of tokens from claims in the entire dataset. Apply one-hot encoding to tokens found in each claim of the training set. Send the derived features space as inputs to an Amazon SageMaker builtin supervised learning algorithm.
B) Apply Amazon SageMaker BlazingText in Word2Vec mode to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
C) Apply Amazon SageMaker BlazingText in classification mode to labeled claims in the training set to derive features for the claims that correspond to the compliant and non-compliant labels, respectively.
D) Apply Amazon SageMaker Object2Vec to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
 

Answer  13)

D

 

Notes 13)

Amazon SageMaker Object2Vec generalizes the Word2Vec embedding technique for words to more complex objects, such as sentences and paragraphs. Since the supervised learning task is at the level of whole claims, for which there are labels, and no labels are available at the word level, Object2Vec needs be used instead of Word2Vec.

Reference 13)  Amazon SageMaker
Object2Vec 

Question 14) You have been tasked with capturing two different types of streaming events. The first event type includes mission-critical data that needs to immediately be processed before operations can continue. The second event type includes data of less importance, but operations can continue without immediately processing. What is the most appropriate solution to record these different types of events?

A) Capture both events with the PutRecords API call.
B) Capture both event types using the Kinesis Producer Library (KPL).
C) Capture the mission critical events with the PutRecords API call and the second event type with the Kinesis Producer Library (KPL).
D) Capture the mission critical events with the Kinesis Producer Library (KPL) and the second event type with the Putrecords API call.
 

Answer  14)

C

 

Notes 14)

The question is about sending data to Kinesis synchronously vs. asynchronously. PutRecords is a synchronous send function, so it must be used for the first event type (critical events). The Kinesis Producer Library (KPL) implements an asynchronous send function, so it can be used for the second event type. In this scenario, the reason to use the KPL over the PutRecords API call is because: KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user-configurable). Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance. Applications that cannot tolerate this additional delay may need to use the AWS SDK directly. For more information about using the AWS SDK with Kinesis Data Streams, see Developing Producers Using the Amazon Kinesis Data Streams API with the AWS SDK for Java. For more information about RecordMaxBufferedTime and other user-configurable properties of the KPL, see Configuring the Kinesis Producer Library.

Reference 14: KCL vs PutRecords

 

Question 15) You are collecting clickstream data from an e-commerce website to make near-real time product suggestions for users actively using the site. Which combination of tools can be used to achieve the quickest recommendations and meets all of the requirements?

A) Use Kinesis Data Streams to ingest clickstream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
B) Use Kinesis Data Firehose to ingest click stream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions, then use Lambda to load these results into S3.
C) Use Kinesis Data Streams to ingest clickstream data, then use Lambda to process that data and write it to S3. Once the data is on S3, use Athena to query based on conditions that data and make real time recommendations to users.
D) Use the Kinesis Data Analytics to ingest the clickstream data directly and run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
 

Answer  15)

A

 

Notes 15)

Kinesis Data Analytics gets its input streaming data from Kinesis Data Streams or Kinesis Data Firehose. You can use Kinesis Data Analytics to run real-time SQL queries on your data. Once certain conditions are met you can trigger Lambda functions to make real time product suggestions to users. It is not important that we store or persist the clickstream data.

Reference 15: Kinesis Data Analytics

Question 16) Which service built by AWS makes it easy to set up a retry mechanism, aggregate records to improve throughput, and automatically submits CloudWatch metrics?

A) Kinesis API (AWS SDK)
B) Kinesis Producer Library (KPL)
C) Kinesis Consumer Library
D) Kinesis Client Library (KCL)

Answer  16)

B

 

Notes 16)

Although the Kinesis API built into the AWS SDK can be used for all of this, the Kinesis Producer Library (KPL) makes it easy to integrate all of this into your applications.

Reference 16:  Kinesis Producer Library (KPL) 

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore  9n8rl80hvm4t-mobile screenshots]

Question 17) You have been tasked with capturing data from an online gaming platform to run analytics on and process through a machine learning pipeline. The data that you are ingesting is players controller inputs every 1 second (up to 10 players in a game) that is in JSON format. The data needs to be ingested through Kinesis Data Streams and the JSON data blob is 100 KB in size. What is the minimum number of shards you can use to successfully ingest this data?

A) 10 shards
B) Greater than 500 shards, so you’ll need to request more shards from AWS
C) 1 shard
D) 100 shards

Answer  17)

C

 

Notes 17)

In this scenario, there will be a maximum of 10 records per second with a max payload size of 1000 KB (10 records x 100 KB = 1000KB) written to the shard. A single shard can ingest up to 1 MB of data per second, which is enough to ingest the 1000 KB from the streaming game play. Therefor 1 shard is enough to handle the streaming data.

Reference 17: shards

Question 18) Which services in the Kinesis family allows you to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time?

A) Kinesis Streams
B) Kinesis Firehose
C) Kinesis Video Streams
D) Kinesis Data Analytics

Answer  18)

D

 

Notes 18)

Kinesis Data Analytics allows you to run real-time SQL queries on your data to gain insights and respond to events in real time.

Reference 18: Kinesis Data Analytics

 

Question 19) You are a ML specialist needing to collect data from Twitter tweets. Your goal is to collect tweets that include only the name of your company and the tweet body, and store it off into a data store in AWS. What set of tools can you use to stream, transform, and load the data into AWS with the LEAST amount of effort?

A) Setup a Kinesis Data Firehose for data ingestion and immediately write that data to S3. Next, setup a Lambda function to trigger when data lands in S3 to transform it and finally write it to DynamoDB.
B) Setup A Kinesis Data Stream for data ingestion, setup EC2 instances as data consumers to poll and transform the data from the stream. Once the data is transformed, make an API call to write the data to DynamoDB.
C) Setup Kinesis Data Streams for data ingestion. Next, setup Kinesis Data Firehouse to load that data into RedShift. Next, setup a Lambda function to query data using RedShift spectrum and store the results onto DynamoDB.
D) Create a Kinesis Data Stream to ingest the data. Next, setup a Kinesis Data Firehose and use Lambda to transform the data from the Kinesis Data Stream, then use Lambda to write the data to DynamoDB. Finally, use S3 as the data destination for Kinesis Data Firehose.
 

Answer 19)

A

Notes 19)

All of these could be used to stream, transform, and load the data into an AWS data store. The setup that requires the LEAST amount of effort and moving parts involves setting up a Kinesis Data Firehose to stream the data into S3, have it transformed by Lambda with an S3 trigger, and then written to DynamoDB.

Reference 19: Kinesis Data Firehose to stream the data into S3

Question 20) Which service in the Kinesis family allows you to build custom applications that process or analyze streaming data for specialized needs?

A) Kinesis Firehose
B) Kinesis Streams
C) Kinesis Video Streams
D) Kinesis Data Analytics

Answer 20)

B

Notes 20)

Kinesis Streams allows you to stream data into AWS and build custom applications around that streaming data.

Reference 20: Kinesis Streams

Question21:

Answer21:

What are the Top 100 AWS and Google Certified Machine Learning Specialty Questions and Answers Dumps?

This blog is the best way  is the best way to prepare for your upcoming  AWS Certified Machine Learning Specialty and Google Certified Professional Machine Learning Engineer exam. With over 100 questions and answers, this blog provides quizzes similar  that are very similar to the real exam. It also includes  the option to show and hide answers. Additionally, there are machine learning interview questions and detailed answers, as well as cheat sheets and illustrations. This blog is the best way to make sure you are well-prepared for your AWS Certified Machine Learning Specialty Exam.

The typical Google Machine Learning Engineer salary is $147,218. Machine Learning Engineer salaries at Google can range from $110,000 – $152,183.

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.

  • By the end of 2020, 85% of customer interactions will be handled without a human (Call Center, Chatbot, etc…)
  • 61% of marketers say artificial intelligence is the most important aspect of their data strategy.
  • 80% of business and tech leaders say AI already boosts productivity (Robotic Process Automation, Power Automate, etc..)
  • Current AI technology can boost business productivity by up to 40%

AWS Machine Learning Certification Specialty Exam Prep for iOs Android Windows10/11

AWS machine Learning Specialty Exam Prep MLS-C01 - Top 200 AWS and Google Certified Machine Learning Specialty Questions and Answers Dumps
AWS machine Learning Specialty Exam Prep MLS-C01

GCP Professional Machine Learning Engineer for iOs, Android, Windows 10/11

Quizzes, Practice Exams: Framing, Architecting, Designing, Developing ML Problems & Solutions, ML Jobs Interview Q&A

GCP Professional Machine Learning Engineer
GCP Professional Machine Learning Engineer

 

Azure AI Fundamentals AI-900 Exam Prep App for iOS, Android, Windows10/11

Basics and Advanced Machine Learning Quizzes on Azure, Azure Machine Learning Job Interviews Questions and Answer, ML Cheat Sheets

Azure AI Fundamentals AI-900 Exam Prep
Azure AI Fundamentals AI-900 Exam Prep

Machine Learning For Dummies App for iOs, Android, Windows10/11

Use this App to learn about Machine Learning and Elevate your Brain with Machine Learning Quizzes, Cheat Sheets, Ml Jobs Interview Questions and Answers updated daily.

Machine Learning For Dummies
Machine Learning For Dummies

What does a Professional Machine Learning Engineer do?

Professional Machine Learning Engineer designs, builds, and productionizes ML models to solve business challenges using Google Cloud technologies and knowledge of proven ML models and techniques. The ML Engineer collaborates closely with other job roles to ensure long-term success of models. The ML Engineer should be proficient in all aspects of model architecture, data pipeline interaction, and metrics interpretation. The ML Engineer needs familiarity with application development, infrastructure management, data engineering, and security. Through an understanding of training, retraining, deploying, scheduling, monitoring, and improving models, they design and create scalable solutions for optimal performance.

The AWS Certified Machine Learning – Specialty certification is intended for individuals who perform a development or data science role. It validates a candidate’s ability to design, implement, deploy, and maintain machine learning (ML) solutions for given business problems.

This blog covers Machine Learning 101, Top 20 AWS Certified Machine Learning Specialty Questions and Answers, Top 20 Google Professional Machine Learning Engineer Sample Questions, Machine Learning Quizzes, Machine Learning Q&A, Top 10 Machine Learning Algorithms, Machine Learning Latest Hot News, Machine Learning Demos (Ex: Tensorflow Demos)

Below are the Top 100 AWS Certified Machine Learning Specialty Questions and Answers Dumps.

Top

 

Question1: A machine learning team has several large CSV datasets in Amazon S3. Historically, models built with the Amazon SageMaker Linear Learner algorithm have taken hours to train on similar-sized datasets. The team’s leaders need to accelerate the training process. What can a machine learning specialist do to address this concern?

A) Use Amazon SageMaker Pipe mode.
B) Use Amazon Machine Learning to train the models.
C) Use Amazon Kinesis to stream the data to Amazon SageMaker.
D) Use AWS Glue to transform the CSV dataset to the JSON format.
ANSWER1:

A

Notes/Hint1:


Amazon SageMaker Pipe mode streams the data directly to the container, which improves the performance of training jobs. (Refer to this link for supporting information.) In Pipe mode, your training job streams data directly from Amazon S3. Streaming can provide faster start times for training jobs and better throughput. With Pipe mode, you also reduce the size of the Amazon EBS volumes for your training instances. B would not apply in this scenario. C is a streaming ingestion solution, but is not applicable in this scenario. D transforms the data structure.

Reference1: Amazon SageMaker

Question 2) A local university wants to track cars in a parking lot to determine which students are parking in the lot. The university is wanting to ingest videos of the cars parking in near-real time, use machine learning to identify license plates, and store that data in an AWS data store. Which solution meets these requirements with the LEAST amount of development effort?

A) Use Amazon Kinesis Data Streams to ingest the video in near-real time, use the Kinesis Data Streams consumer integrated with Amazon Rekognition Video to process the license plate information, and then store results in DynamoDB.

B) Use Amazon Kinesis Video Streams to ingest the videos in near-real time, use the Kinesis Video Streams integration with Amazon Rekognition Video to identify the license plate information, and then store the results in DynamoDB.

C) Use Amazon Kinesis Data Streams to ingest videos in near-real time, call Amazon Rekognition to identify license plate information, and then store results in DynamoDB.

D) Use Amazon Kinesis Firehose to ingest the video in near-real time and outputs results onto S3. Set up a Lambda function that triggers when a new video is PUT onto S3 to send results to Amazon Rekognition to identify license plate information, and then store results in DynamoDB.

Answer 2)

B

Notes/Hint2)

Kinesis Video Streams is used to stream videos in near-real time. Amazon Rekognition Video uses Amazon Kinesis Video Streams to receive and process a video stream. After the videos have been processed by Rekognition we can output the results in DynamoDB.

Reference: Kinesis Video Streams

Question 3) A term frequency–inverse document frequency (tf–idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences:

1. Please call the number below.
2. Please do not call us. What are the dimensions of the tf–idf matrix?
A) (2, 16)
B) (2, 8)
C) (2, 10)
D) (8, 10)

ANSWER3:

A

Notes/Hint3:

There are 2 sentences, 8 unique unigrams, and 8 unique bigrams, so the result would be (2,16). The phrases are “Please call the number below” and “Please do not call us.” Each word individually (unigram) is “Please,” “call,” ”the,” ”number,” “below,” “do,” “not,” and “us.” The unique bigrams are “Please call,” “call the,” ”the number,” “number below,” “Please do,” “do not,” “not call,” and “call us.” The tf–idf vectorizer is described at this link.

Reference3:  tf-idf vertorizer

Question 4: A company is setting up a system to manage all of the datasets it stores in Amazon S3. The company would like to automate running transformation jobs on the data and maintaining a catalog of the metadata concerning the datasets. The solution should require the least amount of setup and maintenance. Which solution will allow the company to achieve its goals? 

A) Create an Amazon EMR cluster with Apache Hive installed. Then, create a Hive metastore and a script to run transformation jobs on a schedule.
B) Create an AWS Glue crawler to populate the AWS Glue Data Catalog. Then, author an AWS Glue ETL job, and set up a schedule for data transformation jobs.
C) Create an Amazon EMR cluster with Apache Spark installed. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule. D) Create an AWS Data Pipeline that transforms the data. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule.
 

ANSWER4:

B

Notes/Hint4:

AWS Glue is the correct answer because this option requires the least amount of setup and maintenance since it is serverless, and it does not require management of the infrastructure. Refer to this link for supporting information. A, C, and D are all solutions that can solve the problem, but require more steps for configuration, and require higher operational overhead to run and maintain.
Reference4:  Glue

Question 5) Which service in the Kinesis family allows you to easily load streaming data into data stores and analytics tools?

A) Kinesis Firehose
B) Kinesis Streams
C) Kinesis Data Analytics
D) Kinesis Video Streams
 

ANSWER5:

A

Notes/Hint5:

Kinesis Firehose is perfect for streaming data into AWS and sending it directly to its final destination – places like S3, Redshift, Elastisearch, and Splunk Instances.

Reference 5): Kinesis Firehose

Question 6) A data scientist is working on optimizing a model during the training process by varying multiple parameters. The data scientist observes that, during multiple runs with identical parameters, the loss function converges to different, yet stable, values. What should the data scientist do to improve the training process? 
A) Increase the learning rate. Keep the batch size the same.
B) Reduce the batch size. Decrease the learning rate.
C) Keep the batch size the same. Decrease the learning rate.
D) Do not change the learning rate. Increase the batch size.
 
Answer  6)
B
 

Notes 6)

It is most likely that the loss function is very curvy and has multiple local minima where the training is getting stuck. Decreasing the batch size would help the data scientist stochastically get out of the local minima saddles. Decreasing the learning rate would prevent overshooting the global loss function minimum. Refer to the paper at this link for an explanation.
Reference 6) : Here

Question 7) Your organization has a standalone Javascript (Node.js) application that streams data into AWS using Kinesis Data Streams. You notice that they are using the Kinesis API (AWS SDK) over the Kinesis Producer Library (KPL). What might be the reasoning behind this?
A) The Kinesis API (AWS SDK) provides greater functionality over the Kinesis Producer Library.
B) The Kinesis API (AWS SDK) runs faster in Javascript applications over the Kinesis Producer Library.
C) The Kinesis Producer Library must be installed as a Java application to use with Kinesis Data Streams.
D) The Kinesis Producer Library cannot be integrated with a Javascript application because of its asynchronous architecture.
Answer 7)
C
Notes/Hint7:
The KPL must be installed as a Java application before it can be used with your Kinesis Data Streams. There are ways to process KPL serialized data within AWS Lambda, in Java, Node.js, and Python, but not if these answers mentions Lambda.
Reference 7) KPL
 
 
Question 8) A data scientist is evaluating different binary classification models. A false positive result is 5 times more expensive (from a business perspective) than a false negative result. The models should be evaluated based on the following criteria: 
1) Must have a recall rate of at least 80%
2) Must have a false positive rate of 10% or less
3) Must minimize business costs After creating each binary classification model, the data scientist generates the corresponding confusion matrix. Which confusion matrix represents the model that satisfies the requirements?
A) TN = 91, FP = 9 FN = 22, TP = 78
 B) TN = 99, FP = 1 FN = 21, TP = 79
C) TN = 96, FP = 4 FN = 10, TP = 90
D) TN = 98, FP = 2 FN = 18, TP = 82
 
Answer 8): 
D
 

Notes/Hint 8)


The following calculations are required: TP = True Positive FP = False Positive FN = False Negative TN = True Negative FN = False Negative Recall = TP / (TP + FN) False Positive Rate (FPR) = FP / (FP + TN) Cost = 5 * FP + FN A B C D Recall 78 / (78 + 22) = 0.78 79 / (79 + 21) = 0.79 90 / (90 + 10) = 0.9 82 / (82 + 18) = 0.82 False Positive Rate 9 / (9 + 91) = 0.09 1 / (1 + 99) = 0.01 4 / (4 + 96) = 0.04 2 / (2 + 98) = 0.02 Costs 5 * 9 + 22 = 67 5 * 1 + 21 = 26 5 * 4 + 10 = 30 5 * 2 + 18 = 28 Options C and D have a recall greater than 80% and an FPR less than 10%, but D is the most cost effective. For supporting information, refer to this link.
Reference 8: Here

 
 
Question 9) A data scientist uses logistic regression to build a fraud detection model. While the model accuracy is 99%, 90% of the fraud cases are not detected by the model. What action will definitely help the model detect more than 10% of fraud cases? 
A) Using undersampling to balance the dataset
B) Decreasing the class probability threshold
C) Using regularization to reduce overfitting
D) Using oversampling to balance the dataset
 

Answer  9)

B

 

Notes 9)


Decreasing the class probability threshold makes the model more sensitive and, therefore, marks more cases as the positive class, which is fraud in this case. This will increase the likelihood of fraud detection. However, it comes at the price of lowering precision. This is covered in the Discussion section of the paper at this link
Reference 9: Here

 
 
Question 10) A company is interested in building a fraud detection model. Currently, the data scientist does not have a sufficient amount of information due to the low number of fraud cases. Which method is MOST likely to detect the GREATEST number of valid fraud cases?
A) Oversampling using bootstrapping
B) Undersampling
C) Oversampling using SMOTE
D) Class weight adjustment
 

Answer  10)

C

 
Notes 10)

With datasets that are not fully populated, the Synthetic Minority Over-sampling Technique (SMOTE) adds new information by adding synthetic data points to the minority class. This technique would be the most effective in this scenario. Refer to Section 4.2 at this link for supporting information.
Reference 10) : Here
 
Question 11) A machine learning engineer is preparing a data frame for a supervised learning task with the Amazon SageMaker Linear Learner algorithm. The ML engineer notices the target label classes are highly imbalanced and multiple feature columns contain missing values. The proportion of missing values across the entire data frame is less than 5%. What should the ML engineer do to minimize bias due to missing values? 
 
A) Replace each missing value by the mean or median across non-missing values in same row.
B) Delete observations that contain missing values because these represent less than 5% of the data.
C) Replace each missing value by the mean or median across non-missing values in the same column.
D) For each feature, approximate the missing values using supervised learning based on other features.
 

Answer  11)

D

 

Notes 11)

Use supervised learning to predict missing values based on the values of other features. Different supervised learning approaches might have different performances, but any properly implemented supervised learning approach should provide the same or better approximation than mean or median approximation, as proposed in responses A and C. Supervised learning applied to the imputation of missing values is an active field of research. Refer to this link for an example.
Reference 11): Here

 
Question 12) A company has collected customer comments on its products, rating them as safe or unsafe, using decision trees. The training dataset has the following features: id, date, full review, full review summary, and a binary safe/unsafe tag. During training, any data sample with missing features was dropped. In a few instances, the test set was found to be missing the full review text field. For this use case, which is the most effective course of action to address test data samples with missing features? 
A) Drop the test samples with missing full review text fields, and then run through the test set.
B) Copy the summary text fields and use them to fill in the missing full review text fields, and then run through the test set.
C) Use an algorithm that handles missing data better than decision trees.
D) Generate synthetic data to fill in the fields that are missing data, and then run through the test set.
 
Answer  12)
B

 

 

Notes 12) 

In this case, a full review summary usually contains the most descriptive phrases of the entire review and is a valid stand-in for the missing full review text field. For supporting information, refer to page 1627 at this link, and this link and this link.

Reference 12) Here

 

 
Question 13) An insurance company needs to automate claim compliance reviews because human reviews are expensive and error-prone. The company has a large set of claims and a compliance label for each. Each claim consists of a few sentences in English, many of which contain complex related information. Management would like to use Amazon SageMaker built-in algorithms to design a machine learning supervised model that can be trained to read each claim and predict if the claim is compliant or not. Which approach should be used to extract features from the claims to be used as inputs for the downstream supervised task? 
A) Derive a dictionary of tokens from claims in the entire dataset. Apply one-hot encoding to tokens found in each claim of the training set. Send the derived features space as inputs to an Amazon SageMaker builtin supervised learning algorithm.
B) Apply Amazon SageMaker BlazingText in Word2Vec mode to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
C) Apply Amazon SageMaker BlazingText in classification mode to labeled claims in the training set to derive features for the claims that correspond to the compliant and non-compliant labels, respectively.
D) Apply Amazon SageMaker Object2Vec to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
 

Answer  13)

D

 

Notes 13)

Amazon SageMaker Object2Vec generalizes the Word2Vec embedding technique for words to more complex objects, such as sentences and paragraphs. Since the supervised learning task is at the level of whole claims, for which there are labels, and no labels are available at the word level, Object2Vec needs be used instead of Word2Vec.

Reference 13)  Amazon SageMaker
Object2Vec 

Question 14) You have been tasked with capturing two different types of streaming events. The first event type includes mission-critical data that needs to immediately be processed before operations can continue. The second event type includes data of less importance, but operations can continue without immediately processing. What is the most appropriate solution to record these different types of events?

A) Capture both events with the PutRecords API call.
B) Capture both event types using the Kinesis Producer Library (KPL).
C) Capture the mission critical events with the PutRecords API call and the second event type with the Kinesis Producer Library (KPL).
D) Capture the mission critical events with the Kinesis Producer Library (KPL) and the second event type with the Putrecords API call.
 

Answer  14)

C

 

Notes 14)

The question is about sending data to Kinesis synchronously vs. asynchronously. PutRecords is a synchronous send function, so it must be used for the first event type (critical events). The Kinesis Producer Library (KPL) implements an asynchronous send function, so it can be used for the second event type. In this scenario, the reason to use the KPL over the PutRecords API call is because: KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user-configurable). Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance. Applications that cannot tolerate this additional delay may need to use the AWS SDK directly. For more information about using the AWS SDK with Kinesis Data Streams, see Developing Producers Using the Amazon Kinesis Data Streams API with the AWS SDK for Java. For more information about RecordMaxBufferedTime and other user-configurable properties of the KPL, see Configuring the Kinesis Producer Library.

Reference 14: KCL vs PutRecords

 

Question 15) You are collecting clickstream data from an e-commerce website to make near-real time product suggestions for users actively using the site. Which combination of tools can be used to achieve the quickest recommendations and meets all of the requirements?

A) Use Kinesis Data Streams to ingest clickstream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
B) Use Kinesis Data Firehose to ingest click stream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions, then use Lambda to load these results into S3.
C) Use Kinesis Data Streams to ingest clickstream data, then use Lambda to process that data and write it to S3. Once the data is on S3, use Athena to query based on conditions that data and make real time recommendations to users.
D) Use the Kinesis Data Analytics to ingest the clickstream data directly and run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
 

Answer  15)

A

 

Notes 15)

Kinesis Data Analytics gets its input streaming data from Kinesis Data Streams or Kinesis Data Firehose. You can use Kinesis Data Analytics to run real-time SQL queries on your data. Once certain conditions are met you can trigger Lambda functions to make real time product suggestions to users. It is not important that we store or persist the clickstream data.

Reference 15: Kinesis Data Analytics

Question 16) Which service built by AWS makes it easy to set up a retry mechanism, aggregate records to improve throughput, and automatically submits CloudWatch metrics?

A) Kinesis API (AWS SDK)
B) Kinesis Producer Library (KPL)
C) Kinesis Consumer Library
D) Kinesis Client Library (KCL)

Answer  16)

B

 

Notes 16)

Although the Kinesis API built into the AWS SDK can be used for all of this, the Kinesis Producer Library (KPL) makes it easy to integrate all of this into your applications.

Reference 16:  Kinesis Producer Library (KPL) 

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore  9n8rl80hvm4t-mobile screenshots]

Question 17) You have been tasked with capturing data from an online gaming platform to run analytics on and process through a machine learning pipeline. The data that you are ingesting is players controller inputs every 1 second (up to 10 players in a game) that is in JSON format. The data needs to be ingested through Kinesis Data Streams and the JSON data blob is 100 KB in size. What is the minimum number of shards you can use to successfully ingest this data?

A) 10 shards
B) Greater than 500 shards, so you’ll need to request more shards from AWS
C) 1 shard
D) 100 shards

Answer  17)

C

 

Notes 17)

In this scenario, there will be a maximum of 10 records per second with a max payload size of 1000 KB (10 records x 100 KB = 1000KB) written to the shard. A single shard can ingest up to 1 MB of data per second, which is enough to ingest the 1000 KB from the streaming game play. Therefor 1 shard is enough to handle the streaming data.

Reference 17: shards

Question 18) Which services in the Kinesis family allows you to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time?

A) Kinesis Streams
B) Kinesis Firehose
C) Kinesis Video Streams
D) Kinesis Data Analytics

Answer  18)

D

 

Notes 18)

Kinesis Data Analytics allows you to run real-time SQL queries on your data to gain insights and respond to events in real time.

Reference 18: Kinesis Data Analytics

 

Question 19) You are a ML specialist needing to collect data from Twitter tweets. Your goal is to collect tweets that include only the name of your company and the tweet body, and store it off into a data store in AWS. What set of tools can you use to stream, transform, and load the data into AWS with the LEAST amount of effort?

A) Setup a Kinesis Data Firehose for data ingestion and immediately write that data to S3. Next, setup a Lambda function to trigger when data lands in S3 to transform it and finally write it to DynamoDB.
B) Setup A Kinesis Data Stream for data ingestion, setup EC2 instances as data consumers to poll and transform the data from the stream. Once the data is transformed, make an API call to write the data to DynamoDB.
C) Setup Kinesis Data Streams for data ingestion. Next, setup Kinesis Data Firehouse to load that data into RedShift. Next, setup a Lambda function to query data using RedShift spectrum and store the results onto DynamoDB.
D) Create a Kinesis Data Stream to ingest the data. Next, setup a Kinesis Data Firehose and use Lambda to transform the data from the Kinesis Data Stream, then use Lambda to write the data to DynamoDB. Finally, use S3 as the data destination for Kinesis Data Firehose.
 

Answer 19)

A

Notes 19)

All of these could be used to stream, transform, and load the data into an AWS data store. The setup that requires the LEAST amount of effort and moving parts involves setting up a Kinesis Data Firehose to stream the data into S3, have it transformed by Lambda with an S3 trigger, and then written to DynamoDB.

Reference 19: Kinesis Data Firehose to stream the data into S3

Question 20) Which service in the Kinesis family allows you to build custom applications that process or analyze streaming data for specialized needs?

A) Kinesis Firehose
B) Kinesis Streams
C) Kinesis Video Streams
D) Kinesis Data Analytics

Answer 20)

B

Notes 20)

Kinesis Streams allows you to stream data into AWS and build custom applications around that streaming data.

Reference 20: Kinesis Streams

Question21

Answer21:

 

Notes 21: 

Question22

Answer22:

 

Notes 22: 

Question23

Answer23:

 

Notes 23: 

Question24

Answer24:

 

Notes 24: 

What are the Top 100 AWS and Google Certified Machine Learning Specialty Questions and Answers Dumps?

This blog is the best way  is the best way to prepare for your upcoming  AWS Certified Machine Learning Specialty and Google Certified Professional Machine Learning Engineer exam. With over 100 questions and answers, this blog provides quizzes similar  that are very similar to the real exam. It also includes  the option to show and hide answers. Additionally, there are machine learning interview questions and detailed answers, as well as cheat sheets and illustrations. This blog is the best way to make sure you are well-prepared for your AWS Certified Machine Learning Specialty Exam.

The typical Google Machine Learning Engineer salary is $147,218. Machine Learning Engineer salaries at Google can range from $110,000 – $152,183.

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.

  • By the end of 2020, 85% of customer interactions will be handled without a human (Call Center, Chatbot, etc…)
  • 61% of marketers say artificial intelligence is the most important aspect of their data strategy.
  • 80% of business and tech leaders say AI already boosts productivity (Robotic Process Automation, Power Automate, etc..)
  • Current AI technology can boost business productivity by up to 40%

AWS Machine Learning Certification Specialty Exam Prep for iOs Android Windows10/11

AWS machine Learning Specialty Exam Prep MLS-C01
AWS machine Learning Specialty Exam Prep MLS-C01

GCP Professional Machine Learning Engineer for iOs, Android, Windows 10/11

Quizzes, Practice Exams: Framing, Architecting, Designing, Developing ML Problems & Solutions, ML Jobs Interview Q&A

GCP Professional Machine Learning Engineer
GCP Professional Machine Learning Engineer

 

Azure AI Fundamentals AI-900 Exam Prep App for iOS, Android, Windows10/11

Basics and Advanced Machine Learning Quizzes on Azure, Azure Machine Learning Job Interviews Questions and Answer, ML Cheat Sheets

Azure AI Fundamentals AI-900 Exam Prep
Azure AI Fundamentals AI-900 Exam Prep

Machine Learning For Dummies App for iOs, Android, Windows10/11

Use this App to learn about Machine Learning and Elevate your Brain with Machine Learning Quizzes, Cheat Sheets, Ml Jobs Interview Questions and Answers updated daily.

Machine Learning For Dummies
Machine Learning For Dummies

What does a Professional Machine Learning Engineer do?

Professional Machine Learning Engineer designs, builds, and productionizes ML models to solve business challenges using Google Cloud technologies and knowledge of proven ML models and techniques. The ML Engineer collaborates closely with other job roles to ensure long-term success of models. The ML Engineer should be proficient in all aspects of model architecture, data pipeline interaction, and metrics interpretation. The ML Engineer needs familiarity with application development, infrastructure management, data engineering, and security. Through an understanding of training, retraining, deploying, scheduling, monitoring, and improving models, they design and create scalable solutions for optimal performance.

The AWS Certified Machine Learning – Specialty certification is intended for individuals who perform a development or data science role. It validates a candidate’s ability to design, implement, deploy, and maintain machine learning (ML) solutions for given business problems.

This blog covers Machine Learning 101, Top 20 AWS Certified Machine Learning Specialty Questions and Answers, Top 20 Google Professional Machine Learning Engineer Sample Questions, Machine Learning Quizzes, Machine Learning Q&A, Top 10 Machine Learning Algorithms, Machine Learning Latest Hot News, Machine Learning Demos (Ex: Tensorflow Demos)

Below are the Top 100 AWS Certified Machine Learning Specialty Questions and Answers Dumps.

Top

 

Question1: A machine learning team has several large CSV datasets in Amazon S3. Historically, models built with the Amazon SageMaker Linear Learner algorithm have taken hours to train on similar-sized datasets. The team’s leaders need to accelerate the training process. What can a machine learning specialist do to address this concern?

A) Use Amazon SageMaker Pipe mode.
B) Use Amazon Machine Learning to train the models.
C) Use Amazon Kinesis to stream the data to Amazon SageMaker.
D) Use AWS Glue to transform the CSV dataset to the JSON format.
ANSWER1:

A

Notes/Hint1:


Amazon SageMaker Pipe mode streams the data directly to the container, which improves the performance of training jobs. (Refer to this link for supporting information.) In Pipe mode, your training job streams data directly from Amazon S3. Streaming can provide faster start times for training jobs and better throughput. With Pipe mode, you also reduce the size of the Amazon EBS volumes for your training instances. B would not apply in this scenario. C is a streaming ingestion solution, but is not applicable in this scenario. D transforms the data structure.

Reference1: Amazon SageMaker

Question 2) A local university wants to track cars in a parking lot to determine which students are parking in the lot. The university is wanting to ingest videos of the cars parking in near-real time, use machine learning to identify license plates, and store that data in an AWS data store. Which solution meets these requirements with the LEAST amount of development effort?

A) Use Amazon Kinesis Data Streams to ingest the video in near-real time, use the Kinesis Data Streams consumer integrated with Amazon Rekognition Video to process the license plate information, and then store results in DynamoDB.

B) Use Amazon Kinesis Video Streams to ingest the videos in near-real time, use the Kinesis Video Streams integration with Amazon Rekognition Video to identify the license plate information, and then store the results in DynamoDB.

C) Use Amazon Kinesis Data Streams to ingest videos in near-real time, call Amazon Rekognition to identify license plate information, and then store results in DynamoDB.

D) Use Amazon Kinesis Firehose to ingest the video in near-real time and outputs results onto S3. Set up a Lambda function that triggers when a new video is PUT onto S3 to send results to Amazon Rekognition to identify license plate information, and then store results in DynamoDB.

Answer 2)

B

Notes/Hint2)

Kinesis Video Streams is used to stream videos in near-real time. Amazon Rekognition Video uses Amazon Kinesis Video Streams to receive and process a video stream. After the videos have been processed by Rekognition we can output the results in DynamoDB.

Reference: Kinesis Video Streams

Question 3) A term frequency–inverse document frequency (tf–idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences:

1. Please call the number below.
2. Please do not call us. What are the dimensions of the tf–idf matrix?
A) (2, 16)
B) (2, 8)
C) (2, 10)
D) (8, 10)

ANSWER3:

A

Notes/Hint3:

There are 2 sentences, 8 unique unigrams, and 8 unique bigrams, so the result would be (2,16). The phrases are “Please call the number below” and “Please do not call us.” Each word individually (unigram) is “Please,” “call,” ”the,” ”number,” “below,” “do,” “not,” and “us.” The unique bigrams are “Please call,” “call the,” ”the number,” “number below,” “Please do,” “do not,” “not call,” and “call us.” The tf–idf vectorizer is described at this link.

Reference3:  tf-idf vertorizer

Question 4: A company is setting up a system to manage all of the datasets it stores in Amazon S3. The company would like to automate running transformation jobs on the data and maintaining a catalog of the metadata concerning the datasets. The solution should require the least amount of setup and maintenance. Which solution will allow the company to achieve its goals? 

A) Create an Amazon EMR cluster with Apache Hive installed. Then, create a Hive metastore and a script to run transformation jobs on a schedule.
B) Create an AWS Glue crawler to populate the AWS Glue Data Catalog. Then, author an AWS Glue ETL job, and set up a schedule for data transformation jobs.
C) Create an Amazon EMR cluster with Apache Spark installed. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule. D) Create an AWS Data Pipeline that transforms the data. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule.
 

ANSWER4:

B

Notes/Hint4:

AWS Glue is the correct answer because this option requires the least amount of setup and maintenance since it is serverless, and it does not require management of the infrastructure. Refer to this link for supporting information. A, C, and D are all solutions that can solve the problem, but require more steps for configuration, and require higher operational overhead to run and maintain.
Reference4:  Glue

Question 5) Which service in the Kinesis family allows you to easily load streaming data into data stores and analytics tools?

A) Kinesis Firehose
B) Kinesis Streams
C) Kinesis Data Analytics
D) Kinesis Video Streams
 

ANSWER5:

A

Notes/Hint5:

Kinesis Firehose is perfect for streaming data into AWS and sending it directly to its final destination – places like S3, Redshift, Elastisearch, and Splunk Instances.

Reference 5): Kinesis Firehose

Question 6) A data scientist is working on optimizing a model during the training process by varying multiple parameters. The data scientist observes that, during multiple runs with identical parameters, the loss function converges to different, yet stable, values. What should the data scientist do to improve the training process? 
A) Increase the learning rate. Keep the batch size the same.
B) Reduce the batch size. Decrease the learning rate.
C) Keep the batch size the same. Decrease the learning rate.
D) Do not change the learning rate. Increase the batch size.
 
Answer  6)
B
 

Notes 6)

It is most likely that the loss function is very curvy and has multiple local minima where the training is getting stuck. Decreasing the batch size would help the data scientist stochastically get out of the local minima saddles. Decreasing the learning rate would prevent overshooting the global loss function minimum. Refer to the paper at this link for an explanation.
Reference 6) : Here

Question 7) Your organization has a standalone Javascript (Node.js) application that streams data into AWS using Kinesis Data Streams. You notice that they are using the Kinesis API (AWS SDK) over the Kinesis Producer Library (KPL). What might be the reasoning behind this?
A) The Kinesis API (AWS SDK) provides greater functionality over the Kinesis Producer Library.
B) The Kinesis API (AWS SDK) runs faster in Javascript applications over the Kinesis Producer Library.
C) The Kinesis Producer Library must be installed as a Java application to use with Kinesis Data Streams.
D) The Kinesis Producer Library cannot be integrated with a Javascript application because of its asynchronous architecture.
Answer 7)
C
Notes/Hint7:
The KPL must be installed as a Java application before it can be used with your Kinesis Data Streams. There are ways to process KPL serialized data within AWS Lambda, in Java, Node.js, and Python, but not if these answers mentions Lambda.
Reference 7) KPL
 
 
Question 8) A data scientist is evaluating different binary classification models. A false positive result is 5 times more expensive (from a business perspective) than a false negative result. The models should be evaluated based on the following criteria: 
1) Must have a recall rate of at least 80%
2) Must have a false positive rate of 10% or less
3) Must minimize business costs After creating each binary classification model, the data scientist generates the corresponding confusion matrix. Which confusion matrix represents the model that satisfies the requirements?
A) TN = 91, FP = 9 FN = 22, TP = 78
 B) TN = 99, FP = 1 FN = 21, TP = 79
C) TN = 96, FP = 4 FN = 10, TP = 90
D) TN = 98, FP = 2 FN = 18, TP = 82
 
Answer 8): 
D
 

Notes/Hint 8)


The following calculations are required: TP = True Positive FP = False Positive FN = False Negative TN = True Negative FN = False Negative Recall = TP / (TP + FN) False Positive Rate (FPR) = FP / (FP + TN) Cost = 5 * FP + FN A B C D Recall 78 / (78 + 22) = 0.78 79 / (79 + 21) = 0.79 90 / (90 + 10) = 0.9 82 / (82 + 18) = 0.82 False Positive Rate 9 / (9 + 91) = 0.09 1 / (1 + 99) = 0.01 4 / (4 + 96) = 0.04 2 / (2 + 98) = 0.02 Costs 5 * 9 + 22 = 67 5 * 1 + 21 = 26 5 * 4 + 10 = 30 5 * 2 + 18 = 28 Options C and D have a recall greater than 80% and an FPR less than 10%, but D is the most cost effective. For supporting information, refer to this link.
Reference 8: Here

 
 
Question 9) A data scientist uses logistic regression to build a fraud detection model. While the model accuracy is 99%, 90% of the fraud cases are not detected by the model. What action will definitely help the model detect more than 10% of fraud cases? 
A) Using undersampling to balance the dataset
B) Decreasing the class probability threshold
C) Using regularization to reduce overfitting
D) Using oversampling to balance the dataset
 

Answer  9)

B

 

Notes 9)


Decreasing the class probability threshold makes the model more sensitive and, therefore, marks more cases as the positive class, which is fraud in this case. This will increase the likelihood of fraud detection. However, it comes at the price of lowering precision. This is covered in the Discussion section of the paper at this link
Reference 9: Here

 
 
Question 10) A company is interested in building a fraud detection model. Currently, the data scientist does not have a sufficient amount of information due to the low number of fraud cases. Which method is MOST likely to detect the GREATEST number of valid fraud cases?
A) Oversampling using bootstrapping
B) Undersampling
C) Oversampling using SMOTE
D) Class weight adjustment
 

Answer  10)

C

 
Notes 10)

With datasets that are not fully populated, the Synthetic Minority Over-sampling Technique (SMOTE) adds new information by adding synthetic data points to the minority class. This technique would be the most effective in this scenario. Refer to Section 4.2 at this link for supporting information.
Reference 10) : Here
 
Question 11) A machine learning engineer is preparing a data frame for a supervised learning task with the Amazon SageMaker Linear Learner algorithm. The ML engineer notices the target label classes are highly imbalanced and multiple feature columns contain missing values. The proportion of missing values across the entire data frame is less than 5%. What should the ML engineer do to minimize bias due to missing values? 
 
A) Replace each missing value by the mean or median across non-missing values in same row.
B) Delete observations that contain missing values because these represent less than 5% of the data.
C) Replace each missing value by the mean or median across non-missing values in the same column.
D) For each feature, approximate the missing values using supervised learning based on other features.
 

Answer  11)

D

 

Notes 11)

Use supervised learning to predict missing values based on the values of other features. Different supervised learning approaches might have different performances, but any properly implemented supervised learning approach should provide the same or better approximation than mean or median approximation, as proposed in responses A and C. Supervised learning applied to the imputation of missing values is an active field of research. Refer to this link for an example.
Reference 11): Here

 
Question 12) A company has collected customer comments on its products, rating them as safe or unsafe, using decision trees. The training dataset has the following features: id, date, full review, full review summary, and a binary safe/unsafe tag. During training, any data sample with missing features was dropped. In a few instances, the test set was found to be missing the full review text field. For this use case, which is the most effective course of action to address test data samples with missing features? 
A) Drop the test samples with missing full review text fields, and then run through the test set.
B) Copy the summary text fields and use them to fill in the missing full review text fields, and then run through the test set.
C) Use an algorithm that handles missing data better than decision trees.
D) Generate synthetic data to fill in the fields that are missing data, and then run through the test set.
 
Answer  12)
B

 

 

Notes 12) 

In this case, a full review summary usually contains the most descriptive phrases of the entire review and is a valid stand-in for the missing full review text field. For supporting information, refer to page 1627 at this link, and this link and this link.

Reference 12) Here

 

 
Question 13) An insurance company needs to automate claim compliance reviews because human reviews are expensive and error-prone. The company has a large set of claims and a compliance label for each. Each claim consists of a few sentences in English, many of which contain complex related information. Management would like to use Amazon SageMaker built-in algorithms to design a machine learning supervised model that can be trained to read each claim and predict if the claim is compliant or not. Which approach should be used to extract features from the claims to be used as inputs for the downstream supervised task? 
A) Derive a dictionary of tokens from claims in the entire dataset. Apply one-hot encoding to tokens found in each claim of the training set. Send the derived features space as inputs to an Amazon SageMaker builtin supervised learning algorithm.
B) Apply Amazon SageMaker BlazingText in Word2Vec mode to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
C) Apply Amazon SageMaker BlazingText in classification mode to labeled claims in the training set to derive features for the claims that correspond to the compliant and non-compliant labels, respectively.
D) Apply Amazon SageMaker Object2Vec to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
 

Answer  13)

D

 

Notes 13)

Amazon SageMaker Object2Vec generalizes the Word2Vec embedding technique for words to more complex objects, such as sentences and paragraphs. Since the supervised learning task is at the level of whole claims, for which there are labels, and no labels are available at the word level, Object2Vec needs be used instead of Word2Vec.

Reference 13)  Amazon SageMaker
Object2Vec 

Question 14) You have been tasked with capturing two different types of streaming events. The first event type includes mission-critical data that needs to immediately be processed before operations can continue. The second event type includes data of less importance, but operations can continue without immediately processing. What is the most appropriate solution to record these different types of events?

A) Capture both events with the PutRecords API call.
B) Capture both event types using the Kinesis Producer Library (KPL).
C) Capture the mission critical events with the PutRecords API call and the second event type with the Kinesis Producer Library (KPL).
D) Capture the mission critical events with the Kinesis Producer Library (KPL) and the second event type with the Putrecords API call.
 

Answer  14)

C

 

Notes 14)

The question is about sending data to Kinesis synchronously vs. asynchronously. PutRecords is a synchronous send function, so it must be used for the first event type (critical events). The Kinesis Producer Library (KPL) implements an asynchronous send function, so it can be used for the second event type. In this scenario, the reason to use the KPL over the PutRecords API call is because: KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user-configurable). Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance. Applications that cannot tolerate this additional delay may need to use the AWS SDK directly. For more information about using the AWS SDK with Kinesis Data Streams, see Developing Producers Using the Amazon Kinesis Data Streams API with the AWS SDK for Java. For more information about RecordMaxBufferedTime and other user-configurable properties of the KPL, see Configuring the Kinesis Producer Library.

Reference 14: KCL vs PutRecords

 

Question 15) You are collecting clickstream data from an e-commerce website to make near-real time product suggestions for users actively using the site. Which combination of tools can be used to achieve the quickest recommendations and meets all of the requirements?

A) Use Kinesis Data Streams to ingest clickstream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
B) Use Kinesis Data Firehose to ingest click stream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions, then use Lambda to load these results into S3.
C) Use Kinesis Data Streams to ingest clickstream data, then use Lambda to process that data and write it to S3. Once the data is on S3, use Athena to query based on conditions that data and make real time recommendations to users.
D) Use the Kinesis Data Analytics to ingest the clickstream data directly and run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
 

Answer  15)

A

 

Notes 15)

Kinesis Data Analytics gets its input streaming data from Kinesis Data Streams or Kinesis Data Firehose. You can use Kinesis Data Analytics to run real-time SQL queries on your data. Once certain conditions are met you can trigger Lambda functions to make real time product suggestions to users. It is not important that we store or persist the clickstream data.

Reference 15: Kinesis Data Analytics

Question 16) Which service built by AWS makes it easy to set up a retry mechanism, aggregate records to improve throughput, and automatically submits CloudWatch metrics?

A) Kinesis API (AWS SDK)
B) Kinesis Producer Library (KPL)
C) Kinesis Consumer Library
D) Kinesis Client Library (KCL)

Answer  16)

B

 

Notes 16)

Although the Kinesis API built into the AWS SDK can be used for all of this, the Kinesis Producer Library (KPL) makes it easy to integrate all of this into your applications.

Reference 16:  Kinesis Producer Library (KPL) 

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore  9n8rl80hvm4t-mobile screenshots]

Question 17) You have been tasked with capturing data from an online gaming platform to run analytics on and process through a machine learning pipeline. The data that you are ingesting is players controller inputs every 1 second (up to 10 players in a game) that is in JSON format. The data needs to be ingested through Kinesis Data Streams and the JSON data blob is 100 KB in size. What is the minimum number of shards you can use to successfully ingest this data?

A) 10 shards
B) Greater than 500 shards, so you’ll need to request more shards from AWS
C) 1 shard
D) 100 shards

Answer  17)

C

 

Notes 17)

In this scenario, there will be a maximum of 10 records per second with a max payload size of 1000 KB (10 records x 100 KB = 1000KB) written to the shard. A single shard can ingest up to 1 MB of data per second, which is enough to ingest the 1000 KB from the streaming game play. Therefor 1 shard is enough to handle the streaming data.

Reference 17: shards

Question 18) Which services in the Kinesis family allows you to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time?

A) Kinesis Streams
B) Kinesis Firehose
C) Kinesis Video Streams
D) Kinesis Data Analytics

Answer  18)

D

 

Notes 18)

Kinesis Data Analytics allows you to run real-time SQL queries on your data to gain insights and respond to events in real time.

Reference 18: Kinesis Data Analytics

 

Question 19) You are a ML specialist needing to collect data from Twitter tweets. Your goal is to collect tweets that include only the name of your company and the tweet body, and store it off into a data store in AWS. What set of tools can you use to stream, transform, and load the data into AWS with the LEAST amount of effort?

A) Setup a Kinesis Data Firehose for data ingestion and immediately write that data to S3. Next, setup a Lambda function to trigger when data lands in S3 to transform it and finally write it to DynamoDB.
B) Setup A Kinesis Data Stream for data ingestion, setup EC2 instances as data consumers to poll and transform the data from the stream. Once the data is transformed, make an API call to write the data to DynamoDB.
C) Setup Kinesis Data Streams for data ingestion. Next, setup Kinesis Data Firehouse to load that data into RedShift. Next, setup a Lambda function to query data using RedShift spectrum and store the results onto DynamoDB.
D) Create a Kinesis Data Stream to ingest the data. Next, setup a Kinesis Data Firehose and use Lambda to transform the data from the Kinesis Data Stream, then use Lambda to write the data to DynamoDB. Finally, use S3 as the data destination for Kinesis Data Firehose.
 

Answer 19)

A

Notes 19)

All of these could be used to stream, transform, and load the data into an AWS data store. The setup that requires the LEAST amount of effort and moving parts involves setting up a Kinesis Data Firehose to stream the data into S3, have it transformed by Lambda with an S3 trigger, and then written to DynamoDB.

Reference 19: Kinesis Data Firehose to stream the data into S3

Question 20) Which service in the Kinesis family allows you to build custom applications that process or analyze streaming data for specialized needs?

A) Kinesis Firehose
B) Kinesis Streams
C) Kinesis Video Streams
D) Kinesis Data Analytics

Answer 20)

B

Notes 20)

Kinesis Streams allows you to stream data into AWS and build custom applications around that streaming data.

Reference 20: Kinesis Streams

Question21:

Answer21:

What are the Top 100 AWS and Google Certified Machine Learning Specialty Questions and Answers Dumps?

This blog is the best way  is the best way to prepare for your upcoming  AWS Certified Machine Learning Specialty and Google Certified Professional Machine Learning Engineer exam. With over 100 questions and answers, this blog provides quizzes similar  that are very similar to the real exam. It also includes  the option to show and hide answers. Additionally, there are machine learning interview questions and detailed answers, as well as cheat sheets and illustrations. This blog is the best way to make sure you are well-prepared for your AWS Certified Machine Learning Specialty Exam.

The typical Google Machine Learning Engineer salary is $147,218. Machine Learning Engineer salaries at Google can range from $110,000 – $152,183.

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.

  • By the end of 2020, 85% of customer interactions will be handled without a human (Call Center, Chatbot, etc…)
  • 61% of marketers say artificial intelligence is the most important aspect of their data strategy.
  • 80% of business and tech leaders say AI already boosts productivity (Robotic Process Automation, Power Automate, etc..)
  • Current AI technology can boost business productivity by up to 40%

AWS Machine Learning Certification Specialty Exam Prep for iOs Android Windows10/11

AWS machine Learning Specialty Exam Prep MLS-C01
AWS machine Learning Specialty Exam Prep MLS-C01

GCP Professional Machine Learning Engineer for iOs, Android, Windows 10/11

Quizzes, Practice Exams: Framing, Architecting, Designing, Developing ML Problems & Solutions, ML Jobs Interview Q&A

GCP Professional Machine Learning Engineer
GCP Professional Machine Learning Engineer

 

Azure AI Fundamentals AI-900 Exam Prep App for iOS, Android, Windows10/11

Basics and Advanced Machine Learning Quizzes on Azure, Azure Machine Learning Job Interviews Questions and Answer, ML Cheat Sheets

Azure AI Fundamentals AI-900 Exam Prep
Azure AI Fundamentals AI-900 Exam Prep

Machine Learning For Dummies App for iOs, Android, Windows10/11

Use this App to learn about Machine Learning and Elevate your Brain with Machine Learning Quizzes, Cheat Sheets, Ml Jobs Interview Questions and Answers updated daily.

Machine Learning For Dummies
Machine Learning For Dummies

What does a Professional Machine Learning Engineer do?

Professional Machine Learning Engineer designs, builds, and productionizes ML models to solve business challenges using Google Cloud technologies and knowledge of proven ML models and techniques. The ML Engineer collaborates closely with other job roles to ensure long-term success of models. The ML Engineer should be proficient in all aspects of model architecture, data pipeline interaction, and metrics interpretation. The ML Engineer needs familiarity with application development, infrastructure management, data engineering, and security. Through an understanding of training, retraining, deploying, scheduling, monitoring, and improving models, they design and create scalable solutions for optimal performance.

The AWS Certified Machine Learning – Specialty certification is intended for individuals who perform a development or data science role. It validates a candidate’s ability to design, implement, deploy, and maintain machine learning (ML) solutions for given business problems.

This blog covers Machine Learning 101, Top 20 AWS Certified Machine Learning Specialty Questions and Answers, Top 20 Google Professional Machine Learning Engineer Sample Questions, Machine Learning Quizzes, Machine Learning Q&A, Top 10 Machine Learning Algorithms, Machine Learning Latest Hot News, Machine Learning Demos (Ex: Tensorflow Demos)

Below are the Top 100 AWS Certified Machine Learning Specialty Questions and Answers Dumps.

Top

 

Question1: A machine learning team has several large CSV datasets in Amazon S3. Historically, models built with the Amazon SageMaker Linear Learner algorithm have taken hours to train on similar-sized datasets. The team’s leaders need to accelerate the training process. What can a machine learning specialist do to address this concern?

A) Use Amazon SageMaker Pipe mode.
B) Use Amazon Machine Learning to train the models.
C) Use Amazon Kinesis to stream the data to Amazon SageMaker.
D) Use AWS Glue to transform the CSV dataset to the JSON format.
ANSWER1:

A

Notes/Hint1:


Amazon SageMaker Pipe mode streams the data directly to the container, which improves the performance of training jobs. (Refer to this link for supporting information.) In Pipe mode, your training job streams data directly from Amazon S3. Streaming can provide faster start times for training jobs and better throughput. With Pipe mode, you also reduce the size of the Amazon EBS volumes for your training instances. B would not apply in this scenario. C is a streaming ingestion solution, but is not applicable in this scenario. D transforms the data structure.

Reference1: Amazon SageMaker

Question 2) A local university wants to track cars in a parking lot to determine which students are parking in the lot. The university is wanting to ingest videos of the cars parking in near-real time, use machine learning to identify license plates, and store that data in an AWS data store. Which solution meets these requirements with the LEAST amount of development effort?

A) Use Amazon Kinesis Data Streams to ingest the video in near-real time, use the Kinesis Data Streams consumer integrated with Amazon Rekognition Video to process the license plate information, and then store results in DynamoDB.

B) Use Amazon Kinesis Video Streams to ingest the videos in near-real time, use the Kinesis Video Streams integration with Amazon Rekognition Video to identify the license plate information, and then store the results in DynamoDB.

C) Use Amazon Kinesis Data Streams to ingest videos in near-real time, call Amazon Rekognition to identify license plate information, and then store results in DynamoDB.

D) Use Amazon Kinesis Firehose to ingest the video in near-real time and outputs results onto S3. Set up a Lambda function that triggers when a new video is PUT onto S3 to send results to Amazon Rekognition to identify license plate information, and then store results in DynamoDB.

Answer 2)

B

Notes/Hint2)

Kinesis Video Streams is used to stream videos in near-real time. Amazon Rekognition Video uses Amazon Kinesis Video Streams to receive and process a video stream. After the videos have been processed by Rekognition we can output the results in DynamoDB.

Reference: Kinesis Video Streams

Question 3) A term frequency–inverse document frequency (tf–idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences:

1. Please call the number below.
2. Please do not call us. What are the dimensions of the tf–idf matrix?
A) (2, 16)
B) (2, 8)
C) (2, 10)
D) (8, 10)

ANSWER3:

A

Notes/Hint3:

There are 2 sentences, 8 unique unigrams, and 8 unique bigrams, so the result would be (2,16). The phrases are “Please call the number below” and “Please do not call us.” Each word individually (unigram) is “Please,” “call,” ”the,” ”number,” “below,” “do,” “not,” and “us.” The unique bigrams are “Please call,” “call the,” ”the number,” “number below,” “Please do,” “do not,” “not call,” and “call us.” The tf–idf vectorizer is described at this link.

Reference3:  tf-idf vertorizer

Question 4: A company is setting up a system to manage all of the datasets it stores in Amazon S3. The company would like to automate running transformation jobs on the data and maintaining a catalog of the metadata concerning the datasets. The solution should require the least amount of setup and maintenance. Which solution will allow the company to achieve its goals? 

A) Create an Amazon EMR cluster with Apache Hive installed. Then, create a Hive metastore and a script to run transformation jobs on a schedule.
B) Create an AWS Glue crawler to populate the AWS Glue Data Catalog. Then, author an AWS Glue ETL job, and set up a schedule for data transformation jobs.
C) Create an Amazon EMR cluster with Apache Spark installed. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule. D) Create an AWS Data Pipeline that transforms the data. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule.
 

ANSWER4:

B

Notes/Hint4:

AWS Glue is the correct answer because this option requires the least amount of setup and maintenance since it is serverless, and it does not require management of the infrastructure. Refer to this link for supporting information. A, C, and D are all solutions that can solve the problem, but require more steps for configuration, and require higher operational overhead to run and maintain.
Reference4:  Glue

Question 5) Which service in the Kinesis family allows you to easily load streaming data into data stores and analytics tools?

A) Kinesis Firehose
B) Kinesis Streams
C) Kinesis Data Analytics
D) Kinesis Video Streams
 

ANSWER5:

A

Notes/Hint5:

Kinesis Firehose is perfect for streaming data into AWS and sending it directly to its final destination – places like S3, Redshift, Elastisearch, and Splunk Instances.

Reference 5): Kinesis Firehose

Question 6) A data scientist is working on optimizing a model during the training process by varying multiple parameters. The data scientist observes that, during multiple runs with identical parameters, the loss function converges to different, yet stable, values. What should the data scientist do to improve the training process? 

A) Increase the learning rate. Keep the batch size the same.
B) Reduce the batch size. Decrease the learning rate.
C) Keep the batch size the same. Decrease the learning rate.
D) Do not change the learning rate. Increase the batch size.
 
Answer  6)
B
 

Notes 6)

It is most likely that the loss function is very curvy and has multiple local minima where the training is getting stuck. Decreasing the batch size would help the data scientist stochastically get out of the local minima saddles. Decreasing the learning rate would prevent overshooting the global loss function minimum. Refer to the paper at this link for an explanation.
Reference 6) : Here

Question 7) Your organization has a standalone Javascript (Node.js) application that streams data into AWS using Kinesis Data Streams. You notice that they are using the Kinesis API (AWS SDK) over the Kinesis Producer Library (KPL). What might be the reasoning behind this?

A) The Kinesis API (AWS SDK) provides greater functionality over the Kinesis Producer Library.
B) The Kinesis API (AWS SDK) runs faster in Javascript applications over the Kinesis Producer Library.
C) The Kinesis Producer Library must be installed as a Java application to use with Kinesis Data Streams.
D) The Kinesis Producer Library cannot be integrated with a Javascript application because of its asynchronous architecture.
Answer 7)
C
Notes/Hint7:
The KPL must be installed as a Java application before it can be used with your Kinesis Data Streams. There are ways to process KPL serialized data within AWS Lambda, in Java, Node.js, and Python, but not if these answers mentions Lambda.
Reference 7) KPL
 
 

Question 8) A data scientist is evaluating different binary classification models. A false positive result is 5 times more expensive (from a business perspective) than a false negative result. The models should be evaluated based on the following criteria: 

1) Must have a recall rate of at least 80%
2) Must have a false positive rate of 10% or less
3) Must minimize business costs After creating each binary classification model, the data scientist generates the corresponding confusion matrix. Which confusion matrix represents the model that satisfies the requirements?
A) TN = 91, FP = 9 FN = 22, TP = 78
 B) TN = 99, FP = 1 FN = 21, TP = 79
C) TN = 96, FP = 4 FN = 10, TP = 90
D) TN = 98, FP = 2 FN = 18, TP = 82
 
Answer 8): 
D
 

Notes/Hint 8)


The following calculations are required: TP = True Positive FP = False Positive FN = False Negative TN = True Negative FN = False Negative Recall = TP / (TP + FN) False Positive Rate (FPR) = FP / (FP + TN) Cost = 5 * FP + FN A B C D Recall 78 / (78 + 22) = 0.78 79 / (79 + 21) = 0.79 90 / (90 + 10) = 0.9 82 / (82 + 18) = 0.82 False Positive Rate 9 / (9 + 91) = 0.09 1 / (1 + 99) = 0.01 4 / (4 + 96) = 0.04 2 / (2 + 98) = 0.02 Costs 5 * 9 + 22 = 67 5 * 1 + 21 = 26 5 * 4 + 10 = 30 5 * 2 + 18 = 28 Options C and D have a recall greater than 80% and an FPR less than 10%, but D is the most cost effective. For supporting information, refer to this link.
Reference 8: Here

 
 

Question 9) A data scientist uses logistic regression to build a fraud detection model. While the model accuracy is 99%, 90% of the fraud cases are not detected by the model. What action will definitely help the model detect more than 10% of fraud cases? 

A) Using undersampling to balance the dataset
B) Decreasing the class probability threshold
C) Using regularization to reduce overfitting
D) Using oversampling to balance the dataset
 

Answer  9)

B

 

Notes 9)


Decreasing the class probability threshold makes the model more sensitive and, therefore, marks more cases as the positive class, which is fraud in this case. This will increase the likelihood of fraud detection. However, it comes at the price of lowering precision. This is covered in the Discussion section of the paper at this link
Reference 9: Here

 
 

Question 10) A company is interested in building a fraud detection model. Currently, the data scientist does not have a sufficient amount of information due to the low number of fraud cases. Which method is MOST likely to detect the GREATEST number of valid fraud cases?

A) Oversampling using bootstrapping
B) Undersampling
C) Oversampling using SMOTE
D) Class weight adjustment
 

Answer  10)

C

 
Notes 10)

With datasets that are not fully populated, the Synthetic Minority Over-sampling Technique (SMOTE) adds new information by adding synthetic data points to the minority class. This technique would be the most effective in this scenario. Refer to Section 4.2 at this link for supporting information.
Reference 10) : Here
 

Question 11) A machine learning engineer is preparing a data frame for a supervised learning task with the Amazon SageMaker Linear Learner algorithm. The ML engineer notices the target label classes are highly imbalanced and multiple feature columns contain missing values. The proportion of missing values across the entire data frame is less than 5%. What should the ML engineer do to minimize bias due to missing values? 

 
A) Replace each missing value by the mean or median across non-missing values in same row.
B) Delete observations that contain missing values because these represent less than 5% of the data.
C) Replace each missing value by the mean or median across non-missing values in the same column.
D) For each feature, approximate the missing values using supervised learning based on other features.
 

Answer  11)

D

 

Notes 11)

Use supervised learning to predict missing values based on the values of other features. Different supervised learning approaches might have different performances, but any properly implemented supervised learning approach should provide the same or better approximation than mean or median approximation, as proposed in responses A and C. Supervised learning applied to the imputation of missing values is an active field of research. Refer to this link for an example.
Reference 11): Here

 

Question 12) A company has collected customer comments on its products, rating them as safe or unsafe, using decision trees. The training dataset has the following features: id, date, full review, full review summary, and a binary safe/unsafe tag. During training, any data sample with missing features was dropped. In a few instances, the test set was found to be missing the full review text field. For this use case, which is the most effective course of action to address test data samples with missing features? 

A) Drop the test samples with missing full review text fields, and then run through the test set.
B) Copy the summary text fields and use them to fill in the missing full review text fields, and then run through the test set.
C) Use an algorithm that handles missing data better than decision trees.
D) Generate synthetic data to fill in the fields that are missing data, and then run through the test set.
 
Answer  12)
B

 

 

Notes 12) 

In this case, a full review summary usually contains the most descriptive phrases of the entire review and is a valid stand-in for the missing full review text field. For supporting information, refer to page 1627 at this link, and this link and this link.

Reference 12) Here

 

 

Question 13) An insurance company needs to automate claim compliance reviews because human reviews are expensive and error-prone. The company has a large set of claims and a compliance label for each. Each claim consists of a few sentences in English, many of which contain complex related information. Management would like to use Amazon SageMaker built-in algorithms to design a machine learning supervised model that can be trained to read each claim and predict if the claim is compliant or not. Which approach should be used to extract features from the claims to be used as inputs for the downstream supervised task? 

 
A) Derive a dictionary of tokens from claims in the entire dataset. Apply one-hot encoding to tokens found in each claim of the training set. Send the derived features space as inputs to an Amazon SageMaker builtin supervised learning algorithm.
 
B) Apply Amazon SageMaker BlazingText in Word2Vec mode to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
 
C) Apply Amazon SageMaker BlazingText in classification mode to labeled claims in the training set to derive features for the claims that correspond to the compliant and non-compliant labels, respectively.
 
D) Apply Amazon SageMaker Object2Vec to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
 

Answer  13)

D

 

Notes 13)

Amazon SageMaker Object2Vec generalizes the Word2Vec embedding technique for words to more complex objects, such as sentences and paragraphs. Since the supervised learning task is at the level of whole claims, for which there are labels, and no labels are available at the word level, Object2Vec needs be used instead of Word2Vec.

Reference 13)  Amazon SageMaker
Object2Vec 

Question 14) You have been tasked with capturing two different types of streaming events. The first event type includes mission-critical data that needs to immediately be processed before operations can continue. The second event type includes data of less importance, but operations can continue without immediately processing. What is the most appropriate solution to record these different types of events?

A) Capture both events with the PutRecords API call.
B) Capture both event types using the Kinesis Producer Library (KPL).
C) Capture the mission critical events with the PutRecords API call and the second event type with the Kinesis Producer Library (KPL).
D) Capture the mission critical events with the Kinesis Producer Library (KPL) and the second event type with the Putrecords API call.
 

Answer  14)

C

 

Notes 14)

The question is about sending data to Kinesis synchronously vs. asynchronously. PutRecords is a synchronous send function, so it must be used for the first event type (critical events). The Kinesis Producer Library (KPL) implements an asynchronous send function, so it can be used for the second event type. In this scenario, the reason to use the KPL over the PutRecords API call is because: KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user-configurable). Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance. Applications that cannot tolerate this additional delay may need to use the AWS SDK directly. For more information about using the AWS SDK with Kinesis Data Streams, see Developing Producers Using the Amazon Kinesis Data Streams API with the AWS SDK for Java. For more information about RecordMaxBufferedTime and other user-configurable properties of the KPL, see Configuring the Kinesis Producer Library.

Reference 14: KCL vs PutRecords

 

Question 15) You are collecting clickstream data from an e-commerce website to make near-real time product suggestions for users actively using the site. Which combination of tools can be used to achieve the quickest recommendations and meets all of the requirements?

A) Use Kinesis Data Streams to ingest clickstream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
 
B) Use Kinesis Data Firehose to ingest click stream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions, then use Lambda to load these results into S3.
 
C) Use Kinesis Data Streams to ingest clickstream data, then use Lambda to process that data and write it to S3. Once the data is on S3, use Athena to query based on conditions that data and make real time recommendations to users.
 
D) Use the Kinesis Data Analytics to ingest the clickstream data directly and run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
 

Answer  15)

A

 

Notes 15)

Kinesis Data Analytics gets its input streaming data from Kinesis Data Streams or Kinesis Data Firehose. You can use Kinesis Data Analytics to run real-time SQL queries on your data. Once certain conditions are met you can trigger Lambda functions to make real time product suggestions to users. It is not important that we store or persist the clickstream data.

Reference 15: Kinesis Data Analytics

Question 16) Which service built by AWS makes it easy to set up a retry mechanism, aggregate records to improve throughput, and automatically submits CloudWatch metrics?

A) Kinesis API (AWS SDK)
B) Kinesis Producer Library (KPL)
C) Kinesis Consumer Library
D) Kinesis Client Library (KCL)

Answer  16)

B

 

Notes 16)

Although the Kinesis API built into the AWS SDK can be used for all of this, the Kinesis Producer Library (KPL) makes it easy to integrate all of this into your applications.

Reference 16:  Kinesis Producer Library (KPL) 

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore  9n8rl80hvm4t-mobile screenshots]

Question 17) You have been tasked with capturing data from an online gaming platform to run analytics on and process through a machine learning pipeline. The data that you are ingesting is players controller inputs every 1 second (up to 10 players in a game) that is in JSON format. The data needs to be ingested through Kinesis Data Streams and the JSON data blob is 100 KB in size. What is the minimum number of shards you can use to successfully ingest this data?

A) 10 shards
B) Greater than 500 shards, so you’ll need to request more shards from AWS
C) 1 shard
D) 100 shards

Answer  17)

C

 

Notes 17)

In this scenario, there will be a maximum of 10 records per second with a max payload size of 1000 KB (10 records x 100 KB = 1000KB) written to the shard. A single shard can ingest up to 1 MB of data per second, which is enough to ingest the 1000 KB from the streaming game play. Therefor 1 shard is enough to handle the streaming data.

Reference 17: shards

Question 18) Which services in the Kinesis family allows you to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time?

A) Kinesis Streams
B) Kinesis Firehose
C) Kinesis Video Streams
D) Kinesis Data Analytics

Answer  18)

D

 

Notes 18)

Kinesis Data Analytics allows you to run real-time SQL queries on your data to gain insights and respond to events in real time.

Reference 18: Kinesis Data Analytics

 

Question 19) You are a ML specialist needing to collect data from Twitter tweets. Your goal is to collect tweets that include only the name of your company and the tweet body, and store it off into a data store in AWS. What set of tools can you use to stream, transform, and load the data into AWS with the LEAST amount of effort?

A) Setup a Kinesis Data Firehose for data ingestion and immediately write that data to S3. Next, setup a Lambda function to trigger when data lands in S3 to transform it and finally write it to DynamoDB.
B) Setup A Kinesis Data Stream for data ingestion, setup EC2 instances as data consumers to poll and transform the data from the stream. Once the data is transformed, make an API call to write the data to DynamoDB.
C) Setup Kinesis Data Streams for data ingestion. Next, setup Kinesis Data Firehouse to load that data into RedShift. Next, setup a Lambda function to query data using RedShift spectrum and store the results onto DynamoDB.
D) Create a Kinesis Data Stream to ingest the data. Next, setup a Kinesis Data Firehose and use Lambda to transform the data from the Kinesis Data Stream, then use Lambda to write the data to DynamoDB. Finally, use S3 as the data destination for Kinesis Data Firehose.
 

Answer 19)

A

Notes 19)

All of these could be used to stream, transform, and load the data into an AWS data store. The setup that requires the LEAST amount of effort and moving parts involves setting up a Kinesis Data Firehose to stream the data into S3, have it transformed by Lambda with an S3 trigger, and then written to DynamoDB.

Reference 19: Kinesis Data Firehose to stream the data into S3

Question 20) Which service in the Kinesis family allows you to build custom applications that process or analyze streaming data for specialized needs?

A) Kinesis Firehose
B) Kinesis Streams
C) Kinesis Video Streams
D) Kinesis Data Analytics

Answer 20)

B

Notes 20)

Kinesis Streams allows you to stream data into AWS and build custom applications around that streaming data.

Reference 20: Kinesis Streams

Question21: Of the following, which is an example of machine learning? (Select TWO.)

A) Calculating the shortest route from current location to the destination

B) Optimizing product pricing based on real-time sales data

C) Sentiment analysis of text on product reviews

D) A loan approval system that classifies applicants entirely based on credit score

Answer21:

B and C

Notes 21: 

Optimizing product pricing based on real-time sales data and Sentiment analysis of text on product reviews.
 

Question22:Which of the following is an appropriate use case for unsupervised learning?

A) Partitioning an image of a street scene into multiple segments

B) Finding an optimal path out of a maze

C) Identifying clusters of housing sales based on related data points

D) Analyzing sentiment of social media posts

Answer22:

C

Notes 22: 

Identifying clusters of housing sales based on related data points

Question23

Answer23:

 

Notes 23: 

Question24: A Djamgatech retail company wants to deploy a machine learning model to predict the demand for a product using sales data from the past 5 years. What is the MOST efficient solution that the company should implement first?

A) Regression

B) Multi-class classification

C) Binary class classification

D) N/A

Answer24:

A

Notes 24: 

Question25: In which phase of the ML pipeline do you analyze the business requirements and re-frame that information into a machine learning context.

A) Problem formulation

B) Model training

C) Deployment

D)

Data preprocessing

Answer25:

A

Notes 25:

AWS machine Learning Specialty Exam Prep MLS-C01

iOs: https://apps.apple.com/ca/app/aws-machine-learning-prep-pro/id1611045854

Windows: https://www.microsoft.com/en-ca/p/aws-machine-learning-mls-c01-specialty-certification-exam-prep/9n8rl80hvm4t

Android/Amazon: https://www.amazon.com/gp/product/B09TZ4H8V6

AWS MLS-C01 Machine Learning Exam Prep

Quizzes, Practice Exams: Modeling, Data Engineering, Vision, Exploratory Data Analysis, ML Ops, Cheat Sheets, ML Jobs Interview Q&A

Use this App to learn about Machine Learning on AWS and prepare for the AWS Machine Learning Specialty Certification MLS-C01.

Earning AWS Certified Machine Learning Specialty validates expertise in building, training, tuning, and deploying machine learning (ML) models on AWS.

The App provides hundreds of quizzes and practice exam about:

– Machine Learning Operation on AWS

– Modelling

– Data Engineering

– Computer Vision,

– Exploratory Data Analysis,

– ML implementation & Operations

– Machine Learning Basics Questions and Answers

– Machine Learning Advanced Questions and Answers

– Scorecard

– Countdown timer

– Machine Learning Cheat Sheets

– Machine Learning Interview Questions and Answers

– Machine Learning Latest News

The App covers Machine Learning Basics and Advanced topics including: NLP, Computer Vision, Python, linear regression, logistic regression, Sampling, dataset, statistical interaction, selection bias, non-Gaussian distribution, bias-variance trade-off, Normal Distribution, correlation and covariance, Point Estimates and Confidence Interval, A/B Testing, p-value, statistical power of sensitivity, over-fitting and under-fitting, regularization, Law of Large Numbers, Confounding Variables, Survivorship Bias, univariate, bivariate and multivariate, Resampling, ROC curve, TF/IDF vectorization, Cluster Sampling, etc.

Domain 1: Data Engineering

Create data repositories for machine learning.

Identify data sources (e.g., content and location, primary sources such as user data)

Determine storage mediums (e.g., DB, Data Lake, S3, EFS, EBS)

Identify and implement a data ingestion solution.

Data job styles/types (batch load, streaming)

Data ingestion pipelines (Batch-based ML workloads and streaming-based ML workloads), etc.

Domain 2: Exploratory Data Analysis

Sanitize and prepare data for modeling.

Perform feature engineering.

Analyze and visualize data for machine learning.

Domain 3: Modeling

Frame business problems as machine learning problems.

Select the appropriate model(s) for a given machine learning problem.

Train machine learning models.

Perform hyperparameter optimization.

Evaluate machine learning models.

Domain 4: Machine Learning Implementation and Operations

Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.

Recommend and implement the appropriate machine learning services and features for a given problem.

Apply basic AWS security practices to machine learning solutions.

Deploy and operationalize machine learning solutions.

Machine Learning Services covered:

Amazon Comprehend

AWS Deep Learning AMIs (DLAMI)

AWS DeepLens

Amazon Forecast

Amazon Fraud Detector

Amazon Lex

Amazon Polly

Amazon Rekognition

Amazon SageMaker

Amazon Textract

Amazon Transcribe

Amazon Translate

Other Services and topics covered are:

Ingestion/Collection

Processing/ETL

Data analysis/visualization

Model training

Model deployment/inference

Operational

AWS ML application services

Language relevant to ML (for example, Python, Java, Scala, R, SQL)

Notebooks and integrated development environments (IDEs),

S3, SageMaker, Kinesis, Lake Formation, Athena, Kibana, Redshift, Textract, EMR, Glue, SageMaker, CSV, JSON, IMG, parquet or databases, Amazon Athena

Amazon EC2, Amazon Elastic Container Registry (Amazon ECR), Amazon Elastic Container Service, Amazon Elastic Kubernetes Service , Amazon Redshift

Sagemaker API Explained:

SageMaker API

AWS Certified Machine Learning Engineer Specialty Questions and Answers:

Question1: An advertising and analytics company uses machine learning to predict user response to online advertisements using a custom XGBoost model. The company wants to improve its ML pipeline by porting its training and inference code, written in R, to Amazon SageMaker, and do so with minimal changes to the existing code.

Answer1: Use the Build Your Own Container (BYOC) Amazon Sagemaker option.
Create a new docker container with the existing code. Register the container in Amazon Elastic Container registry. with the existing code. Register the container in Amazon Elastic Container Registry. Finally run the training and inference jobs using this container.

Question2: Which feature of Amazon SageMaker can you use for preprocessing the data?

 

Answer2: Amazon Sagemaker Notebook instances

Amazon SageMaker enables developers and data scientists to build, train, tune, and deploy machine learning (ML) models at scale. You can deploy trained ML models for real-time or batch predictions on unseen data, a process known as inference. However, in most cases, the raw input data must be preprocessed and can’t be used directly for making predictions. This is because most ML models expect the data in a predefined format, so the raw data needs to be first cleaned and formatted in order for the ML model to process the data.  You can use the Amazon SageMaker built-in Scikit-learn library for preprocessing input data and then use the Amazon SageMaker built-in Linear Learner algorithm for predictions.

Question3: What setting, when creating an Amazon SageMaker notebook instance, can you use to install libraries and import data?

Answer3: LifeCycle Configuration

Question4: How to Choose the right Sagemaker built-in algorithm?

How to chose the right built in algorithm in SageMaker?
How to chose the right built in algorithm in SageMaker?
Guide to choosing the right unsupervised learning algorithm
Guide to choosing the right unsupervised learning algorithm

 

Choosing the right  ML algorithm based on Data Type
Choosing the right ML algorithm based on Data Type

 

Choosing the right ML algo based on data type
Choosing the right ML algo based on data type

This is a general guide for choosing which algorithm to use depending on what business problem you have and what data you have. 

 

Top

Top 10 Google Professional Machine Learning Engineer Sample Questions

Question 1: You work for a textile manufacturer and have been asked to build a model to detect and classify fabric defects. You trained a machine learning model with high recall based on high resolution images taken at the end of the production line. You want quality control inspectors to gain trust in your model. Which technique should you use to understand the rationale of your classifier?

A. Use K-fold cross validation to understand how the model performs on different test datasets.

B. Use the Integrated Gradients method to efficiently compute feature attributions for each predicted image.

C. Use PCA (Principal Component Analysis) to reduce the original feature set to a smaller set of easily understood features.

D. Use k-means clustering to group similar images together, and calculate the Davies-Bouldin index to evaluate the separation between clusters.

Answer 1)

B

Notes 1)

B is correct because it identifies the pixel of the input image that leads to the classification of the image itself.

Question 2: You need to write a generic test to verify whether Dense Neural Network (DNN) models automatically released by your team have a sufficient number of parameters to learn the task for which they were built. What should you do?

A. Train the model for a few iterations, and check for NaN values.
B. Train the model for a few iterations, and verify that the loss is constant.
C. Train a simple linear model, and determine if the DNN model outperforms it.
D. Train the model with no regularization, and verify that the loss function is close to zero.
 

Answer 2)

D

Notes 2)

D is correct because the test can check that the model has enough parameters to memorize the task.

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

Question 3: Your team is using a TensorFlow Inception-v3 CNN model pretrained on ImageNet for an image classification prediction challenge on 10,000 images. You will use AI Platform to perform the model training. What TensorFlow distribution strategy and AI Platform training job configuration should you use to train the model and optimize for wall-clock time?

 

A. Default Strategy; Custom tier with a single master node and four v100 GPUs.
B. One Device Strategy; Custom tier with a single master node and four v100 GPUs.
C. One Device Strategy; Custom tier with a single master node and eight v100 GPUs.
D. Central Storage Strategy; Custom tier with a single master node and four v100 GPUs.
 

Answer 3)

D

Notes 3)

D is correct because this is the only strategy that can perform distributed training; albeit there is only a single copy of the variables on the CPU host.

Question 4: You work on a team where the process for deploying a model into production starts with data scientists training different versions of models in a Kubeflow pipeline. The workflow then stores the new model artifact into the corresponding Cloud Storage bucket. You need to build the next steps of the pipeline after the submitted model is ready to be tested and deployed in production on AI Platform. How should you configure the architecture before deploying the model to production?

 
A. Deploy model in test environment -> Validate model -> Create a new AI Platform model version
 
B. Validate model -> Deploy model in test environment -> Create a new AI Platform model version
 
C. Create a new AI Platform model version -> Validate model -> Deploy model in test environment
D. Create a new AI Platform model version – > Deploy model in test environment -> Validate model
 
Answer 4)
A
 
Notes 4)
A is correct because the model can be validated after it is deployed to the test environment, and the release version is established before the model is deployed in production.
 
Question 5: You work for a maintenance company and have built and trained a deep learning model that identifies defects based on thermal images of underground electric cables. Your dataset contains 10,000 images, 100 of which contain visible defects. How should you evaluate the performance of the model on a test dataset?
 
A. Calculate the Area Under the Curve (AUC) value.
 
B. Calculate the number of true positive results predicted by the model.
C. Calculate the fraction of images predicted by the model to have a visible defect.
D. Calculate the Cosine Similarity to compare the model’s performance on the test dataset to the model’s performance on the training dataset.
 
Answer 5)
A
 
Notes 5)
A is correct because it is scale-invariant. AUC measures how well predictions are ranked, rather than their absolute values. AUC is also classification-threshold invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.
 
Question 6: You work for a manufacturing company that owns a high-value machine which has several machine settings and multiple sensors. A history of the machine’s hourly sensor readings and known failure event data are stored in BigQuery. You need to predict if the machine will fail within the next 3 days in order to schedule maintenance before the machine fails. Which data preparation and model training steps should you take?

 

A. Data preparation: Daily max value feature engineering with DataPrep; Model training: AutoML classification with BQML
 
B. Data preparation: Daily min value feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
C. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to False
D. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
Answer 6)
D
 
Notes 6)
D is correct because it uses the rolling average of the sensor data and balances the weights using the BQML auto class weight balance parameter.
 
 
Question 7: You are an ML engineer at a media company. You need to build an ML model to analyze video content frame-by-frame, identify objects, and alert users if there is inappropriate content. Which Google Cloud products should you use to build this project?

 

A. Pub/Sub, Cloud Function, Cloud Vision API
 
B. Pub/Sub, Cloud IoT, Dataflow, Cloud Vision API, Cloud Logging
C. Pub/Sub, Cloud Function, Video Intelligence API, Cloud Logging
D. Pub/Sub, Cloud Function, AutoML Video Intelligence, Cloud Logging
 
Answer 7)
C
 
Notes 7)
C is correct as Video Intelligence API can find inappropriate components and other components satisfy the requirements of real-time processing and notification.
 
Question 8: You work for a large retailer. You want to use ML to forecast future sales leveraging 10 years of historical sales data. The historical data is stored in Cloud Storage in Avro format. You want to rapidly experiment with all the available data. How should you build and train your model for the sales forecast?
 
A. Load data into BigQuery and use the ARIMA model type on BigQuery ML.
B. Convert the data into CSV format and create a regression model on AutoML Tables.
C. Convert the data into TFRecords and create an RNN model on TensorFlow on AI Platform Notebooks.
D. Convert and refactor the data into CSV format and use the built-in XGBoost algorithm on AI Platform Training.
 
Answer 8)
A
 
Notes 8)
A is correct because BigQuery ML is designed for fast and rapid experimentation and it is possible to use federated queries to read data directly from Cloud Storage. Moreover, ARIMA is considered one of the best in class for time series forecasting.
 
Question 9) You need to build an object detection model for a small startup company to identify if and where the company’s logo appears in an image. You were given a large repository of images, some with logos and some without. These images are not yet labelled. You need to label these pictures, and then train and deploy the model. What should you do?

 

A. Use Google Cloud’s Data Labelling Service to label your data. Use AutoML Object Detection to train and deploy the model.
B. Use Vision API to detect and identify logos in pictures and use it as a label. Use AI Platform to build and train a convolutional neural network.
 
C. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a convolutional neural network.
D. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a real time object detection model.
 
Answer 9)
A
 
Notes 9)
A is correct as this will allow you to easily create a request for a labelling task and deploy a high-performance model.
 

Question 10) You work for a large financial institution that is planning to use Dialogflow to create a chatbot for the company’s mobile app. You have reviewed old chat logs and tagged each conversation for intent based on each customer’s stated intention for contacting customer service. About 70% of customer inquiries are simple requests that are solved within 10 intents. The remaining 30% of inquiries require much longer and more complicated requests. Which intents should you automate first?

A. Automate a blend of the shortest and longest intents to be representative of all intents.
B. Automate the more complicated requests first because those require more of the agents’ time.
C. Automate the 10 intents that cover 70% of the requests so that live agents can handle the more complicated requests.
 
D. Automate intents in places where common words such as “payment” only appear once to avoid confusing the software.
Answer 10)
C
 
Notes 10)

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore  9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A Part I:

Google.

Azure and AWS are second class citizens in this area.

Sure, AWS has 70% of the market.

Sure, Azure is the easiest turn key and super user friendly.

But, the king of machine learning in the cloud is GCP.

GCP = Google Cloud Platform

Google has the largest data science team in the world, not mention they have Hinton.

Let’s forgot for a minute they created TensorFlow and give it away.

Let’s just talk about building a real world model with data that doesn’t fit into a excel spreadsheet.

The vast majority of applied machine learning is supervised and that means we need data.

Not just normal data, we need very clean highly structured data.

Where’s the easiest place in the world to upload and model a Petabyte of structured dataBigQuery of course.

Why BigQuery? I don’t have to do anything but upload my data. No spinning up RedShit clusters or whatever I have to do in Azure, just upload and massage data with my familiar SQL. If I do have to wrangle my data it won’t take my six months to update 5 rows here, minutes usually.

Then, you’ll need a front end. Cloud datalab is a Jupyter notebook, which is good because I don’t want nor do I need anything else.

Then, with a single line of code I connect by datalab (Jupyter) notebook to my data in BigQuery and build away.

I’ve worked in all three and the only thing I care about is getting to my job the fastest and right now that means I build my models in GCP.

If you’re new to machine learning don’t start in GCP or any cloud vendor for that matter. Start learning Python from the comfort of your laptop.

The course below is free to the first 20.

The Complete Python Course for Machine Learning Engineers

Here, I want to share the best research paper on Machine Learning classification methods, titled ‘Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?’, published in the ‘Journal of Machine Learning Research’.

This paper nicely explained 179 classification techniques and applied them on 121 data sets thus sharing small summary of the paper:

Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?

 
 
 

The paper evaluated 179 classifiers arising from 17 ML families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest neighbours, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R ( with and without the caret package), C and Matlab, including all the relevant classifiers available today.

Experiments used total 121 data sets , which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behaviour, not dependent on the data set collection.

The whole data set and partitions are available from: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/data.tar.gz

The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package).

The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).

You can see the table with the complete results: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/results.txt

I hope it will be helpful for Statistic and Machine Leaning aspirants!

Thank you!

 
 
 

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

  • Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
  • Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
  • Model versioning: add a hash key to your different models. You will thank me later.
  • Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
  • Monitor performances: execution time and statistical scores of your models.
  • Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

  1. Not understanding the structure of the dataset
  2. Not giving proper care during features selection
  3. Leaving out categorical features and considering just numerical variables
  4. Falling into dummy variable trap
  5. Selection of inefficient machine learning algorithm
  6. Not trying out various ML algorithms for building the model based on structure of data.
  7. Improper tuning of model parameters
  8. Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
  9. Read more here…

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.

However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….

The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.

Thus, the data science life-cycle can include the following steps:

  1. Business requirement understanding.
  2. Data collection.
  3. Data cleaning.
  4. Data analysis.
  5. Modeling.
  6. Performance evaluation.
  7. Communicating with stakeholders.
  8. Deployment.
  9. Real-world testing.
  10. Business buy-in.
  11. Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.

Read more here….

 

Top

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore  9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A -Part II:

 
 
 

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

  • Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
  • Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
  • Model versioning: add a hash key to your different models. You will thank me later.
  • Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
  • Monitor performances: execution time and statistical scores of your models.
  • Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

  1. Not understanding the structure of the dataset
  2. Not giving proper care during features selection
  3. Leaving out categorical features and considering just numerical variables
  4. Falling into dummy variable trap
  5. Selection of inefficient machine learning algorithm
  6. Not trying out various ML algorithms for building the model based on structure of data.
  7. Improper tuning of model parameters
  8. Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
  9. Read more here…

Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.

However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….

The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.

Thus, the data science life-cycle can include the following steps:

  1. Business requirement understanding.
  2. Data collection.
  3. Data cleaning.
  4. Data analysis.
  5. Modeling.
  6. Performance evaluation.
  7. Communicating with stakeholders.
  8. Deployment.
  9. Real-world testing.
  10. Business buy-in.
  11. Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.

Read more here….

 

Top

 

AWS machine Learning Specialty Exam Prep MLS-C01

iOs: https://apps.apple.com/ca/app/aws-machine-learning-prep-pro/id1611045854

Windows: https://www.microsoft.com/en-ca/p/aws-machine-learning-mls-c01-specialty-certification-exam-prep/9n8rl80hvm4t

Android/Amazon: https://www.amazon.com/gp/product/B09TZ4H8V6

AWS MLS-C01 Machine Learning Exam Prep

Quizzes, Practice Exams: Modeling, Data Engineering, Vision, Exploratory Data Analysis, ML Ops, Cheat Sheets, ML Jobs Interview Q&A

Use this App to learn about Machine Learning on AWS and prepare for the AWS Machine Learning Specialty Certification MLS-C01.

Earning AWS Certified Machine Learning Specialty validates expertise in building, training, tuning, and deploying machine learning (ML) models on AWS.

The App provides hundreds of quizzes and practice exam about:

– Machine Learning Operation on AWS

– Modelling

– Data Engineering

– Computer Vision,

– Exploratory Data Analysis,

– ML implementation & Operations

– Machine Learning Basics Questions and Answers

– Machine Learning Advanced Questions and Answers

– Scorecard

– Countdown timer

– Machine Learning Cheat Sheets

– Machine Learning Interview Questions and Answers

– Machine Learning Latest News

The App covers Machine Learning Basics and Advanced topics including: NLP, Computer Vision, Python, linear regression, logistic regression, Sampling, dataset, statistical interaction, selection bias, non-Gaussian distribution, bias-variance trade-off, Normal Distribution, correlation and covariance, Point Estimates and Confidence Interval, A/B Testing, p-value, statistical power of sensitivity, over-fitting and under-fitting, regularization, Law of Large Numbers, Confounding Variables, Survivorship Bias, univariate, bivariate and multivariate, Resampling, ROC curve, TF/IDF vectorization, Cluster Sampling, etc.

Domain 1: Data Engineering

Create data repositories for machine learning.

Identify data sources (e.g., content and location, primary sources such as user data)

Determine storage mediums (e.g., DB, Data Lake, S3, EFS, EBS)

Identify and implement a data ingestion solution.

Data job styles/types (batch load, streaming)

Data ingestion pipelines (Batch-based ML workloads and streaming-based ML workloads), etc.

Domain 2: Exploratory Data Analysis

Sanitize and prepare data for modeling.

Perform feature engineering.

Analyze and visualize data for machine learning.

Domain 3: Modeling

Frame business problems as machine learning problems.

Select the appropriate model(s) for a given machine learning problem.

Train machine learning models.

Perform hyperparameter optimization.

Evaluate machine learning models.

Domain 4: Machine Learning Implementation and Operations

Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.

Recommend and implement the appropriate machine learning services and features for a given problem.

Apply basic AWS security practices to machine learning solutions.

Deploy and operationalize machine learning solutions.

Machine Learning Services covered:

Amazon Comprehend

AWS Deep Learning AMIs (DLAMI)

AWS DeepLens

Amazon Forecast

Amazon Fraud Detector

Amazon Lex

Amazon Polly

Amazon Rekognition

Amazon SageMaker

Amazon Textract

Amazon Transcribe

Amazon Translate

Other Services and topics covered are:

Ingestion/Collection

Processing/ETL

Data analysis/visualization

Model training

Model deployment/inference

Operational

AWS ML application services

Language relevant to ML (for example, Python, Java, Scala, R, SQL)

Notebooks and integrated development environments (IDEs),

S3, SageMaker, Kinesis, Lake Formation, Athena, Kibana, Redshift, Textract, EMR, Glue, SageMaker, CSV, JSON, IMG, parquet or databases, Amazon Athena

Amazon EC2, Amazon Elastic Container Registry (Amazon ECR), Amazon Elastic Container Service, Amazon Elastic Kubernetes Service , Amazon Redshift

Sagemaker API Explained:

SageMaker API

AWS Certified Machine Learning Engineer Specialty Questions and Answers:

Question1: An advertising and analytics company uses machine learning to predict user response to online advertisements using a custom XGBoost model. The company wants to improve its ML pipeline by porting its training and inference code, written in R, to Amazon SageMaker, and do so with minimal changes to the existing code.

Answer1: Use the Build Your Own Container (BYOC) Amazon Sagemaker option.
Create a new docker container with the existing code. Register the container in Amazon Elastic Container registry. with the existing code. Register the container in Amazon Elastic Container Registry. Finally run the training and inference jobs using this container.

Question2: Which feature of Amazon SageMaker can you use for preprocessing the data?

 

Answer2: Amazon Sagemaker Notebook instances

Amazon SageMaker enables developers and data scientists to build, train, tune, and deploy machine learning (ML) models at scale. You can deploy trained ML models for real-time or batch predictions on unseen data, a process known as inference. However, in most cases, the raw input data must be preprocessed and can’t be used directly for making predictions. This is because most ML models expect the data in a predefined format, so the raw data needs to be first cleaned and formatted in order for the ML model to process the data.  You can use the Amazon SageMaker built-in Scikit-learn library for preprocessing input data and then use the Amazon SageMaker built-in Linear Learner algorithm for predictions.

Question3: What setting, when creating an Amazon SageMaker notebook instance, can you use to install libraries and import data?

Answer3: LifeCycle Configuration

Question4: How to Choose the right Sagemaker built-in algorithm?

How to chose the right built in algorithm in SageMaker?
How to chose the right built in algorithm in SageMaker?
Guide to choosing the right unsupervised learning algorithm
Guide to choosing the right unsupervised learning algorithm

 

Choosing the right  ML algorithm based on Data Type
Choosing the right ML algorithm based on Data Type

 

Choosing the right ML algo based on data type
Choosing the right ML algo based on data type

This is a general guide for choosing which algorithm to use depending on what business problem you have and what data you have. 

 

Top

Top 10 Google Professional Machine Learning Engineer Sample Questions

Question 1: You work for a textile manufacturer and have been asked to build a model to detect and classify fabric defects. You trained a machine learning model with high recall based on high resolution images taken at the end of the production line. You want quality control inspectors to gain trust in your model. Which technique should you use to understand the rationale of your classifier?

A. Use K-fold cross validation to understand how the model performs on different test datasets.

B. Use the Integrated Gradients method to efficiently compute feature attributions for each predicted image.

C. Use PCA (Principal Component Analysis) to reduce the original feature set to a smaller set of easily understood features.

D. Use k-means clustering to group similar images together, and calculate the Davies-Bouldin index to evaluate the separation between clusters.

Answer 1)

B

Notes 1)

B is correct because it identifies the pixel of the input image that leads to the classification of the image itself.

Question 2: You need to write a generic test to verify whether Dense Neural Network (DNN) models automatically released by your team have a sufficient number of parameters to learn the task for which they were built. What should you do?

A. Train the model for a few iterations, and check for NaN values.
B. Train the model for a few iterations, and verify that the loss is constant.
C. Train a simple linear model, and determine if the DNN model outperforms it.
D. Train the model with no regularization, and verify that the loss function is close to zero.
 

Answer 2)

D

Notes 2)

D is correct because the test can check that the model has enough parameters to memorize the task.

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

Question 3: Your team is using a TensorFlow Inception-v3 CNN model pretrained on ImageNet for an image classification prediction challenge on 10,000 images. You will use AI Platform to perform the model training. What TensorFlow distribution strategy and AI Platform training job configuration should you use to train the model and optimize for wall-clock time?

 

A. Default Strategy; Custom tier with a single master node and four v100 GPUs.
B. One Device Strategy; Custom tier with a single master node and four v100 GPUs.
C. One Device Strategy; Custom tier with a single master node and eight v100 GPUs.
D. Central Storage Strategy; Custom tier with a single master node and four v100 GPUs.
 

Answer 3)

D

Notes 3)

D is correct because this is the only strategy that can perform distributed training; albeit there is only a single copy of the variables on the CPU host.

Question 4: You work on a team where the process for deploying a model into production starts with data scientists training different versions of models in a Kubeflow pipeline. The workflow then stores the new model artifact into the corresponding Cloud Storage bucket. You need to build the next steps of the pipeline after the submitted model is ready to be tested and deployed in production on AI Platform. How should you configure the architecture before deploying the model to production?

 
A. Deploy model in test environment -> Validate model -> Create a new AI Platform model version
 
B. Validate model -> Deploy model in test environment -> Create a new AI Platform model version
 
C. Create a new AI Platform model version -> Validate model -> Deploy model in test environment
D. Create a new AI Platform model version – > Deploy model in test environment -> Validate model
 
Answer 4)
A
 
Notes 4)
A is correct because the model can be validated after it is deployed to the test environment, and the release version is established before the model is deployed in production.
 
Question 5: You work for a maintenance company and have built and trained a deep learning model that identifies defects based on thermal images of underground electric cables. Your dataset contains 10,000 images, 100 of which contain visible defects. How should you evaluate the performance of the model on a test dataset?
 
A. Calculate the Area Under the Curve (AUC) value.
 
B. Calculate the number of true positive results predicted by the model.
C. Calculate the fraction of images predicted by the model to have a visible defect.
D. Calculate the Cosine Similarity to compare the model’s performance on the test dataset to the model’s performance on the training dataset.
 
Answer 5)
A
 
Notes 5)
A is correct because it is scale-invariant. AUC measures how well predictions are ranked, rather than their absolute values. AUC is also classification-threshold invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.
 
Question 6: You work for a manufacturing company that owns a high-value machine which has several machine settings and multiple sensors. A history of the machine’s hourly sensor readings and known failure event data are stored in BigQuery. You need to predict if the machine will fail within the next 3 days in order to schedule maintenance before the machine fails. Which data preparation and model training steps should you take?

 

A. Data preparation: Daily max value feature engineering with DataPrep; Model training: AutoML classification with BQML
 
B. Data preparation: Daily min value feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
C. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to False
D. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
Answer 6)
D
 
Notes 6)
D is correct because it uses the rolling average of the sensor data and balances the weights using the BQML auto class weight balance parameter.
 
 
Question 7: You are an ML engineer at a media company. You need to build an ML model to analyze video content frame-by-frame, identify objects, and alert users if there is inappropriate content. Which Google Cloud products should you use to build this project?

 

A. Pub/Sub, Cloud Function, Cloud Vision API
 
B. Pub/Sub, Cloud IoT, Dataflow, Cloud Vision API, Cloud Logging
C. Pub/Sub, Cloud Function, Video Intelligence API, Cloud Logging
D. Pub/Sub, Cloud Function, AutoML Video Intelligence, Cloud Logging
 
Answer 7)
C
 
Notes 7)
C is correct as Video Intelligence API can find inappropriate components and other components satisfy the requirements of real-time processing and notification.
 
Question 8: You work for a large retailer. You want to use ML to forecast future sales leveraging 10 years of historical sales data. The historical data is stored in Cloud Storage in Avro format. You want to rapidly experiment with all the available data. How should you build and train your model for the sales forecast?
 
A. Load data into BigQuery and use the ARIMA model type on BigQuery ML.
B. Convert the data into CSV format and create a regression model on AutoML Tables.
C. Convert the data into TFRecords and create an RNN model on TensorFlow on AI Platform Notebooks.
D. Convert and refactor the data into CSV format and use the built-in XGBoost algorithm on AI Platform Training.
 
Answer 8)
A
 
Notes 8)
A is correct because BigQuery ML is designed for fast and rapid experimentation and it is possible to use federated queries to read data directly from Cloud Storage. Moreover, ARIMA is considered one of the best in class for time series forecasting.
 
Question 9) You need to build an object detection model for a small startup company to identify if and where the company’s logo appears in an image. You were given a large repository of images, some with logos and some without. These images are not yet labelled. You need to label these pictures, and then train and deploy the model. What should you do?

 

A. Use Google Cloud’s Data Labelling Service to label your data. Use AutoML Object Detection to train and deploy the model.
B. Use Vision API to detect and identify logos in pictures and use it as a label. Use AI Platform to build and train a convolutional neural network.
 
C. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a convolutional neural network.
D. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a real time object detection model.
 
Answer 9)
A
 
Notes 9)
A is correct as this will allow you to easily create a request for a labelling task and deploy a high-performance model.
 

Question 10) You work for a large financial institution that is planning to use Dialogflow to create a chatbot for the company’s mobile app. You have reviewed old chat logs and tagged each conversation for intent based on each customer’s stated intention for contacting customer service. About 70% of customer inquiries are simple requests that are solved within 10 intents. The remaining 30% of inquiries require much longer and more complicated requests. Which intents should you automate first?

A. Automate a blend of the shortest and longest intents to be representative of all intents.
B. Automate the more complicated requests first because those require more of the agents’ time.
C. Automate the 10 intents that cover 70% of the requests so that live agents can handle the more complicated requests.
 
D. Automate intents in places where common words such as “payment” only appear once to avoid confusing the software.
Answer 10)
C
 
Notes 10)

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore  9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A Part I:

Google.

Azure and AWS are second class citizens in this area.

Sure, AWS has 70% of the market.

Sure, Azure is the easiest turn key and super user friendly.

But, the king of machine learning in the cloud is GCP.

GCP = Google Cloud Platform

Google has the largest data science team in the world, not mention they have Hinton.

Let’s forgot for a minute they created TensorFlow and give it away.

Let’s just talk about building a real world model with data that doesn’t fit into a excel spreadsheet.

The vast majority of applied machine learning is supervised and that means we need data.

Not just normal data, we need very clean highly structured data.

Where’s the easiest place in the world to upload and model a Petabyte of structured dataBigQuery of course.

Why BigQuery? I don’t have to do anything but upload my data. No spinning up RedShit clusters or whatever I have to do in Azure, just upload and massage data with my familiar SQL. If I do have to wrangle my data it won’t take my six months to update 5 rows here, minutes usually.

Then, you’ll need a front end. Cloud datalab is a Jupyter notebook, which is good because I don’t want nor do I need anything else.

Then, with a single line of code I connect by datalab (Jupyter) notebook to my data in BigQuery and build away.

I’ve worked in all three and the only thing I care about is getting to my job the fastest and right now that means I build my models in GCP.

If you’re new to machine learning don’t start in GCP or any cloud vendor for that matter. Start learning Python from the comfort of your laptop.

The course below is free to the first 20.

The Complete Python Course for Machine Learning Engineers

Here, I want to share the best research paper on Machine Learning classification methods, titled ‘Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?’, published in the ‘Journal of Machine Learning Research’.

This paper nicely explained 179 classification techniques and applied them on 121 data sets thus sharing small summary of the paper:

Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?

 
 
 

The paper evaluated 179 classifiers arising from 17 ML families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest neighbours, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R ( with and without the caret package), C and Matlab, including all the relevant classifiers available today.

Experiments used total 121 data sets , which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behaviour, not dependent on the data set collection.

The whole data set and partitions are available from: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/data.tar.gz

The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package).

The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).

You can see the table with the complete results: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/results.txt

I hope it will be helpful for Statistic and Machine Leaning aspirants!

Thank you!

 
 
 

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

  • Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
  • Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
  • Model versioning: add a hash key to your different models. You will thank me later.
  • Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
  • Monitor performances: execution time and statistical scores of your models.
  • Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

  1. Not understanding the structure of the dataset
  2. Not giving proper care during features selection
  3. Leaving out categorical features and considering just numerical variables
  4. Falling into dummy variable trap
  5. Selection of inefficient machine learning algorithm
  6. Not trying out various ML algorithms for building the model based on structure of data.
  7. Improper tuning of model parameters
  8. Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
  9. Read more here…

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.

However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….

The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.

Thus, the data science life-cycle can include the following steps:

  1. Business requirement understanding.
  2. Data collection.
  3. Data cleaning.
  4. Data analysis.
  5. Modeling.
  6. Performance evaluation.
  7. Communicating with stakeholders.
  8. Deployment.
  9. Real-world testing.
  10. Business buy-in.
  11. Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.

Read more here….

 

Top

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore  9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A -Part II:

 
 
 

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

  • Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
  • Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
  • Model versioning: add a hash key to your different models. You will thank me later.
  • Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
  • Monitor performances: execution time and statistical scores of your models.
  • Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

  1. Not understanding the structure of the dataset
  2. Not giving proper care during features selection
  3. Leaving out categorical features and considering just numerical variables
  4. Falling into dummy variable trap
  5. Selection of inefficient machine learning algorithm
  6. Not trying out various ML algorithms for building the model based on structure of data.
  7. Improper tuning of model parameters
  8. Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
  9. Read more here…

Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.

However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….

The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.

Thus, the data science life-cycle can include the following steps:

  1. Business requirement understanding.
  2. Data collection.
  3. Data cleaning.
  4. Data analysis.
  5. Modeling.
  6. Performance evaluation.
  7. Communicating with stakeholders.
  8. Deployment.
  9. Real-world testing.
  10. Business buy-in.
  11. Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.

Read more here….

 

Top

AWS machine Learning Specialty Exam Prep MLS-C01

iOs: https://apps.apple.com/ca/app/aws-machine-learning-prep-pro/id1611045854

Windows: https://www.microsoft.com/en-ca/p/aws-machine-learning-mls-c01-specialty-certification-exam-prep/9n8rl80hvm4t

Android/Amazon: https://www.amazon.com/gp/product/B09TZ4H8V6

AWS MLS-C01 Machine Learning Exam Prep

Quizzes, Practice Exams: Modeling, Data Engineering, Vision, Exploratory Data Analysis, ML Ops, Cheat Sheets, ML Jobs Interview Q&A

Use this App to learn about Machine Learning on AWS and prepare for the AWS Machine Learning Specialty Certification MLS-C01.

Earning AWS Certified Machine Learning Specialty validates expertise in building, training, tuning, and deploying machine learning (ML) models on AWS.

The App provides hundreds of quizzes and practice exam about:

– Machine Learning Operation on AWS

– Modelling

– Data Engineering

– Computer Vision,

– Exploratory Data Analysis,

– ML implementation & Operations

– Machine Learning Basics Questions and Answers

– Machine Learning Advanced Questions and Answers

– Scorecard

– Countdown timer

– Machine Learning Cheat Sheets

– Machine Learning Interview Questions and Answers

– Machine Learning Latest News

The App covers Machine Learning Basics and Advanced topics including: NLP, Computer Vision, Python, linear regression, logistic regression, Sampling, dataset, statistical interaction, selection bias, non-Gaussian distribution, bias-variance trade-off, Normal Distribution, correlation and covariance, Point Estimates and Confidence Interval, A/B Testing, p-value, statistical power of sensitivity, over-fitting and under-fitting, regularization, Law of Large Numbers, Confounding Variables, Survivorship Bias, univariate, bivariate and multivariate, Resampling, ROC curve, TF/IDF vectorization, Cluster Sampling, etc.

Domain 1: Data Engineering

Create data repositories for machine learning.

Identify data sources (e.g., content and location, primary sources such as user data)

Determine storage mediums (e.g., DB, Data Lake, S3, EFS, EBS)

Identify and implement a data ingestion solution.

Data job styles/types (batch load, streaming)

Data ingestion pipelines (Batch-based ML workloads and streaming-based ML workloads), etc.

Domain 2: Exploratory Data Analysis

Sanitize and prepare data for modeling.

Perform feature engineering.

Analyze and visualize data for machine learning.

Domain 3: Modeling

Frame business problems as machine learning problems.

Select the appropriate model(s) for a given machine learning problem.

Train machine learning models.

Perform hyperparameter optimization.

Evaluate machine learning models.

Domain 4: Machine Learning Implementation and Operations

Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.

Recommend and implement the appropriate machine learning services and features for a given problem.

Apply basic AWS security practices to machine learning solutions.

Deploy and operationalize machine learning solutions.

Machine Learning Services covered:

Amazon Comprehend

AWS Deep Learning AMIs (DLAMI)

AWS DeepLens

Amazon Forecast

Amazon Fraud Detector

Amazon Lex

Amazon Polly

Amazon Rekognition

Amazon SageMaker

Amazon Textract

Amazon Transcribe

Amazon Translate

Other Services and topics covered are:

Ingestion/Collection

Processing/ETL

Data analysis/visualization

Model training

Model deployment/inference

Operational

AWS ML application services

Language relevant to ML (for example, Python, Java, Scala, R, SQL)

Notebooks and integrated development environments (IDEs),

S3, SageMaker, Kinesis, Lake Formation, Athena, Kibana, Redshift, Textract, EMR, Glue, SageMaker, CSV, JSON, IMG, parquet or databases, Amazon Athena

Amazon EC2, Amazon Elastic Container Registry (Amazon ECR), Amazon Elastic Container Service, Amazon Elastic Kubernetes Service , Amazon Redshift

Sagemaker API Explained:

SageMaker API

AWS Certified Machine Learning Engineer Specialty Questions and Answers:

Question1: An advertising and analytics company uses machine learning to predict user response to online advertisements using a custom XGBoost model. The company wants to improve its ML pipeline by porting its training and inference code, written in R, to Amazon SageMaker, and do so with minimal changes to the existing code.

Answer1: Use the Build Your Own Container (BYOC) Amazon Sagemaker option.
Create a new docker container with the existing code. Register the container in Amazon Elastic Container registry. with the existing code. Register the container in Amazon Elastic Container Registry. Finally run the training and inference jobs using this container.

Question2: Which feature of Amazon SageMaker can you use for preprocessing the data?

 

Answer2: Amazon Sagemaker Notebook instances

Amazon SageMaker enables developers and data scientists to build, train, tune, and deploy machine learning (ML) models at scale. You can deploy trained ML models for real-time or batch predictions on unseen data, a process known as inference. However, in most cases, the raw input data must be preprocessed and can’t be used directly for making predictions. This is because most ML models expect the data in a predefined format, so the raw data needs to be first cleaned and formatted in order for the ML model to process the data.  You can use the Amazon SageMaker built-in Scikit-learn library for preprocessing input data and then use the Amazon SageMaker built-in Linear Learner algorithm for predictions.

Question3: What setting, when creating an Amazon SageMaker notebook instance, can you use to install libraries and import data?

Answer3: LifeCycle Configuration

Question4: How to Choose the right Sagemaker built-in algorithm?

How to chose the right built in algorithm in SageMaker?
How to chose the right built in algorithm in SageMaker?
Guide to choosing the right unsupervised learning algorithm
Guide to choosing the right unsupervised learning algorithm

 

Choosing the right  ML algorithm based on Data Type
Choosing the right ML algorithm based on Data Type

 

Choosing the right ML algo based on data type
Choosing the right ML algo based on data type

This is a general guide for choosing which algorithm to use depending on what business problem you have and what data you have. 

 

Top

Top 10 Google Professional Machine Learning Engineer Sample Questions

Question 1: You work for a textile manufacturer and have been asked to build a model to detect and classify fabric defects. You trained a machine learning model with high recall based on high resolution images taken at the end of the production line. You want quality control inspectors to gain trust in your model. Which technique should you use to understand the rationale of your classifier?

A. Use K-fold cross validation to understand how the model performs on different test datasets.

B. Use the Integrated Gradients method to efficiently compute feature attributions for each predicted image.

C. Use PCA (Principal Component Analysis) to reduce the original feature set to a smaller set of easily understood features.

D. Use k-means clustering to group similar images together, and calculate the Davies-Bouldin index to evaluate the separation between clusters.

Answer 1)

B

Notes 1)

B is correct because it identifies the pixel of the input image that leads to the classification of the image itself.

Question 2: You need to write a generic test to verify whether Dense Neural Network (DNN) models automatically released by your team have a sufficient number of parameters to learn the task for which they were built. What should you do?

A. Train the model for a few iterations, and check for NaN values.
B. Train the model for a few iterations, and verify that the loss is constant.
C. Train a simple linear model, and determine if the DNN model outperforms it.
D. Train the model with no regularization, and verify that the loss function is close to zero.
 

Answer 2)

D

Notes 2)

D is correct because the test can check that the model has enough parameters to memorize the task.

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

Question 3: Your team is using a TensorFlow Inception-v3 CNN model pretrained on ImageNet for an image classification prediction challenge on 10,000 images. You will use AI Platform to perform the model training. What TensorFlow distribution strategy and AI Platform training job configuration should you use to train the model and optimize for wall-clock time?

 

A. Default Strategy; Custom tier with a single master node and four v100 GPUs.
B. One Device Strategy; Custom tier with a single master node and four v100 GPUs.
C. One Device Strategy; Custom tier with a single master node and eight v100 GPUs.
D. Central Storage Strategy; Custom tier with a single master node and four v100 GPUs.
 

Answer 3)

D

Notes 3)

D is correct because this is the only strategy that can perform distributed training; albeit there is only a single copy of the variables on the CPU host.

Question 4: You work on a team where the process for deploying a model into production starts with data scientists training different versions of models in a Kubeflow pipeline. The workflow then stores the new model artifact into the corresponding Cloud Storage bucket. You need to build the next steps of the pipeline after the submitted model is ready to be tested and deployed in production on AI Platform. How should you configure the architecture before deploying the model to production?

 
A. Deploy model in test environment -> Validate model -> Create a new AI Platform model version
 
B. Validate model -> Deploy model in test environment -> Create a new AI Platform model version
 
C. Create a new AI Platform model version -> Validate model -> Deploy model in test environment
D. Create a new AI Platform model version – > Deploy model in test environment -> Validate model
 
Answer 4)
A
 
Notes 4)
A is correct because the model can be validated after it is deployed to the test environment, and the release version is established before the model is deployed in production.
 
Question 5: You work for a maintenance company and have built and trained a deep learning model that identifies defects based on thermal images of underground electric cables. Your dataset contains 10,000 images, 100 of which contain visible defects. How should you evaluate the performance of the model on a test dataset?
 
A. Calculate the Area Under the Curve (AUC) value.
 
B. Calculate the number of true positive results predicted by the model.
C. Calculate the fraction of images predicted by the model to have a visible defect.
D. Calculate the Cosine Similarity to compare the model’s performance on the test dataset to the model’s performance on the training dataset.
 
Answer 5)
A
 
Notes 5)
A is correct because it is scale-invariant. AUC measures how well predictions are ranked, rather than their absolute values. AUC is also classification-threshold invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.
 
Question 6: You work for a manufacturing company that owns a high-value machine which has several machine settings and multiple sensors. A history of the machine’s hourly sensor readings and known failure event data are stored in BigQuery. You need to predict if the machine will fail within the next 3 days in order to schedule maintenance before the machine fails. Which data preparation and model training steps should you take?

 

A. Data preparation: Daily max value feature engineering with DataPrep; Model training: AutoML classification with BQML
 
B. Data preparation: Daily min value feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
C. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to False
D. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
Answer 6)
D
 
Notes 6)
D is correct because it uses the rolling average of the sensor data and balances the weights using the BQML auto class weight balance parameter.
 
 
Question 7: You are an ML engineer at a media company. You need to build an ML model to analyze video content frame-by-frame, identify objects, and alert users if there is inappropriate content. Which Google Cloud products should you use to build this project?

 

A. Pub/Sub, Cloud Function, Cloud Vision API
 
B. Pub/Sub, Cloud IoT, Dataflow, Cloud Vision API, Cloud Logging
C. Pub/Sub, Cloud Function, Video Intelligence API, Cloud Logging
D. Pub/Sub, Cloud Function, AutoML Video Intelligence, Cloud Logging
 
Answer 7)
C
 
Notes 7)
C is correct as Video Intelligence API can find inappropriate components and other components satisfy the requirements of real-time processing and notification.
 
Question 8: You work for a large retailer. You want to use ML to forecast future sales leveraging 10 years of historical sales data. The historical data is stored in Cloud Storage in Avro format. You want to rapidly experiment with all the available data. How should you build and train your model for the sales forecast?
 
A. Load data into BigQuery and use the ARIMA model type on BigQuery ML.
B. Convert the data into CSV format and create a regression model on AutoML Tables.
C. Convert the data into TFRecords and create an RNN model on TensorFlow on AI Platform Notebooks.
D. Convert and refactor the data into CSV format and use the built-in XGBoost algorithm on AI Platform Training.
 
Answer 8)
A
 
Notes 8)
A is correct because BigQuery ML is designed for fast and rapid experimentation and it is possible to use federated queries to read data directly from Cloud Storage. Moreover, ARIMA is considered one of the best in class for time series forecasting.
 
Question 9) You need to build an object detection model for a small startup company to identify if and where the company’s logo appears in an image. You were given a large repository of images, some with logos and some without. These images are not yet labelled. You need to label these pictures, and then train and deploy the model. What should you do?

 

A. Use Google Cloud’s Data Labelling Service to label your data. Use AutoML Object Detection to train and deploy the model.
B. Use Vision API to detect and identify logos in pictures and use it as a label. Use AI Platform to build and train a convolutional neural network.
 
C. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a convolutional neural network.
D. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a real time object detection model.
 
Answer 9)
A
 
Notes 9)
A is correct as this will allow you to easily create a request for a labelling task and deploy a high-performance model.
 

Question 10) You work for a large financial institution that is planning to use Dialogflow to create a chatbot for the company’s mobile app. You have reviewed old chat logs and tagged each conversation for intent based on each customer’s stated intention for contacting customer service. About 70% of customer inquiries are simple requests that are solved within 10 intents. The remaining 30% of inquiries require much longer and more complicated requests. Which intents should you automate first?

A. Automate a blend of the shortest and longest intents to be representative of all intents.
B. Automate the more complicated requests first because those require more of the agents’ time.
C. Automate the 10 intents that cover 70% of the requests so that live agents can handle the more complicated requests.
 
D. Automate intents in places where common words such as “payment” only appear once to avoid confusing the software.
Answer 10)
C
 
Notes 10)

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore  9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A Part I:

Google.

Azure and AWS are second class citizens in this area.

Sure, AWS has 70% of the market.

Sure, Azure is the easiest turn key and super user friendly.

But, the king of machine learning in the cloud is GCP.

GCP = Google Cloud Platform

Google has the largest data science team in the world, not mention they have Hinton.

Let’s forgot for a minute they created TensorFlow and give it away.

Let’s just talk about building a real world model with data that doesn’t fit into a excel spreadsheet.

The vast majority of applied machine learning is supervised and that means we need data.

Not just normal data, we need very clean highly structured data.

Where’s the easiest place in the world to upload and model a Petabyte of structured dataBigQuery of course.

Why BigQuery? I don’t have to do anything but upload my data. No spinning up RedShit clusters or whatever I have to do in Azure, just upload and massage data with my familiar SQL. If I do have to wrangle my data it won’t take my six months to update 5 rows here, minutes usually.

Then, you’ll need a front end. Cloud datalab is a Jupyter notebook, which is good because I don’t want nor do I need anything else.

Then, with a single line of code I connect by datalab (Jupyter) notebook to my data in BigQuery and build away.

I’ve worked in all three and the only thing I care about is getting to my job the fastest and right now that means I build my models in GCP.

If you’re new to machine learning don’t start in GCP or any cloud vendor for that matter. Start learning Python from the comfort of your laptop.

The course below is free to the first 20.

The Complete Python Course for Machine Learning Engineers

Here, I want to share the best research paper on Machine Learning classification methods, titled ‘Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?’, published in the ‘Journal of Machine Learning Research’.

This paper nicely explained 179 classification techniques and applied them on 121 data sets thus sharing small summary of the paper:

Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?

 
 
 

The paper evaluated 179 classifiers arising from 17 ML families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest neighbours, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R ( with and without the caret package), C and Matlab, including all the relevant classifiers available today.

Experiments used total 121 data sets , which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behaviour, not dependent on the data set collection.

The whole data set and partitions are available from: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/data.tar.gz

The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package).

The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).

You can see the table with the complete results: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/results.txt

I hope it will be helpful for Statistic and Machine Leaning aspirants!

Thank you!

 
 
 

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

  • Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
  • Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
  • Model versioning: add a hash key to your different models. You will thank me later.
  • Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
  • Monitor performances: execution time and statistical scores of your models.
  • Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

  1. Not understanding the structure of the dataset
  2. Not giving proper care during features selection
  3. Leaving out categorical features and considering just numerical variables
  4. Falling into dummy variable trap
  5. Selection of inefficient machine learning algorithm
  6. Not trying out various ML algorithms for building the model based on structure of data.
  7. Improper tuning of model parameters
  8. Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
  9. Read more here…

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.

However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….

The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.

Thus, the data science life-cycle can include the following steps:

  1. Business requirement understanding.
  2. Data collection.
  3. Data cleaning.
  4. Data analysis.
  5. Modeling.
  6. Performance evaluation.
  7. Communicating with stakeholders.
  8. Deployment.
  9. Real-world testing.
  10. Business buy-in.
  11. Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.

Read more here….

 

Top

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore  9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A -Part II:

 
 
 

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

  • Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
  • Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
  • Model versioning: add a hash key to your different models. You will thank me later.
  • Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
  • Monitor performances: execution time and statistical scores of your models.
  • Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

  1. Not understanding the structure of the dataset
  2. Not giving proper care during features selection
  3. Leaving out categorical features and considering just numerical variables
  4. Falling into dummy variable trap
  5. Selection of inefficient machine learning algorithm
  6. Not trying out various ML algorithms for building the model based on structure of data.
  7. Improper tuning of model parameters
  8. Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
  9. Read more here…

Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.

However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….

The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.

Thus, the data science life-cycle can include the following steps:

  1. Business requirement understanding.
  2. Data collection.
  3. Data cleaning.
  4. Data analysis.
  5. Modeling.
  6. Performance evaluation.
  7. Communicating with stakeholders.
  8. Deployment.
  9. Real-world testing.
  10. Business buy-in.
  11. Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.

Read more here….

 

Top

 

AWS machine Learning Specialty Exam Prep MLS-C01

iOs: https://apps.apple.com/ca/app/aws-machine-learning-prep-pro/id1611045854

Windows: https://www.microsoft.com/en-ca/p/aws-machine-learning-mls-c01-specialty-certification-exam-prep/9n8rl80hvm4t

Android/Amazon: https://www.amazon.com/gp/product/B09TZ4H8V6

AWS MLS-C01 Machine Learning Exam Prep

Quizzes, Practice Exams: Modeling, Data Engineering, Vision, Exploratory Data Analysis, ML Ops, Cheat Sheets, ML Jobs Interview Q&A

Use this App to learn about Machine Learning on AWS and prepare for the AWS Machine Learning Specialty Certification MLS-C01.

Earning AWS Certified Machine Learning Specialty validates expertise in building, training, tuning, and deploying machine learning (ML) models on AWS.

The App provides hundreds of quizzes and practice exam about:

– Machine Learning Operation on AWS

– Modelling

– Data Engineering

– Computer Vision,

– Exploratory Data Analysis,

– ML implementation & Operations

– Machine Learning Basics Questions and Answers

– Machine Learning Advanced Questions and Answers

– Scorecard

– Countdown timer

– Machine Learning Cheat Sheets

– Machine Learning Interview Questions and Answers

– Machine Learning Latest News

The App covers Machine Learning Basics and Advanced topics including: NLP, Computer Vision, Python, linear regression, logistic regression, Sampling, dataset, statistical interaction, selection bias, non-Gaussian distribution, bias-variance trade-off, Normal Distribution, correlation and covariance, Point Estimates and Confidence Interval, A/B Testing, p-value, statistical power of sensitivity, over-fitting and under-fitting, regularization, Law of Large Numbers, Confounding Variables, Survivorship Bias, univariate, bivariate and multivariate, Resampling, ROC curve, TF/IDF vectorization, Cluster Sampling, etc.

Domain 1: Data Engineering

Create data repositories for machine learning.

Identify data sources (e.g., content and location, primary sources such as user data)

Determine storage mediums (e.g., DB, Data Lake, S3, EFS, EBS)

Identify and implement a data ingestion solution.

Data job styles/types (batch load, streaming)

Data ingestion pipelines (Batch-based ML workloads and streaming-based ML workloads), etc.

Domain 2: Exploratory Data Analysis

Sanitize and prepare data for modeling.

Perform feature engineering.

Analyze and visualize data for machine learning.

Domain 3: Modeling

Frame business problems as machine learning problems.

Select the appropriate model(s) for a given machine learning problem.

Train machine learning models.

Perform hyperparameter optimization.

Evaluate machine learning models.

Domain 4: Machine Learning Implementation and Operations

Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.

Recommend and implement the appropriate machine learning services and features for a given problem.

Apply basic AWS security practices to machine learning solutions.

Deploy and operationalize machine learning solutions.

Machine Learning Services covered:

Amazon Comprehend

AWS Deep Learning AMIs (DLAMI)

AWS DeepLens

Amazon Forecast

Amazon Fraud Detector

Amazon Lex

Amazon Polly

Amazon Rekognition

Amazon SageMaker

Amazon Textract

Amazon Transcribe

Amazon Translate

Other Services and topics covered are:

Ingestion/Collection

Processing/ETL

Data analysis/visualization

Model training

Model deployment/inference

Operational

AWS ML application services

Language relevant to ML (for example, Python, Java, Scala, R, SQL)

Notebooks and integrated development environments (IDEs),

S3, SageMaker, Kinesis, Lake Formation, Athena, Kibana, Redshift, Textract, EMR, Glue, SageMaker, CSV, JSON, IMG, parquet or databases, Amazon Athena

Amazon EC2, Amazon Elastic Container Registry (Amazon ECR), Amazon Elastic Container Service, Amazon Elastic Kubernetes Service , Amazon Redshift

Sagemaker API Explained:

SageMaker API

AWS Certified Machine Learning Engineer Specialty Questions and Answers:

Question1: An advertising and analytics company uses machine learning to predict user response to online advertisements using a custom XGBoost model. The company wants to improve its ML pipeline by porting its training and inference code, written in R, to Amazon SageMaker, and do so with minimal changes to the existing code.

Answer1: Use the Build Your Own Container (BYOC) Amazon Sagemaker option.
Create a new docker container with the existing code. Register the container in Amazon Elastic Container registry. with the existing code. Register the container in Amazon Elastic Container Registry. Finally run the training and inference jobs using this container.

Question2: Which feature of Amazon SageMaker can you use for preprocessing the data?

 

Answer2: Amazon Sagemaker Notebook instances

Amazon SageMaker enables developers and data scientists to build, train, tune, and deploy machine learning (ML) models at scale. You can deploy trained ML models for real-time or batch predictions on unseen data, a process known as inference. However, in most cases, the raw input data must be preprocessed and can’t be used directly for making predictions. This is because most ML models expect the data in a predefined format, so the raw data needs to be first cleaned and formatted in order for the ML model to process the data.  You can use the Amazon SageMaker built-in Scikit-learn library for preprocessing input data and then use the Amazon SageMaker built-in Linear Learner algorithm for predictions.

Question3: What setting, when creating an Amazon SageMaker notebook instance, can you use to install libraries and import data?

Answer3: LifeCycle Configuration

Question4: How to Choose the right Sagemaker built-in algorithm?

How to chose the right built in algorithm in SageMaker?
How to chose the right built in algorithm in SageMaker?
Guide to choosing the right unsupervised learning algorithm
Guide to choosing the right unsupervised learning algorithm

 

Choosing the right  ML algorithm based on Data Type
Choosing the right ML algorithm based on Data Type

 

Choosing the right ML algo based on data type
Choosing the right ML algo based on data type

This is a general guide for choosing which algorithm to use depending on what business problem you have and what data you have. 

 

Top

Top 10 Google Professional Machine Learning Engineer Sample Questions

Question 1: You work for a textile manufacturer and have been asked to build a model to detect and classify fabric defects. You trained a machine learning model with high recall based on high resolution images taken at the end of the production line. You want quality control inspectors to gain trust in your model. Which technique should you use to understand the rationale of your classifier?

A. Use K-fold cross validation to understand how the model performs on different test datasets.

B. Use the Integrated Gradients method to efficiently compute feature attributions for each predicted image.

C. Use PCA (Principal Component Analysis) to reduce the original feature set to a smaller set of easily understood features.

D. Use k-means clustering to group similar images together, and calculate the Davies-Bouldin index to evaluate the separation between clusters.

Answer 1)

B

Notes 1)

B is correct because it identifies the pixel of the input image that leads to the classification of the image itself.

Question 2: You need to write a generic test to verify whether Dense Neural Network (DNN) models automatically released by your team have a sufficient number of parameters to learn the task for which they were built. What should you do?

A. Train the model for a few iterations, and check for NaN values.
B. Train the model for a few iterations, and verify that the loss is constant.
C. Train a simple linear model, and determine if the DNN model outperforms it.
D. Train the model with no regularization, and verify that the loss function is close to zero.
 

Answer 2)

D

Notes 2)

D is correct because the test can check that the model has enough parameters to memorize the task.

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

Question 3: Your team is using a TensorFlow Inception-v3 CNN model pretrained on ImageNet for an image classification prediction challenge on 10,000 images. You will use AI Platform to perform the model training. What TensorFlow distribution strategy and AI Platform training job configuration should you use to train the model and optimize for wall-clock time?

 

A. Default Strategy; Custom tier with a single master node and four v100 GPUs.
B. One Device Strategy; Custom tier with a single master node and four v100 GPUs.
C. One Device Strategy; Custom tier with a single master node and eight v100 GPUs.
D. Central Storage Strategy; Custom tier with a single master node and four v100 GPUs.
 

Answer 3)

D

Notes 3)

D is correct because this is the only strategy that can perform distributed training; albeit there is only a single copy of the variables on the CPU host.

Question 4: You work on a team where the process for deploying a model into production starts with data scientists training different versions of models in a Kubeflow pipeline. The workflow then stores the new model artifact into the corresponding Cloud Storage bucket. You need to build the next steps of the pipeline after the submitted model is ready to be tested and deployed in production on AI Platform. How should you configure the architecture before deploying the model to production?

 
A. Deploy model in test environment -> Validate model -> Create a new AI Platform model version
 
B. Validate model -> Deploy model in test environment -> Create a new AI Platform model version
 
C. Create a new AI Platform model version -> Validate model -> Deploy model in test environment
D. Create a new AI Platform model version – > Deploy model in test environment -> Validate model
 
Answer 4)
A
 
Notes 4)
A is correct because the model can be validated after it is deployed to the test environment, and the release version is established before the model is deployed in production.
 
Question 5: You work for a maintenance company and have built and trained a deep learning model that identifies defects based on thermal images of underground electric cables. Your dataset contains 10,000 images, 100 of which contain visible defects. How should you evaluate the performance of the model on a test dataset?
 
A. Calculate the Area Under the Curve (AUC) value.
 
B. Calculate the number of true positive results predicted by the model.
C. Calculate the fraction of images predicted by the model to have a visible defect.
D. Calculate the Cosine Similarity to compare the model’s performance on the test dataset to the model’s performance on the training dataset.
 
Answer 5)
A
 
Notes 5)
A is correct because it is scale-invariant. AUC measures how well predictions are ranked, rather than their absolute values. AUC is also classification-threshold invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.
 
Question 6: You work for a manufacturing company that owns a high-value machine which has several machine settings and multiple sensors. A history of the machine’s hourly sensor readings and known failure event data are stored in BigQuery. You need to predict if the machine will fail within the next 3 days in order to schedule maintenance before the machine fails. Which data preparation and model training steps should you take?

 

A. Data preparation: Daily max value feature engineering with DataPrep; Model training: AutoML classification with BQML
 
B. Data preparation: Daily min value feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
C. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to False
D. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
Answer 6)
D
 
Notes 6)
D is correct because it uses the rolling average of the sensor data and balances the weights using the BQML auto class weight balance parameter.
 
 
Question 7: You are an ML engineer at a media company. You need to build an ML model to analyze video content frame-by-frame, identify objects, and alert users if there is inappropriate content. Which Google Cloud products should you use to build this project?

 

A. Pub/Sub, Cloud Function, Cloud Vision API
 
B. Pub/Sub, Cloud IoT, Dataflow, Cloud Vision API, Cloud Logging
C. Pub/Sub, Cloud Function, Video Intelligence API, Cloud Logging
D. Pub/Sub, Cloud Function, AutoML Video Intelligence, Cloud Logging
 
Answer 7)
C
 
Notes 7)
C is correct as Video Intelligence API can find inappropriate components and other components satisfy the requirements of real-time processing and notification.
 
Question 8: You work for a large retailer. You want to use ML to forecast future sales leveraging 10 years of historical sales data. The historical data is stored in Cloud Storage in Avro format. You want to rapidly experiment with all the available data. How should you build and train your model for the sales forecast?
 
A. Load data into BigQuery and use the ARIMA model type on BigQuery ML.
B. Convert the data into CSV format and create a regression model on AutoML Tables.
C. Convert the data into TFRecords and create an RNN model on TensorFlow on AI Platform Notebooks.
D. Convert and refactor the data into CSV format and use the built-in XGBoost algorithm on AI Platform Training.
 
Answer 8)
A
 
Notes 8)
A is correct because BigQuery ML is designed for fast and rapid experimentation and it is possible to use federated queries to read data directly from Cloud Storage. Moreover, ARIMA is considered one of the best in class for time series forecasting.
 
Question 9) You need to build an object detection model for a small startup company to identify if and where the company’s logo appears in an image. You were given a large repository of images, some with logos and some without. These images are not yet labelled. You need to label these pictures, and then train and deploy the model. What should you do?

 

A. Use Google Cloud’s Data Labelling Service to label your data. Use AutoML Object Detection to train and deploy the model.
B. Use Vision API to detect and identify logos in pictures and use it as a label. Use AI Platform to build and train a convolutional neural network.
 
C. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a convolutional neural network.
D. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a real time object detection model.
 
Answer 9)
A
 
Notes 9)
A is correct as this will allow you to easily create a request for a labelling task and deploy a high-performance model.
 

Question 10) You work for a large financial institution that is planning to use Dialogflow to create a chatbot for the company’s mobile app. You have reviewed old chat logs and tagged each conversation for intent based on each customer’s stated intention for contacting customer service. About 70% of customer inquiries are simple requests that are solved within 10 intents. The remaining 30% of inquiries require much longer and more complicated requests. Which intents should you automate first?

A. Automate a blend of the shortest and longest intents to be representative of all intents.
B. Automate the more complicated requests first because those require more of the agents’ time.
C. Automate the 10 intents that cover 70% of the requests so that live agents can handle the more complicated requests.
 
D. Automate intents in places where common words such as “payment” only appear once to avoid confusing the software.
Answer 10)
C
 
Notes 10)

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore  9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A Part I:

Google.

Azure and AWS are second class citizens in this area.

Sure, AWS has 70% of the market.

Sure, Azure is the easiest turn key and super user friendly.

But, the king of machine learning in the cloud is GCP.

GCP = Google Cloud Platform

Google has the largest data science team in the world, not mention they have Hinton.

Let’s forgot for a minute they created TensorFlow and give it away.

Let’s just talk about building a real world model with data that doesn’t fit into a excel spreadsheet.

The vast majority of applied machine learning is supervised and that means we need data.

Not just normal data, we need very clean highly structured data.

Where’s the easiest place in the world to upload and model a Petabyte of structured dataBigQuery of course.

Why BigQuery? I don’t have to do anything but upload my data. No spinning up RedShit clusters or whatever I have to do in Azure, just upload and massage data with my familiar SQL. If I do have to wrangle my data it won’t take my six months to update 5 rows here, minutes usually.

Then, you’ll need a front end. Cloud datalab is a Jupyter notebook, which is good because I don’t want nor do I need anything else.

Then, with a single line of code I connect by datalab (Jupyter) notebook to my data in BigQuery and build away.

I’ve worked in all three and the only thing I care about is getting to my job the fastest and right now that means I build my models in GCP.

If you’re new to machine learning don’t start in GCP or any cloud vendor for that matter. Start learning Python from the comfort of your laptop.

The course below is free to the first 20.

The Complete Python Course for Machine Learning Engineers

Here, I want to share the best research paper on Machine Learning classification methods, titled ‘Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?’, published in the ‘Journal of Machine Learning Research’.

This paper nicely explained 179 classification techniques and applied them on 121 data sets thus sharing small summary of the paper:

Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?

 
 
 

The paper evaluated 179 classifiers arising from 17 ML families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest neighbours, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R ( with and without the caret package), C and Matlab, including all the relevant classifiers available today.

Experiments used total 121 data sets , which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behaviour, not dependent on the data set collection.

The whole data set and partitions are available from: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/data.tar.gz

The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package).

The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).

You can see the table with the complete results: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/results.txt

I hope it will be helpful for Statistic and Machine Leaning aspirants!

Thank you!

 
 
 

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

  • Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
  • Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
  • Model versioning: add a hash key to your different models. You will thank me later.
  • Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
  • Monitor performances: execution time and statistical scores of your models.
  • Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

  1. Not understanding the structure of the dataset
  2. Not giving proper care during features selection
  3. Leaving out categorical features and considering just numerical variables
  4. Falling into dummy variable trap
  5. Selection of inefficient machine learning algorithm
  6. Not trying out various ML algorithms for building the model based on structure of data.
  7. Improper tuning of model parameters
  8. Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
  9. Read more here…

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.

However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….

The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.

Thus, the data science life-cycle can include the following steps:

  1. Business requirement understanding.
  2. Data collection.
  3. Data cleaning.
  4. Data analysis.
  5. Modeling.
  6. Performance evaluation.
  7. Communicating with stakeholders.
  8. Deployment.
  9. Real-world testing.
  10. Business buy-in.
  11. Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.

Read more here….

 

Top

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore  9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A -Part II:

 
 
 

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

  • Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
  • Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
  • Model versioning: add a hash key to your different models. You will thank me later.
  • Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
  • Monitor performances: execution time and statistical scores of your models.
  • Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

  1. Not understanding the structure of the dataset
  2. Not giving proper care during features selection
  3. Leaving out categorical features and considering just numerical variables
  4. Falling into dummy variable trap
  5. Selection of inefficient machine learning algorithm
  6. Not trying out various ML algorithms for building the model based on structure of data.
  7. Improper tuning of model parameters
  8. Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
  9. Read more here…

Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.

However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….

The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.

Thus, the data science life-cycle can include the following steps:

  1. Business requirement understanding.
  2. Data collection.
  3. Data cleaning.
  4. Data analysis.
  5. Modeling.
  6. Performance evaluation.
  7. Communicating with stakeholders.
  8. Deployment.
  9. Real-world testing.
  10. Business buy-in.
  11. Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.

Read more here….

 

Top

Machine Learning Latest News

Top

Top 10 Machine Learning Algorithms

Source: Top 10 Machine Learning Algorithms for Data Scientist

In machine learning, there’s something called the “No Free Lunch” theorem. In a nutshell, it states that no one algorithm works best for every problem. It’s especially relevant for supervised learning. For example, you can’t say that neural networks are always better than decision trees or vice-versa. Furthermore, there are many factors at play, such as the size and structure of your dataset. As a result, you should try many different algorithms for your problem!

Top ML Algorithms

1. Linear Regression

Regression is a technique for numerical prediction. Additionally, regression is a statistical measure that attempts to determine the strength of the relationship between two variables. One is a dependent variable. Other is from a series of other changing variables which are our independent variables. Moreover, just like Classification is for predicting categorical labels, Regression is for predicting a continuous value. For example, we may wish to predict the salary of university graduates with 5 years of work experience. We use regression to determine how much specific factors or sectors influence the dependent variable.

Linear regression attempts to model the relationship between a scalar variable and explanatory variables by fitting a linear equation. For example, one might want to relate the weights of individuals to their heights using a linear regression model.

Additionally, this operator calculates a linear regression model. It uses the Akaike criterion for model selection. Furthermore, the Akaike information criterion is a measure of the relative goodness of a fit of a statistical model.

2. Logistic Regression

Logistic regression is a classification model. It uses input variables to predict a categorical outcome variable. The variable can take on one of a limited set of class values. A binomial logistic regression relates to two binary output categories. A multinomial logistic regression allows for more than two classes. Examples of logistic regression include classifying a binary condition as “healthy” / “not healthy”. Logistic regression applies the logistic sigmoid function to weighted input values to generate a prediction of the data class.

A logistic regression model estimates the probability of a dependent variable as a function of independent variables. The dependent variable is the output that we are trying to predict. The independent variables or explanatory variables are the factors that we feel could influence the output. Multiple regression refers to regression analysis with two or more independent variables. Multivariate regression, on the other hand, refers to regression analysis with two or more dependent variables.

3. Linear Discriminant Analysis

Logistic Regression is a classification algorithm traditionally for two-class classification problems. If you have more than two classes then the Linear Discriminant Analysis algorithm is the preferred linear classification technique.

The representation of LDA is pretty straight forward. It consists of statistical properties of your data, calculated for each class. For a single input variable this includes:

  1. The mean value for each class.
  2. The variance calculated across all classes.

We make predictions by calculating a discriminate value for each class. After that we make a prediction for the class with the largest value. The technique assumes that the data has a Gaussian distribution. Hence, it is a good idea to remove outliers from your data beforehand. It’s a simple and powerful method for classification predictive modelling problems.

4. Classification and Regression Trees

Prediction Trees are for predicting response or class YY from input X1, X2,…,XnX1,X2,…,Xn. If it is a continuous response it is a regression tree, if it is categorical, it is a classification tree. At each node of the tree, we check the value of one the input XiXi. Depending on the (binary) answer we continue to the left or to the right subbranch. When we reach a leaf we will find the prediction.

Contrary to linear or polynomial regression which are global models, trees try to partition the data space into small enough parts where we can apply a simple different model on each part. The non-leaf part of the tree is just the procedure to determine for each data xx what is the model we will use to classify it.

5. Naive Bayes

A Naive Bayes Classifier is a supervised machine-learning algorithm that uses the Bayes’ Theorem, which assumes that features are statistically independent. The theorem relies on the naive assumption that input variables are independent of each other, i.e. there is no way to know anything about other variables when given an additional variable. Regardless of this assumption, it has proven itself to be a classifier with good results.

Naive Bayes Classifiers rely on the Bayes’ Theorem, which is based on conditional probability or in simple terms, the likelihood that an event (A) will happen given that another event (B) has already happened. Essentially, the theorem allows a hypothesis to be updated each time new evidence is introduced. The equation below expresses Bayes’ Theorem in the language of probability:

Let’s explain what each of these terms means.

  • “P” is the symbol to denote probability.
  • P(A | B) = The probability of event A (hypothesis) occurring given that B (evidence) has occurred.
  • P(B | A) = The probability of the event B (evidence) occurring given that A (hypothesis) has occurred.
  • P(A) = The probability of event B (hypothesis) occurring.
  • P(B) = The probability of event A (evidence) occurring.

6. K-Nearest Neighbors

k-nearest neighbours (or k-NN for short) is a simple machine learning algorithm that categorizes an input by using its k nearest neighbours.

For example, suppose a k-NN algorithm has an input of data points of specific men and women’s weight and height, as plotted below. To determine the gender of an unknown input (green point), k-NN can look at the nearest k neighbours (suppose ) and will determine that the input’s gender is male. This method is a very simple and logical way of marking unknown inputs, with a high rate of success.

Also, we can k-NN in a variety of machine learning tasks; for example, in computer vision, k-NN can help identify handwritten letters and in gene expression analysis, the algorithm can determine which genes contribute to a certain characteristic. Overall, k-nearest neighbours provide a combination of simplicity and effectiveness that makes it an attractive algorithm to use for many machine learning tasks.

7. Learning Vector Quantization

A downside of K-Nearest Neighbors is that you need to hang on to your entire training dataset. The Learning Vector Quantization algorithm (or LVQ for short) is an artificial neural network algorithm that allows you to choose how many training instances to hang onto and learns exactly what those instances should look like.

Additionally, the representation for LVQ is a collection of codebook vectors. We select them randomly in the beginning and adapted to best summarize the training dataset over a number of iterations of the learning algorithm. After learned, the codebook vectors can make predictions just like K-Nearest Neighbors. Also, we find the most similar neighbour (best matching codebook vector) by calculating the distance between each codebook vector and the new data instance. The class value or (real value in the case of regression) for the best matching unit is then returned as the prediction. Moreover, you can get the best results if you rescale your data to have the same range, such as between 0 and 1.

If you discover that KNN gives good results on your dataset try using LVQ to reduce the memory requirements of storing the entire training dataset.

8. Bagging and Random Forest

A Random Forest consists of a collection or ensemble of simple tree predictors, each capable of producing a response when presented with a set of predictor values. For classification problems, this response takes the form of a class membership, which associates, or classifies, a set of independent predictor values with one of the categories present in the dependent variable. Alternatively, for regression problems, the tree response is an estimate of the dependent variable given the predictors.e

A Random Forest consists of an arbitrary number of simple trees, which determine the final outcome. For classification problems, the ensemble of simple trees votes for the most popular class. In the regression problem, we average responses to obtain an estimate of the dependent variable. Using tree ensembles can lead to significant improvement in prediction accuracy (i.e., better ability to predict new data cases).

9. SVM

A Support Vector Machine (SVM) is a supervised machine learning algorithm that can be employed for both classification and regression purposes. Also, SVMs have more common usage in classification problems and as such, this is what we will focus on in this post.

SVMs are based on the idea of finding a hyperplane that best divides a dataset into two classes, as shown in the image below.

Also, you can think of a hyperplane as a line that linearly separates and classifies a set of data.

Intuitively, the further from the hyperplane our data points lie, the more confident we are that they have been correctly classified. We, therefore, want our data points to be as far away from the hyperplane as possible, while still being on the correct side of it.

So when we add a new testing data , whatever side of the hyperplane it lands will decide the class that we assign to it.

The distance between the hyperplane and the nearest data point from either set is the margin. Furthermore, the goal is to choose a hyperplane with the greatest possible margin between the hyperplane and any point within the training set, giving a greater chance of correct classification of data.

But the data is rarely ever as clean as our simple example above. A dataset will often look more like the jumbled balls below which represent a linearly non-separable dataset.

10. Boosting and AdaBoost

Boosting is an ensemble technique that attempts to create a strong classifier from a number of weak classifiers. We do this by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. We can add models until the training set is predicted perfectly or a maximum number of models are added.

AdaBoost was the first really successful boosting algorithm developed for binary classification. It is the best starting point for understanding boosting. Modern boosting methods build on AdaBoost, most notably stochastic gradient boosting machines.

AdaBoost is used with short decision trees. After the first tree is created, the performance of the tree on each training instance is used to weight how much attention the next tree that is created should pay attention to each training instance. Training data that is hard to predict is given more weight, whereas easy to predict instances are given less weight. Models are created sequentially one after the other, each updating the weights on the training instances that affect the learning performed by the next tree in the sequence. After all the trees are built, predictions are made for new data, and the performance of each tree is weighted by how accurate it was on training data.

Because so much attention is put on correcting mistakes by the algorithm it is important that you have clean data with outliers removed.

Summary

A typical question asked by a beginner, when facing a wide variety of machine learning algorithms, is “which algorithm should I use?” The answer to the question varies depending on many factors, including: (1) The size, quality, and nature of data; (2) The available computational time; (3) The urgency of the task; and (4) What you want to do with the data.

Even an experienced data scientist cannot tell which algorithm will perform the best before trying different algorithms. Although there are many other Machine Learning algorithms, these are the most popular ones. If you’re a newbie to Machine Learning, these would be a good starting point to learn.

Follow this link, if you are looking to learn Data Science Course Online!

Additionally, if you are having an interest in learning Data Science, Learn online Data Science Course to boost your career in Data Science.

Also, learn AWS Big Data Course click here, AWS Online Course

Furthermore, if you want to read more about data science, read this Data Science blogs

Top

The foundations of most algorithms lie in linear algebra, multivariable calculus, and optimization methods. Most algorithms use a sequence of combinations to estimate an objective function given a set of data, and the sequence order and included methods distinguish one algorithm from another. It’s helpful to learn enough math to read the development papers associated with key algorithms in the field, as many other methods (or one’s own innovations) include pieces of those algorithms. It’s like learning the language of machine learning. Once you are fluent in it, it’s pretty easy to modify algorithms as needed and create new ones likely to improve on a problem in a short period of time.

Matrix factorization: a simple, beautiful way to do dimensionality reduction —and dimensionality reduction is the essence of cognition. Recommender systems would be a big application of matrix factorization. Another application I’ve been using over the years (starting in 2010 with video data) is factorizing a matrix of pairwise mutual information (or pointwise mutual information, which is more common) between features, which can be used for feature extraction, computing word embeddings, computing label embeddings (that was the topic of a recent paper of mine [1]), etc.

Used in a convolutional settings, this acts as an excellent unsupervised feature extractor for images and videos. There’s one big issue though: it is fundamentally a shallow algorithm. Deep neural networks will quickly outperform it if any kind of supervision labels are available.

[1] [1607.05691] Information-theoretical label embeddings for large-scale image classification

Machine Learning Demos:

1- TensorFlow Demos

LipSync by YouTube

See how well you synchronize to the lyrics of the popular hit “Dance Monkey.” This in-browser experience uses the Facemesh model for estimating key points around the lips to score lip-syncing accuracy.Explore demo  View code  

Emoji Scavenger Hunt

Use your phone’s camera to identify emojis in the real world. Can you find all the emojis before time expires?Explore demo  View code  

Webcam Controller

Play Pac-Man using images trained in your browser.Explore demo  View code  

Teachable Machine

No coding required! Teach a machine to recognize images and play sounds.Explore demo  View code  

Move Mirror

Explore pictures in a fun new way, just by moving around.Explore demo  View code  

Performance RNN

Enjoy a real-time piano performance by a neural network.Explore demo  View code  

Node.js Pitch Prediction

Train a server-side model to classify baseball pitch types using Node.js.View code  

Visualize Model Training

See how to visualize in-browser training and model behaviour and training using tfjs-vis.Explore demo  View code  

Community demos

Get started with official templates and explore top picks from the community for inspiration.Glitch 

Check out community Glitches and make your own TensorFlow.js-powered projects.Explore Glitch  Codepen 

Fork boilerplate templates and check out working examples from the community.Explore CodePen  GitHub Community Projects 

See what the community has created and submitted to the TensorFlow.js gallery page.Explore GitHub  

https://cdpn.io/jasonmayes/fullcpgrid/QWbNeJdOpen in Editor

Real time body segmentation using TensorFlow.js

Load in a pre-trained Body-Pix model from the TensorFlow.js team so that you can locate all pixels in an image that are part of a body, and what part of the body they belong to. Clone this to make your own TensorFlow.js powered projects to recognize body parts in images from your webcam and more!

New Pen from Templatehttps://cdpn.io/jasonmayes/fullcpgrid/qBEJxggOpen in Editor

Multiple object detection using pre trained model in TensorFlow.js

This demo shows how we can use a pre made machine learning solution to recognize objects (yes, more than one at a time!) on any image you wish to present to it. Even better, not only do we know that the image contains an object, but we can also get the co-ordinates of the bounding box for each object it finds, which allows you to highlight the found object in the image.

For this demo we are loading a model using the ImageNet-SSD architecture, to recognize 90 common objects it has already been taught to find from the COCO dataset.

If what you want to recognize is in that list of things it knows about (for example a cat, dog, etc), this may be useful to you as is in your own projects, or just to experiment with Machine Learning in the browser and get familiar with the possibilities of machine learning.

If you are feeling particularly confident you can check out our GitHub documentation (https://github.com/tensorflow/tfjs-models/tree/master/coco-ssd) which goes into much more detail for customizing various parameters to tailor performance to your needs.

New Pen from Templatehttps://cdpn.io/jasonmayes/fullcpgrid/JjompwwOpen in Editor

Classifying images using a pre trained model in TensorFlow.js

This demo shows how we can use a pre made machine learning solution to classify images (aka a binary image classifier). It should be noted that this model works best when a single item is in the image at a time. Busy images may not work so well. You may want to try our demo for Multiple Object Detection (https://codepen.io/jasonmayes/pen/qBEJxgg) for that.

For this demo we are loading a model using the MobileNet architecture, to recognize 1000 common objects it has already been taught to find from the ImageNet data set (http://image-net.org/).

If what you want to recognize is in that list of things it knows about (for example a cat, dog, etc), this may be useful to you as is in your own projects, or just to experiment with Machine Learning in the browser and get familiar with the possibilities of machine learning.

Please note: This demo loads an easy to use JavaScript class made by the TensorFlow.js team to do the hardwork for you so no machine learning knowledge is needed to use it.

If you were looking to learn how to load in a TensorFlow.js saved model directly yourself then please see our tutorial on loading TensorFlow.js models directly.

If you want to train a system to recognize your own objects, using your own data, then check out our tutorials on “transfer learning”.

New Pen from TemplateOpen in Editor

Tensorflow.js Boilerplate

The hello world for TensorFlow.js 🙂 Absolute minimum needed to import into your website and simply prints the loaded TensorFlow.js version. From here we can do great things. Clone this to make your own TensorFlow.js powered projects or if you are following a tutorial that needs TensorFlow.js to work.

New Pen from Template

Examples

tfjs-examples provides small code examples that implement various ML tasks using TensorFlow.js.MNIST Digit Recognizer

Train a model to recognize handwritten digits from the MNIST database.Explore example  View code  Addition RNN

Train a model to learn addition from text examples.Explore example  View code  

TensorFlow.js Layers: Iris Demo

More TensorFlow examples

Top-paying Cloud certifications:

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore  9n8rl80hvm4t-mobile screenshots]

  1. Google Certified Professional Cloud Architect — $175,761/year
  2. AWS Certified Solutions Architect – Associate — $149,446/year
  3. Azure/Microsoft Cloud Solution Architect – $141,748/yr
  4. Google Cloud Associate Engineer – $145,769/yr
  5. AWS Certified Cloud Practitioner — $131,465/year
  6. Microsoft Certified: Azure Fundamentals — $126,653/year
  7. Microsoft Certified: Azure Administrator Associate — $125,993/year

Complete overview of machine learning concepts seen in 27 data science and machine learning interviews:

Supervised Learning

Linear Regression

Logistic Regression

Naive Bayes

Support Vector Machines

Decision Trees

K-Nearest Neighbors

Test your knowledge

Machine Learning in Practice

Bias-Variance Tradeoff

How to Select a Model

How to Select Features

Regularizing Your Model

Ensembling: How to Combine Your Models

Evaluation Metrics

Unsupervised Learning

Market Basket Analysis

K-Means Clustering

Principal Components Analysis

Deep Learning

Feedforward Neural Networks

Grab Bag of Neural Network Practices

Convolutional Neural Networks

Recurrent Neural Networks

Test Your Knowledge

Feature Extraction

Best Subset Features Feature

Selection Examples

Adding Features Example
Activation Practice I
Activation Practice II
Activation Practice III
Weight Initialization
Batch vs. Stochastic

Recurrent Network Advantages

Alternatives Recurrent Units


Convolutional Application
Convolutional Layer Advantages

Are you interested in becoming an AWS Certified Machine Learning Specialist? If so, then this exam preparation blog is for you! The blog contains over 100 quiz and practice exam questions, as well as detailed answers. The questions are very similar to those you will encounter on the actual exam, so this is a great way to prepare. In addition, the blog also includes cheat sheets and illustrations to help you understand the concepts better.

Bring your own algorithm to an MLOps Pipeline: Architecture

AWS Certified machine Learning Specialty Exam Prep MLS-C01: AWS architecture diagram showing all services used and how they are connected
AWS Certified machine Learning Specialty Exam Prep MLS-C01
Bring your own algorithm to an MLOps Pipeline: Architecture
Bring your own algorithm to an MLOps Pipeline: Architecture
Bring your own algorithm to an MLOps Pipeline: Architecture

Code and Serve Your ML Model with AWS CodeBuild

What are some ways we can use machine learning and artificial intelligence for algorithmic trading in the stock market?

How do we know that the Top 3 Voice Recognition Devices like Siri Alexa and Ok Google are not spying on us?

What are some good datasets for Data Science and Machine Learning?

Machine Learning Engineer Interview Questions and Answers

  • [D] A slide which makes you feel old
    by /u/xiikjuy (Machine Learning) on April 20, 2024 at 8:20 am

    submitted by /u/xiikjuy [link] [comments]

  • [D] How can I have successful career as ML researcher? Please give me your advice
    by /u/Equivalent_Future207 (Machine Learning) on April 20, 2024 at 8:06 am

    Hi, I am a postdoctoral researcher. Two years ago, I changed my research area from wireless communications to ML research. The reason for the change is for my interests and future career. It was very hard decision, because my previous research papers may no longer be helpful for me. Recent two years, I have written two top ML conference papers, focusing on privacy theory, and I was very proud of this. However, the number of papers being published is growing exponentially and appears to be overheating. It is also common to see undergraduate students having multiple first-author papers from top conferences for their MS program. Every time I write a paper, social standards increase accordingly, and the gap between me and the standard seems to remain parallel. Would it be better to do major research (LLM or computer vision) and write more papers rather than research in a relatively minor field (privacy and security)? I also have questions about generative models and LLM, which seem to be the most promising. Now, I am working with a research work using pixart-alpha. The VRAM usage is almost at its limitation. The 24GB GPUx2 I currently have doesn't seem to be enough to handle LLM/Generative AI research. What kind of equipment are you doing your research with? Please share your thought and give me your advice. submitted by /u/Equivalent_Future207 [link] [comments]

  • [R] Backpropagation through space, time, and the brain
    by /u/SeawaterFlows (Machine Learning) on April 20, 2024 at 3:02 am

    Paper: https://arxiv.org/abs/2403.16933 Abstract: Effective learning in neuronal networks requires the adaptation of individual synapses given their relative contribution to solving a task. However, physical neuronal systems -- whether biological or artificial -- are constrained by spatio-temporal locality. How such networks can perform efficient credit assignment, remains, to a large extent, an open question. In Machine Learning, the answer is almost universally given by the error backpropagation algorithm, through both space (BP) and time (BPTT). However, BP(TT) is well-known to rely on biologically implausible assumptions, in particular with respect to spatiotemporal (non-)locality, while forward-propagation models such as real-time recurrent learning (RTRL) suffer from prohibitive memory constraints. We introduce Generalized Latent Equilibrium (GLE), a computational framework for fully local spatio-temporal credit assignment in physical, dynamical networks of neurons. We start by defining an energy based on neuron-local mismatches, from which we derive both neuronal dynamics via stationarity and parameter dynamics via gradient descent. The resulting dynamics can be interpreted as a real-time, biologically plausible approximation of BPTT in deep cortical networks with continuous-time neuronal dynamics and continuously active, local synaptic plasticity. In particular, GLE exploits the ability of biological neurons to phase-shift their output rate with respect to their membrane potential, which is essential in both directions of information propagation. For the forward computation, it enables the mapping of time-continuous inputs to neuronal space, performing an effective spatiotemporal convolution. For the backward computation, it permits the temporal inversion of feedback signals, which consequently approximate the adjoint states necessary for useful parameter updates. submitted by /u/SeawaterFlows [link] [comments]

  • [D] How can I become a machine learning engineer?
    by /u/Nwaiy (Machine Learning) on April 20, 2024 at 2:41 am

    I'm a 20-year-old Brazilian student who has always admired the field of machine learning engineering. I'm more interested in the field of computer vision, but the more I research, the more lost I become with so much content that comes my way. So I decided to go to dear reddit and ask you... can you give me tips, roadmaps, or anything that will help me start my journey? submitted by /u/Nwaiy [link] [comments]

  • [N] Kaiming He's lecture on DL architecture for Representation Learning
    by /u/lkhphuc (Machine Learning) on April 20, 2024 at 12:57 am

    https://youtu.be/D_jt-xO_RmI Extremely good lecture, highest signal to noise of historical architecture advances of DL. submitted by /u/lkhphuc [link] [comments]

  • Do you think Reinforcement Learning still got it? [D]
    by /u/cyb0rg14_ (Machine Learning) on April 19, 2024 at 8:40 pm

    Recently I've heard many people saying reinforcement learning itself hasn't shown any improvement in many years (maybe alphago was the last big thing). Whereas other field of AI has seen many SOTA architectures like 'Transformers' for Sequence based tasks and 'ResNet', 'Diffusers' & 'VAE' like architectures for Computer vision tasks. Thought I do believe, directly or indirectly, reinforcement learning still playing a crucial role behind LLMs like ChatGPT and Claude using 'RLHF' techniques. And in many other recent technologies including self driving cars and robots. I think this is just a cold winter going in this field, which will soon find a state of the art architecture in upcoming years (or this is what I hope) What's your thoughts? 🤔 submitted by /u/cyb0rg14_ [link] [comments]

  • Resources to improve code design and software design
    by /u/RepairFar7806 (Data Science) on April 19, 2024 at 7:15 pm

    Hi all, I have been a data scientist for the past 5 years. My bachelors is in information systems and my masters is in statistics. I don’t come from compsci and I had minimal coding other than SQL and R in my education. I have been using python for the past 4 years self taught and I am adequate with it. I would like to improve my python coding skills, more around how to build out and organize it, and best practices for structuring the files and packages. additionally use of classes and methods. I think this can be summed up as software design. The other members of my team have more extensive and formal teachings in these subjects and it is becoming apparent to my manager that I lack skills in this compared to them. We are expected to be machine learning engineers as well as data scientists at this company because we are a smaller start up. Can anyone recommend any resources to help me level up my knowledge in this area? submitted by /u/RepairFar7806 [link] [comments]

  • Need help with project ideas for software development skills and writing production level code.
    by /u/medylan (Data Science) on April 19, 2024 at 6:34 pm

    Hello, I am a stats MS struggling to find work. I believe my math/stats background is holding me back because I am not PhD level but lack the engineering skills to work in applied roles in industry. When I do self learning projects I can only ever think of ideas implementing models I am interested in, but am lost as what to do to start writing production quality code and challenge myself as a software developer. Any ideas and advice is greatly appreciated! Thank you submitted by /u/medylan [link] [comments]

  • [P] TorchFix - a linter for PyTorch-using code with autofix support
    by /u/kit1980 (Machine Learning) on April 19, 2024 at 6:13 pm

    TorchFix is a Python code static analysis tool - a linter with autofix capabilities - for users of PyTorch. It can be used to find and fix issues like usage of deprecated PyTorch functions and non-public symbols, and to adopt PyTorch best practices in general: https://github.com/pytorch-labs/torchfix submitted by /u/kit1980 [link] [comments]

  • [D] Is Google Set to Dominate the RAG Scene with Its Massive Data Resources?
    by /u/Few-Pomegranate4369 (Machine Learning) on April 19, 2024 at 5:06 pm

    Hey everyone! It looks like in a few years, the basic large language models (LLMs) we use will get commoditised, and it won't really matter which one you pick. The next big thing could be LLMs that use Retrieval-Augmented Generation (RAG), which means they need a ton of data to work well. Given that Google has access to loads of data through its search engine, do you think they're in a better position to lead in this new phase compared to other companies? What do you all think? submitted by /u/Few-Pomegranate4369 [link] [comments]

  • [P] AI-based Language Teacher that can run locally on a 12GB graphics card (RTX 4070)
    by /u/HichamEB (Machine Learning) on April 19, 2024 at 2:43 pm

    I've been playing around with various open-source models lately. One fun application that I figured I could try was a <Language Teacher> 🌍 The result are not half bad, you can give it a try here: https://github.com/helboukkouri/virtual-teacher submitted by /u/HichamEB [link] [comments]

  • Imputation methods satisfying constraints
    by /u/Antoinefdu (Data Science) on April 19, 2024 at 2:08 pm

    Hey everyone, I have here a dataset of KPI metrics from various social media posts. For those of you lucky enough to not be working in digital marketing, the metrics in question are things like: "impressions" (number of times a post has been seen) "reach" (number of unique accounts who have seen a post) "clicks", "comments", "likes", "shares", etc (self-explanatory) The dataset in question is incomplete, the missing values are distributed across pretty much every dimension, and my job is to develop a model to fill in those missing values. So far I've tested a KNN imputer with some success, as well as an Iterative imputer (MICE) with much better results. But there's 1 problem that persists: some values need to be constrained by others in the same entry. Imagine for instance that a given post had 55 "Impressions", meaning that it has been seen 55 times, and we try to fill the missing "Reach" (number of unique accounts that have seen that post). Obviously that amount cannot be higher than 55. A post cannot be viewed 55 times by 60 different accounts. There are a bunch of such constraints that I somehow need to pass in to my model, I've tried looking into the MICE algorithm to find an answer there but without success. Does anyone know of a way I can enforce these types of constraints? Or is there another data imputation method that's better suited for this type of task? submitted by /u/Antoinefdu [link] [comments]

  • [P] End-to-end locally-running language teacher
    by /u/HichamEB (Machine Learning) on April 19, 2024 at 1:44 pm

    Hey! Given how easy it is to just grab an open-source model and run it locally these days, I figured I'd try to make some kind of Language Teacher with whom I'd casually have discussions and learn new phrases on the go. This is a quick test in English/Spanish: https://www.loom.com/share/f0dbed21254f445b9d5b0a8e11270982 I used an LLM for the underlying chatbot, a TTS model for speaking out the answers and an ASR model for transforming my speech into an input for the LLM. Let me know if you have any comments 🙂 submitted by /u/HichamEB [link] [comments]

  • [D] Embeddings search "drowning" in a sea of noise! Can you solve this riddle?
    by /u/grudev (Machine Learning) on April 19, 2024 at 1:21 pm

    I'm writing a proof of concept for a RAG application for hundreds of thousands of textual records stored in a Postgres DB, using pgvector to store embeddings ( and using an HNSW index). Vector dimensions are specified correctly. Currently running experiments using varied chunk sizes for the text and comparing two different embedding models. (actual chunk size can vary a little because I am not breaking words to force a size). nomic-embed-text snowflake-arctic-embed-m-long Here's the gist experiment: 1- Create embeddings for "n" documents 2- Create a list of queries/prompts for information that is assuredly contained in SOME of those documents. Examples: What were the events that happened at "location x"? What is the John Doe's nickname? Who were the patients that checked into "hospital name"? Tell me about a requisition made by the director of sales. ... 3- For each query/prompt, I run a cosine distance query and get the the nearest 5 matching chunks. 4- After calculating the average distance for all queries/chunks, the lowest value is, in theory, the best combination of model/chunk_size. This worked SUPER well with a small sample of documents (say ≃ 200), but once I added more documents I started noticing an issue. Some of the NEW documents contain lists of literally 30k+ names. Whenever I ran a query that contains names, chunks from the lists above are returned, EVEN IF THEY DON'T CONTAIN THE NAMES, or any of the other information presented in the prompt (this happens regardless of the chose chunk size or strategy). My theory is that when a chunk containing names is embedded, the resulting embedding contain a strong vector for the semantic meaning of "name", but the vectors that differentiate that name from others can be relatively weak. A chunk containing almost nothing but references to the vector for "name" is then considered very similar to the prompt's embeddings, despite the names themselves not matching. For those of you with more experience/understanding, am I wrong in these assumptions? Would you have any suggestions/workarounds? I have some ideas but would like to see if anyone faced the same issues. submitted by /u/grudev [link] [comments]

  • [R] The roles of value, key, and query in the diffusion model.
    by /u/Candid_Finish444 (Machine Learning) on April 19, 2024 at 1:10 pm

    I am trying to replace the key, query, and value in different prompts of the diffusion model for video editing. I want to understand why key, query, and value are effective and what they represent in the diffusion model. https://preview.redd.it/uoce1dh4rfvc1.png?width=1086&format=png&auto=webp&s=24d6504ca9c50d9f5924dd935204db6c15484a16 submitted by /u/Candid_Finish444 [link] [comments]

  • [P] How to obtain the mean and std from the rms to obtain the first prediction time for a time series case study ?
    by /u/Papytho (Machine Learning) on April 19, 2024 at 8:59 am

    Hello I am trying to implement this from a paper: First, select the first l sampling points in the sampling points of bearing faults and calculate the mean μ_rms and standard deviation σ_rms of their root mean square values, and establish a 3σ criterion- based judgment interval [μ_rms − 3σ_rms, μ_rms +3σ_rms] accordingly. 2) Second, calculate the RMS index for the l + 1 th point FPTl+1 and compare it with the decision interval in step 1. If its value is not in this range, then recalculate the judgment interval after making l =l + 1. If its value is within this range, a judgment is triggered once. 3) Finally, in order to avoid false triggers, three consecutive triggers are used as the identification basis for the final FPT, and make this time FPTl = FPT The paper title: Physics guided neural network: Remaining useful life prediction of rolling bearings using long short-term memory network through dynamic weighting of degradation process My question is: how do I get the μ_rms and σ_rms from the RMS? What I did in this case was first sample the data and then calculate the RMS on the samples. But then I recreate sequences from these RMS values (which doesn't seem logical to me) and then calculate the μ_rms and σ_rms. I do use this value I obtain to do the interval and compare it with the RMS value. But the problem is that by doing this, it triggers way too early. This is the code I have made: def find_fpt(rms_sample, sample): fpt_index = 0 trigger = 0 for i in range(len(rms_sample)): upper = np.mean(rms_sample[i] + 3 * np.std(rms_sample[i])) lower = np.mean(rms_sample[i] - 3 * np.std(rms_sample[i])) rms = np.mean(np.square(sample[i + 1]) ** 2) if upper > rms > lower: if trigger == 3: fpt_index = i break trigger += 1 else: trigger = 0 print(trigger) return fpt_index def sliding_window(data, window_size): return np.lib.stride_tricks.sliding_window_view(data, window_size) window_size = 20 list_bearing, list_rul = load_dataset_and_rul() sampling = sliding_window(list_bearing[0][::100], window_size) rms_values = np.sqrt(np.mean(np.square(sampling) ** 2, axis=1)) rms_sample = sliding_window(rms_values, window_size) fpt = find_fpt(rms_sample,sampling) submitted by /u/Papytho [link] [comments]

  • Any ways to improve TabNet..??? [D]
    by /u/Shoddy_Battle_5397 (Machine Learning) on April 19, 2024 at 8:06 am

    so i was experimenting with tabnet architecture by google https://arxiv.org/pdf/1908.07442.pdf and found that if the data has a lot of randomness and noice then only it can outperform based on my dataset, but traditional machine learning algo like xgboost, random forest do a better job at those dataset where the features are robust enough but they fail the zero shot test and the transformer show some accuracy in that, so i just wanted to check if its possible to merge both of the traditional techniques and the transformer architecture so that it can perform better at traditional ml algo datasets and also give a good zero shot accuracy. while trying to merge it i found that in the tabnet paper they assume that each feature is independent and do not provide any place for any relationship with the features itself but the Tabtransformer architecture takes it into account https://arxiv.org/pdf/2012.06678.pdf as well but doesnt have any feature selection as proposed in tabnet.... i tried to merge them but was stuck where i have to do feature selection on the basis of the dimension assigned to each feature, while this work i s done by sparsemax in the tabnet paper i cant find a way to do that... any help would be appreciated submitted by /u/Shoddy_Battle_5397 [link] [comments]

  • [R] Machine learning from 3D meshes and physical fields
    by /u/SatieGonzales (Machine Learning) on April 19, 2024 at 7:38 am

    Ansys has released an AutoML product for physical simulation called Ansys Sim AI (https://www.ansys.com/fr-fr/news-center/press-releases/1-9-24-ansys-launches-simai). As a machine learning engineer, I wonder what types of models can be used to train on 3D mesh data in STL format with physical fields. How can the varying dimensions of input and output data be managed for different geometric objects? Does anyone have any ideas on this topic? submitted by /u/SatieGonzales [link] [comments]

  • [D] Auto Scripting
    by /u/starcrashing (Machine Learning) on April 19, 2024 at 7:33 am

    I have been working on a project for the past couple of months, and I wanted to know if anyone had feedback or thoughts to fuel its completion. I built a lexer and parser using python and C tokens to create a language that reads a python script or file and utilizes hooks to amend or write new lines. It will be able to take even a blank Python file to write, test, and deliver a working program based on a single prompt provided initially by the user. The way it works is it uses GPTs API to call automated prompts that are built into the program. It creates a program by itself by only using 1 initial prompt by the user on the program. It is a python program with the language I named autoscripter built into it. I hope to finish it by the end of the year if not into next year. This is a very challenging project, but I believe it is the future of scripting, and I have no doubts Microsoft will release something on this sooner than later. Any thoughts? I created this first by designing a debugger that error corrected python code and realized that not only error correction could be automated, but also the entire scripting process could be left to a lot of automation. submitted by /u/starcrashing [link] [comments]

  • [Discussion] Are there specific technical/scientific breakthroughs that have allowed the significant jump in maximum context length across multiple large language models recently?
    by /u/analyticalmonk (Machine Learning) on April 19, 2024 at 6:28 am

    Latest releases of models such as GPT-4 and Claude have a significant jump in the maximum context length (4/8k -> 128k+). The progress in terms of number of tokens that can be processed by these models sound remarkable in % terms. What has led to this? Is this something that's happened purely because of increased compute becoming available during training? Are there algorithmic advances that have led to this? submitted by /u/analyticalmonk [link] [comments]

What are the corresponding Azure and Google Cloud services for each of the AWS services?

Azure Administrator AZ-104 Exam Questions and Answers Dumps

AI Dashboard is available on the Web, Apple, Google, and Microsoft, PRO version

What are the corresponding Azure and Google Cloud services for each of the AWS services?

What are unique distinctions and similarities between AWS, Azure and Google Cloud services? For each AWS service, what is the equivalent Azure and Google Cloud service? For each Azure service, what is the corresponding Google Service? AWS Services vs Azure vs Google Services? Side by side comparison between AWS, Google Cloud and Azure Service?

For a better experience, use the mobile app here.

AWS vs Azure vs Google
What are the corresponding  Azure and Google Cloud services for each of the AWS services?
AWS vs Azure vs Google Mobile App
Cloud Practitioner Exam Prep:  AWS vs Azure vs Google
Cloud Practitioner Exam Prep: AWS vs Azure vs Google

1

Category: Marketplace
Easy-to-deploy and automatically configured third-party applications, including single virtual machine or multiple virtual machine solutions.
References:
[AWS]:AWS Marketplace
[Azure]:Azure Marketplace
[Google]:Google Cloud Marketplace
Tags: #AWSMarketplace, #AzureMarketPlace, #GoogleMarketplace
Differences: They are both digital catalog with thousands of software listings from independent software vendors that make it easy to find, test, buy, and deploy software that runs on their respective cloud platform.

Get 20% off Google Google Workspace (Google Meet) Standard Plan with  the following codes: 96DRHDRA9J7GTN6
Get 20% off Google Workspace (Google Meet)  Business Plan (AMERICAS) with  the following codes:  C37HCAQRVR7JTFK Get 20% off Google Workspace (Google Meet) Business Plan (AMERICAS): M9HNXHX3WC9H7YE (Email us for more codes)

Active Anti-Aging Eye Gel, Reduces Dark Circles, Puffy Eyes, Crow's Feet and Fine Lines & Wrinkles, Packed with Hyaluronic Acid & Age Defying Botanicals

3


AI Unraveled: Demystifying Frequently Asked Questions on Artificial Intelligence (OpenAI, ChatGPT, Google Bard, Generative AI, Discriminative AI, xAI, LLMs, GPUs, Machine Learning, NLP, Promp Engineering)

Category: AI and machine learning
Build and connect intelligent bots that interact with your users using text/SMS, Skype, Teams, Slack, Office 365 mail, Twitter, and other popular services.
References:
[AWS]:Alexa Skills Kit (enables a developer to build skills, also called conversational applications, on the Amazon Alexa artificial intelligence assistant.)
[Azure]:Microsoft Bot Framework (building enterprise-grade conversational AI experiences.)
[Google]:Google Assistant Actions ( developer platform that lets you create software to extend the functionality of the Google Assistant, Google’s virtual personal assistant,)

Tags: #AlexaSkillsKit, #MicrosoftBotFramework, #GoogleAssistant
Differences: One major advantage Google gets over Alexa is that Google Assistant is available to almost all Android devices.

4

If you are looking for an all-in-one solution to help you prepare for the AWS Cloud Practitioner Certification Exam, look no further than this AWS Cloud Practitioner CCP CLF-C02 book

Category: AI and machine learning
Description:API capable of converting speech to text, understanding intent, and converting text back to speech for natural responsiveness.
References:
[AWS]:Amazon Lex (building conversational interfaces into any application using voice and text.)
[Azure]:Azure Speech Services(unification of speech-to-text, text-to-speech, and speech translation into a single Azure subscription)
[Google]:Google APi.ai, AI Hub (Hosted repo of plug-and-play AI component), AI building blocks(for developers to add sight, language, conversation, and structured data to their applications.), AI Platform(code-based data science development environment, lets ML developers and data scientists quickly take projects from ideation to deployment.), DialogFlow (Google-owned developer of human–computer interaction technologies based on natural language conversations. ), TensorFlow(Open Source Machine Learning platform)

Tags: #AmazonLex, #CogintiveServices, #AzureSpeech, #Api.ai, #DialogFlow, #Tensorflow
Differences: api.ai provides us with such a platform which is easy to learn and comprehensive to develop conversation actions. It is a good example of the simplistic approach to solving complex man to machine communication problem using natural language processing in proximity to machine learning. Api.ai supports context based conversations now, which reduces the overhead of handling user context in session parameters. On the other hand in Lex this has to be handled in session. Also, api.ai can be used for both voice and text based conversations (assistant actions can be easily created using api.ai).

5

Category: AI and machine learning
Description:Computer Vision: Extract information from images to categorize and process visual data.
References:
[AWS]:Amazon Rekognition (based on the same proven, highly scalable, deep learning technology developed by Amazon’s computer vision scientists to analyze billions of images and videos daily. It requires no machine learning expertise to use.)
[Azure]:Cognitive Services(bring AI within reach of every developer—without requiring machine-learning expertise.)
[Google]:Google Vision (offers powerful pre-trained machine learning models through REST and RPC APIs.)
Tags: AmazonRekognition, #GoogleVision, #AzureSpeech
Differences: For now, only Google Cloud Vision supports batch processing. Videos are not natively supported by Google Cloud Vision or Amazon Rekognition. The Object Detection functionality of Google Cloud Vision and Amazon Rekognition is almost identical, both syntactically and semantically.
Differences:
Google Cloud Vision and Amazon Rekognition offer a broad spectrum of solutions, some of which are comparable in terms of functional details, quality, performance, and costs.

6

Category: Big data and analytics: Data warehouse
Description:Cloud-based Enterprise Data Warehouse (EDW) that uses Massively Parallel Processing (MPP) to quickly run complex queries across petabytes of data.
References:
[AWS]:AWS Redshift (scalable data warehouse that makes it simple and cost-effective to analyze all your data across your data warehouse and data lake.), Amazon Redshift Data Lake Export (Save query results in an open format),Amazon Redshift Federated Query(Run queries n line transactional data), Amazon Redshift RA3(Optimize costs with up to 3x better performance), AQUA: AQUA: Advanced Query Accelerator for Amazon Redshift (Power analytics with a new hardware-accelerated cache), UltraWarm for Amazon Elasticsearch Service(Store logs at ~1/10th the cost of existing storage tiers )
[Azure]:Azure Synapse formerly SQL Data Warehouse (limitless analytics service that brings together enterprise data warehousing and Big Data analytics.)
[Google]:BigQuery (RESTful web service that enables interactive analysis of massive datasets working in conjunction with Google Storage. )
Tags:#AWSRedshift, #GoogleBigQuery, #AzureSynapseAnalytics
Differences: Loading data, Managing resources (and hence pricing), Ecosystem. Ecosystem is where Redshift is clearly ahead of BigQuery. While BigQuery is an affordable, performant alternative to Redshift, they are considered to be more up and coming

7

Category: Big data and analytics: Data warehouse
Description: Apache Spark-based analytics platform. Managed Hadoop service. Data orchestration, ETL, Analytics and visualization
References:
[AWS]:EMR, Data Pipeline, Kinesis Stream, Kinesis Firehose, Glue, QuickSight, Athena, CloudSearch
[Azure]:Azure Databricks, Data Catalog Cortana Intelligence, HDInsight, Power BI, Azure Datafactory, Azure Search, Azure Data Lake Anlytics, Stream Analytics, Azure Machine Learning
[Google]:Cloud DataProc, Machine Learning, Cloud Datalab
Tags:#EMR, #DataPipeline, #Kinesis, #Cortana, AzureDatafactory, #AzureDataAnlytics, #CloudDataProc, #MachineLearning, #CloudDatalab
Differences: All three providers offer similar building blocks; data processing, data orchestration, streaming analytics, machine learning and visualisations. AWS certainly has all the bases covered with a solid set of products that will meet most needs. Azure offers a comprehensive and impressive suite of managed analytical products. They support open source big data solutions alongside new serverless analytical products such as Data Lake. Google provide their own twist to cloud analytics with their range of services. With Dataproc and Dataflow, Google have a strong core to their proposition. Tensorflow has been getting a lot of attention recently and there will be many who will be keen to see Machine Learning come out of preview.

8

Category: Virtual servers
Description:Virtual servers allow users to deploy, manage, and maintain OS and server software. Instance types provide combinations of CPU/RAM. Users pay for what they use with the flexibility to change sizes.
Batch: Run large-scale parallel and high-performance computing applications efficiently in the cloud.
References:
[AWS]:Elastic Compute Cloud (EC2), Amazon Bracket(Explore and experiment with quantum computing), Amazon Ec2 M6g Instances (Achieve up to 40% better price performance), Amazon Ec2 Inf1 instancs (Deliver cost-effective ML inference), AWS Graviton2 Processors (Optimize price performance for cloud workloads), AWS Batch, AWS AutoScaling, VMware Cloud on AWS, AWS Local Zones (Run low latency applications at the edge), AWS Wavelength (Deliver ultra-low latency applications for 5G devices), AWS Nitro Enclaves (Further protect highly sensitive data), AWS Outposts (Run AWS infrastructure and services on-premises)
[Azure]:Azure Virtual Machines, Azure Batch, Virtual Machine Scale Sets, Azure VMware by CloudSimple
[Google]:Compute Engine, Preemptible Virtual Machines, Managed instance groups (MIGs), Google Cloud VMware Solution by CloudSimple
Tags: #AWSEC2, #AWSBatch, #AWSAutoscaling, #AzureVirtualMachine, #AzureBatch, #VirtualMachineScaleSets, #AzureVMWare, #ComputeEngine, #MIGS, #VMWare
Differences: There is very little to choose between the 3 providers when it comes to virtual servers. Amazon has some impressive high end kit, on the face of it this sound like it would make AWS a clear winner. However, if your only option is to choose the biggest box available you will need to make sure you have very deep pockets, and perhaps your money may be better spent re-architecting your apps for horizontal scale.Azure’s remains very strong in the PaaS space and now has a IaaS that can genuinely compete with AWS
Google offers a simple and very capable set of services that are easy to understand. However, with availability in only 5 regions it does not have the coverage of the other players.

Djamgatech: Build the skills that’ll drive your career into six figures: Get Djamgatech.

9

Category: Containers and container orchestrators
Description: A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another.
Container orchestration is all about managing the lifecycles of containers, especially in large, dynamic environments.
References:
[AWS]:EC2 Container Service (ECS), Fargate(Run containers without anaging servers or clusters), EC2 Container Registry(managed AWS Docker registry service that is secure, scalable, and reliable.), Elastic Container Service for Kubernetes (EKS: runs the Kubernetes management infrastructure across multiple AWS Availability Zones), App Mesh( application-level networking to make it easy for your services to communicate with each other across multiple types of compute infrastructure)
[Azure]:Azure Container Instances, Azure Container Registry, Azure Kubernetes Service (AKS), Service Fabric Mesh
[Google]:Google Container Engine, Container Registry, Kubernetes Engine
Tags:#ECS, #Fargate, #EKS, #AppMesh, #ContainerEngine, #ContainerRegistry, #AKS
Differences: Google Container Engine, AWS Container Services, and Azure Container Instances can be used to run docker containers. Google offers a simple and very capable set of services that are easy to understand. However, with availability in only 5 regions it does not have the coverage of the other players.

10

Category: Serverless
Description: Integrate systems and run backend processes in response to events or schedules without provisioning or managing servers.
References:
[AWS]:AWS Lambda
[Azure]:Azure Functions
[Google]:Google Cloud Functions
Tags:#AWSLAmbda, #AzureFunctions, #GoogleCloudFunctions
Differences: Both AWS Lambda and Microsoft Azure Functions and Google Cloud Functions offer dynamic, configurable triggers that you can use to invoke your functions on their platforms. AWS Lambda, Azure and Google Cloud Functions support Node.js, Python, and C#. The beauty of serverless development is that, with minor changes, the code you write for one service should be portable to another with little effort – simply modify some interfaces, handle any input/output transforms, and an AWS Lambda Node.JS function is indistinguishable from a Microsoft Azure Node.js Function. AWS Lambda provides further support for Python and Java, while Azure Functions provides support for F# and PHP. AWS Lambda is built from the AMI, which runs on Linux, while Microsoft Azure Functions run in a Windows environment. AWS Lambda uses the AWS Machine architecture to reduce the scope of containerization, letting you spin up and tear down individual pieces of functionality in your application at will.

11

Category: Relational databases
Description: Managed relational database service where resiliency, scale, and maintenance are primarily handled by the platform.
References:
[AWS]:AWS RDS(MySQL and PostgreSQL-compatible relational database built for the cloud,), Aurora(MySQL and PostgreSQL-compatible relational database built for the cloud)
[Azure]:SQL Database, Azure Database for MySQL, Azure Database for PostgreSQL
[Google]:Cloud SQL
Tags: #AWSRDS, #AWSAUrora, #AzureSQlDatabase, #AzureDatabaseforMySQL, #GoogleCloudSQL
Differences: All three providers boast impressive relational database offering. RDS supports an impressive range of managed relational stores while Azure SQL Database is probably the most advanced managed relational database available today. Azure also has the best out-of-the-box support for cross-region geo-replication across its database offerings.

12

Category: NoSQL, Document Databases
Description:A globally distributed, multi-model database that natively supports multiple data models: key-value, documents, graphs, and columnar.
References:
[AWS]:DynamoDB (key-value and document database that delivers single-digit millisecond performance at any scale.), SimpleDB ( a simple web services interface to create and store multiple data sets, query your data easily, and return the results.), Managed Cassandra Services(MCS)
[Azure]:Table Storage, DocumentDB, Azure Cosmos DB
[Google]:Cloud Datastore (handles sharding and replication in order to provide you with a highly available and consistent database. )
Tags:#AWSDynamoDB, #SimpleDB, #TableSTorage, #DocumentDB, AzureCosmosDB, #GoogleCloudDataStore
Differences:DynamoDB and Cloud Datastore are based on the document store database model and are therefore similar in nature to open-source solutions MongoDB and CouchDB. In other words, each database is fundamentally a key-value store. With more workloads moving to the cloud the need for NoSQL databases will become ever more important, and again all providers have a good range of options to satisfy most performance/cost requirements. Of all the NoSQL products on offer it’s hard not to be impressed by DocumentDB; Azure also has the best out-of-the-box support for cross-region geo-replication across its database offerings.

13

Category:Caching
Description:An in-memory–based, distributed caching service that provides a high-performance store typically used to offload non transactional work from a database.
References:
[AWS]:AWS ElastiCache (works as an in-memory data store and cache to support the most demanding applications requiring sub-millisecond response times.)
[Azure]:Azure Cache for Redis (based on the popular software Redis. It is typically used as a cache to improve the performance and scalability of systems that rely heavily on backend data-stores.)
[Google]:Memcache (In-memory key-value store, originally intended for caching)
Tags:#Redis, #Memcached
<Differences: They all support horizontal scaling via sharding.They all improve the performance of web applications by allowing you to retrive information from fast, in-memory caches, instead of relying on slower disk-based databases.”, “Differences”: “ElastiCache supports Memcached and Redis. Memcached Cloud provides various data persistence options as well as remote backups for disaster recovery purposes. Redis offers persistence to disk, Memcache does not. This can be very helpful if you cache lots of data, since you remove the slowness around having a fully cold cache. Redis also offers several extra data structures that Memcache doesn’t— Lists, Sets, Sorted Sets, etc. Memcache only has Key/Value pairs. Memcache is multi-threaded. Redis is single-threaded and event driven. Redis is very fast, but it’ll never be multi-threaded. At hight scale, you can squeeze more connections and transactions out of Memcache. Memcache tends to be more memory efficient. This can make a big difference around the magnitude of 10s of millions or 100s of millions of keys. ElastiCache supports Memcached and Redis. Memcached Cloud provides various data persistence options as well as remote backups for disaster recovery purposes. Redis offers persistence to disk, Memcache does not. This can be very helpful if you cache lots of data, since you remove the slowness around having a fully cold cache. Redis also offers several extra data structures that Memcache doesn’t— Lists, Sets, Sorted Sets, etc. Memcache only has Key/Value pairs. Memcache is multi-threaded. Redis is single-threaded and event driven. Redis is very fast, but it’ll never be multi-threaded. At hight scale, you can squeeze more connections and transactions out of Memcache. Memcache tends to be more memory efficient. This can make a big difference around the magnitude of 10s of millions or 100s of millions of keys.

Ace the Microsoft Azure Fundamentals AZ-900 Certification Exam: Pass the Azure Fundamentals Exam with Ease

14

Category: Security, identity, and access
Description:Authentication and authorization: Allows users to securely control access to services and resources while offering data security and protection. Create and manage users and groups, and use permissions to allow and deny access to resources.
References:
[AWS]:Identity and Access Management (IAM), AWS Organizations, Multi-Factor Authentication, AWS Directory Service, Cognito(provides solutions to control access to backend resources from your app), Amazon Detective (Investigate potential security issues), AWS IAM Access Analyzer(Easily analyze resource accessibility)
[Azure]:Azure Active Directory, Azure Subscription Management + Azure RBAC, Multi-Factor Authentication, Azure Active Directory Domain Services, Azure Active Directory B2C, Azure Policy, Management Groups
[Google]:Cloud Identity, Identity Platform, Cloud IAM, Policy Intelligence, Cloud Resource Manager, Cloud Identity-Aware Proxy, Context-aware accessManaged Service for Microsoft Active Directory, Security key enforcement, Titan Security Key
Tags: #IAM, #AWSIAM, #AzureIAM, #GoogleIAM, #Multi-factorAuthentication
Differences: One unique thing about AWS IAM is that accounts created in the organization (not through federation) can only be used within that organization. This contrasts with Google and Microsoft. On the good side, every organization is self-contained. On the bad side, users can end up with multiple sets of credentials they need to manage to access different organizations. The second unique element is that every user can have a non-interactive account by creating and using access keys, an interactive account by enabling console access, or both. (Side note: To use the CLI, you need to have access keys generated.)

15

Category: Object Storage and Content delivery
Description:Object storage service, for use cases including cloud applications, content distribution, backup, archiving, disaster recovery, and big data analytics.
References:
[AWS]:Simple Storage Services (S3), Import/Export(used to move large amounts of data into and out of the Amazon Web Services public cloud using portable storage devices for transport.), Snowball( petabyte-scale data transport solution that uses devices designed to be secure to transfer large amounts of data into and out of the AWS Cloud), CloudFront( content delivery network (CDN) is massively scaled and globally distributed), Elastic Block Store (EBS: high performance block storage service), Elastic File System(shared, elastic file storage system that grows and shrinks as you add and remove files.), S3 Infrequent Access (IA: is for data that is accessed less frequently, but requires rapid access when needed. ), S3 Glacier( long-term storage of data that is infrequently accessed and for which retrieval latency times of 3 to 5 hours are acceptable.), AWS Backup( makes it easy to centralize and automate the back up of data across AWS services in the cloud as well as on-premises using the AWS Storage Gateway.), Storage Gateway(hybrid cloud storage service that gives you on-premises access to virtually unlimited cloud storage), AWS Import/Export Disk(accelerates moving large amounts of data into and out of AWS using portable storage devices for transport)
[Azure]:
Azure Blob storage, File Storage, Data Lake Store, Azure Backup, Azure managed disks, Azure Files, Azure Storage cool tier, Azure Storage archive access tier, Azure Backup, StorSimple, Import/Export
[Google]:
Cloud Storage, GlusterFS, CloudCDN
Tags:#S3, #AzureBlobStorage, #CloudStorage
Differences:
Source: All providers have good object storage options and so storage alone is unlikely to be a deciding factor when choosing a cloud provider. The exception perhaps is for hybrid scenarios, in this case Azure and AWS clearly win. AWS and Google’s support for automatic versioning is a great feature that is currently missing from Azure; however Microsoft’s fully managed Data Lake Store offers an additional option that will appeal to organisations who are looking to run large scale analytical workloads. If you are prepared to wait 4 hours for your data and you have considerable amounts of the stuff then AWS Glacier storage might be a good option. If you use the common programming patterns for atomic updates and consistency, such as etags and the if-match family of headers, then you should be aware that AWS does not support them, though Google and Azure do. Azure also supports blob leasing, which can be used to provide a distributed lock.

17

Category:Web Applications
Description:Managed hosting platform providing easy to use services for deploying and scaling web applications and services. API Gateway is a a turnkey solution for publishing APIs to external and internal consumers. Cloudfront is a global content delivery network that delivers audio, video, applications, images, and other files.
References:
[AWS]:Elastic Beanstalk (for deploying and scaling web applications and services developed with Java, .NET, PHP, Node.js, Python, Ruby, Go, and Docker on familiar servers such as Apache, Nginx, Passenger, and IIS), AWS Wavelength (for delivering ultra-low latency applications for 5G), API Gateway (makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale.), CloudFront (web service that speeds up distribution of your static and dynamic web content, such as .html, .css, .js, and image files, to your users. CloudFront delivers your content through a worldwide network of data centers called edge locations.),Global Accelerator ( improves the availability and performance of your applications with local or global users. It provides static IP addresses that act as a fixed entry point to your application endpoints in a single or multiple AWS Regions, such as your Application Load Balancers, Network Load Balancers or Amazon EC2 instances.)AWS AppSync (simplifies application development by letting you create a flexible API to securely access, manipulate, and combine data from one or more data sources: GraphQL service with real-time data synchronization and offline programming features. )
[Azure]:App Service, API Management, Azure Content Delivery Network, Azure Content Delivery Network
[Google]:App Engine, Cloud API, Cloud Enpoint, APIGee
Tags: #AWSElasticBeanstalk, #AzureAppService, #GoogleAppEngine, #CloudEnpoint, #CloudFront, #APIgee
Differences: With AWS Elastic Beanstalk, developers retain full control over the AWS resources powering their application. If developers decide they want to manage some (or all) of the elements of their infrastructure, they can do so seamlessly by using Elastic Beanstalk’s management capabilities. AWS Elastic Beanstalk integrates with more apps than Google App Engines (Datadog, Jenkins, Docker, Slack, Github, Eclipse, etc..). Google App Engine has more features than AWS Elastic BEanstalk (App Identity, Java runtime, Datastore, Blobstore, Images, Go Runtime, etc..). Developers describe Amazon API Gateway as “Create, publish, maintain, monitor, and secure APIs at any scale”. Amazon API Gateway handles all the tasks involved in accepting and processing up to hundreds of thousands of concurrent API calls, including traffic management, authorization and access control, monitoring, and API version management. On the other hand, Google Cloud Endpoints is detailed as “Develop, deploy and manage APIs on any Google Cloud backend”. An NGINX-based proxy and distributed architecture give unparalleled performance and scalability. Using an Open API Specification or one of our API frameworks, Cloud Endpoints gives you the tools you need for every phase of API development and provides insight with Google Cloud Monitoring, Cloud Trace, Google Cloud Logging and Cloud Trace.

18

Category:Encryption
Description:Helps you protect and safeguard your data and meet your organizational security and compliance commitments.
References:
[AWS]:Key Management Service AWS KMS, CloudHSM
[Azure]:Key Vault
[Google]:Encryption By Default at Rest, Cloud KMS
Tags:#AWSKMS, #Encryption, #CloudHSM, #EncryptionAtRest, #CloudKMS
Differences: AWS KMS, is an ideal solution for organizations that want to manage encryption keys in conjunction with other AWS services. In contrast to AWS CloudHSM, AWS KMS provides a complete set of tools to manage encryption keys, develop applications and integrate with other AWS services. Google and Azure offer 4096 RSA. AWS and Google offer 256 bit AES. With AWs, you can bring your own key

19

Category:Internet of things (IoT)
Description:A cloud gateway for managing bidirectional communication with billions of IoT devices, securely and at scale. Deploy cloud intelligence directly on IoT devices to run in on-premises scenarios.
References:
[AWS]:AWS IoT, AWS Greengrass, Kinesis Firehose ( captures and loads streaming data in storage and business intelligence (BI) tools to enable near real-time analytics in the AWS cloud), Kinesis Streams (for rapid and continuous data intake and aggregation.), AWS IoT Things Graph (makes it easy to visually connect different devices and web services to build IoT applications.)
[Azure]:Azure IoT Hub, Azure IoT Edge, Event Hubs, Azure Digital Twins, Azure Sphere
[Google]:Google Cloud IoT Core, Firebase, Brillo, Weave, CLoud Pub/SUb, Stream Analysis, Big Query, Big Query Streaming API
Tags:#IoT, #InternetOfThings, #Firebase
Differences:AWS and Azure have a more coherent message with their products clearly integrated into their respective platforms, whereas Google Firebase feels like a distinctly separate product.

20

Category:Object Storage and Content delivery
Description: Object storage service, for use cases including cloud applications, content distribution, backup, archiving, disaster recovery, and big data analytics.
References:
[AWS]:Simple Storage Services (S3), Import/Export Snowball, CloudFront, Elastic Block Store (EBS), Elastic File System, S3 Infrequent Access (IA), S3 Glacier, AWS Backup, Storage Gateway, AWS Import/Export Disk, Amazon S3 Access Points(Easily manage access for shared data)
[Azure]:Azure Blob storage, File Storage, Data Lake Store, Azure Backup, Azure managed disks, Azure Files, Azure Storage cool tier, Azure Storage archive access tier, Azure Backup, StorSimple, Import/Export
[Google]:Cloud Storage, GlusterFS, CloudCDN
Tags:#S3, #AzureBlobStorage, #CloudStorage
Differences:All providers have good object storage options and so storage alone is unlikely to be a deciding factor when choosing a cloud provider. The exception perhaps is for hybrid scenarios, in this case Azure and AWS clearly win. AWS and Google’s support for automatic versioning is a great feature that is currently missing from Azure; however Microsoft’s fully managed Data Lake Store offers an additional option that will appeal to organisations who are looking to run large scale analytical workloads. If you are prepared to wait 4 hours for your data and you have considerable amounts of the stuff then AWS Glacier storage might be a good option. If you use the common programming patterns for atomic updates and consistency, such as etags and the if-match family of headers, then you should be aware that AWS does not support them, though Google and Azure do. Azure also supports blob leasing, which can be used to provide a distributed lock.

21

Category: Backend process logic
Description: Cloud technology to build distributed applications using out-of-the-box connectors to reduce integration challenges. Connect apps, data and devices on-premises or in the cloud.
References:
[AWS]:AWS Step Functions ( lets you build visual workflows that enable fast translation of business requirements into technical requirements. You can build applications in a matter of minutes, and when needs change, you can swap or reorganize components without customizing any code.)
[Azure]:Logic Apps (cloud service that helps you schedule, automate, and orchestrate tasks, business processes, and workflows when you need to integrate apps, data, systems, and services across enterprises or organizations.)
[Google]:Dataflow ( fully managed service for executing Apache Beam pipelines within the Google Cloud Platform ecosystem.)
Tags:#AWSStepFunctions, #LogicApps, #Dataflow
Differences: AWS Step Functions makes it easy to coordinate the components of distributed applications and microservices using visual workflows. Building applications from individual components that each perform a discrete function lets you scale and change applications quickly. AWS Step Functions belongs to \”Cloud Task Management\” category of the tech stack, while Google Cloud Dataflow can be primarily classified under \”Real-time Data Processing\”. According to the StackShare community, Google Cloud Dataflow has a broader approval, being mentioned in 32 company stacks & 8 developers stacks; compared to AWS Step Functions, which is listed in 19 company stacks and 7 developer stacks.

22

Category: Enterprise application services
Description:Fully integrated Cloud service providing communications, email, document management in the cloud and available on a wide variety of devices.
References:
[AWS]:Amazon WorkMail, Amazon WorkDocs, Amazon Kendra (Sync and Index)
[Azure]:Office 365
[Google]:G Suite
Tags: #AmazonWorkDocs, #Office365, #GoogleGSuite
Differences: G suite document processing applications like Google Docs are far behind Office 365 popular Word and Excel software, but G Suite User interface is intuite, simple and easy to navigate. Office 365 is too clunky. Get 20% off G-Suite Business Plan with Promo Code: PCQ49CJYK7EATNC

23

Category: Networking
Description: Provides an isolated, private environment in the cloud. Users have control over their virtual networking environment, including selection of their own IP address range, creation of subnets, and configuration of route tables and network gateways.
References:
[AWS]:Virtual Private Cloud (VPC), Cloud virtual networking, Subnets, Elastic Network Interface (ENI), Route Tables, Network ACL, Secutity Groups, Internet Gateway, NAT Gateway, AWS VPN Gateway, AWS Route 53, AWS Direct Connect, AWS Network Load Balancer, VPN CloudHub, AWS Local Zones, AWS Transit Gateway network manager (Centrally manage global networks)
[Azure]:Virtual Network(provide services for building networks within Azure.),Subnets (network resources can be grouped by subnet for organisation and security.), Network Interface (Each virtual machine can be assigned one or more network interfaces (NICs)), Network Security Groups (NSG: contains a set of prioritised ACL rules that explicitly grant or deny access), Azure VPN Gateway ( allows connectivity to on-premise networks), Azure DNS, Traffic Manager (DNS based traffic routing solution.), ExpressRoute (provides connections up to 10 Gbps to Azure services over a dedicated fibre connection), Azure Load Balancer, Network Peering, Azure Stack (Azure Stack allows organisations to use Azure services running in private data centers.), Azure Load Balancer , Azure Log Analytics, Azure DNS,
[Google]:Cloud Virtual Network, Subnets, Network Interface, Protocol fowarding, Cloud VPN, Cloud DNS, Virtual Private Network, Cloud Interconnect, CDN interconnect, Cloud DNS, Stackdriver, Google Cloud Load Balancing,
Tags:#VPC, #Subnets, #ACL, #VPNGateway, #CloudVPN, #NetworkInterface, #ENI, #RouteTables, #NSG, #NetworkACL, #InternetGateway, #NatGateway, #ExpressRoute, #CloudInterConnect, #StackDriver
Differences: Subnets group related resources, however, unlike AWS and Azure, Google do not constrain the private IP address ranges of subnets to the address space of the parent network. Like Azure, Google has a built in internet gateway that can be specified from routing rules.

24

Category: Management
Description: A unified management console that simplifies building, deploying, and operating your cloud resources.
References:
[AWS]: AWS Management Console, Trusted Advisor, AWS Usage and Billing Report, AWS Application Discovery Service, Amazon EC2 Systems Manager, AWS Personal Health Dashboard, AWS Compute Optimizer (Identify optimal AWS Compute resources)
[Azure]:Azure portal, Azure Advisor, Azure Billing API, Azure Migrate, Azure Monitor, Azure Resource Health
[Google]:Google CLoud Platform, Cost Management, Security Command Center, StackDriver
Tags: #AWSConsole, #AzurePortal, #GoogleCloudConsole, #TrustedAdvisor, #AzureMonitor, #SecurityCommandCenter
Differences: AWS Console categorizes its Infrastructure as a Service offerings into Compute, Storage and Content Delivery Network (CDN), Database, and Networking to help businesses and individuals grow. Azure excels in the Hybrid Cloud space allowing companies to integrate onsite servers with cloud offerings. Google has a strong offering in containers, since Google developed the Kubernetes standard that AWS and Azure now offer. GCP specializes in high compute offerings like Big Data, analytics and machine learning. It also offers considerable scale and load balancing – Google knows data centers and fast response time.

25

Category: DevOps and application monitoring
Description: Comprehensive solution for collecting, analyzing, and acting on telemetry from your cloud and on-premises environments; Cloud services for collaborating on code development; Collection of tools for building, debugging, deploying, diagnosing, and managing multiplatform scalable apps and services; Fully managed build service that supports continuous integration and deployment.
References:
[AWS]:AWS CodePipeline(orchestrates workflow for continuous integration, continuous delivery, and continuous deployment), AWS CloudWatch (monitor your AWS resources and the applications you run on AWS in real time. ), AWS X-Ray (application performance management service that enables a developer to analyze and debug applications in aws), AWS CodeDeploy (automates code deployments to Elastic Compute Cloud (EC2) and on-premises servers. ), AWS CodeCommit ( source code storage and version-control service), AWS Developer Tools, AWS CodeBuild (continuous integration service that compiles source code, runs tests, and produces software packages that are ready to deploy. ), AWS Command Line Interface (unified tool to manage your AWS services), AWS OpsWorks (Chef-based), AWS CloudFormation ( provides a common language for you to describe and provision all the infrastructure resources in your cloud environment.), Amazon CodeGuru (for automated code reviews and application performance recommendations)
[Azure]:Azure Monitor, Azure DevOps, Azure Developer Tools, Azure CLI Azure PowerShell, Azure Automation, Azure Resource Manager , VM extensions , Azure Automation
[Google]:DevOps Solutions (Infrastructure as code, Configuration management, Secrets management, Serverless computing, Continuous delivery, Continuous integration , Stackdriver (combines metrics, logs, and metadata from all of your cloud accounts and projects into a single comprehensive view of your environment)
Tags: #CloudWatch, #StackDriver, #AzureMonitor, #AWSXray, #AWSCodeDeploy, #AzureDevOps, #GoogleDevopsSolutions
Differences: CodeCommit eliminates the need to operate your own source control system or worry about scaling its infrastructure. Azure DevOps provides unlimited private Git hosting, cloud build for continuous integration, agile planning, and release management for continuous delivery to the cloud and on-premises. Includes broad IDE support.

SageMakerAzure Machine Learning Studio

A collaborative, drag-and-drop tool to build, test, and deploy predictive analytics solutions on your data.

Alexa Skills KitMicrosoft Bot Framework

Build and connect intelligent bots that interact with your users using text/SMS, Skype, Teams, Slack, Office 365 mail, Twitter, and other popular services.

Amazon LexSpeech Services

API capable of converting speech to text, understanding intent, and converting text back to speech for natural responsiveness.

Amazon LexLanguage Understanding (LUIS)

Allows your applications to understand user commands contextually.

Amazon Polly, Amazon Transcribe | Azure Speech Services

Enables both Speech to Text, and Text into Speech capabilities.
The Speech Services are the unification of speech-to-text, text-to-speech, and speech-translation into a single Azure subscription. It’s easy to speech enable your applications, tools, and devices with the Speech SDK, Speech Devices SDK, or REST APIs.
Amazon Polly is a Text-to-Speech (TTS) service that uses advanced deep learning technologies to synthesize speech that sounds like a human voice. With dozens of lifelike voices across a variety of languages, you can select the ideal voice and build speech-enabled applications that work in many different countries.
Amazon Transcribe is an automatic speech recognition (ASR) service that makes it easy for developers to add speech-to-text capability to their applications. Using the Amazon Transcribe API, you can analyze audio files stored in Amazon S3 and have the service return a text file of the transcribed speech.

Amazon RekognitionCognitive Services

Computer Vision: Extract information from images to categorize and process visual data.
Amazon Rekognition is a simple and easy to use API that can quickly analyze any image or video file stored in Amazon S3. Amazon Rekognition is always learning from new data, and we are continually adding new labels and facial recognition features to the service.

Face: Detect, identy, and analyze faces in photos.

Emotions: Recognize emotions in images.

Alexa Skill SetAzure Virtual Assistant

The Virtual Assistant Template brings together a number of best practices we’ve identified through the building of conversational experiences and automates integration of components that we’ve found to be highly beneficial to Bot Framework developers.

Big data and analytics

Data warehouse

AWS RedshiftSQL Data Warehouse

Cloud-based Enterprise Data Warehouse (EDW) that uses Massively Parallel Processing (MPP) to quickly run complex queries across petabytes of data.

Big data processing EMR | Azure Databricks
Apache Spark-based analytics platform.

EMR HDInsight

Managed Hadoop service. Deploy and manage Hadoop clusters in Azure.

Data orchestration / ETL

AWS Data Pipeline, AWS Glue | Data Factory

Processes and moves data between different compute and storage services, as well as on-premises data sources at specified intervals. Create, schedule, orchestrate, and manage data pipelines.

AWS GlueData Catalog

A fully managed service that serves as a system of registration and system of discovery for enterprise data sources

Analytics and visualization

AWS Kinesis Analytics | Stream Analytics

Data Lake Analytics | Data Lake Store

Storage and analysis platforms that create insights from large quantities of data, or data that originates from many sources.

QuickSightPower BI

Business intelligence tools that build visualizations, perform ad hoc analysis, and develop business insights from data.

CloudSearchAzure Search

Delivers full-text search and related search analytics and capabilities.

Amazon AthenaAzure Data Lake Analytics

Provides a serverless interactive query service that uses standard SQL for analyzing databases.

Compute

Virtual servers

Elastic Compute Cloud (EC2)Azure Virtual Machines

Virtual servers allow users to deploy, manage, and maintain OS and server software. Instance types provide combinations of CPU/RAM. Users pay for what they use with the flexibility to change sizes.

AWS BatchAzure Batch

Run large-scale parallel and high-performance computing applications efficiently in the cloud.

AWS Auto ScalingVirtual Machine Scale Sets

Allows you to automatically change the number of VM instances. You set defined metric and thresholds that determine if the platform adds or removes instances.

VMware Cloud on AWSAzure VMware by CloudSimple

Redeploy and extend your VMware-based enterprise workloads to Azure with Azure VMware Solution by CloudSimple. Keep using the VMware tools you already know to manage workloads on Azure without disrupting network, security, or data protection policies.

Containers and container orchestrators

EC2 Container Service (ECS), FargateAzure Container Instances

Azure Container Instances is the fastest and simplest way to run a container in Azure, without having to provision any virtual machines or adopt a higher-level orchestration service.

EC2 Container RegistryAzure Container Registry

Allows customers to store Docker formatted images. Used to create all types of container deployments on Azure.

Elastic Container Service for Kubernetes (EKS)Azure Kubernetes Service (AKS)

Deploy orchestrated containerized applications with Kubernetes. Simplify monitoring and cluster management through auto upgrades and a built-in operations console.

App MeshService Fabric Mesh

Fully managed service that enables developers to deploy microservices applications without managing virtual machines, storage, or networking.
AWS App Mesh is a service mesh that provides application-level networking to make it easy for your services to communicate with each other across multiple types of compute infrastructure. App Mesh standardizes how your services communicate, giving you end-to-end visibility and ensuring high-availability for your applications.

Serverless

AWS Lambda | Azure Functions

Integrate systems and run backend processes in response to events or schedules without provisioning or managing servers.
AWS Lambda is an event-driven, serverless computing platform provided by Amazon as a part of the Amazon Web Services. It is a computing service that runs code in response to events and automatically manages the computing resources required by that code

Database

Relational database

AWS RDS | SQL Database Azure Database for MySQL Azure Database for PostgreSQL

Managed relational database service where resiliency, scale, and maintenance are primarily handled by the platform.
Amazon Relational Database Service is a distributed relational database service by Amazon Web Services. It is a web service running “in the cloud” designed to simplify the setup, operation, and scaling of a relational database for use in applications. Administration processes like patching the database software, backing up databases and enabling point-in-time recovery are managed automatically. Scaling storage and compute resources can be performed by a single API call as AWS does not offer an ssh connection to RDS instances.

NoSQL / Document

DynamoDB and SimpleDBAzure Cosmos DB

A globally distributed, multi-model database that natively supports multiple data models: key-value, documents, graphs, and columnar.

Caching

AWS ElastiCache | Azure Cache for Redis

An in-memory–based, distributed caching service that provides a high-performance store typically used to offload non transactional work from a database.
Amazon ElastiCache is a fully managed in-memory data store and cache service by Amazon Web Services. The service improves the performance of web applications by retrieving information from managed in-memory caches, instead of relying entirely on slower disk-based databases. ElastiCache supports two open-source in-memory caching engines: Memcached and Redis.

Database migration

AWS Database Migration ServiceAzure Database Migration Service

Migration of database schema and data from one database format to a specific database technology in the cloud.
AWS Database Migration Service helps you migrate databases to AWS quickly and securely. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database. The AWS Database Migration Service can migrate your data to and from most widely used commercial and open-source databases.

DevOps and application monitoring

AWS CloudWatch, AWS X-Ray | Azure Monitor

Comprehensive solution for collecting, analyzing, and acting on telemetry from your cloud and on-premises environments.
Amazon CloudWatch is a monitoring and observability service built for DevOps engineers, developers, site reliability engineers (SREs), and IT managers. CloudWatch provides you with data and actionable insights to monitor your applications, respond to system-wide performance changes, optimize resource utilization, and get a unified view of operational health. CloudWatch collects monitoring and operational data in the form of logs, metrics, and events, providing you with a unified view of AWS resources, applications, and services that run on AWS and on-premises servers.
AWS X-Ray is an application performance management service that enables a developer to analyze and debug applications in the Amazon Web Services (AWS) public cloud. A developer can use AWS X-Ray to visualize how a distributed application is performing during development or production, and across multiple AWS regions and accounts.

AWS CodeDeploy, AWS CodeCommit, AWS CodePipeline | Azure DevOps

A cloud service for collaborating on code development.
AWS CodeDeploy is a fully managed deployment service that automates software deployments to a variety of compute services such as Amazon EC2, AWS Fargate, AWS Lambda, and your on-premises servers. AWS CodeDeploy makes it easier for you to rapidly release new features, helps you avoid downtime during application deployment, and handles the complexity of updating your applications.
AWS CodePipeline is a fully managed continuous delivery service that helps you automate your release pipelines for fast and reliable application and infrastructure updates. CodePipeline automates the build, test, and deploy phases of your release process every time there is a code change, based on the release model you define.
AWS CodeCommit is a source code storage and version-control service for Amazon Web Services’ public cloud customers. CodeCommit was designed to help IT teams collaborate on software development, including continuous integration and application delivery.

AWS Developer ToolsAzure Developer Tools

Collection of tools for building, debugging, deploying, diagnosing, and managing multiplatform scalable apps and services.
The AWS Developer Tools are designed to help you build software like Amazon. They facilitate practices such as continuous delivery and infrastructure as code for serverless, containers, and Amazon EC2.

AWS CodeBuild | Azure DevOps

Fully managed build service that supports continuous integration and deployment.

AWS Command Line Interface | Azure CLI Azure PowerShell

Built on top of the native REST API across all cloud services, various programming language-specific wrappers provide easier ways to create solutions.
The AWS Command Line Interface (CLI) is a unified tool to manage your AWS services. With just one tool to download and configure, you can control multiple AWS services from the command line and automate them through scripts.

AWS OpsWorks (Chef-based)Azure Automation

Configures and operates applications of all shapes and sizes, and provides templates to create and manage a collection of resources.
AWS OpsWorks is a configuration management service that provides managed instances of Chef and Puppet. Chef and Puppet are automation platforms that allow you to use code to automate the configurations of your servers.

AWS CloudFormation | Azure Resource Manager , VM extensions , Azure Automation

Provides a way for users to automate the manual, long-running, error-prone, and frequently repeated IT tasks.
AWS CloudFormation provides a common language for you to describe and provision all the infrastructure resources in your cloud environment. CloudFormation allows you to use a simple text file to model and provision, in an automated and secure manner, all the resources needed for your applications across all regions and accounts.

Networking

Area

Cloud virtual networking, Virtual Private Cloud (VPC) | Virtual Network

Provides an isolated, private environment in the cloud. Users have control over their virtual networking environment, including selection of their own IP address range, creation of subnets, and configuration of route tables and network gateways.

Cross-premises connectivity

AWS VPN Gateway | Azure VPN Gateway

Connects Azure virtual networks to other Azure virtual networks, or customer on-premises networks (Site To Site). Allows end users to connect to Azure services through VPN tunneling (Point To Site).

DNS management

AWS Route 53 | Azure DNS

Manage your DNS records using the same credentials and billing and support contract as your other Azure services

Route 53 | Traffic Manager

A service that hosts domain names, plus routes users to Internet applications, connects user requests to datacenters, manages traffic to apps, and improves app availability with automatic failover.

Dedicated network

AWS Direct Connect | ExpressRoute

Establishes a dedicated, private network connection from a location to the cloud provider (not over the Internet).

Load balancing

AWS Network Load Balancer | Azure Load Balancer

Azure Load Balancer load-balances traffic at layer 4 (TCP or UDP).

Application Load Balancer | Application Gateway

Application Gateway is a layer 7 load balancer. It supports SSL termination, cookie-based session affinity, and round robin for load-balancing traffic.

Internet of things (IoT)

AWS IoT | Azure IoT Hub

A cloud gateway for managing bidirectional communication with billions of IoT devices, securely and at scale.

AWS Greengrass | Azure IoT Edge

Deploy cloud intelligence directly on IoT devices to run in on-premises scenarios.

Kinesis Firehose, Kinesis Streams | Event Hubs

Services that allow the mass ingestion of small data inputs, typically from devices and sensors, to process and route the data.

AWS IoT Things Graph | Azure Digital Twins

Azure Digital Twins is an IoT service that helps you create comprehensive models of physical environments. Create spatial intelligence graphs to model the relationships and interactions between people, places, and devices. Query data from a physical space rather than disparate sensors.

Management

Trusted Advisor | Azure Advisor

Provides analysis of cloud resource configuration and security so subscribers can ensure they’re making use of best practices and optimum configurations.

AWS Usage and Billing Report | Azure Billing API

Services to help generate, monitor, forecast, and share billing data for resource usage by time, organization, or product resources.

AWS Management Console | Azure portal

A unified management console that simplifies building, deploying, and operating your cloud resources.

AWS Application Discovery Service | Azure Migrate

Assesses on-premises workloads for migration to Azure, performs performance-based sizing, and provides cost estimations.

Amazon EC2 Systems Manager | Azure Monitor

Comprehensive solution for collecting, analyzing, and acting on telemetry from your cloud and on-premises environments.

AWS Personal Health Dashboard | Azure Resource Health

Provides detailed information about the health of resources as well as recommended actions for maintaining resource health.

Security, identity, and access

Authentication and authorization

Identity and Access Management (IAM) | Azure Active Directory

Allows users to securely control access to services and resources while offering data security and protection. Create and manage users and groups, and use permissions to allow and deny access to resources.

Identity and Access Management (IAM) | Azure Role Based Access Control

Role-based access control (RBAC) helps you manage who has access to Azure resources, what they can do with those resources, and what areas they have access to.

AWS Organizations | Azure Subscription Management + Azure RBAC

Security policy and role management for working with multiple accounts.

Multi-Factor Authentication | Multi-Factor Authentication

Safeguard access to data and applications while meeting user demand for a simple sign-in process.

AWS Directory Service | Azure Active Directory Domain Services

Provides managed domain services such as domain join, group policy, LDAP, and Kerberos/NTLM authentication that are fully compatible with Windows Server Active Directory.

Cognito | Azure Active Directory B2C

A highly available, global, identity management service for consumer-facing applications that scales to hundreds of millions of identities.

AWS Organizations | Azure Policy

Azure Policy is a service in Azure that you use to create, assign, and manage policies. These policies enforce different rules and effects over your resources, so those resources stay compliant with your corporate standards and service level agreements.

AWS Organizations | Management Groups

Azure management groups provide a level of scope above subscriptions. You organize subscriptions into containers called “management groups” and apply your governance conditions to the management groups. All subscriptions within a management group automatically inherit the conditions applied to the management group. Management groups give you enterprise-grade management at a large scale, no matter what type of subscriptions you have.

Encryption

Server-side encryption with Amazon S3 Key Management Service | Azure Storage Service Encryption

Helps you protect and safeguard your data and meet your organizational security and compliance commitments.

Key Management Service AWS KMS, CloudHSM | Key Vault

Provides security solution and works with other services by providing a way to manage, create, and control encryption keys stored in hardware security modules (HSM).

Firewall

Web Application Firewall | Application Gateway – Web Application Firewall

A firewall that protects web applications from common web exploits.

Web Application Firewall | Azure Firewall

Provides inbound protection for non-HTTP/S protocols, outbound network-level protection for all ports and protocols, and application-level protection for outbound HTTP/S.

Security

Inspector | Security Center

An automated security assessment service that improves the security and compliance of applications. Automatically assess applications for vulnerabilities or deviations from best practices.

Certificate Manager | App Service Certificates available on the Portal

Service that allows customers to create, manage, and consume certificates seamlessly in the cloud.

GuardDuty | Azure Advanced Threat Protection

Detect and investigate advanced attacks on-premises and in the cloud.

AWS Artifact | Service Trust Portal

Provides access to audit reports, compliance guides, and trust documents from across cloud services.

AWS Shield | Azure DDos Protection Service

Provides cloud services with protection from distributed denial of services (DDoS) attacks.

Storage

Object storage

Simple Storage Services (S3) | Azure Blob storage

Object storage service, for use cases including cloud applications, content distribution, backup, archiving, disaster recovery, and big data analytics.

Virtual server disks

Elastic Block Store (EBS) | Azure managed disks

SSD storage optimized for I/O intensive read/write operations. For use as high-performance Azure virtual machine storage.

Shared files

Elastic File System | Azure Files

Provides a simple interface to create and configure file systems quickly, and share common files. Can be used with traditional protocols that access files over a network.

Archiving and backup

S3 Infrequent Access (IA) | Azure Storage cool tier

Cool storage is a lower-cost tier for storing data that is infrequently accessed and long-lived.

S3 Glacier | Azure Storage archive access tier

Archive storage has the lowest storage cost and higher data retrieval costs compared to hot and cool storage.

AWS Backup | Azure Backup

Back up and recover files and folders from the cloud, and provide offsite protection against data loss.

Hybrid storage

Storage Gateway | StorSimple

Integrates on-premises IT environments with cloud storage. Automates data management and storage, plus supports disaster recovery.

Bulk data transfer

AWS Import/Export Disk | Import/Export

A data transport solution that uses secure disks and appliances to transfer large amounts of data. Also offers data protection during transit.

AWS Import/Export Snowball, Snowball Edge, Snowmobile | Azure Data Box

Petabyte- to exabyte-scale data transport solution that uses secure data storage devices to transfer large amounts of data to and from Azure.

Web applications

Elastic Beanstalk | App Service

Managed hosting platform providing easy to use services for deploying and scaling web applications and services.

API Gateway | API Management

A turnkey solution for publishing APIs to external and internal consumers.

CloudFront | Azure Content Delivery Network

A global content delivery network that delivers audio, video, applications, images, and other files.

Global Accelerator | Azure Front Door

Easily join your distributed microservice architectures into a single global application using HTTP load balancing and path-based routing rules. Automate turning up new regions and scale-out with API-driven global actions, and independent fault-tolerance to your back end microservices in Azure—or anywhere.

Miscellaneous

Backend process logic

AWS Step Functions | Logic Apps

Cloud technology to build distributed applications using out-of-the-box connectors to reduce integration challenges. Connect apps, data and devices on-premises or in the cloud.

Enterprise application services

Amazon WorkMail, Amazon WorkDocs | Office 365

Fully integrated Cloud service providing communications, email, document management in the cloud and available on a wide variety of devices.

Gaming

GameLift, GameSparks | PlayFab

Managed services for hosting dedicated game servers.

Media transcoding

Elastic Transcoder | Media Services

Services that offer broadcast-quality video streaming services, including various transcoding technologies.

Workflow

Simple Workflow Service (SWF) | Logic Apps

Serverless technology for connecting apps, data and devices anywhere, whether on-premises or in the cloud for large ecosystems of SaaS and cloud-based connectors.

Hybrid

Outposts | Azure Stack

Azure Stack is a hybrid cloud platform that enables you to run Azure services in your company’s or service provider’s datacenter. As a developer, you can build apps on Azure Stack. You can then deploy them to either Azure Stack or Azure, or you can build truly hybrid apps that take advantage of connectivity between an Azure Stack cloud and Azure.

How does a business decide between Microsoft Azure or AWS?

Basically, it all comes down to what your organizational needs are and if there’s a particular area that’s especially important to your business (ex. serverless, or integration with Microsoft applications).

Some of the main things it comes down to is compute options, pricing, and purchasing options.

Here’s a brief comparison of the compute option features across cloud providers:

Here’s an example of a few instances’ costs (all are Linux OS):

Each provider offers a variety of options to lower costs from the listed On-Demand prices. These can fall under reservations, spot and preemptible instances and contracts.

Both AWS and Azure offer a way for customers to purchase compute capacity in advance in exchange for a discount: AWS Reserved Instances and Azure Reserved Virtual Machine Instances. There are a few interesting variations between the instances across the cloud providers which could affect which is more appealing to a business.

Another discounting mechanism is the idea of spot instances in AWS and low-priority VMs in Azure. These options allow users to purchase unused capacity for a steep discount.

With AWS and Azure, enterprise contracts are available. These are typically aimed at enterprise customers, and encourage large companies to commit to specific levels of usage and spend in exchange for an across-the-board discount – for example, AWS EDPs and Azure Enterprise Agreements.

You can read more about the differences between AWS and Azure to help decide which your business should use in this blog post

Source: AWS to Azure services comparison – Azure Architecture

Pass the 2023 AWS Cloud Practitioner CCP CLF-C02 Certification with flying colors Ace the 2023 AWS Solutions Architect Associate SAA-C03 Exam with Confidence Pass the 2023 AWS Certified Machine Learning Specialty MLS-C01 Exam with Flying Colors

List of Freely available programming books - What is the single most influential book every Programmers should read



#BlackOwned #BlackEntrepreneurs #BlackBuniness #AWSCertified #AWSCloudPractitioner #AWSCertification #AWSCLFC02 #CloudComputing #AWSStudyGuide #AWSTraining #AWSCareer #AWSExamPrep #AWSCommunity #AWSEducation #AWSBasics #AWSCertified #AWSMachineLearning #AWSCertification #AWSSpecialty #MachineLearning #AWSStudyGuide #CloudComputing #DataScience #AWSCertified #AWSSolutionsArchitect #AWSArchitectAssociate #AWSCertification #AWSStudyGuide #CloudComputing #AWSArchitecture #AWSTraining #AWSCareer #AWSExamPrep #AWSCommunity #AWSEducation #AzureFundamentals #AZ900 #MicrosoftAzure #ITCertification #CertificationPrep #StudyMaterials #TechLearning #MicrosoftCertified #AzureCertification #TechBooks

Top 1000 Canada Quiz and trivia: CANADA CITIZENSHIP TEST- HISTORY - GEOGRAPHY - GOVERNMENT- CULTURE - PEOPLE - LANGUAGES - TRAVEL - WILDLIFE - HOCKEY - TOURISM - SCENERIES - ARTS - DATA VISUALIZATION
zCanadian Quiz and Trivia, Canadian History, Citizenship Test, Geography, Wildlife, Secenries, Banff, Tourism

Top 1000 Africa Quiz and trivia: HISTORY - GEOGRAPHY - WILDLIFE - CULTURE - PEOPLE - LANGUAGES - TRAVEL - TOURISM - SCENERIES - ARTS - DATA VISUALIZATION
Africa Quiz, Africa Trivia, Quiz, African History, Geography, Wildlife, Culture

Exploring the Pros and Cons of Visiting All Provinces and Territories in Canada.
Exploring the Pros and Cons of Visiting All Provinces and Territories in Canada

Exploring the Advantages and Disadvantages of Visiting All 50 States in the USA
Exploring the Advantages and Disadvantages of Visiting All 50 States in the USA


Health Health, a science-based community to discuss health news and the coronavirus (COVID-19) pandemic

Today I Learned (TIL) You learn something new every day; what did you learn today? Submit interesting and specific facts about something that you just found out here.

Reddit Science This community is a place to share and discuss new scientific research. Read about the latest advances in astronomy, biology, medicine, physics, social science, and more. Find and submit new publications and popular science coverage of current research.

Reddit Sports Sports News and Highlights from the NFL, NBA, NHL, MLB, MLS, and leagues around the world.

Turn your dream into reality with Google Workspace: It’s free for the first 14 days.
Get 20% off Google Google Workspace (Google Meet) Standard Plan with  the following codes:
Get 20% off Google Google Workspace (Google Meet) Standard Plan with  the following codes: 96DRHDRA9J7GTN6 96DRHDRA9J7GTN6
63F733CLLY7R7MM
63F7D7CPD9XXUVT
63FLKQHWV3AEEE6
63JGLWWK36CP7WM
63KKR9EULQRR7VE
63KNY4N7VHCUA9R
63LDXXFYU6VXDG9
63MGNRCKXURAYWC
63NGNDVVXJP4N99
63P4G3ELRPADKQU
With Google Workspace, Get custom email @yourcompany, Work from anywhere; Easily scale up or down
Google gives you the tools you need to run your business like a pro. Set up custom email, share files securely online, video chat from any device, and more.
Google Workspace provides a platform, a common ground, for all our internal teams and operations to collaboratively support our primary business goal, which is to deliver quality information to our readers quickly.
Get 20% off Google Workspace (Google Meet) Business Plan (AMERICAS): M9HNXHX3WC9H7YE
C37HCAQRVR7JTFK
C3AE76E7WATCTL9
C3C3RGUF9VW6LXE
C3D9LD4L736CALC
C3EQXV674DQ6PXP
C3G9M3JEHXM3XC7
C3GGR3H4TRHUD7L
C3LVUVC3LHKUEQK
C3PVGM4CHHPMWLE
C3QHQ763LWGTW4C
Even if you’re small, you want people to see you as a professional business. If you’re still growing, you need the building blocks to get you where you want to be. I’ve learned so much about business through Google Workspace—I can’t imagine working without it.
(Email us for more codes)

error: Content is protected !!