Machine Learning 101 – Top 100 AWS and Google Certified Machine Learning Specialty Questions and Answers Dumps

AWS machine Learning Specialty Exam Prep MLS-C01

You can translate the content of this page by selecting a language in the select box.

What are the Top 100 AWS and Google Certified Machine Learning Specialty Questions and Answers Dumps?

This blog is the best way  is the best way to prepare for your upcoming  AWS Certified Machine Learning Specialty and Google Certified Professional Machine Learning Engineer exam. With over 100 questions and answers, this blog provides quizzes similar  that are very similar to the real exam. It also includes  the option to show and hide answers. Additionally, there are machine learning interview questions and detailed answers, as well as cheat sheets and illustrations. This blog is the best way to make sure you are well-prepared for your AWS Certified Machine Learning Specialty Exam.

The typical Google Machine Learning Engineer salary is $147,218. Machine Learning Engineer salaries at Google can range from $110,000 – $152,183.

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.

  • By the end of 2020, 85% of customer interactions will be handled without a human (Call Center, Chatbot, etc…)
  • 61% of marketers say artificial intelligence is the most important aspect of their data strategy.
  • 80% of business and tech leaders say AI already boosts productivity (Robotic Process Automation, Power Automate, etc..)
  • Current AI technology can boost business productivity by up to 40%

AWS Machine Learning Certification Specialty Exam Prep for iOs Android Windows10/11

AWS machine Learning Specialty Exam Prep MLS-C01
AWS machine Learning Specialty Exam Prep MLS-C01

GCP Professional Machine Learning Engineer for iOs, Android, Windows 10/11

Quizzes, Practice Exams: Framing, Architecting, Designing, Developing ML Problems & Solutions, ML Jobs Interview Q&A

GCP Professional Machine Learning Engineer
GCP Professional Machine Learning Engineer

 

Azure AI Fundamentals AI-900 Exam Prep App for iOS, Android, Windows10/11

Basics and Advanced Machine Learning Quizzes on Azure, Azure Machine Learning Job Interviews Questions and Answer, ML Cheat Sheets

Azure AI Fundamentals AI-900 Exam Prep
Azure AI Fundamentals AI-900 Exam Prep

Machine Learning For Dummies App for iOs, Android, Windows10/11

Use this App to learn about Machine Learning and Elevate your Brain with Machine Learning Quizzes, Cheat Sheets, Ml Jobs Interview Questions and Answers updated daily.

Machine Learning For Dummies
Machine Learning For Dummies

What does a Professional Machine Learning Engineer do?

Professional Machine Learning Engineer designs, builds, and productionizes ML models to solve business challenges using Google Cloud technologies and knowledge of proven ML models and techniques. The ML Engineer collaborates closely with other job roles to ensure long-term success of models. The ML Engineer should be proficient in all aspects of model architecture, data pipeline interaction, and metrics interpretation. The ML Engineer needs familiarity with application development, infrastructure management, data engineering, and security. Through an understanding of training, retraining, deploying, scheduling, monitoring, and improving models, they design and create scalable solutions for optimal performance.


The AWS Certified Machine Learning – Specialty certification is intended for individuals who perform a development or data science role. It validates a candidate’s ability to design, implement, deploy, and maintain machine learning (ML) solutions for given business problems.

This blog covers Machine Learning 101, Top 20 AWS Certified Machine Learning Specialty Questions and Answers, Top 20 Google Professional Machine Learning Engineer Sample Questions, Machine Learning Quizzes, Machine Learning Q&A, Top 10 Machine Learning Algorithms, Machine Learning Latest Hot News, Machine Learning Demos (Ex: Tensorflow Demos)

Below are the Top 100 AWS Certified Machine Learning Specialty Questions and Answers Dumps.

Top

 

If you are looking for an all-in-one solution to help you prepare for the AWS Cloud Practitioner Certification Exam, look no further than this AWS Cloud Practitioner CCP CLFC01 book below.


Question1: A machine learning team has several large CSV datasets in Amazon S3. Historically, models built with the Amazon SageMaker Linear Learner algorithm have taken hours to train on similar-sized datasets. The team’s leaders need to accelerate the training process. What can a machine learning specialist do to address this concern?

A) Use Amazon SageMaker Pipe mode.
B) Use Amazon Machine Learning to train the models.
C) Use Amazon Kinesis to stream the data to Amazon SageMaker.
D) Use AWS Glue to transform the CSV dataset to the JSON format.
ANSWER1:

A

Invest in your future today by enrolling in this Azure Fundamentals - Microsoft Azure Certification and Training ebook below. This Azure Fundamentals Exam Prep Book will prepare you for the Azure Fundamentals AZ900 Certification Exam.


Notes/Hint1:


Amazon SageMaker Pipe mode streams the data directly to the container, which improves the performance of training jobs. (Refer to this link for supporting information.) In Pipe mode, your training job streams data directly from Amazon S3. Streaming can provide faster start times for training jobs and better throughput. With Pipe mode, you also reduce the size of the Amazon EBS volumes for your training instances. B would not apply in this scenario. C is a streaming ingestion solution, but is not applicable in this scenario. D transforms the data structure.

Reference1: Amazon SageMaker

Question 2) A local university wants to track cars in a parking lot to determine which students are parking in the lot. The university is wanting to ingest videos of the cars parking in near-real time, use machine learning to identify license plates, and store that data in an AWS data store. Which solution meets these requirements with the LEAST amount of development effort?

A) Use Amazon Kinesis Data Streams to ingest the video in near-real time, use the Kinesis Data Streams consumer integrated with Amazon Rekognition Video to process the license plate information, and then store results in DynamoDB.

B) Use Amazon Kinesis Video Streams to ingest the videos in near-real time, use the Kinesis Video Streams integration with Amazon Rekognition Video to identify the license plate information, and then store the results in DynamoDB.

C) Use Amazon Kinesis Data Streams to ingest videos in near-real time, call Amazon Rekognition to identify license plate information, and then store results in DynamoDB.

D) Use Amazon Kinesis Firehose to ingest the video in near-real time and outputs results onto S3. Set up a Lambda function that triggers when a new video is PUT onto S3 to send results to Amazon Rekognition to identify license plate information, and then store results in DynamoDB.

Answer 2)

B

Notes/Hint2)

With average increases in salary of over 25% for certified individuals, you’re going to be in a much better position to secure your dream job or promotion if you earn your AWS Certified Solutions Architect Associate our Cloud Practitioner certification. Get the books below to for real practice exams:

Use the promo codes: W6XM9XP4TWN9 or T6K9P4J9JPPR or 9LWMYKJ7TWPN or TN4NTERJYHY4 for AWS CCP eBook at Apple iBook store.


Use Promo Codes XKPHAATA6LRL 4XJRP9XLT9XL or LTFFY6JA33EL or HKRMTMTHFMAM or 4XHAFTWT4FN6 for AWS SAA-C03 eBook at Apple iBook store



Use Promo Codes EF46PT44LXPN or L6L9R9LKEFFR or TWELPA4JFJWM for Azure Fundamentals eBook at Apple iBook store.

Kinesis Video Streams is used to stream videos in near-real time. Amazon Rekognition Video uses Amazon Kinesis Video Streams to receive and process a video stream. After the videos have been processed by Rekognition we can output the results in DynamoDB.

Reference: Kinesis Video Streams

Question 3) A term frequency–inverse document frequency (tf–idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences:

1. Please call the number below.
2. Please do not call us. What are the dimensions of the tf–idf matrix?
A) (2, 16)
B) (2, 8)
C) (2, 10)
D) (8, 10)

ANSWER3:

A


We know you like your hobbies and especially coding, We do too, but you should find time to build the skills that’ll drive your career into Six Figures. Cloud skills and certifications can be just the thing you need to make the move into cloud or to level up and advance your career. 85% of hiring managers say cloud certifications make a candidate more attractive. Start your cloud journey with these excellent books below:

Notes/Hint3:

There are 2 sentences, 8 unique unigrams, and 8 unique bigrams, so the result would be (2,16). The phrases are “Please call the number below” and “Please do not call us.” Each word individually (unigram) is “Please,” “call,” ”the,” ”number,” “below,” “do,” “not,” and “us.” The unique bigrams are “Please call,” “call the,” ”the number,” “number below,” “Please do,” “do not,” “not call,” and “call us.” The tf–idf vectorizer is described at this link.

Reference3:  tf-idf vertorizer

Question 4: A company is setting up a system to manage all of the datasets it stores in Amazon S3. The company would like to automate running transformation jobs on the data and maintaining a catalog of the metadata concerning the datasets. The solution should require the least amount of setup and maintenance. Which solution will allow the company to achieve its goals? 

A) Create an Amazon EMR cluster with Apache Hive installed. Then, create a Hive metastore and a script to run transformation jobs on a schedule.
B) Create an AWS Glue crawler to populate the AWS Glue Data Catalog. Then, author an AWS Glue ETL job, and set up a schedule for data transformation jobs.
C) Create an Amazon EMR cluster with Apache Spark installed. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule. D) Create an AWS Data Pipeline that transforms the data. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule.
 

ANSWER4:

B

Notes/Hint4:

AWS Glue is the correct answer because this option requires the least amount of setup and maintenance since it is serverless, and it does not require management of the infrastructure. Refer to this link for supporting information. A, C, and D are all solutions that can solve the problem, but require more steps for configuration, and require higher operational overhead to run and maintain.
Reference4:  Glue

Question 5) Which service in the Kinesis family allows you to easily load streaming data into data stores and analytics tools?

A) Kinesis Firehose
B) Kinesis Streams
C) Kinesis Data Analytics
D) Kinesis Video Streams
 

ANSWER5:

A

Notes/Hint5:

Kinesis Firehose is perfect for streaming data into AWS and sending it directly to its final destination – places like S3, Redshift, Elastisearch, and Splunk Instances.

Reference 5): Kinesis Firehose

Question 6) A data scientist is working on optimizing a model during the training process by varying multiple parameters. The data scientist observes that, during multiple runs with identical parameters, the loss function converges to different, yet stable, values. What should the data scientist do to improve the training process? 
A) Increase the learning rate. Keep the batch size the same.
B) Reduce the batch size. Decrease the learning rate.
C) Keep the batch size the same. Decrease the learning rate.
D) Do not change the learning rate. Increase the batch size.
 
Answer  6)
B
 

Notes 6)

It is most likely that the loss function is very curvy and has multiple local minima where the training is getting stuck. Decreasing the batch size would help the data scientist stochastically get out of the local minima saddles. Decreasing the learning rate would prevent overshooting the global loss function minimum. Refer to the paper at this link for an explanation.
Reference 6) : Here

Question 7) Your organization has a standalone Javascript (Node.js) application that streams data into AWS using Kinesis Data Streams. You notice that they are using the Kinesis API (AWS SDK) over the Kinesis Producer Library (KPL). What might be the reasoning behind this?
A) The Kinesis API (AWS SDK) provides greater functionality over the Kinesis Producer Library.
B) The Kinesis API (AWS SDK) runs faster in Javascript applications over the Kinesis Producer Library.
C) The Kinesis Producer Library must be installed as a Java application to use with Kinesis Data Streams.
D) The Kinesis Producer Library cannot be integrated with a Javascript application because of its asynchronous architecture.
Answer 7)
C
Notes/Hint7:
The KPL must be installed as a Java application before it can be used with your Kinesis Data Streams. There are ways to process KPL serialized data within AWS Lambda, in Java, Node.js, and Python, but not if these answers mentions Lambda.
Reference 7) KPL
 
 
Question 8) A data scientist is evaluating different binary classification models. A false positive result is 5 times more expensive (from a business perspective) than a false negative result. The models should be evaluated based on the following criteria: 
1) Must have a recall rate of at least 80%
2) Must have a false positive rate of 10% or less
3) Must minimize business costs After creating each binary classification model, the data scientist generates the corresponding confusion matrix. Which confusion matrix represents the model that satisfies the requirements?
A) TN = 91, FP = 9 FN = 22, TP = 78
 B) TN = 99, FP = 1 FN = 21, TP = 79
C) TN = 96, FP = 4 FN = 10, TP = 90
D) TN = 98, FP = 2 FN = 18, TP = 82
 
Answer 8): 
D
 

Notes/Hint 8)


The following calculations are required: TP = True Positive FP = False Positive FN = False Negative TN = True Negative FN = False Negative Recall = TP / (TP + FN) False Positive Rate (FPR) = FP / (FP + TN) Cost = 5 * FP + FN A B C D Recall 78 / (78 + 22) = 0.78 79 / (79 + 21) = 0.79 90 / (90 + 10) = 0.9 82 / (82 + 18) = 0.82 False Positive Rate 9 / (9 + 91) = 0.09 1 / (1 + 99) = 0.01 4 / (4 + 96) = 0.04 2 / (2 + 98) = 0.02 Costs 5 * 9 + 22 = 67 5 * 1 + 21 = 26 5 * 4 + 10 = 30 5 * 2 + 18 = 28 Options C and D have a recall greater than 80% and an FPR less than 10%, but D is the most cost effective. For supporting information, refer to this link.
Reference 8: Here

 
 
Question 9) A data scientist uses logistic regression to build a fraud detection model. While the model accuracy is 99%, 90% of the fraud cases are not detected by the model. What action will definitely help the model detect more than 10% of fraud cases? 
A) Using undersampling to balance the dataset
B) Decreasing the class probability threshold
C) Using regularization to reduce overfitting
D) Using oversampling to balance the dataset
 

Answer  9)

B

 

Notes 9)


Decreasing the class probability threshold makes the model more sensitive and, therefore, marks more cases as the positive class, which is fraud in this case. This will increase the likelihood of fraud detection. However, it comes at the price of lowering precision. This is covered in the Discussion section of the paper at this link
Reference 9: Here

 

 
Question 10) A company is interested in building a fraud detection model. Currently, the data scientist does not have a sufficient amount of information due to the low number of fraud cases. Which method is MOST likely to detect the GREATEST number of valid fraud cases?
A) Oversampling using bootstrapping
B) Undersampling
C) Oversampling using SMOTE
D) Class weight adjustment
 

Answer  10)

C

 
Notes 10)

With datasets that are not fully populated, the Synthetic Minority Over-sampling Technique (SMOTE) adds new information by adding synthetic data points to the minority class. This technique would be the most effective in this scenario. Refer to Section 4.2 at this link for supporting information.
Reference 10) : Here
 
Question 11) A machine learning engineer is preparing a data frame for a supervised learning task with the Amazon SageMaker Linear Learner algorithm. The ML engineer notices the target label classes are highly imbalanced and multiple feature columns contain missing values. The proportion of missing values across the entire data frame is less than 5%. What should the ML engineer do to minimize bias due to missing values? 
 
A) Replace each missing value by the mean or median across non-missing values in same row.
B) Delete observations that contain missing values because these represent less than 5% of the data.
C) Replace each missing value by the mean or median across non-missing values in the same column.
D) For each feature, approximate the missing values using supervised learning based on other features.
 

Answer  11)

D

 

Notes 11)

Use supervised learning to predict missing values based on the values of other features. Different supervised learning approaches might have different performances, but any properly implemented supervised learning approach should provide the same or better approximation than mean or median approximation, as proposed in responses A and C. Supervised learning applied to the imputation of missing values is an active field of research. Refer to this link for an example.
Reference 11): Here

 
Question 12) A company has collected customer comments on its products, rating them as safe or unsafe, using decision trees. The training dataset has the following features: id, date, full review, full review summary, and a binary safe/unsafe tag. During training, any data sample with missing features was dropped. In a few instances, the test set was found to be missing the full review text field. For this use case, which is the most effective course of action to address test data samples with missing features? 
A) Drop the test samples with missing full review text fields, and then run through the test set.
B) Copy the summary text fields and use them to fill in the missing full review text fields, and then run through the test set.
C) Use an algorithm that handles missing data better than decision trees.
D) Generate synthetic data to fill in the fields that are missing data, and then run through the test set.
 
Answer  12)
B

 

 

Notes 12) 

In this case, a full review summary usually contains the most descriptive phrases of the entire review and is a valid stand-in for the missing full review text field. For supporting information, refer to page 1627 at this link, and this link and this link.

Reference 12) Here

 

 
Question 13) An insurance company needs to automate claim compliance reviews because human reviews are expensive and error-prone. The company has a large set of claims and a compliance label for each. Each claim consists of a few sentences in English, many of which contain complex related information. Management would like to use Amazon SageMaker built-in algorithms to design a machine learning supervised model that can be trained to read each claim and predict if the claim is compliant or not. Which approach should be used to extract features from the claims to be used as inputs for the downstream supervised task? 
A) Derive a dictionary of tokens from claims in the entire dataset. Apply one-hot encoding to tokens found in each claim of the training set. Send the derived features space as inputs to an Amazon SageMaker builtin supervised learning algorithm.
B) Apply Amazon SageMaker BlazingText in Word2Vec mode to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
C) Apply Amazon SageMaker BlazingText in classification mode to labeled claims in the training set to derive features for the claims that correspond to the compliant and non-compliant labels, respectively.
D) Apply Amazon SageMaker Object2Vec to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
 

Answer  13)

D

 

Notes 13)

Amazon SageMaker Object2Vec generalizes the Word2Vec embedding technique for words to more complex objects, such as sentences and paragraphs. Since the supervised learning task is at the level of whole claims, for which there are labels, and no labels are available at the word level, Object2Vec needs be used instead of Word2Vec.

Reference 13)  Amazon SageMaker
Object2Vec 

Question 14) You have been tasked with capturing two different types of streaming events. The first event type includes mission-critical data that needs to immediately be processed before operations can continue. The second event type includes data of less importance, but operations can continue without immediately processing. What is the most appropriate solution to record these different types of events?

A) Capture both events with the PutRecords API call.
B) Capture both event types using the Kinesis Producer Library (KPL).
C) Capture the mission critical events with the PutRecords API call and the second event type with the Kinesis Producer Library (KPL).
D) Capture the mission critical events with the Kinesis Producer Library (KPL) and the second event type with the Putrecords API call.
 

Answer  14)

C

 

Notes 14)

The question is about sending data to Kinesis synchronously vs. asynchronously. PutRecords is a synchronous send function, so it must be used for the first event type (critical events). The Kinesis Producer Library (KPL) implements an asynchronous send function, so it can be used for the second event type. In this scenario, the reason to use the KPL over the PutRecords API call is because: KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user-configurable). Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance. Applications that cannot tolerate this additional delay may need to use the AWS SDK directly. For more information about using the AWS SDK with Kinesis Data Streams, see Developing Producers Using the Amazon Kinesis Data Streams API with the AWS SDK for Java. For more information about RecordMaxBufferedTime and other user-configurable properties of the KPL, see Configuring the Kinesis Producer Library.

Reference 14: KCL vs PutRecords

 

Question 15) You are collecting clickstream data from an e-commerce website to make near-real time product suggestions for users actively using the site. Which combination of tools can be used to achieve the quickest recommendations and meets all of the requirements?

A) Use Kinesis Data Streams to ingest clickstream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
B) Use Kinesis Data Firehose to ingest click stream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions, then use Lambda to load these results into S3.
C) Use Kinesis Data Streams to ingest clickstream data, then use Lambda to process that data and write it to S3. Once the data is on S3, use Athena to query based on conditions that data and make real time recommendations to users.
D) Use the Kinesis Data Analytics to ingest the clickstream data directly and run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
 

Answer  15)

A

 

Notes 15)

Kinesis Data Analytics gets its input streaming data from Kinesis Data Streams or Kinesis Data Firehose. You can use Kinesis Data Analytics to run real-time SQL queries on your data. Once certain conditions are met you can trigger Lambda functions to make real time product suggestions to users. It is not important that we store or persist the clickstream data.

Reference 15: Kinesis Data Analytics

Question 16) Which service built by AWS makes it easy to set up a retry mechanism, aggregate records to improve throughput, and automatically submits CloudWatch metrics?

A) Kinesis API (AWS SDK)
B) Kinesis Producer Library (KPL)
C) Kinesis Consumer Library
D) Kinesis Client Library (KCL)

Answer  16)

B

 

Notes 16)

Although the Kinesis API built into the AWS SDK can be used for all of this, the Kinesis Producer Library (KPL) makes it easy to integrate all of this into your applications.

Reference 16:  Kinesis Producer Library (KPL) 

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore  9n8rl80hvm4t-mobile screenshots]

Question 17) You have been tasked with capturing data from an online gaming platform to run analytics on and process through a machine learning pipeline. The data that you are ingesting is players controller inputs every 1 second (up to 10 players in a game) that is in JSON format. The data needs to be ingested through Kinesis Data Streams and the JSON data blob is 100 KB in size. What is the minimum number of shards you can use to successfully ingest this data?

A) 10 shards
B) Greater than 500 shards, so you’ll need to request more shards from AWS
C) 1 shard
D) 100 shards

Answer  17)

C

 

Notes 17)

In this scenario, there will be a maximum of 10 records per second with a max payload size of 1000 KB (10 records x 100 KB = 1000KB) written to the shard. A single shard can ingest up to 1 MB of data per second, which is enough to ingest the 1000 KB from the streaming game play. Therefor 1 shard is enough to handle the streaming data.

Reference 17: shards

Question 18) Which services in the Kinesis family allows you to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time?

A) Kinesis Streams
B) Kinesis Firehose
C) Kinesis Video Streams
D) Kinesis Data Analytics

Answer  18)

D

 

Notes 18)

Kinesis Data Analytics allows you to run real-time SQL queries on your data to gain insights and respond to events in real time.

Reference 18: Kinesis Data Analytics

 

Question 19) You are a ML specialist needing to collect data from Twitter tweets. Your goal is to collect tweets that include only the name of your company and the tweet body, and store it off into a data store in AWS. What set of tools can you use to stream, transform, and load the data into AWS with the LEAST amount of effort?

A) Setup a Kinesis Data Firehose for data ingestion and immediately write that data to S3. Next, setup a Lambda function to trigger when data lands in S3 to transform it and finally write it to DynamoDB.
B) Setup A Kinesis Data Stream for data ingestion, setup EC2 instances as data consumers to poll and transform the data from the stream. Once the data is transformed, make an API call to write the data to DynamoDB.
C) Setup Kinesis Data Streams for data ingestion. Next, setup Kinesis Data Firehouse to load that data into RedShift. Next, setup a Lambda function to query data using RedShift spectrum and store the results onto DynamoDB.
D) Create a Kinesis Data Stream to ingest the data. Next, setup a Kinesis Data Firehose and use Lambda to transform the data from the Kinesis Data Stream, then use Lambda to write the data to DynamoDB. Finally, use S3 as the data destination for Kinesis Data Firehose.
 

Answer 19)

A

Notes 19)

All of these could be used to stream, transform, and load the data into an AWS data store. The setup that requires the LEAST amount of effort and moving parts involves setting up a Kinesis Data Firehose to stream the data into S3, have it transformed by Lambda with an S3 trigger, and then written to DynamoDB.

Reference 19: Kinesis Data Firehose to stream the data into S3

Question 20) Which service in the Kinesis family allows you to build custom applications that process or analyze streaming data for specialized needs?

A) Kinesis Firehose
B) Kinesis Streams
C) Kinesis Video Streams
D) Kinesis Data Analytics

Answer 20)

B

Notes 20)

Kinesis Streams allows you to stream data into AWS and build custom applications around that streaming data.

Reference 20: Kinesis Streams

Question21:

Answer21:

What are the Top 100 AWS and Google Certified Machine Learning Specialty Questions and Answers Dumps?

This blog is the best way  is the best way to prepare for your upcoming  AWS Certified Machine Learning Specialty and Google Certified Professional Machine Learning Engineer exam. With over 100 questions and answers, this blog provides quizzes similar  that are very similar to the real exam. It also includes  the option to show and hide answers. Additionally, there are machine learning interview questions and detailed answers, as well as cheat sheets and illustrations. This blog is the best way to make sure you are well-prepared for your AWS Certified Machine Learning Specialty Exam.

The typical Google Machine Learning Engineer salary is $147,218. Machine Learning Engineer salaries at Google can range from $110,000 – $152,183.

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.

  • By the end of 2020, 85% of customer interactions will be handled without a human (Call Center, Chatbot, etc…)
  • 61% of marketers say artificial intelligence is the most important aspect of their data strategy.
  • 80% of business and tech leaders say AI already boosts productivity (Robotic Process Automation, Power Automate, etc..)
  • Current AI technology can boost business productivity by up to 40%

AWS Machine Learning Certification Specialty Exam Prep for iOs Android Windows10/11

AWS machine Learning Specialty Exam Prep MLS-C01
AWS machine Learning Specialty Exam Prep MLS-C01

GCP Professional Machine Learning Engineer for iOs, Android, Windows 10/11

Quizzes, Practice Exams: Framing, Architecting, Designing, Developing ML Problems & Solutions, ML Jobs Interview Q&A

GCP Professional Machine Learning Engineer
GCP Professional Machine Learning Engineer

 

Azure AI Fundamentals AI-900 Exam Prep App for iOS, Android, Windows10/11

Basics and Advanced Machine Learning Quizzes on Azure, Azure Machine Learning Job Interviews Questions and Answer, ML Cheat Sheets

Azure AI Fundamentals AI-900 Exam Prep
Azure AI Fundamentals AI-900 Exam Prep

Machine Learning For Dummies App for iOs, Android, Windows10/11

Use this App to learn about Machine Learning and Elevate your Brain with Machine Learning Quizzes, Cheat Sheets, Ml Jobs Interview Questions and Answers updated daily.

Machine Learning For Dummies
Machine Learning For Dummies

What does a Professional Machine Learning Engineer do?

Professional Machine Learning Engineer designs, builds, and productionizes ML models to solve business challenges using Google Cloud technologies and knowledge of proven ML models and techniques. The ML Engineer collaborates closely with other job roles to ensure long-term success of models. The ML Engineer should be proficient in all aspects of model architecture, data pipeline interaction, and metrics interpretation. The ML Engineer needs familiarity with application development, infrastructure management, data engineering, and security. Through an understanding of training, retraining, deploying, scheduling, monitoring, and improving models, they design and create scalable solutions for optimal performance.

The AWS Certified Machine Learning – Specialty certification is intended for individuals who perform a development or data science role. It validates a candidate’s ability to design, implement, deploy, and maintain machine learning (ML) solutions for given business problems.

This blog covers Machine Learning 101, Top 20 AWS Certified Machine Learning Specialty Questions and Answers, Top 20 Google Professional Machine Learning Engineer Sample Questions, Machine Learning Quizzes, Machine Learning Q&A, Top 10 Machine Learning Algorithms, Machine Learning Latest Hot News, Machine Learning Demos (Ex: Tensorflow Demos)

Below are the Top 100 AWS Certified Machine Learning Specialty Questions and Answers Dumps.

Top

 

Question1: A machine learning team has several large CSV datasets in Amazon S3. Historically, models built with the Amazon SageMaker Linear Learner algorithm have taken hours to train on similar-sized datasets. The team’s leaders need to accelerate the training process. What can a machine learning specialist do to address this concern?

A) Use Amazon SageMaker Pipe mode.
B) Use Amazon Machine Learning to train the models.
C) Use Amazon Kinesis to stream the data to Amazon SageMaker.
D) Use AWS Glue to transform the CSV dataset to the JSON format.
ANSWER1:

A

Notes/Hint1:


Amazon SageMaker Pipe mode streams the data directly to the container, which improves the performance of training jobs. (Refer to this link for supporting information.) In Pipe mode, your training job streams data directly from Amazon S3. Streaming can provide faster start times for training jobs and better throughput. With Pipe mode, you also reduce the size of the Amazon EBS volumes for your training instances. B would not apply in this scenario. C is a streaming ingestion solution, but is not applicable in this scenario. D transforms the data structure.

Reference1: Amazon SageMaker

Question 2) A local university wants to track cars in a parking lot to determine which students are parking in the lot. The university is wanting to ingest videos of the cars parking in near-real time, use machine learning to identify license plates, and store that data in an AWS data store. Which solution meets these requirements with the LEAST amount of development effort?

A) Use Amazon Kinesis Data Streams to ingest the video in near-real time, use the Kinesis Data Streams consumer integrated with Amazon Rekognition Video to process the license plate information, and then store results in DynamoDB.

B) Use Amazon Kinesis Video Streams to ingest the videos in near-real time, use the Kinesis Video Streams integration with Amazon Rekognition Video to identify the license plate information, and then store the results in DynamoDB.

C) Use Amazon Kinesis Data Streams to ingest videos in near-real time, call Amazon Rekognition to identify license plate information, and then store results in DynamoDB.

D) Use Amazon Kinesis Firehose to ingest the video in near-real time and outputs results onto S3. Set up a Lambda function that triggers when a new video is PUT onto S3 to send results to Amazon Rekognition to identify license plate information, and then store results in DynamoDB.

Answer 2)

B

Notes/Hint2)

Kinesis Video Streams is used to stream videos in near-real time. Amazon Rekognition Video uses Amazon Kinesis Video Streams to receive and process a video stream. After the videos have been processed by Rekognition we can output the results in DynamoDB.

Reference: Kinesis Video Streams

Question 3) A term frequency–inverse document frequency (tf–idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences:

1. Please call the number below.
2. Please do not call us. What are the dimensions of the tf–idf matrix?
A) (2, 16)
B) (2, 8)
C) (2, 10)
D) (8, 10)

ANSWER3:

A

Notes/Hint3:

There are 2 sentences, 8 unique unigrams, and 8 unique bigrams, so the result would be (2,16). The phrases are “Please call the number below” and “Please do not call us.” Each word individually (unigram) is “Please,” “call,” ”the,” ”number,” “below,” “do,” “not,” and “us.” The unique bigrams are “Please call,” “call the,” ”the number,” “number below,” “Please do,” “do not,” “not call,” and “call us.” The tf–idf vectorizer is described at this link.

Reference3:  tf-idf vertorizer

Question 4: A company is setting up a system to manage all of the datasets it stores in Amazon S3. The company would like to automate running transformation jobs on the data and maintaining a catalog of the metadata concerning the datasets. The solution should require the least amount of setup and maintenance. Which solution will allow the company to achieve its goals? 

A) Create an Amazon EMR cluster with Apache Hive installed. Then, create a Hive metastore and a script to run transformation jobs on a schedule.
B) Create an AWS Glue crawler to populate the AWS Glue Data Catalog. Then, author an AWS Glue ETL job, and set up a schedule for data transformation jobs.
C) Create an Amazon EMR cluster with Apache Spark installed. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule. D) Create an AWS Data Pipeline that transforms the data. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule.
 

ANSWER4:

B

Notes/Hint4:

AWS Glue is the correct answer because this option requires the least amount of setup and maintenance since it is serverless, and it does not require management of the infrastructure. Refer to this link for supporting information. A, C, and D are all solutions that can solve the problem, but require more steps for configuration, and require higher operational overhead to run and maintain.
Reference4:  Glue

Question 5) Which service in the Kinesis family allows you to easily load streaming data into data stores and analytics tools?

A) Kinesis Firehose
B) Kinesis Streams
C) Kinesis Data Analytics
D) Kinesis Video Streams
 

ANSWER5:

A

Notes/Hint5:

Kinesis Firehose is perfect for streaming data into AWS and sending it directly to its final destination – places like S3, Redshift, Elastisearch, and Splunk Instances.

Reference 5): Kinesis Firehose

Question 6) A data scientist is working on optimizing a model during the training process by varying multiple parameters. The data scientist observes that, during multiple runs with identical parameters, the loss function converges to different, yet stable, values. What should the data scientist do to improve the training process? 
A) Increase the learning rate. Keep the batch size the same.
B) Reduce the batch size. Decrease the learning rate.
C) Keep the batch size the same. Decrease the learning rate.
D) Do not change the learning rate. Increase the batch size.
 
Answer  6)
B
 

Notes 6)

It is most likely that the loss function is very curvy and has multiple local minima where the training is getting stuck. Decreasing the batch size would help the data scientist stochastically get out of the local minima saddles. Decreasing the learning rate would prevent overshooting the global loss function minimum. Refer to the paper at this link for an explanation.
Reference 6) : Here

Question 7) Your organization has a standalone Javascript (Node.js) application that streams data into AWS using Kinesis Data Streams. You notice that they are using the Kinesis API (AWS SDK) over the Kinesis Producer Library (KPL). What might be the reasoning behind this?
A) The Kinesis API (AWS SDK) provides greater functionality over the Kinesis Producer Library.
B) The Kinesis API (AWS SDK) runs faster in Javascript applications over the Kinesis Producer Library.
C) The Kinesis Producer Library must be installed as a Java application to use with Kinesis Data Streams.
D) The Kinesis Producer Library cannot be integrated with a Javascript application because of its asynchronous architecture.
Answer 7)
C
Notes/Hint7:
The KPL must be installed as a Java application before it can be used with your Kinesis Data Streams. There are ways to process KPL serialized data within AWS Lambda, in Java, Node.js, and Python, but not if these answers mentions Lambda.
Reference 7) KPL
 
 
Question 8) A data scientist is evaluating different binary classification models. A false positive result is 5 times more expensive (from a business perspective) than a false negative result. The models should be evaluated based on the following criteria: 
1) Must have a recall rate of at least 80%
2) Must have a false positive rate of 10% or less
3) Must minimize business costs After creating each binary classification model, the data scientist generates the corresponding confusion matrix. Which confusion matrix represents the model that satisfies the requirements?
A) TN = 91, FP = 9 FN = 22, TP = 78
 B) TN = 99, FP = 1 FN = 21, TP = 79
C) TN = 96, FP = 4 FN = 10, TP = 90
D) TN = 98, FP = 2 FN = 18, TP = 82
 
Answer 8): 
D
 

Notes/Hint 8)


The following calculations are required: TP = True Positive FP = False Positive FN = False Negative TN = True Negative FN = False Negative Recall = TP / (TP + FN) False Positive Rate (FPR) = FP / (FP + TN) Cost = 5 * FP + FN A B C D Recall 78 / (78 + 22) = 0.78 79 / (79 + 21) = 0.79 90 / (90 + 10) = 0.9 82 / (82 + 18) = 0.82 False Positive Rate 9 / (9 + 91) = 0.09 1 / (1 + 99) = 0.01 4 / (4 + 96) = 0.04 2 / (2 + 98) = 0.02 Costs 5 * 9 + 22 = 67 5 * 1 + 21 = 26 5 * 4 + 10 = 30 5 * 2 + 18 = 28 Options C and D have a recall greater than 80% and an FPR less than 10%, but D is the most cost effective. For supporting information, refer to this link.
Reference 8: Here

 
 
Question 9) A data scientist uses logistic regression to build a fraud detection model. While the model accuracy is 99%, 90% of the fraud cases are not detected by the model. What action will definitely help the model detect more than 10% of fraud cases? 
A) Using undersampling to balance the dataset
B) Decreasing the class probability threshold
C) Using regularization to reduce overfitting
D) Using oversampling to balance the dataset
 

Answer  9)

B

 

Notes 9)


Decreasing the class probability threshold makes the model more sensitive and, therefore, marks more cases as the positive class, which is fraud in this case. This will increase the likelihood of fraud detection. However, it comes at the price of lowering precision. This is covered in the Discussion section of the paper at this link
Reference 9: Here

 

 
Question 10) A company is interested in building a fraud detection model. Currently, the data scientist does not have a sufficient amount of information due to the low number of fraud cases. Which method is MOST likely to detect the GREATEST number of valid fraud cases?
A) Oversampling using bootstrapping
B) Undersampling
C) Oversampling using SMOTE
D) Class weight adjustment
 

Answer  10)

C

 
Notes 10)

With datasets that are not fully populated, the Synthetic Minority Over-sampling Technique (SMOTE) adds new information by adding synthetic data points to the minority class. This technique would be the most effective in this scenario. Refer to Section 4.2 at this link for supporting information.
Reference 10) : Here
 
Question 11) A machine learning engineer is preparing a data frame for a supervised learning task with the Amazon SageMaker Linear Learner algorithm. The ML engineer notices the target label classes are highly imbalanced and multiple feature columns contain missing values. The proportion of missing values across the entire data frame is less than 5%. What should the ML engineer do to minimize bias due to missing values? 
 
A) Replace each missing value by the mean or median across non-missing values in same row.
B) Delete observations that contain missing values because these represent less than 5% of the data.
C) Replace each missing value by the mean or median across non-missing values in the same column.
D) For each feature, approximate the missing values using supervised learning based on other features.
 

Answer  11)

D

 

Notes 11)

Use supervised learning to predict missing values based on the values of other features. Different supervised learning approaches might have different performances, but any properly implemented supervised learning approach should provide the same or better approximation than mean or median approximation, as proposed in responses A and C. Supervised learning applied to the imputation of missing values is an active field of research. Refer to this link for an example.
Reference 11): Here

 
Question 12) A company has collected customer comments on its products, rating them as safe or unsafe, using decision trees. The training dataset has the following features: id, date, full review, full review summary, and a binary safe/unsafe tag. During training, any data sample with missing features was dropped. In a few instances, the test set was found to be missing the full review text field. For this use case, which is the most effective course of action to address test data samples with missing features? 
A) Drop the test samples with missing full review text fields, and then run through the test set.
B) Copy the summary text fields and use them to fill in the missing full review text fields, and then run through the test set.
C) Use an algorithm that handles missing data better than decision trees.
D) Generate synthetic data to fill in the fields that are missing data, and then run through the test set.
 
Answer  12)
B

 

 

Notes 12) 

In this case, a full review summary usually contains the most descriptive phrases of the entire review and is a valid stand-in for the missing full review text field. For supporting information, refer to page 1627 at this link, and this link and this link.

Reference 12) Here

 

 
Question 13) An insurance company needs to automate claim compliance reviews because human reviews are expensive and error-prone. The company has a large set of claims and a compliance label for each. Each claim consists of a few sentences in English, many of which contain complex related information. Management would like to use Amazon SageMaker built-in algorithms to design a machine learning supervised model that can be trained to read each claim and predict if the claim is compliant or not. Which approach should be used to extract features from the claims to be used as inputs for the downstream supervised task? 
A) Derive a dictionary of tokens from claims in the entire dataset. Apply one-hot encoding to tokens found in each claim of the training set. Send the derived features space as inputs to an Amazon SageMaker builtin supervised learning algorithm.
B) Apply Amazon SageMaker BlazingText in Word2Vec mode to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
C) Apply Amazon SageMaker BlazingText in classification mode to labeled claims in the training set to derive features for the claims that correspond to the compliant and non-compliant labels, respectively.
D) Apply Amazon SageMaker Object2Vec to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
 

Answer  13)

D

 

Notes 13)

Amazon SageMaker Object2Vec generalizes the Word2Vec embedding technique for words to more complex objects, such as sentences and paragraphs. Since the supervised learning task is at the level of whole claims, for which there are labels, and no labels are available at the word level, Object2Vec needs be used instead of Word2Vec.

Reference 13)  Amazon SageMaker
Object2Vec 

Question 14) You have been tasked with capturing two different types of streaming events. The first event type includes mission-critical data that needs to immediately be processed before operations can continue. The second event type includes data of less importance, but operations can continue without immediately processing. What is the most appropriate solution to record these different types of events?

A) Capture both events with the PutRecords API call.
B) Capture both event types using the Kinesis Producer Library (KPL).
C) Capture the mission critical events with the PutRecords API call and the second event type with the Kinesis Producer Library (KPL).
D) Capture the mission critical events with the Kinesis Producer Library (KPL) and the second event type with the Putrecords API call.
 

Answer  14)

C

 

Notes 14)

The question is about sending data to Kinesis synchronously vs. asynchronously. PutRecords is a synchronous send function, so it must be used for the first event type (critical events). The Kinesis Producer Library (KPL) implements an asynchronous send function, so it can be used for the second event type. In this scenario, the reason to use the KPL over the PutRecords API call is because: KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user-configurable). Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance. Applications that cannot tolerate this additional delay may need to use the AWS SDK directly. For more information about using the AWS SDK with Kinesis Data Streams, see Developing Producers Using the Amazon Kinesis Data Streams API with the AWS SDK for Java. For more information about RecordMaxBufferedTime and other user-configurable properties of the KPL, see Configuring the Kinesis Producer Library.

Reference 14: KCL vs PutRecords

 

Question 15) You are collecting clickstream data from an e-commerce website to make near-real time product suggestions for users actively using the site. Which combination of tools can be used to achieve the quickest recommendations and meets all of the requirements?

A) Use Kinesis Data Streams to ingest clickstream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
B) Use Kinesis Data Firehose to ingest click stream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions, then use Lambda to load these results into S3.
C) Use Kinesis Data Streams to ingest clickstream data, then use Lambda to process that data and write it to S3. Once the data is on S3, use Athena to query based on conditions that data and make real time recommendations to users.
D) Use the Kinesis Data Analytics to ingest the clickstream data directly and run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
 

Answer  15)

A

 

Notes 15)

Kinesis Data Analytics gets its input streaming data from Kinesis Data Streams or Kinesis Data Firehose. You can use Kinesis Data Analytics to run real-time SQL queries on your data. Once certain conditions are met you can trigger Lambda functions to make real time product suggestions to users. It is not important that we store or persist the clickstream data.

Reference 15: Kinesis Data Analytics

Question 16) Which service built by AWS makes it easy to set up a retry mechanism, aggregate records to improve throughput, and automatically submits CloudWatch metrics?

A) Kinesis API (AWS SDK)
B) Kinesis Producer Library (KPL)
C) Kinesis Consumer Library
D) Kinesis Client Library (KCL)

Answer  16)

B

 

Notes 16)

Although the Kinesis API built into the AWS SDK can be used for all of this, the Kinesis Producer Library (KPL) makes it easy to integrate all of this into your applications.

Reference 16:  Kinesis Producer Library (KPL) 

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore  9n8rl80hvm4t-mobile screenshots]

Question 17) You have been tasked with capturing data from an online gaming platform to run analytics on and process through a machine learning pipeline. The data that you are ingesting is players controller inputs every 1 second (up to 10 players in a game) that is in JSON format. The data needs to be ingested through Kinesis Data Streams and the JSON data blob is 100 KB in size. What is the minimum number of shards you can use to successfully ingest this data?

A) 10 shards
B) Greater than 500 shards, so you’ll need to request more shards from AWS
C) 1 shard
D) 100 shards

Answer  17)

C

 

Notes 17)

In this scenario, there will be a maximum of 10 records per second with a max payload size of 1000 KB (10 records x 100 KB = 1000KB) written to the shard. A single shard can ingest up to 1 MB of data per second, which is enough to ingest the 1000 KB from the streaming game play. Therefor 1 shard is enough to handle the streaming data.

Reference 17: shards

Question 18) Which services in the Kinesis family allows you to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time?

A) Kinesis Streams
B) Kinesis Firehose
C) Kinesis Video Streams
D) Kinesis Data Analytics

Answer  18)

D

 

Notes 18)

Kinesis Data Analytics allows you to run real-time SQL queries on your data to gain insights and respond to events in real time.

Reference 18: Kinesis Data Analytics

 

Question 19) You are a ML specialist needing to collect data from Twitter tweets. Your goal is to collect tweets that include only the name of your company and the tweet body, and store it off into a data store in AWS. What set of tools can you use to stream, transform, and load the data into AWS with the LEAST amount of effort?

A) Setup a Kinesis Data Firehose for data ingestion and immediately write that data to S3. Next, setup a Lambda function to trigger when data lands in S3 to transform it and finally write it to DynamoDB.
B) Setup A Kinesis Data Stream for data ingestion, setup EC2 instances as data consumers to poll and transform the data from the stream. Once the data is transformed, make an API call to write the data to DynamoDB.
C) Setup Kinesis Data Streams for data ingestion. Next, setup Kinesis Data Firehouse to load that data into RedShift. Next, setup a Lambda function to query data using RedShift spectrum and store the results onto DynamoDB.
D) Create a Kinesis Data Stream to ingest the data. Next, setup a Kinesis Data Firehose and use Lambda to transform the data from the Kinesis Data Stream, then use Lambda to write the data to DynamoDB. Finally, use S3 as the data destination for Kinesis Data Firehose.
 

Answer 19)

A

Notes 19)

All of these could be used to stream, transform, and load the data into an AWS data store. The setup that requires the LEAST amount of effort and moving parts involves setting up a Kinesis Data Firehose to stream the data into S3, have it transformed by Lambda with an S3 trigger, and then written to DynamoDB.

Reference 19: Kinesis Data Firehose to stream the data into S3

Question 20) Which service in the Kinesis family allows you to build custom applications that process or analyze streaming data for specialized needs?

A) Kinesis Firehose
B) Kinesis Streams
C) Kinesis Video Streams
D) Kinesis Data Analytics

Answer 20)

B

Notes 20)

Kinesis Streams allows you to stream data into AWS and build custom applications around that streaming data.

Reference 20: Kinesis Streams

Question21

Answer21:

 

Notes 21: 

Question22

Answer22:

 

Notes 22: 

Question23

Answer23:

 

Notes 23: 

Question24

Answer24:

 

Notes 24: 

What are the Top 100 AWS and Google Certified Machine Learning Specialty Questions and Answers Dumps?

This blog is the best way  is the best way to prepare for your upcoming  AWS Certified Machine Learning Specialty and Google Certified Professional Machine Learning Engineer exam. With over 100 questions and answers, this blog provides quizzes similar  that are very similar to the real exam. It also includes  the option to show and hide answers. Additionally, there are machine learning interview questions and detailed answers, as well as cheat sheets and illustrations. This blog is the best way to make sure you are well-prepared for your AWS Certified Machine Learning Specialty Exam.

The typical Google Machine Learning Engineer salary is $147,218. Machine Learning Engineer salaries at Google can range from $110,000 – $152,183.

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.

  • By the end of 2020, 85% of customer interactions will be handled without a human (Call Center, Chatbot, etc…)
  • 61% of marketers say artificial intelligence is the most important aspect of their data strategy.
  • 80% of business and tech leaders say AI already boosts productivity (Robotic Process Automation, Power Automate, etc..)
  • Current AI technology can boost business productivity by up to 40%

AWS Machine Learning Certification Specialty Exam Prep for iOs Android Windows10/11

AWS machine Learning Specialty Exam Prep MLS-C01
AWS machine Learning Specialty Exam Prep MLS-C01

GCP Professional Machine Learning Engineer for iOs, Android, Windows 10/11

Quizzes, Practice Exams: Framing, Architecting, Designing, Developing ML Problems & Solutions, ML Jobs Interview Q&A

GCP Professional Machine Learning Engineer
GCP Professional Machine Learning Engineer

 

Azure AI Fundamentals AI-900 Exam Prep App for iOS, Android, Windows10/11

Basics and Advanced Machine Learning Quizzes on Azure, Azure Machine Learning Job Interviews Questions and Answer, ML Cheat Sheets

Azure AI Fundamentals AI-900 Exam Prep
Azure AI Fundamentals AI-900 Exam Prep

Machine Learning For Dummies App for iOs, Android, Windows10/11

Use this App to learn about Machine Learning and Elevate your Brain with Machine Learning Quizzes, Cheat Sheets, Ml Jobs Interview Questions and Answers updated daily.

Machine Learning For Dummies
Machine Learning For Dummies

What does a Professional Machine Learning Engineer do?

Professional Machine Learning Engineer designs, builds, and productionizes ML models to solve business challenges using Google Cloud technologies and knowledge of proven ML models and techniques. The ML Engineer collaborates closely with other job roles to ensure long-term success of models. The ML Engineer should be proficient in all aspects of model architecture, data pipeline interaction, and metrics interpretation. The ML Engineer needs familiarity with application development, infrastructure management, data engineering, and security. Through an understanding of training, retraining, deploying, scheduling, monitoring, and improving models, they design and create scalable solutions for optimal performance.

The AWS Certified Machine Learning – Specialty certification is intended for individuals who perform a development or data science role. It validates a candidate’s ability to design, implement, deploy, and maintain machine learning (ML) solutions for given business problems.

This blog covers Machine Learning 101, Top 20 AWS Certified Machine Learning Specialty Questions and Answers, Top 20 Google Professional Machine Learning Engineer Sample Questions, Machine Learning Quizzes, Machine Learning Q&A, Top 10 Machine Learning Algorithms, Machine Learning Latest Hot News, Machine Learning Demos (Ex: Tensorflow Demos)

Below are the Top 100 AWS Certified Machine Learning Specialty Questions and Answers Dumps.

Top

 

Question1: A machine learning team has several large CSV datasets in Amazon S3. Historically, models built with the Amazon SageMaker Linear Learner algorithm have taken hours to train on similar-sized datasets. The team’s leaders need to accelerate the training process. What can a machine learning specialist do to address this concern?

A) Use Amazon SageMaker Pipe mode.
B) Use Amazon Machine Learning to train the models.
C) Use Amazon Kinesis to stream the data to Amazon SageMaker.
D) Use AWS Glue to transform the CSV dataset to the JSON format.
ANSWER1:

A

Notes/Hint1:


Amazon SageMaker Pipe mode streams the data directly to the container, which improves the performance of training jobs. (Refer to this link for supporting information.) In Pipe mode, your training job streams data directly from Amazon S3. Streaming can provide faster start times for training jobs and better throughput. With Pipe mode, you also reduce the size of the Amazon EBS volumes for your training instances. B would not apply in this scenario. C is a streaming ingestion solution, but is not applicable in this scenario. D transforms the data structure.

Reference1: Amazon SageMaker

Question 2) A local university wants to track cars in a parking lot to determine which students are parking in the lot. The university is wanting to ingest videos of the cars parking in near-real time, use machine learning to identify license plates, and store that data in an AWS data store. Which solution meets these requirements with the LEAST amount of development effort?

A) Use Amazon Kinesis Data Streams to ingest the video in near-real time, use the Kinesis Data Streams consumer integrated with Amazon Rekognition Video to process the license plate information, and then store results in DynamoDB.

B) Use Amazon Kinesis Video Streams to ingest the videos in near-real time, use the Kinesis Video Streams integration with Amazon Rekognition Video to identify the license plate information, and then store the results in DynamoDB.

C) Use Amazon Kinesis Data Streams to ingest videos in near-real time, call Amazon Rekognition to identify license plate information, and then store results in DynamoDB.

D) Use Amazon Kinesis Firehose to ingest the video in near-real time and outputs results onto S3. Set up a Lambda function that triggers when a new video is PUT onto S3 to send results to Amazon Rekognition to identify license plate information, and then store results in DynamoDB.

Answer 2)

B

Notes/Hint2)

Kinesis Video Streams is used to stream videos in near-real time. Amazon Rekognition Video uses Amazon Kinesis Video Streams to receive and process a video stream. After the videos have been processed by Rekognition we can output the results in DynamoDB.

Reference: Kinesis Video Streams

Question 3) A term frequency–inverse document frequency (tf–idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences:

1. Please call the number below.
2. Please do not call us. What are the dimensions of the tf–idf matrix?
A) (2, 16)
B) (2, 8)
C) (2, 10)
D) (8, 10)

ANSWER3:

A

Notes/Hint3:

There are 2 sentences, 8 unique unigrams, and 8 unique bigrams, so the result would be (2,16). The phrases are “Please call the number below” and “Please do not call us.” Each word individually (unigram) is “Please,” “call,” ”the,” ”number,” “below,” “do,” “not,” and “us.” The unique bigrams are “Please call,” “call the,” ”the number,” “number below,” “Please do,” “do not,” “not call,” and “call us.” The tf–idf vectorizer is described at this link.

Reference3:  tf-idf vertorizer

Question 4: A company is setting up a system to manage all of the datasets it stores in Amazon S3. The company would like to automate running transformation jobs on the data and maintaining a catalog of the metadata concerning the datasets. The solution should require the least amount of setup and maintenance. Which solution will allow the company to achieve its goals? 

A) Create an Amazon EMR cluster with Apache Hive installed. Then, create a Hive metastore and a script to run transformation jobs on a schedule.
B) Create an AWS Glue crawler to populate the AWS Glue Data Catalog. Then, author an AWS Glue ETL job, and set up a schedule for data transformation jobs.
C) Create an Amazon EMR cluster with Apache Spark installed. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule. D) Create an AWS Data Pipeline that transforms the data. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule.
 

ANSWER4:

B

Notes/Hint4:

AWS Glue is the correct answer because this option requires the least amount of setup and maintenance since it is serverless, and it does not require management of the infrastructure. Refer to this link for supporting information. A, C, and D are all solutions that can solve the problem, but require more steps for configuration, and require higher operational overhead to run and maintain.
Reference4:  Glue

Question 5) Which service in the Kinesis family allows you to easily load streaming data into data stores and analytics tools?

A) Kinesis Firehose
B) Kinesis Streams
C) Kinesis Data Analytics
D) Kinesis Video Streams
 

ANSWER5:

A

Notes/Hint5:

Kinesis Firehose is perfect for streaming data into AWS and sending it directly to its final destination – places like S3, Redshift, Elastisearch, and Splunk Instances.

Reference 5): Kinesis Firehose

Question 6) A data scientist is working on optimizing a model during the training process by varying multiple parameters. The data scientist observes that, during multiple runs with identical parameters, the loss function converges to different, yet stable, values. What should the data scientist do to improve the training process? 
A) Increase the learning rate. Keep the batch size the same.
B) Reduce the batch size. Decrease the learning rate.
C) Keep the batch size the same. Decrease the learning rate.
D) Do not change the learning rate. Increase the batch size.
 
Answer  6)
B
 

Notes 6)

It is most likely that the loss function is very curvy and has multiple local minima where the training is getting stuck. Decreasing the batch size would help the data scientist stochastically get out of the local minima saddles. Decreasing the learning rate would prevent overshooting the global loss function minimum. Refer to the paper at this link for an explanation.
Reference 6) : Here

Question 7) Your organization has a standalone Javascript (Node.js) application that streams data into AWS using Kinesis Data Streams. You notice that they are using the Kinesis API (AWS SDK) over the Kinesis Producer Library (KPL). What might be the reasoning behind this?
A) The Kinesis API (AWS SDK) provides greater functionality over the Kinesis Producer Library.
B) The Kinesis API (AWS SDK) runs faster in Javascript applications over the Kinesis Producer Library.
C) The Kinesis Producer Library must be installed as a Java application to use with Kinesis Data Streams.
D) The Kinesis Producer Library cannot be integrated with a Javascript application because of its asynchronous architecture.
Answer 7)
C
Notes/Hint7:
The KPL must be installed as a Java application before it can be used with your Kinesis Data Streams. There are ways to process KPL serialized data within AWS Lambda, in Java, Node.js, and Python, but not if these answers mentions Lambda.
Reference 7) KPL
 
 
Question 8) A data scientist is evaluating different binary classification models. A false positive result is 5 times more expensive (from a business perspective) than a false negative result. The models should be evaluated based on the following criteria: 
1) Must have a recall rate of at least 80%
2) Must have a false positive rate of 10% or less
3) Must minimize business costs After creating each binary classification model, the data scientist generates the corresponding confusion matrix. Which confusion matrix represents the model that satisfies the requirements?
A) TN = 91, FP = 9 FN = 22, TP = 78
 B) TN = 99, FP = 1 FN = 21, TP = 79
C) TN = 96, FP = 4 FN = 10, TP = 90
D) TN = 98, FP = 2 FN = 18, TP = 82
 
Answer 8): 
D
 

Notes/Hint 8)


The following calculations are required: TP = True Positive FP = False Positive FN = False Negative TN = True Negative FN = False Negative Recall = TP / (TP + FN) False Positive Rate (FPR) = FP / (FP + TN) Cost = 5 * FP + FN A B C D Recall 78 / (78 + 22) = 0.78 79 / (79 + 21) = 0.79 90 / (90 + 10) = 0.9 82 / (82 + 18) = 0.82 False Positive Rate 9 / (9 + 91) = 0.09 1 / (1 + 99) = 0.01 4 / (4 + 96) = 0.04 2 / (2 + 98) = 0.02 Costs 5 * 9 + 22 = 67 5 * 1 + 21 = 26 5 * 4 + 10 = 30 5 * 2 + 18 = 28 Options C and D have a recall greater than 80% and an FPR less than 10%, but D is the most cost effective. For supporting information, refer to this link.
Reference 8: Here

 
 
Question 9) A data scientist uses logistic regression to build a fraud detection model. While the model accuracy is 99%, 90% of the fraud cases are not detected by the model. What action will definitely help the model detect more than 10% of fraud cases? 
A) Using undersampling to balance the dataset
B) Decreasing the class probability threshold
C) Using regularization to reduce overfitting
D) Using oversampling to balance the dataset
 

Answer  9)

B

 

Notes 9)


Decreasing the class probability threshold makes the model more sensitive and, therefore, marks more cases as the positive class, which is fraud in this case. This will increase the likelihood of fraud detection. However, it comes at the price of lowering precision. This is covered in the Discussion section of the paper at this link
Reference 9: Here

 

 
Question 10) A company is interested in building a fraud detection model. Currently, the data scientist does not have a sufficient amount of information due to the low number of fraud cases. Which method is MOST likely to detect the GREATEST number of valid fraud cases?
A) Oversampling using bootstrapping
B) Undersampling
C) Oversampling using SMOTE
D) Class weight adjustment
 

Answer  10)

C

 
Notes 10)

With datasets that are not fully populated, the Synthetic Minority Over-sampling Technique (SMOTE) adds new information by adding synthetic data points to the minority class. This technique would be the most effective in this scenario. Refer to Section 4.2 at this link for supporting information.
Reference 10) : Here
 
Question 11) A machine learning engineer is preparing a data frame for a supervised learning task with the Amazon SageMaker Linear Learner algorithm. The ML engineer notices the target label classes are highly imbalanced and multiple feature columns contain missing values. The proportion of missing values across the entire data frame is less than 5%. What should the ML engineer do to minimize bias due to missing values? 
 
A) Replace each missing value by the mean or median across non-missing values in same row.
B) Delete observations that contain missing values because these represent less than 5% of the data.
C) Replace each missing value by the mean or median across non-missing values in the same column.
D) For each feature, approximate the missing values using supervised learning based on other features.
 

Answer  11)

D

 

Notes 11)

Use supervised learning to predict missing values based on the values of other features. Different supervised learning approaches might have different performances, but any properly implemented supervised learning approach should provide the same or better approximation than mean or median approximation, as proposed in responses A and C. Supervised learning applied to the imputation of missing values is an active field of research. Refer to this link for an example.
Reference 11): Here

 
Question 12) A company has collected customer comments on its products, rating them as safe or unsafe, using decision trees. The training dataset has the following features: id, date, full review, full review summary, and a binary safe/unsafe tag. During training, any data sample with missing features was dropped. In a few instances, the test set was found to be missing the full review text field. For this use case, which is the most effective course of action to address test data samples with missing features? 
A) Drop the test samples with missing full review text fields, and then run through the test set.
B) Copy the summary text fields and use them to fill in the missing full review text fields, and then run through the test set.
C) Use an algorithm that handles missing data better than decision trees.
D) Generate synthetic data to fill in the fields that are missing data, and then run through the test set.
 
Answer  12)
B

 

 

Notes 12) 

In this case, a full review summary usually contains the most descriptive phrases of the entire review and is a valid stand-in for the missing full review text field. For supporting information, refer to page 1627 at this link, and this link and this link.

Reference 12) Here

 

 
Question 13) An insurance company needs to automate claim compliance reviews because human reviews are expensive and error-prone. The company has a large set of claims and a compliance label for each. Each claim consists of a few sentences in English, many of which contain complex related information. Management would like to use Amazon SageMaker built-in algorithms to design a machine learning supervised model that can be trained to read each claim and predict if the claim is compliant or not. Which approach should be used to extract features from the claims to be used as inputs for the downstream supervised task? 
A) Derive a dictionary of tokens from claims in the entire dataset. Apply one-hot encoding to tokens found in each claim of the training set. Send the derived features space as inputs to an Amazon SageMaker builtin supervised learning algorithm.
B) Apply Amazon SageMaker BlazingText in Word2Vec mode to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
C) Apply Amazon SageMaker BlazingText in classification mode to labeled claims in the training set to derive features for the claims that correspond to the compliant and non-compliant labels, respectively.
D) Apply Amazon SageMaker Object2Vec to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
 

Answer  13)

D

 

Notes 13)

Amazon SageMaker Object2Vec generalizes the Word2Vec embedding technique for words to more complex objects, such as sentences and paragraphs. Since the supervised learning task is at the level of whole claims, for which there are labels, and no labels are available at the word level, Object2Vec needs be used instead of Word2Vec.

Reference 13)  Amazon SageMaker
Object2Vec 

Question 14) You have been tasked with capturing two different types of streaming events. The first event type includes mission-critical data that needs to immediately be processed before operations can continue. The second event type includes data of less importance, but operations can continue without immediately processing. What is the most appropriate solution to record these different types of events?

A) Capture both events with the PutRecords API call.
B) Capture both event types using the Kinesis Producer Library (KPL).
C) Capture the mission critical events with the PutRecords API call and the second event type with the Kinesis Producer Library (KPL).
D) Capture the mission critical events with the Kinesis Producer Library (KPL) and the second event type with the Putrecords API call.
 

Answer  14)

C

 

Notes 14)

The question is about sending data to Kinesis synchronously vs. asynchronously. PutRecords is a synchronous send function, so it must be used for the first event type (critical events). The Kinesis Producer Library (KPL) implements an asynchronous send function, so it can be used for the second event type. In this scenario, the reason to use the KPL over the PutRecords API call is because: KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user-configurable). Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance. Applications that cannot tolerate this additional delay may need to use the AWS SDK directly. For more information about using the AWS SDK with Kinesis Data Streams, see Developing Producers Using the Amazon Kinesis Data Streams API with the AWS SDK for Java. For more information about RecordMaxBufferedTime and other user-configurable properties of the KPL, see Configuring the Kinesis Producer Library.

Reference 14: KCL vs PutRecords

 

Question 15) You are collecting clickstream data from an e-commerce website to make near-real time product suggestions for users actively using the site. Which combination of tools can be used to achieve the quickest recommendations and meets all of the requirements?

A) Use Kinesis Data Streams to ingest clickstream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
B) Use Kinesis Data Firehose to ingest click stream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions, then use Lambda to load these results into S3.
C) Use Kinesis Data Streams to ingest clickstream data, then use Lambda to process that data and write it to S3. Once the data is on S3, use Athena to query based on conditions that data and make real time recommendations to users.
D) Use the Kinesis Data Analytics to ingest the clickstream data directly and run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
 

Answer  15)

A

 

Notes 15)

Kinesis Data Analytics gets its input streaming data from Kinesis Data Streams or Kinesis Data Firehose. You can use Kinesis Data Analytics to run real-time SQL queries on your data. Once certain conditions are met you can trigger Lambda functions to make real time product suggestions to users. It is not important that we store or persist the clickstream data.

Reference 15: Kinesis Data Analytics

Question 16) Which service built by AWS makes it easy to set up a retry mechanism, aggregate records to improve throughput, and automatically submits CloudWatch metrics?

A) Kinesis API (AWS SDK)
B) Kinesis Producer Library (KPL)
C) Kinesis Consumer Library
D) Kinesis Client Library (KCL)

Answer  16)

B

 

Notes 16)

Although the Kinesis API built into the AWS SDK can be used for all of this, the Kinesis Producer Library (KPL) makes it easy to integrate all of this into your applications.

Reference 16:  Kinesis Producer Library (KPL) 

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore  9n8rl80hvm4t-mobile screenshots]

Question 17) You have been tasked with capturing data from an online gaming platform to run analytics on and process through a machine learning pipeline. The data that you are ingesting is players controller inputs every 1 second (up to 10 players in a game) that is in JSON format. The data needs to be ingested through Kinesis Data Streams and the JSON data blob is 100 KB in size. What is the minimum number of shards you can use to successfully ingest this data?

A) 10 shards
B) Greater than 500 shards, so you’ll need to request more shards from AWS
C) 1 shard
D) 100 shards

Answer  17)

C

 

Notes 17)

In this scenario, there will be a maximum of 10 records per second with a max payload size of 1000 KB (10 records x 100 KB = 1000KB) written to the shard. A single shard can ingest up to 1 MB of data per second, which is enough to ingest the 1000 KB from the streaming game play. Therefor 1 shard is enough to handle the streaming data.

Reference 17: shards

Question 18) Which services in the Kinesis family allows you to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time?

A) Kinesis Streams
B) Kinesis Firehose
C) Kinesis Video Streams
D) Kinesis Data Analytics

Answer  18)

D

 

Notes 18)

Kinesis Data Analytics allows you to run real-time SQL queries on your data to gain insights and respond to events in real time.

Reference 18: Kinesis Data Analytics

 

Question 19) You are a ML specialist needing to collect data from Twitter tweets. Your goal is to collect tweets that include only the name of your company and the tweet body, and store it off into a data store in AWS. What set of tools can you use to stream, transform, and load the data into AWS with the LEAST amount of effort?

A) Setup a Kinesis Data Firehose for data ingestion and immediately write that data to S3. Next, setup a Lambda function to trigger when data lands in S3 to transform it and finally write it to DynamoDB.
B) Setup A Kinesis Data Stream for data ingestion, setup EC2 instances as data consumers to poll and transform the data from the stream. Once the data is transformed, make an API call to write the data to DynamoDB.
C) Setup Kinesis Data Streams for data ingestion. Next, setup Kinesis Data Firehouse to load that data into RedShift. Next, setup a Lambda function to query data using RedShift spectrum and store the results onto DynamoDB.
D) Create a Kinesis Data Stream to ingest the data. Next, setup a Kinesis Data Firehose and use Lambda to transform the data from the Kinesis Data Stream, then use Lambda to write the data to DynamoDB. Finally, use S3 as the data destination for Kinesis Data Firehose.
 

Answer 19)

A

Notes 19)

All of these could be used to stream, transform, and load the data into an AWS data store. The setup that requires the LEAST amount of effort and moving parts involves setting up a Kinesis Data Firehose to stream the data into S3, have it transformed by Lambda with an S3 trigger, and then written to DynamoDB.

Reference 19: Kinesis Data Firehose to stream the data into S3

Question 20) Which service in the Kinesis family allows you to build custom applications that process or analyze streaming data for specialized needs?

A) Kinesis Firehose
B) Kinesis Streams
C) Kinesis Video Streams
D) Kinesis Data Analytics

Answer 20)

B

Notes 20)

Kinesis Streams allows you to stream data into AWS and build custom applications around that streaming data.

Reference 20: Kinesis Streams

Question21:

Answer21:

What are the Top 100 AWS and Google Certified Machine Learning Specialty Questions and Answers Dumps?

This blog is the best way  is the best way to prepare for your upcoming  AWS Certified Machine Learning Specialty and Google Certified Professional Machine Learning Engineer exam. With over 100 questions and answers, this blog provides quizzes similar  that are very similar to the real exam. It also includes  the option to show and hide answers. Additionally, there are machine learning interview questions and detailed answers, as well as cheat sheets and illustrations. This blog is the best way to make sure you are well-prepared for your AWS Certified Machine Learning Specialty Exam.

The typical Google Machine Learning Engineer salary is $147,218. Machine Learning Engineer salaries at Google can range from $110,000 – $152,183.

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.

  • By the end of 2020, 85% of customer interactions will be handled without a human (Call Center, Chatbot, etc…)
  • 61% of marketers say artificial intelligence is the most important aspect of their data strategy.
  • 80% of business and tech leaders say AI already boosts productivity (Robotic Process Automation, Power Automate, etc..)
  • Current AI technology can boost business productivity by up to 40%

AWS Machine Learning Certification Specialty Exam Prep for iOs Android Windows10/11

AWS machine Learning Specialty Exam Prep MLS-C01
AWS machine Learning Specialty Exam Prep MLS-C01

GCP Professional Machine Learning Engineer for iOs, Android, Windows 10/11

Quizzes, Practice Exams: Framing, Architecting, Designing, Developing ML Problems & Solutions, ML Jobs Interview Q&A

GCP Professional Machine Learning Engineer
GCP Professional Machine Learning Engineer

 

Azure AI Fundamentals AI-900 Exam Prep App for iOS, Android, Windows10/11

Basics and Advanced Machine Learning Quizzes on Azure, Azure Machine Learning Job Interviews Questions and Answer, ML Cheat Sheets

Azure AI Fundamentals AI-900 Exam Prep
Azure AI Fundamentals AI-900 Exam Prep

Machine Learning For Dummies App for iOs, Android, Windows10/11

Use this App to learn about Machine Learning and Elevate your Brain with Machine Learning Quizzes, Cheat Sheets, Ml Jobs Interview Questions and Answers updated daily.

Machine Learning For Dummies
Machine Learning For Dummies

What does a Professional Machine Learning Engineer do?

Professional Machine Learning Engineer designs, builds, and productionizes ML models to solve business challenges using Google Cloud technologies and knowledge of proven ML models and techniques. The ML Engineer collaborates closely with other job roles to ensure long-term success of models. The ML Engineer should be proficient in all aspects of model architecture, data pipeline interaction, and metrics interpretation. The ML Engineer needs familiarity with application development, infrastructure management, data engineering, and security. Through an understanding of training, retraining, deploying, scheduling, monitoring, and improving models, they design and create scalable solutions for optimal performance.

The AWS Certified Machine Learning – Specialty certification is intended for individuals who perform a development or data science role. It validates a candidate’s ability to design, implement, deploy, and maintain machine learning (ML) solutions for given business problems.

This blog covers Machine Learning 101, Top 20 AWS Certified Machine Learning Specialty Questions and Answers, Top 20 Google Professional Machine Learning Engineer Sample Questions, Machine Learning Quizzes, Machine Learning Q&A, Top 10 Machine Learning Algorithms, Machine Learning Latest Hot News, Machine Learning Demos (Ex: Tensorflow Demos)

Below are the Top 100 AWS Certified Machine Learning Specialty Questions and Answers Dumps.

Top

 

Question1: A machine learning team has several large CSV datasets in Amazon S3. Historically, models built with the Amazon SageMaker Linear Learner algorithm have taken hours to train on similar-sized datasets. The team’s leaders need to accelerate the training process. What can a machine learning specialist do to address this concern?

A) Use Amazon SageMaker Pipe mode.
B) Use Amazon Machine Learning to train the models.
C) Use Amazon Kinesis to stream the data to Amazon SageMaker.
D) Use AWS Glue to transform the CSV dataset to the JSON format.
ANSWER1:

A

Notes/Hint1:


Amazon SageMaker Pipe mode streams the data directly to the container, which improves the performance of training jobs. (Refer to this link for supporting information.) In Pipe mode, your training job streams data directly from Amazon S3. Streaming can provide faster start times for training jobs and better throughput. With Pipe mode, you also reduce the size of the Amazon EBS volumes for your training instances. B would not apply in this scenario. C is a streaming ingestion solution, but is not applicable in this scenario. D transforms the data structure.

Reference1: Amazon SageMaker

Question 2) A local university wants to track cars in a parking lot to determine which students are parking in the lot. The university is wanting to ingest videos of the cars parking in near-real time, use machine learning to identify license plates, and store that data in an AWS data store. Which solution meets these requirements with the LEAST amount of development effort?

A) Use Amazon Kinesis Data Streams to ingest the video in near-real time, use the Kinesis Data Streams consumer integrated with Amazon Rekognition Video to process the license plate information, and then store results in DynamoDB.

B) Use Amazon Kinesis Video Streams to ingest the videos in near-real time, use the Kinesis Video Streams integration with Amazon Rekognition Video to identify the license plate information, and then store the results in DynamoDB.

C) Use Amazon Kinesis Data Streams to ingest videos in near-real time, call Amazon Rekognition to identify license plate information, and then store results in DynamoDB.

D) Use Amazon Kinesis Firehose to ingest the video in near-real time and outputs results onto S3. Set up a Lambda function that triggers when a new video is PUT onto S3 to send results to Amazon Rekognition to identify license plate information, and then store results in DynamoDB.

Answer 2)

B

Notes/Hint2)

Kinesis Video Streams is used to stream videos in near-real time. Amazon Rekognition Video uses Amazon Kinesis Video Streams to receive and process a video stream. After the videos have been processed by Rekognition we can output the results in DynamoDB.

Reference: Kinesis Video Streams

Question 3) A term frequency–inverse document frequency (tf–idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences:

1. Please call the number below.
2. Please do not call us. What are the dimensions of the tf–idf matrix?
A) (2, 16)
B) (2, 8)
C) (2, 10)
D) (8, 10)

ANSWER3:

A

Notes/Hint3:

There are 2 sentences, 8 unique unigrams, and 8 unique bigrams, so the result would be (2,16). The phrases are “Please call the number below” and “Please do not call us.” Each word individually (unigram) is “Please,” “call,” ”the,” ”number,” “below,” “do,” “not,” and “us.” The unique bigrams are “Please call,” “call the,” ”the number,” “number below,” “Please do,” “do not,” “not call,” and “call us.” The tf–idf vectorizer is described at this link.

Reference3:  tf-idf vertorizer

Question 4: A company is setting up a system to manage all of the datasets it stores in Amazon S3. The company would like to automate running transformation jobs on the data and maintaining a catalog of the metadata concerning the datasets. The solution should require the least amount of setup and maintenance. Which solution will allow the company to achieve its goals? 

A) Create an Amazon EMR cluster with Apache Hive installed. Then, create a Hive metastore and a script to run transformation jobs on a schedule.
B) Create an AWS Glue crawler to populate the AWS Glue Data Catalog. Then, author an AWS Glue ETL job, and set up a schedule for data transformation jobs.
C) Create an Amazon EMR cluster with Apache Spark installed. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule. D) Create an AWS Data Pipeline that transforms the data. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule.
 

ANSWER4:

B

Notes/Hint4:

AWS Glue is the correct answer because this option requires the least amount of setup and maintenance since it is serverless, and it does not require management of the infrastructure. Refer to this link for supporting information. A, C, and D are all solutions that can solve the problem, but require more steps for configuration, and require higher operational overhead to run and maintain.
Reference4:  Glue

Question 5) Which service in the Kinesis family allows you to easily load streaming data into data stores and analytics tools?

A) Kinesis Firehose
B) Kinesis Streams
C) Kinesis Data Analytics
D) Kinesis Video Streams
 

ANSWER5:

A

Notes/Hint5:

Kinesis Firehose is perfect for streaming data into AWS and sending it directly to its final destination – places like S3, Redshift, Elastisearch, and Splunk Instances.

Reference 5): Kinesis Firehose

Question 6) A data scientist is working on optimizing a model during the training process by varying multiple parameters. The data scientist observes that, during multiple runs with identical parameters, the loss function converges to different, yet stable, values. What should the data scientist do to improve the training process? 

A) Increase the learning rate. Keep the batch size the same.
B) Reduce the batch size. Decrease the learning rate.
C) Keep the batch size the same. Decrease the learning rate.
D) Do not change the learning rate. Increase the batch size.
 
Answer  6)
B
 

Notes 6)

It is most likely that the loss function is very curvy and has multiple local minima where the training is getting stuck. Decreasing the batch size would help the data scientist stochastically get out of the local minima saddles. Decreasing the learning rate would prevent overshooting the global loss function minimum. Refer to the paper at this link for an explanation.
Reference 6) : Here

Question 7) Your organization has a standalone Javascript (Node.js) application that streams data into AWS using Kinesis Data Streams. You notice that they are using the Kinesis API (AWS SDK) over the Kinesis Producer Library (KPL). What might be the reasoning behind this?

A) The Kinesis API (AWS SDK) provides greater functionality over the Kinesis Producer Library.
B) The Kinesis API (AWS SDK) runs faster in Javascript applications over the Kinesis Producer Library.
C) The Kinesis Producer Library must be installed as a Java application to use with Kinesis Data Streams.
D) The Kinesis Producer Library cannot be integrated with a Javascript application because of its asynchronous architecture.
Answer 7)
C
Notes/Hint7:
The KPL must be installed as a Java application before it can be used with your Kinesis Data Streams. There are ways to process KPL serialized data within AWS Lambda, in Java, Node.js, and Python, but not if these answers mentions Lambda.
Reference 7) KPL
 
 

Question 8) A data scientist is evaluating different binary classification models. A false positive result is 5 times more expensive (from a business perspective) than a false negative result. The models should be evaluated based on the following criteria: 

1) Must have a recall rate of at least 80%
2) Must have a false positive rate of 10% or less
3) Must minimize business costs After creating each binary classification model, the data scientist generates the corresponding confusion matrix. Which confusion matrix represents the model that satisfies the requirements?
A) TN = 91, FP = 9 FN = 22, TP = 78
 B) TN = 99, FP = 1 FN = 21, TP = 79
C) TN = 96, FP = 4 FN = 10, TP = 90
D) TN = 98, FP = 2 FN = 18, TP = 82
 
Answer 8): 
D
 

Notes/Hint 8)


The following calculations are required: TP = True Positive FP = False Positive FN = False Negative TN = True Negative FN = False Negative Recall = TP / (TP + FN) False Positive Rate (FPR) = FP / (FP + TN) Cost = 5 * FP + FN A B C D Recall 78 / (78 + 22) = 0.78 79 / (79 + 21) = 0.79 90 / (90 + 10) = 0.9 82 / (82 + 18) = 0.82 False Positive Rate 9 / (9 + 91) = 0.09 1 / (1 + 99) = 0.01 4 / (4 + 96) = 0.04 2 / (2 + 98) = 0.02 Costs 5 * 9 + 22 = 67 5 * 1 + 21 = 26 5 * 4 + 10 = 30 5 * 2 + 18 = 28 Options C and D have a recall greater than 80% and an FPR less than 10%, but D is the most cost effective. For supporting information, refer to this link.
Reference 8: Here

 
 

Question 9) A data scientist uses logistic regression to build a fraud detection model. While the model accuracy is 99%, 90% of the fraud cases are not detected by the model. What action will definitely help the model detect more than 10% of fraud cases? 

A) Using undersampling to balance the dataset
B) Decreasing the class probability threshold
C) Using regularization to reduce overfitting
D) Using oversampling to balance the dataset
 

Answer  9)

B

 

Notes 9)


Decreasing the class probability threshold makes the model more sensitive and, therefore, marks more cases as the positive class, which is fraud in this case. This will increase the likelihood of fraud detection. However, it comes at the price of lowering precision. This is covered in the Discussion section of the paper at this link
Reference 9: Here

 

 

Question 10) A company is interested in building a fraud detection model. Currently, the data scientist does not have a sufficient amount of information due to the low number of fraud cases. Which method is MOST likely to detect the GREATEST number of valid fraud cases?

A) Oversampling using bootstrapping
B) Undersampling
C) Oversampling using SMOTE
D) Class weight adjustment
 

Answer  10)

C

 
Notes 10)

With datasets that are not fully populated, the Synthetic Minority Over-sampling Technique (SMOTE) adds new information by adding synthetic data points to the minority class. This technique would be the most effective in this scenario. Refer to Section 4.2 at this link for supporting information.
Reference 10) : Here
 

Question 11) A machine learning engineer is preparing a data frame for a supervised learning task with the Amazon SageMaker Linear Learner algorithm. The ML engineer notices the target label classes are highly imbalanced and multiple feature columns contain missing values. The proportion of missing values across the entire data frame is less than 5%. What should the ML engineer do to minimize bias due to missing values? 

 
A) Replace each missing value by the mean or median across non-missing values in same row.
B) Delete observations that contain missing values because these represent less than 5% of the data.
C) Replace each missing value by the mean or median across non-missing values in the same column.
D) For each feature, approximate the missing values using supervised learning based on other features.
 

Answer  11)

D

 

Notes 11)

Use supervised learning to predict missing values based on the values of other features. Different supervised learning approaches might have different performances, but any properly implemented supervised learning approach should provide the same or better approximation than mean or median approximation, as proposed in responses A and C. Supervised learning applied to the imputation of missing values is an active field of research. Refer to this link for an example.
Reference 11): Here

 

Question 12) A company has collected customer comments on its products, rating them as safe or unsafe, using decision trees. The training dataset has the following features: id, date, full review, full review summary, and a binary safe/unsafe tag. During training, any data sample with missing features was dropped. In a few instances, the test set was found to be missing the full review text field. For this use case, which is the most effective course of action to address test data samples with missing features? 

A) Drop the test samples with missing full review text fields, and then run through the test set.
B) Copy the summary text fields and use them to fill in the missing full review text fields, and then run through the test set.
C) Use an algorithm that handles missing data better than decision trees.
D) Generate synthetic data to fill in the fields that are missing data, and then run through the test set.
 
Answer  12)
B

 

 

Notes 12) 

In this case, a full review summary usually contains the most descriptive phrases of the entire review and is a valid stand-in for the missing full review text field. For supporting information, refer to page 1627 at this link, and this link and this link.

Reference 12) Here

 

 

Question 13) An insurance company needs to automate claim compliance reviews because human reviews are expensive and error-prone. The company has a large set of claims and a compliance label for each. Each claim consists of a few sentences in English, many of which contain complex related information. Management would like to use Amazon SageMaker built-in algorithms to design a machine learning supervised model that can be trained to read each claim and predict if the claim is compliant or not. Which approach should be used to extract features from the claims to be used as inputs for the downstream supervised task? 

 
A) Derive a dictionary of tokens from claims in the entire dataset. Apply one-hot encoding to tokens found in each claim of the training set. Send the derived features space as inputs to an Amazon SageMaker builtin supervised learning algorithm.
 
B) Apply Amazon SageMaker BlazingText in Word2Vec mode to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
 
C) Apply Amazon SageMaker BlazingText in classification mode to labeled claims in the training set to derive features for the claims that correspond to the compliant and non-compliant labels, respectively.
 
D) Apply Amazon SageMaker Object2Vec to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
 

Answer  13)

D

 

Notes 13)

Amazon SageMaker Object2Vec generalizes the Word2Vec embedding technique for words to more complex objects, such as sentences and paragraphs. Since the supervised learning task is at the level of whole claims, for which there are labels, and no labels are available at the word level, Object2Vec needs be used instead of Word2Vec.

Reference 13)  Amazon SageMaker
Object2Vec 

Question 14) You have been tasked with capturing two different types of streaming events. The first event type includes mission-critical data that needs to immediately be processed before operations can continue. The second event type includes data of less importance, but operations can continue without immediately processing. What is the most appropriate solution to record these different types of events?

A) Capture both events with the PutRecords API call.
B) Capture both event types using the Kinesis Producer Library (KPL).
C) Capture the mission critical events with the PutRecords API call and the second event type with the Kinesis Producer Library (KPL).
D) Capture the mission critical events with the Kinesis Producer Library (KPL) and the second event type with the Putrecords API call.
 

Answer  14)

C

 

Notes 14)

The question is about sending data to Kinesis synchronously vs. asynchronously. PutRecords is a synchronous send function, so it must be used for the first event type (critical events). The Kinesis Producer Library (KPL) implements an asynchronous send function, so it can be used for the second event type. In this scenario, the reason to use the KPL over the PutRecords API call is because: KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user-configurable). Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance. Applications that cannot tolerate this additional delay may need to use the AWS SDK directly. For more information about using the AWS SDK with Kinesis Data Streams, see Developing Producers Using the Amazon Kinesis Data Streams API with the AWS SDK for Java. For more information about RecordMaxBufferedTime and other user-configurable properties of the KPL, see Configuring the Kinesis Producer Library.

Reference 14: KCL vs PutRecords

 

Question 15) You are collecting clickstream data from an e-commerce website to make near-real time product suggestions for users actively using the site. Which combination of tools can be used to achieve the quickest recommendations and meets all of the requirements?

A) Use Kinesis Data Streams to ingest clickstream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
 
B) Use Kinesis Data Firehose to ingest click stream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions, then use Lambda to load these results into S3.
 
C) Use Kinesis Data Streams to ingest clickstream data, then use Lambda to process that data and write it to S3. Once the data is on S3, use Athena to query based on conditions that data and make real time recommendations to users.
 
D) Use the Kinesis Data Analytics to ingest the clickstream data directly and run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
 

Answer  15)

A

 

Notes 15)

Kinesis Data Analytics gets its input streaming data from Kinesis Data Streams or Kinesis Data Firehose. You can use Kinesis Data Analytics to run real-time SQL queries on your data. Once certain conditions are met you can trigger Lambda functions to make real time product suggestions to users. It is not important that we store or persist the clickstream data.

Reference 15: Kinesis Data Analytics

Question 16) Which service built by AWS makes it easy to set up a retry mechanism, aggregate records to improve throughput, and automatically submits CloudWatch metrics?

A) Kinesis API (AWS SDK)
B) Kinesis Producer Library (KPL)
C) Kinesis Consumer Library
D) Kinesis Client Library (KCL)

Answer  16)

B

 

Notes 16)

Although the Kinesis API built into the AWS SDK can be used for all of this, the Kinesis Producer Library (KPL) makes it easy to integrate all of this into your applications.

Reference 16:  Kinesis Producer Library (KPL) 

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore  9n8rl80hvm4t-mobile screenshots]

Question 17) You have been tasked with capturing data from an online gaming platform to run analytics on and process through a machine learning pipeline. The data that you are ingesting is players controller inputs every 1 second (up to 10 players in a game) that is in JSON format. The data needs to be ingested through Kinesis Data Streams and the JSON data blob is 100 KB in size. What is the minimum number of shards you can use to successfully ingest this data?

A) 10 shards
B) Greater than 500 shards, so you’ll need to request more shards from AWS
C) 1 shard
D) 100 shards

Answer  17)

C

 

Notes 17)

In this scenario, there will be a maximum of 10 records per second with a max payload size of 1000 KB (10 records x 100 KB = 1000KB) written to the shard. A single shard can ingest up to 1 MB of data per second, which is enough to ingest the 1000 KB from the streaming game play. Therefor 1 shard is enough to handle the streaming data.

Reference 17: shards

Question 18) Which services in the Kinesis family allows you to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time?

A) Kinesis Streams
B) Kinesis Firehose
C) Kinesis Video Streams
D) Kinesis Data Analytics

Answer  18)

D

 

Notes 18)

Kinesis Data Analytics allows you to run real-time SQL queries on your data to gain insights and respond to events in real time.

Reference 18: Kinesis Data Analytics

 

Question 19) You are a ML specialist needing to collect data from Twitter tweets. Your goal is to collect tweets that include only the name of your company and the tweet body, and store it off into a data store in AWS. What set of tools can you use to stream, transform, and load the data into AWS with the LEAST amount of effort?

A) Setup a Kinesis Data Firehose for data ingestion and immediately write that data to S3. Next, setup a Lambda function to trigger when data lands in S3 to transform it and finally write it to DynamoDB.
B) Setup A Kinesis Data Stream for data ingestion, setup EC2 instances as data consumers to poll and transform the data from the stream. Once the data is transformed, make an API call to write the data to DynamoDB.
C) Setup Kinesis Data Streams for data ingestion. Next, setup Kinesis Data Firehouse to load that data into RedShift. Next, setup a Lambda function to query data using RedShift spectrum and store the results onto DynamoDB.
D) Create a Kinesis Data Stream to ingest the data. Next, setup a Kinesis Data Firehose and use Lambda to transform the data from the Kinesis Data Stream, then use Lambda to write the data to DynamoDB. Finally, use S3 as the data destination for Kinesis Data Firehose.
 

Answer 19)

A

Notes 19)

All of these could be used to stream, transform, and load the data into an AWS data store. The setup that requires the LEAST amount of effort and moving parts involves setting up a Kinesis Data Firehose to stream the data into S3, have it transformed by Lambda with an S3 trigger, and then written to DynamoDB.

Reference 19: Kinesis Data Firehose to stream the data into S3

Question 20) Which service in the Kinesis family allows you to build custom applications that process or analyze streaming data for specialized needs?

A) Kinesis Firehose
B) Kinesis Streams
C) Kinesis Video Streams
D) Kinesis Data Analytics

Answer 20)

B

Notes 20)

Kinesis Streams allows you to stream data into AWS and build custom applications around that streaming data.

Reference 20: Kinesis Streams

Question21: Of the following, which is an example of machine learning? (Select TWO.)

A) Calculating the shortest route from current location to the destination

B) Optimizing product pricing based on real-time sales data

C) Sentiment analysis of text on product reviews

D) A loan approval system that classifies applicants entirely based on credit score

Answer21:

B and C

Notes 21: 

Optimizing product pricing based on real-time sales data and Sentiment analysis of text on product reviews.
 

Question22:Which of the following is an appropriate use case for unsupervised learning?

A) Partitioning an image of a street scene into multiple segments

B) Finding an optimal path out of a maze

C) Identifying clusters of housing sales based on related data points

D) Analyzing sentiment of social media posts

Answer22:

C

Notes 22: 

Identifying clusters of housing sales based on related data points

Question23

Answer23:

 

Notes 23: 

Question24: A Djamgatech retail company wants to deploy a machine learning model to predict the demand for a product using sales data from the past 5 years. What is the MOST efficient solution that the company should implement first?

A) Regression

B) Multi-class classification

C) Binary class classification

D) N/A

Answer24:

A

Notes 24: 

Question25: In which phase of the ML pipeline do you analyze the business requirements and re-frame that information into a machine learning context.

A) Problem formulation

B) Model training

C) Deployment

D)

Data preprocessing

 

Answer25:

A

Notes 25:

AWS machine Learning Specialty Exam Prep MLS-C01

iOs: https://apps.apple.com/ca/app/aws-machine-learning-prep-pro/id1611045854

Windows: https://www.microsoft.com/en-ca/p/aws-machine-learning-mls-c01-specialty-certification-exam-prep/9n8rl80hvm4t

Android/Amazon: https://www.amazon.com/gp/product/B09TZ4H8V6

AWS MLS-C01 Machine Learning Exam Prep

Quizzes, Practice Exams: Modeling, Data Engineering, Vision, Exploratory Data Analysis, ML Ops, Cheat Sheets, ML Jobs Interview Q&A

Use this App to learn about Machine Learning on AWS and prepare for the AWS Machine Learning Specialty Certification MLS-C01.

Earning AWS Certified Machine Learning Specialty validates expertise in building, training, tuning, and deploying machine learning (ML) models on AWS.

The App provides hundreds of quizzes and practice exam about:

– Machine Learning Operation on AWS

– Modelling

– Data Engineering

– Computer Vision,

– Exploratory Data Analysis,

– ML implementation & Operations

– Machine Learning Basics Questions and Answers

– Machine Learning Advanced Questions and Answers

– Scorecard

– Countdown timer

– Machine Learning Cheat Sheets

– Machine Learning Interview Questions and Answers

– Machine Learning Latest News

The App covers Machine Learning Basics and Advanced topics including: NLP, Computer Vision, Python, linear regression, logistic regression, Sampling, dataset, statistical interaction, selection bias, non-Gaussian distribution, bias-variance trade-off, Normal Distribution, correlation and covariance, Point Estimates and Confidence Interval, A/B Testing, p-value, statistical power of sensitivity, over-fitting and under-fitting, regularization, Law of Large Numbers, Confounding Variables, Survivorship Bias, univariate, bivariate and multivariate, Resampling, ROC curve, TF/IDF vectorization, Cluster Sampling, etc.

Domain 1: Data Engineering

Create data repositories for machine learning.

Identify data sources (e.g., content and location, primary sources such as user data)

Determine storage mediums (e.g., DB, Data Lake, S3, EFS, EBS)

Identify and implement a data ingestion solution.

Data job styles/types (batch load, streaming)

Data ingestion pipelines (Batch-based ML workloads and streaming-based ML workloads), etc.

Domain 2: Exploratory Data Analysis

Sanitize and prepare data for modeling.

Perform feature engineering.

Analyze and visualize data for machine learning.

Domain 3: Modeling

Frame business problems as machine learning problems.

Select the appropriate model(s) for a given machine learning problem.

Train machine learning models.

Perform hyperparameter optimization.

Evaluate machine learning models.

Domain 4: Machine Learning Implementation and Operations

Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.

Recommend and implement the appropriate machine learning services and features for a given problem.

Apply basic AWS security practices to machine learning solutions.

Deploy and operationalize machine learning solutions.

Machine Learning Services covered:

Amazon Comprehend

AWS Deep Learning AMIs (DLAMI)

AWS DeepLens

Amazon Forecast

Amazon Fraud Detector

Amazon Lex

Amazon Polly

Amazon Rekognition

Amazon SageMaker

Amazon Textract

Amazon Transcribe

Amazon Translate

Other Services and topics covered are:

Ingestion/Collection

Processing/ETL

Data analysis/visualization

Model training

Model deployment/inference

Operational

AWS ML application services

Language relevant to ML (for example, Python, Java, Scala, R, SQL)

Notebooks and integrated development environments (IDEs),

S3, SageMaker, Kinesis, Lake Formation, Athena, Kibana, Redshift, Textract, EMR, Glue, SageMaker, CSV, JSON, IMG, parquet or databases, Amazon Athena

Amazon EC2, Amazon Elastic Container Registry (Amazon ECR), Amazon Elastic Container Service, Amazon Elastic Kubernetes Service , Amazon Redshift

Sagemaker API Explained:

SageMaker API

AWS Certified Machine Learning Engineer Specialty Questions and Answers:

Question1: An advertising and analytics company uses machine learning to predict user response to online advertisements using a custom XGBoost model. The company wants to improve its ML pipeline by porting its training and inference code, written in R, to Amazon SageMaker, and do so with minimal changes to the existing code.

Answer1: Use the Build Your Own Container (BYOC) Amazon Sagemaker option.
Create a new docker container with the existing code. Register the container in Amazon Elastic Container registry. with the existing code. Register the container in Amazon Elastic Container Registry. Finally run the training and inference jobs using this container.

Question2: Which feature of Amazon SageMaker can you use for preprocessing the data?

 

Answer2: Amazon Sagemaker Notebook instances

Amazon SageMaker enables developers and data scientists to build, train, tune, and deploy machine learning (ML) models at scale. You can deploy trained ML models for real-time or batch predictions on unseen data, a process known as inference. However, in most cases, the raw input data must be preprocessed and can’t be used directly for making predictions. This is because most ML models expect the data in a predefined format, so the raw data needs to be first cleaned and formatted in order for the ML model to process the data.  You can use the Amazon SageMaker built-in Scikit-learn library for preprocessing input data and then use the Amazon SageMaker built-in Linear Learner algorithm for predictions.

Question3: What setting, when creating an Amazon SageMaker notebook instance, can you use to install libraries and import data?

Answer3: LifeCycle Configuration

Question4: How to Choose the right Sagemaker built-in algorithm?

How to chose the right built in algorithm in SageMaker?
How to chose the right built in algorithm in SageMaker?

 

Guide to choosing the right unsupervised learning algorithm
Guide to choosing the right unsupervised learning algorithm
Choosing the right  ML algorithm based on Data Type
Choosing the right ML algorithm based on Data Type
Choosing the right ML algo based on data type
Choosing the right ML algo based on data type

This is a general guide for choosing which algorithm to use depending on what business problem you have and what data you have. 

 

Top

Top 10 Google Professional Machine Learning Engineer Sample Questions

Question 1: You work for a textile manufacturer and have been asked to build a model to detect and classify fabric defects. You trained a machine learning model with high recall based on high resolution images taken at the end of the production line. You want quality control inspectors to gain trust in your model. Which technique should you use to understand the rationale of your classifier?

A. Use K-fold cross validation to understand how the model performs on different test datasets.

B. Use the Integrated Gradients method to efficiently compute feature attributions for each predicted image.

C. Use PCA (Principal Component Analysis) to reduce the original feature set to a smaller set of easily understood features.

D. Use k-means clustering to group similar images together, and calculate the Davies-Bouldin index to evaluate the separation between clusters.

Answer 1)

B

Notes 1)

B is correct because it identifies the pixel of the input image that leads to the classification of the image itself.

Question 2: You need to write a generic test to verify whether Dense Neural Network (DNN) models automatically released by your team have a sufficient number of parameters to learn the task for which they were built. What should you do?

A. Train the model for a few iterations, and check for NaN values.
B. Train the model for a few iterations, and verify that the loss is constant.
C. Train a simple linear model, and determine if the DNN model outperforms it.
D. Train the model with no regularization, and verify that the loss function is close to zero.
 

Answer 2)

D

Notes 2)

D is correct because the test can check that the model has enough parameters to memorize the task.

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

Question 3: Your team is using a TensorFlow Inception-v3 CNN model pretrained on ImageNet for an image classification prediction challenge on 10,000 images. You will use AI Platform to perform the model training. What TensorFlow distribution strategy and AI Platform training job configuration should you use to train the model and optimize for wall-clock time?

 

A. Default Strategy; Custom tier with a single master node and four v100 GPUs.
B. One Device Strategy; Custom tier with a single master node and four v100 GPUs.
C. One Device Strategy; Custom tier with a single master node and eight v100 GPUs.
D. Central Storage Strategy; Custom tier with a single master node and four v100 GPUs.
 

Answer 3)

D

Notes 3)

D is correct because this is the only strategy that can perform distributed training; albeit there is only a single copy of the variables on the CPU host.

Question 4: You work on a team where the process for deploying a model into production starts with data scientists training different versions of models in a Kubeflow pipeline. The workflow then stores the new model artifact into the corresponding Cloud Storage bucket. You need to build the next steps of the pipeline after the submitted model is ready to be tested and deployed in production on AI Platform. How should you configure the architecture before deploying the model to production?

 
A. Deploy model in test environment -> Validate model -> Create a new AI Platform model version
 
B. Validate model -> Deploy model in test environment -> Create a new AI Platform model version
 
C. Create a new AI Platform model version -> Validate model -> Deploy model in test environment
D. Create a new AI Platform model version – > Deploy model in test environment -> Validate model
 
Answer 4)
A
 
Notes 4)
A is correct because the model can be validated after it is deployed to the test environment, and the release version is established before the model is deployed in production.
 
Question 5: You work for a maintenance company and have built and trained a deep learning model that identifies defects based on thermal images of underground electric cables. Your dataset contains 10,000 images, 100 of which contain visible defects. How should you evaluate the performance of the model on a test dataset?
 
A. Calculate the Area Under the Curve (AUC) value.
 
B. Calculate the number of true positive results predicted by the model.
C. Calculate the fraction of images predicted by the model to have a visible defect.
D. Calculate the Cosine Similarity to compare the model’s performance on the test dataset to the model’s performance on the training dataset.
 
Answer 5)
A
 
Notes 5)
A is correct because it is scale-invariant. AUC measures how well predictions are ranked, rather than their absolute values. AUC is also classification-threshold invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.
 
Question 6: You work for a manufacturing company that owns a high-value machine which has several machine settings and multiple sensors. A history of the machine’s hourly sensor readings and known failure event data are stored in BigQuery. You need to predict if the machine will fail within the next 3 days in order to schedule maintenance before the machine fails. Which data preparation and model training steps should you take?

 

A. Data preparation: Daily max value feature engineering with DataPrep; Model training: AutoML classification with BQML
 
B. Data preparation: Daily min value feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
C. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to False
D. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
Answer 6)
D
 
Notes 6)
D is correct because it uses the rolling average of the sensor data and balances the weights using the BQML auto class weight balance parameter.
 
 
Question 7: You are an ML engineer at a media company. You need to build an ML model to analyze video content frame-by-frame, identify objects, and alert users if there is inappropriate content. Which Google Cloud products should you use to build this project?

 

A. Pub/Sub, Cloud Function, Cloud Vision API
 
B. Pub/Sub, Cloud IoT, Dataflow, Cloud Vision API, Cloud Logging
C. Pub/Sub, Cloud Function, Video Intelligence API, Cloud Logging
D. Pub/Sub, Cloud Function, AutoML Video Intelligence, Cloud Logging
 
Answer 7)
C
 
Notes 7)
C is correct as Video Intelligence API can find inappropriate components and other components satisfy the requirements of real-time processing and notification.
 
Question 8: You work for a large retailer. You want to use ML to forecast future sales leveraging 10 years of historical sales data. The historical data is stored in Cloud Storage in Avro format. You want to rapidly experiment with all the available data. How should you build and train your model for the sales forecast?
 
A. Load data into BigQuery and use the ARIMA model type on BigQuery ML.
B. Convert the data into CSV format and create a regression model on AutoML Tables.
C. Convert the data into TFRecords and create an RNN model on TensorFlow on AI Platform Notebooks.
D. Convert and refactor the data into CSV format and use the built-in XGBoost algorithm on AI Platform Training.
 
Answer 8)
A
 
Notes 8)
A is correct because BigQuery ML is designed for fast and rapid experimentation and it is possible to use federated queries to read data directly from Cloud Storage. Moreover, ARIMA is considered one of the best in class for time series forecasting.
 
Question 9) You need to build an object detection model for a small startup company to identify if and where the company’s logo appears in an image. You were given a large repository of images, some with logos and some without. These images are not yet labelled. You need to label these pictures, and then train and deploy the model. What should you do?

 

A. Use Google Cloud’s Data Labelling Service to label your data. Use AutoML Object Detection to train and deploy the model.
B. Use Vision API to detect and identify logos in pictures and use it as a label. Use AI Platform to build and train a convolutional neural network.
 
C. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a convolutional neural network.
D. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a real time object detection model.
 
Answer 9)
A
 
Notes 9)
A is correct as this will allow you to easily create a request for a labelling task and deploy a high-performance model.
 

Question 10) You work for a large financial institution that is planning to use Dialogflow to create a chatbot for the company’s mobile app. You have reviewed old chat logs and tagged each conversation for intent based on each customer’s stated intention for contacting customer service. About 70% of customer inquiries are simple requests that are solved within 10 intents. The remaining 30% of inquiries require much longer and more complicated requests. Which intents should you automate first?

A. Automate a blend of the shortest and longest intents to be representative of all intents.
B. Automate the more complicated requests first because those require more of the agents’ time.
C. Automate the 10 intents that cover 70% of the requests so that live agents can handle the more complicated requests.
 
D. Automate intents in places where common words such as “payment” only appear once to avoid confusing the software.
Answer 10)
C
 
Notes 10)

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore  9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A Part I:

Google.

Azure and AWS are second class citizens in this area.

Sure, AWS has 70% of the market.

Sure, Azure is the easiest turn key and super user friendly.

But, the king of machine learning in the cloud is GCP.

GCP = Google Cloud Platform

Google has the largest data science team in the world, not mention they have Hinton.

Let’s forgot for a minute they created TensorFlow and give it away.

Let’s just talk about building a real world model with data that doesn’t fit into a excel spreadsheet.

The vast majority of applied machine learning is supervised and that means we need data.

Not just normal data, we need very clean highly structured data.

Where’s the easiest place in the world to upload and model a Petabyte of structured dataBigQuery of course.

Why BigQuery? I don’t have to do anything but upload my data. No spinning up RedShit clusters or whatever I have to do in Azure, just upload and massage data with my familiar SQL. If I do have to wrangle my data it won’t take my six months to update 5 rows here, minutes usually.

Then, you’ll need a front end. Cloud datalab is a Jupyter notebook, which is good because I don’t want nor do I need anything else.

Then, with a single line of code I connect by datalab (Jupyter) notebook to my data in BigQuery and build away.

I’ve worked in all three and the only thing I care about is getting to my job the fastest and right now that means I build my models in GCP.

If you’re new to machine learning don’t start in GCP or any cloud vendor for that matter. Start learning Python from the comfort of your laptop.

The course below is free to the first 20.

The Complete Python Course for Machine Learning Engineers

Here, I want to share the best research paper on Machine Learning classification methods, titled ‘Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?’, published in the ‘Journal of Machine Learning Research’.

This paper nicely explained 179 classification techniques and applied them on 121 data sets thus sharing small summary of the paper:

Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?

 
 
 

The paper evaluated 179 classifiers arising from 17 ML families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest neighbours, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R ( with and without the caret package), C and Matlab, including all the relevant classifiers available today.

Experiments used total 121 data sets , which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behaviour, not dependent on the data set collection.

The whole data set and partitions are available from: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/data.tar.gz

The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package).

The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).

You can see the table with the complete results: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/results.txt

I hope it will be helpful for Statistic and Machine Leaning aspirants!

Thank you!

 
 
 

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

  • Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
  • Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
  • Model versioning: add a hash key to your different models. You will thank me later.
  • Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
  • Monitor performances: execution time and statistical scores of your models.
  • Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

  1. Not understanding the structure of the dataset
  2. Not giving proper care during features selection
  3. Leaving out categorical features and considering just numerical variables
  4. Falling into dummy variable trap
  5. Selection of inefficient machine learning algorithm
  6. Not trying out various ML algorithms for building the model based on structure of data.
  7. Improper tuning of model parameters
  8. Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
  9. Read more here…

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.

However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….

The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.

Thus, the data science life-cycle can include the following steps:

  1. Business requirement understanding.
  2. Data collection.
  3. Data cleaning.
  4. Data analysis.
  5. Modeling.
  6. Performance evaluation.
  7. Communicating with stakeholders.
  8. Deployment.
  9. Real-world testing.
  10. Business buy-in.
  11. Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.

Read more here….

 

Top

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore  9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A -Part II:

 
 
 

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

  • Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
  • Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
  • Model versioning: add a hash key to your different models. You will thank me later.
  • Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
  • Monitor performances: execution time and statistical scores of your models.
  • Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

  1. Not understanding the structure of the dataset
  2. Not giving proper care during features selection
  3. Leaving out categorical features and considering just numerical variables
  4. Falling into dummy variable trap
  5. Selection of inefficient machine learning algorithm
  6. Not trying out various ML algorithms for building the model based on structure of data.
  7. Improper tuning of model parameters
  8. Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
  9. Read more here…

Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.

However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….

The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.

Thus, the data science life-cycle can include the following steps:

  1. Business requirement understanding.
  2. Data collection.
  3. Data cleaning.
  4. Data analysis.
  5. Modeling.
  6. Performance evaluation.
  7. Communicating with stakeholders.
  8. Deployment.
  9. Real-world testing.
  10. Business buy-in.
  11. Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.

Read more here….

 

Top

 

AWS machine Learning Specialty Exam Prep MLS-C01

iOs: https://apps.apple.com/ca/app/aws-machine-learning-prep-pro/id1611045854

Windows: https://www.microsoft.com/en-ca/p/aws-machine-learning-mls-c01-specialty-certification-exam-prep/9n8rl80hvm4t

Android/Amazon: https://www.amazon.com/gp/product/B09TZ4H8V6

AWS MLS-C01 Machine Learning Exam Prep

Quizzes, Practice Exams: Modeling, Data Engineering, Vision, Exploratory Data Analysis, ML Ops, Cheat Sheets, ML Jobs Interview Q&A

Use this App to learn about Machine Learning on AWS and prepare for the AWS Machine Learning Specialty Certification MLS-C01.

Earning AWS Certified Machine Learning Specialty validates expertise in building, training, tuning, and deploying machine learning (ML) models on AWS.

The App provides hundreds of quizzes and practice exam about:

– Machine Learning Operation on AWS

– Modelling

– Data Engineering

– Computer Vision,

– Exploratory Data Analysis,

– ML implementation & Operations

– Machine Learning Basics Questions and Answers

– Machine Learning Advanced Questions and Answers

– Scorecard

– Countdown timer

– Machine Learning Cheat Sheets

– Machine Learning Interview Questions and Answers

– Machine Learning Latest News

The App covers Machine Learning Basics and Advanced topics including: NLP, Computer Vision, Python, linear regression, logistic regression, Sampling, dataset, statistical interaction, selection bias, non-Gaussian distribution, bias-variance trade-off, Normal Distribution, correlation and covariance, Point Estimates and Confidence Interval, A/B Testing, p-value, statistical power of sensitivity, over-fitting and under-fitting, regularization, Law of Large Numbers, Confounding Variables, Survivorship Bias, univariate, bivariate and multivariate, Resampling, ROC curve, TF/IDF vectorization, Cluster Sampling, etc.

Domain 1: Data Engineering

Create data repositories for machine learning.

Identify data sources (e.g., content and location, primary sources such as user data)

Determine storage mediums (e.g., DB, Data Lake, S3, EFS, EBS)

Identify and implement a data ingestion solution.

Data job styles/types (batch load, streaming)

Data ingestion pipelines (Batch-based ML workloads and streaming-based ML workloads), etc.

Domain 2: Exploratory Data Analysis

Sanitize and prepare data for modeling.

Perform feature engineering.

Analyze and visualize data for machine learning.

Domain 3: Modeling

Frame business problems as machine learning problems.

Select the appropriate model(s) for a given machine learning problem.

Train machine learning models.

Perform hyperparameter optimization.

Evaluate machine learning models.

Domain 4: Machine Learning Implementation and Operations

Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.

Recommend and implement the appropriate machine learning services and features for a given problem.

Apply basic AWS security practices to machine learning solutions.

Deploy and operationalize machine learning solutions.

Machine Learning Services covered:

Amazon Comprehend

AWS Deep Learning AMIs (DLAMI)

AWS DeepLens

Amazon Forecast

Amazon Fraud Detector

Amazon Lex

Amazon Polly

Amazon Rekognition

Amazon SageMaker

Amazon Textract

Amazon Transcribe

Amazon Translate

Other Services and topics covered are:

Ingestion/Collection

Processing/ETL

Data analysis/visualization

Model training

Model deployment/inference

Operational

AWS ML application services

Language relevant to ML (for example, Python, Java, Scala, R, SQL)

Notebooks and integrated development environments (IDEs),

S3, SageMaker, Kinesis, Lake Formation, Athena, Kibana, Redshift, Textract, EMR, Glue, SageMaker, CSV, JSON, IMG, parquet or databases, Amazon Athena

Amazon EC2, Amazon Elastic Container Registry (Amazon ECR), Amazon Elastic Container Service, Amazon Elastic Kubernetes Service , Amazon Redshift

Sagemaker API Explained:

SageMaker API

AWS Certified Machine Learning Engineer Specialty Questions and Answers:

Question1: An advertising and analytics company uses machine learning to predict user response to online advertisements using a custom XGBoost model. The company wants to improve its ML pipeline by porting its training and inference code, written in R, to Amazon SageMaker, and do so with minimal changes to the existing code.

Answer1: Use the Build Your Own Container (BYOC) Amazon Sagemaker option.
Create a new docker container with the existing code. Register the container in Amazon Elastic Container registry. with the existing code. Register the container in Amazon Elastic Container Registry. Finally run the training and inference jobs using this container.

Question2: Which feature of Amazon SageMaker can you use for preprocessing the data?

 

Answer2: Amazon Sagemaker Notebook instances

Amazon SageMaker enables developers and data scientists to build, train, tune, and deploy machine learning (ML) models at scale. You can deploy trained ML models for real-time or batch predictions on unseen data, a process known as inference. However, in most cases, the raw input data must be preprocessed and can’t be used directly for making predictions. This is because most ML models expect the data in a predefined format, so the raw data needs to be first cleaned and formatted in order for the ML model to process the data.  You can use the Amazon SageMaker built-in Scikit-learn library for preprocessing input data and then use the Amazon SageMaker built-in Linear Learner algorithm for predictions.

Question3: What setting, when creating an Amazon SageMaker notebook instance, can you use to install libraries and import data?

Answer3: LifeCycle Configuration

Question4: How to Choose the right Sagemaker built-in algorithm?

How to chose the right built in algorithm in SageMaker?
How to chose the right built in algorithm in SageMaker?

 

Guide to choosing the right unsupervised learning algorithm
Guide to choosing the right unsupervised learning algorithm
Choosing the right  ML algorithm based on Data Type
Choosing the right ML algorithm based on Data Type
Choosing the right ML algo based on data type
Choosing the right ML algo based on data type

This is a general guide for choosing which algorithm to use depending on what business problem you have and what data you have. 

 

Top

Top 10 Google Professional Machine Learning Engineer Sample Questions

Question 1: You work for a textile manufacturer and have been asked to build a model to detect and classify fabric defects. You trained a machine learning model with high recall based on high resolution images taken at the end of the production line. You want quality control inspectors to gain trust in your model. Which technique should you use to understand the rationale of your classifier?

A. Use K-fold cross validation to understand how the model performs on different test datasets.

B. Use the Integrated Gradients method to efficiently compute feature attributions for each predicted image.

C. Use PCA (Principal Component Analysis) to reduce the original feature set to a smaller set of easily understood features.

D. Use k-means clustering to group similar images together, and calculate the Davies-Bouldin index to evaluate the separation between clusters.

Answer 1)

B

Notes 1)

B is correct because it identifies the pixel of the input image that leads to the classification of the image itself.

Question 2: You need to write a generic test to verify whether Dense Neural Network (DNN) models automatically released by your team have a sufficient number of parameters to learn the task for which they were built. What should you do?

A. Train the model for a few iterations, and check for NaN values.
B. Train the model for a few iterations, and verify that the loss is constant.
C. Train a simple linear model, and determine if the DNN model outperforms it.
D. Train the model with no regularization, and verify that the loss function is close to zero.
 

Answer 2)

D

Notes 2)

D is correct because the test can check that the model has enough parameters to memorize the task.

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

Question 3: Your team is using a TensorFlow Inception-v3 CNN model pretrained on ImageNet for an image classification prediction challenge on 10,000 images. You will use AI Platform to perform the model training. What TensorFlow distribution strategy and AI Platform training job configuration should you use to train the model and optimize for wall-clock time?

 

A. Default Strategy; Custom tier with a single master node and four v100 GPUs.
B. One Device Strategy; Custom tier with a single master node and four v100 GPUs.
C. One Device Strategy; Custom tier with a single master node and eight v100 GPUs.
D. Central Storage Strategy; Custom tier with a single master node and four v100 GPUs.
 

Answer 3)

D

Notes 3)

D is correct because this is the only strategy that can perform distributed training; albeit there is only a single copy of the variables on the CPU host.

Question 4: You work on a team where the process for deploying a model into production starts with data scientists training different versions of models in a Kubeflow pipeline. The workflow then stores the new model artifact into the corresponding Cloud Storage bucket. You need to build the next steps of the pipeline after the submitted model is ready to be tested and deployed in production on AI Platform. How should you configure the architecture before deploying the model to production?

 
A. Deploy model in test environment -> Validate model -> Create a new AI Platform model version
 
B. Validate model -> Deploy model in test environment -> Create a new AI Platform model version
 
C. Create a new AI Platform model version -> Validate model -> Deploy model in test environment
D. Create a new AI Platform model version – > Deploy model in test environment -> Validate model
 
Answer 4)
A
 
Notes 4)
A is correct because the model can be validated after it is deployed to the test environment, and the release version is established before the model is deployed in production.
 
Question 5: You work for a maintenance company and have built and trained a deep learning model that identifies defects based on thermal images of underground electric cables. Your dataset contains 10,000 images, 100 of which contain visible defects. How should you evaluate the performance of the model on a test dataset?
 
A. Calculate the Area Under the Curve (AUC) value.
 
B. Calculate the number of true positive results predicted by the model.
C. Calculate the fraction of images predicted by the model to have a visible defect.
D. Calculate the Cosine Similarity to compare the model’s performance on the test dataset to the model’s performance on the training dataset.
 
Answer 5)
A
 
Notes 5)
A is correct because it is scale-invariant. AUC measures how well predictions are ranked, rather than their absolute values. AUC is also classification-threshold invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.
 
Question 6: You work for a manufacturing company that owns a high-value machine which has several machine settings and multiple sensors. A history of the machine’s hourly sensor readings and known failure event data are stored in BigQuery. You need to predict if the machine will fail within the next 3 days in order to schedule maintenance before the machine fails. Which data preparation and model training steps should you take?

 

A. Data preparation: Daily max value feature engineering with DataPrep; Model training: AutoML classification with BQML
 
B. Data preparation: Daily min value feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
C. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to False
D. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
Answer 6)
D
 
Notes 6)
D is correct because it uses the rolling average of the sensor data and balances the weights using the BQML auto class weight balance parameter.
 
 
Question 7: You are an ML engineer at a media company. You need to build an ML model to analyze video content frame-by-frame, identify objects, and alert users if there is inappropriate content. Which Google Cloud products should you use to build this project?

 

A. Pub/Sub, Cloud Function, Cloud Vision API
 
B. Pub/Sub, Cloud IoT, Dataflow, Cloud Vision API, Cloud Logging
C. Pub/Sub, Cloud Function, Video Intelligence API, Cloud Logging
D. Pub/Sub, Cloud Function, AutoML Video Intelligence, Cloud Logging
 
Answer 7)
C
 
Notes 7)
C is correct as Video Intelligence API can find inappropriate components and other components satisfy the requirements of real-time processing and notification.
 
Question 8: You work for a large retailer. You want to use ML to forecast future sales leveraging 10 years of historical sales data. The historical data is stored in Cloud Storage in Avro format. You want to rapidly experiment with all the available data. How should you build and train your model for the sales forecast?
 
A. Load data into BigQuery and use the ARIMA model type on BigQuery ML.
B. Convert the data into CSV format and create a regression model on AutoML Tables.
C. Convert the data into TFRecords and create an RNN model on TensorFlow on AI Platform Notebooks.
D. Convert and refactor the data into CSV format and use the built-in XGBoost algorithm on AI Platform Training.
 
Answer 8)
A
 
Notes 8)
A is correct because BigQuery ML is designed for fast and rapid experimentation and it is possible to use federated queries to read data directly from Cloud Storage. Moreover, ARIMA is considered one of the best in class for time series forecasting.
 
Question 9) You need to build an object detection model for a small startup company to identify if and where the company’s logo appears in an image. You were given a large repository of images, some with logos and some without. These images are not yet labelled. You need to label these pictures, and then train and deploy the model. What should you do?

 

A. Use Google Cloud’s Data Labelling Service to label your data. Use AutoML Object Detection to train and deploy the model.
B. Use Vision API to detect and identify logos in pictures and use it as a label. Use AI Platform to build and train a convolutional neural network.
 
C. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a convolutional neural network.
D. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a real time object detection model.
 
Answer 9)
A
 
Notes 9)
A is correct as this will allow you to easily create a request for a labelling task and deploy a high-performance model.
 

Question 10) You work for a large financial institution that is planning to use Dialogflow to create a chatbot for the company’s mobile app. You have reviewed old chat logs and tagged each conversation for intent based on each customer’s stated intention for contacting customer service. About 70% of customer inquiries are simple requests that are solved within 10 intents. The remaining 30% of inquiries require much longer and more complicated requests. Which intents should you automate first?

A. Automate a blend of the shortest and longest intents to be representative of all intents.
B. Automate the more complicated requests first because those require more of the agents’ time.
C. Automate the 10 intents that cover 70% of the requests so that live agents can handle the more complicated requests.
 
D. Automate intents in places where common words such as “payment” only appear once to avoid confusing the software.
Answer 10)
C
 
Notes 10)

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore  9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A Part I:

Google.

Azure and AWS are second class citizens in this area.

Sure, AWS has 70% of the market.

Sure, Azure is the easiest turn key and super user friendly.

But, the king of machine learning in the cloud is GCP.

GCP = Google Cloud Platform

Google has the largest data science team in the world, not mention they have Hinton.

Let’s forgot for a minute they created TensorFlow and give it away.

Let’s just talk about building a real world model with data that doesn’t fit into a excel spreadsheet.

The vast majority of applied machine learning is supervised and that means we need data.

Not just normal data, we need very clean highly structured data.

Where’s the easiest place in the world to upload and model a Petabyte of structured dataBigQuery of course.

Why BigQuery? I don’t have to do anything but upload my data. No spinning up RedShit clusters or whatever I have to do in Azure, just upload and massage data with my familiar SQL. If I do have to wrangle my data it won’t take my six months to update 5 rows here, minutes usually.

Then, you’ll need a front end. Cloud datalab is a Jupyter notebook, which is good because I don’t want nor do I need anything else.

Then, with a single line of code I connect by datalab (Jupyter) notebook to my data in BigQuery and build away.

I’ve worked in all three and the only thing I care about is getting to my job the fastest and right now that means I build my models in GCP.

If you’re new to machine learning don’t start in GCP or any cloud vendor for that matter. Start learning Python from the comfort of your laptop.

The course below is free to the first 20.

The Complete Python Course for Machine Learning Engineers

Here, I want to share the best research paper on Machine Learning classification methods, titled ‘Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?’, published in the ‘Journal of Machine Learning Research’.

This paper nicely explained 179 classification techniques and applied them on 121 data sets thus sharing small summary of the paper:

Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?

 
 
 

The paper evaluated 179 classifiers arising from 17 ML families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest neighbours, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R ( with and without the caret package), C and Matlab, including all the relevant classifiers available today.

Experiments used total 121 data sets , which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behaviour, not dependent on the data set collection.

The whole data set and partitions are available from: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/data.tar.gz

The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package).

The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).

You can see the table with the complete results: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/results.txt

I hope it will be helpful for Statistic and Machine Leaning aspirants!

Thank you!

 
 
 

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

  • Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
  • Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
  • Model versioning: add a hash key to your different models. You will thank me later.
  • Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
  • Monitor performances: execution time and statistical scores of your models.
  • Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

  1. Not understanding the structure of the dataset
  2. Not giving proper care during features selection
  3. Leaving out categorical features and considering just numerical variables
  4. Falling into dummy variable trap
  5. Selection of inefficient machine learning algorithm
  6. Not trying out various ML algorithms for building the model based on structure of data.
  7. Improper tuning of model parameters
  8. Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
  9. Read more here…

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.

However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….

The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.

Thus, the data science life-cycle can include the following steps:

  1. Business requirement understanding.
  2. Data collection.
  3. Data cleaning.
  4. Data analysis.
  5. Modeling.
  6. Performance evaluation.
  7. Communicating with stakeholders.
  8. Deployment.
  9. Real-world testing.
  10. Business buy-in.
  11. Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.

Read more here….

 

Top

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore  9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A -Part II:

 
 
 

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

  • Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
  • Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
  • Model versioning: add a hash key to your different models. You will thank me later.
  • Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
  • Monitor performances: execution time and statistical scores of your models.
  • Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

  1. Not understanding the structure of the dataset
  2. Not giving proper care during features selection
  3. Leaving out categorical features and considering just numerical variables
  4. Falling into dummy variable trap
  5. Selection of inefficient machine learning algorithm
  6. Not trying out various ML algorithms for building the model based on structure of data.
  7. Improper tuning of model parameters
  8. Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
  9. Read more here…

Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.

However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….

The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.

Thus, the data science life-cycle can include the following steps:

  1. Business requirement understanding.
  2. Data collection.
  3. Data cleaning.
  4. Data analysis.
  5. Modeling.
  6. Performance evaluation.
  7. Communicating with stakeholders.
  8. Deployment.
  9. Real-world testing.
  10. Business buy-in.
  11. Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.

Read more here….

 

Top

AWS machine Learning Specialty Exam Prep MLS-C01

iOs: https://apps.apple.com/ca/app/aws-machine-learning-prep-pro/id1611045854

Windows: https://www.microsoft.com/en-ca/p/aws-machine-learning-mls-c01-specialty-certification-exam-prep/9n8rl80hvm4t

Android/Amazon: https://www.amazon.com/gp/product/B09TZ4H8V6

AWS MLS-C01 Machine Learning Exam Prep

Quizzes, Practice Exams: Modeling, Data Engineering, Vision, Exploratory Data Analysis, ML Ops, Cheat Sheets, ML Jobs Interview Q&A

Use this App to learn about Machine Learning on AWS and prepare for the AWS Machine Learning Specialty Certification MLS-C01.

Earning AWS Certified Machine Learning Specialty validates expertise in building, training, tuning, and deploying machine learning (ML) models on AWS.

The App provides hundreds of quizzes and practice exam about:

– Machine Learning Operation on AWS

– Modelling

– Data Engineering

– Computer Vision,

– Exploratory Data Analysis,

– ML implementation & Operations

– Machine Learning Basics Questions and Answers

– Machine Learning Advanced Questions and Answers

– Scorecard

– Countdown timer

– Machine Learning Cheat Sheets

– Machine Learning Interview Questions and Answers

– Machine Learning Latest News

The App covers Machine Learning Basics and Advanced topics including: NLP, Computer Vision, Python, linear regression, logistic regression, Sampling, dataset, statistical interaction, selection bias, non-Gaussian distribution, bias-variance trade-off, Normal Distribution, correlation and covariance, Point Estimates and Confidence Interval, A/B Testing, p-value, statistical power of sensitivity, over-fitting and under-fitting, regularization, Law of Large Numbers, Confounding Variables, Survivorship Bias, univariate, bivariate and multivariate, Resampling, ROC curve, TF/IDF vectorization, Cluster Sampling, etc.

Domain 1: Data Engineering

Create data repositories for machine learning.

Identify data sources (e.g., content and location, primary sources such as user data)

Determine storage mediums (e.g., DB, Data Lake, S3, EFS, EBS)

Identify and implement a data ingestion solution.

Data job styles/types (batch load, streaming)

Data ingestion pipelines (Batch-based ML workloads and streaming-based ML workloads), etc.

Domain 2: Exploratory Data Analysis

Sanitize and prepare data for modeling.

Perform feature engineering.

Analyze and visualize data for machine learning.

Domain 3: Modeling

Frame business problems as machine learning problems.

Select the appropriate model(s) for a given machine learning problem.

Train machine learning models.

Perform hyperparameter optimization.

Evaluate machine learning models.

Domain 4: Machine Learning Implementation and Operations

Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.

Recommend and implement the appropriate machine learning services and features for a given problem.

Apply basic AWS security practices to machine learning solutions.

Deploy and operationalize machine learning solutions.

Machine Learning Services covered:

Amazon Comprehend

AWS Deep Learning AMIs (DLAMI)

AWS DeepLens

Amazon Forecast

Amazon Fraud Detector

Amazon Lex

Amazon Polly

Amazon Rekognition

Amazon SageMaker

Amazon Textract

Amazon Transcribe

Amazon Translate

Other Services and topics covered are:

Ingestion/Collection

Processing/ETL

Data analysis/visualization

Model training

Model deployment/inference

Operational

AWS ML application services

Language relevant to ML (for example, Python, Java, Scala, R, SQL)

Notebooks and integrated development environments (IDEs),

S3, SageMaker, Kinesis, Lake Formation, Athena, Kibana, Redshift, Textract, EMR, Glue, SageMaker, CSV, JSON, IMG, parquet or databases, Amazon Athena

Amazon EC2, Amazon Elastic Container Registry (Amazon ECR), Amazon Elastic Container Service, Amazon Elastic Kubernetes Service , Amazon Redshift

Sagemaker API Explained:

SageMaker API

AWS Certified Machine Learning Engineer Specialty Questions and Answers:

Question1: An advertising and analytics company uses machine learning to predict user response to online advertisements using a custom XGBoost model. The company wants to improve its ML pipeline by porting its training and inference code, written in R, to Amazon SageMaker, and do so with minimal changes to the existing code.

Answer1: Use the Build Your Own Container (BYOC) Amazon Sagemaker option.
Create a new docker container with the existing code. Register the container in Amazon Elastic Container registry. with the existing code. Register the container in Amazon Elastic Container Registry. Finally run the training and inference jobs using this container.

Question2: Which feature of Amazon SageMaker can you use for preprocessing the data?

 

Answer2: Amazon Sagemaker Notebook instances

Amazon SageMaker enables developers and data scientists to build, train, tune, and deploy machine learning (ML) models at scale. You can deploy trained ML models for real-time or batch predictions on unseen data, a process known as inference. However, in most cases, the raw input data must be preprocessed and can’t be used directly for making predictions. This is because most ML models expect the data in a predefined format, so the raw data needs to be first cleaned and formatted in order for the ML model to process the data.  You can use the Amazon SageMaker built-in Scikit-learn library for preprocessing input data and then use the Amazon SageMaker built-in Linear Learner algorithm for predictions.

Question3: What setting, when creating an Amazon SageMaker notebook instance, can you use to install libraries and import data?

Answer3: LifeCycle Configuration

Question4: How to Choose the right Sagemaker built-in algorithm?

How to chose the right built in algorithm in SageMaker?
How to chose the right built in algorithm in SageMaker?

 

Guide to choosing the right unsupervised learning algorithm
Guide to choosing the right unsupervised learning algorithm
Choosing the right  ML algorithm based on Data Type
Choosing the right ML algorithm based on Data Type
Choosing the right ML algo based on data type
Choosing the right ML algo based on data type

This is a general guide for choosing which algorithm to use depending on what business problem you have and what data you have. 

 

Top

Top 10 Google Professional Machine Learning Engineer Sample Questions

Question 1: You work for a textile manufacturer and have been asked to build a model to detect and classify fabric defects. You trained a machine learning model with high recall based on high resolution images taken at the end of the production line. You want quality control inspectors to gain trust in your model. Which technique should you use to understand the rationale of your classifier?

A. Use K-fold cross validation to understand how the model performs on different test datasets.

B. Use the Integrated Gradients method to efficiently compute feature attributions for each predicted image.

C. Use PCA (Principal Component Analysis) to reduce the original feature set to a smaller set of easily understood features.

D. Use k-means clustering to group similar images together, and calculate the Davies-Bouldin index to evaluate the separation between clusters.

Answer 1)

B

Notes 1)

B is correct because it identifies the pixel of the input image that leads to the classification of the image itself.

Question 2: You need to write a generic test to verify whether Dense Neural Network (DNN) models automatically released by your team have a sufficient number of parameters to learn the task for which they were built. What should you do?

A. Train the model for a few iterations, and check for NaN values.
B. Train the model for a few iterations, and verify that the loss is constant.
C. Train a simple linear model, and determine if the DNN model outperforms it.
D. Train the model with no regularization, and verify that the loss function is close to zero.
 

Answer 2)

D

Notes 2)

D is correct because the test can check that the model has enough parameters to memorize the task.

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

Question 3: Your team is using a TensorFlow Inception-v3 CNN model pretrained on ImageNet for an image classification prediction challenge on 10,000 images. You will use AI Platform to perform the model training. What TensorFlow distribution strategy and AI Platform training job configuration should you use to train the model and optimize for wall-clock time?

 

A. Default Strategy; Custom tier with a single master node and four v100 GPUs.
B. One Device Strategy; Custom tier with a single master node and four v100 GPUs.
C. One Device Strategy; Custom tier with a single master node and eight v100 GPUs.
D. Central Storage Strategy; Custom tier with a single master node and four v100 GPUs.
 

Answer 3)

D

Notes 3)

D is correct because this is the only strategy that can perform distributed training; albeit there is only a single copy of the variables on the CPU host.

Question 4: You work on a team where the process for deploying a model into production starts with data scientists training different versions of models in a Kubeflow pipeline. The workflow then stores the new model artifact into the corresponding Cloud Storage bucket. You need to build the next steps of the pipeline after the submitted model is ready to be tested and deployed in production on AI Platform. How should you configure the architecture before deploying the model to production?

 
A. Deploy model in test environment -> Validate model -> Create a new AI Platform model version
 
B. Validate model -> Deploy model in test environment -> Create a new AI Platform model version
 
C. Create a new AI Platform model version -> Validate model -> Deploy model in test environment
D. Create a new AI Platform model version – > Deploy model in test environment -> Validate model
 
Answer 4)
A
 
Notes 4)
A is correct because the model can be validated after it is deployed to the test environment, and the release version is established before the model is deployed in production.
 
Question 5: You work for a maintenance company and have built and trained a deep learning model that identifies defects based on thermal images of underground electric cables. Your dataset contains 10,000 images, 100 of which contain visible defects. How should you evaluate the performance of the model on a test dataset?
 
A. Calculate the Area Under the Curve (AUC) value.
 
B. Calculate the number of true positive results predicted by the model.
C. Calculate the fraction of images predicted by the model to have a visible defect.
D. Calculate the Cosine Similarity to compare the model’s performance on the test dataset to the model’s performance on the training dataset.
 
Answer 5)
A
 
Notes 5)
A is correct because it is scale-invariant. AUC measures how well predictions are ranked, rather than their absolute values. AUC is also classification-threshold invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.
 
Question 6: You work for a manufacturing company that owns a high-value machine which has several machine settings and multiple sensors. A history of the machine’s hourly sensor readings and known failure event data are stored in BigQuery. You need to predict if the machine will fail within the next 3 days in order to schedule maintenance before the machine fails. Which data preparation and model training steps should you take?

 

A. Data preparation: Daily max value feature engineering with DataPrep; Model training: AutoML classification with BQML
 
B. Data preparation: Daily min value feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
C. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to False
D. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
Answer 6)
D
 
Notes 6)
D is correct because it uses the rolling average of the sensor data and balances the weights using the BQML auto class weight balance parameter.
 
 
Question 7: You are an ML engineer at a media company. You need to build an ML model to analyze video content frame-by-frame, identify objects, and alert users if there is inappropriate content. Which Google Cloud products should you use to build this project?

 

A. Pub/Sub, Cloud Function, Cloud Vision API
 
B. Pub/Sub, Cloud IoT, Dataflow, Cloud Vision API, Cloud Logging
C. Pub/Sub, Cloud Function, Video Intelligence API, Cloud Logging
D. Pub/Sub, Cloud Function, AutoML Video Intelligence, Cloud Logging
 
Answer 7)
C
 
Notes 7)
C is correct as Video Intelligence API can find inappropriate components and other components satisfy the requirements of real-time processing and notification.
 
Question 8: You work for a large retailer. You want to use ML to forecast future sales leveraging 10 years of historical sales data. The historical data is stored in Cloud Storage in Avro format. You want to rapidly experiment with all the available data. How should you build and train your model for the sales forecast?
 
A. Load data into BigQuery and use the ARIMA model type on BigQuery ML.
B. Convert the data into CSV format and create a regression model on AutoML Tables.
C. Convert the data into TFRecords and create an RNN model on TensorFlow on AI Platform Notebooks.
D. Convert and refactor the data into CSV format and use the built-in XGBoost algorithm on AI Platform Training.
 
Answer 8)
A
 
Notes 8)
A is correct because BigQuery ML is designed for fast and rapid experimentation and it is possible to use federated queries to read data directly from Cloud Storage. Moreover, ARIMA is considered one of the best in class for time series forecasting.
 
Question 9) You need to build an object detection model for a small startup company to identify if and where the company’s logo appears in an image. You were given a large repository of images, some with logos and some without. These images are not yet labelled. You need to label these pictures, and then train and deploy the model. What should you do?

 

A. Use Google Cloud’s Data Labelling Service to label your data. Use AutoML Object Detection to train and deploy the model.
B. Use Vision API to detect and identify logos in pictures and use it as a label. Use AI Platform to build and train a convolutional neural network.
 
C. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a convolutional neural network.
D. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a real time object detection model.
 
Answer 9)
A
 
Notes 9)
A is correct as this will allow you to easily create a request for a labelling task and deploy a high-performance model.
 

Question 10) You work for a large financial institution that is planning to use Dialogflow to create a chatbot for the company’s mobile app. You have reviewed old chat logs and tagged each conversation for intent based on each customer’s stated intention for contacting customer service. About 70% of customer inquiries are simple requests that are solved within 10 intents. The remaining 30% of inquiries require much longer and more complicated requests. Which intents should you automate first?

A. Automate a blend of the shortest and longest intents to be representative of all intents.
B. Automate the more complicated requests first because those require more of the agents’ time.
C. Automate the 10 intents that cover 70% of the requests so that live agents can handle the more complicated requests.
 
D. Automate intents in places where common words such as “payment” only appear once to avoid confusing the software.
Answer 10)
C
 
Notes 10)

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore  9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A Part I:

Google.

Azure and AWS are second class citizens in this area.

Sure, AWS has 70% of the market.

Sure, Azure is the easiest turn key and super user friendly.

But, the king of machine learning in the cloud is GCP.

GCP = Google Cloud Platform

Google has the largest data science team in the world, not mention they have Hinton.

Let’s forgot for a minute they created TensorFlow and give it away.

Let’s just talk about building a real world model with data that doesn’t fit into a excel spreadsheet.

The vast majority of applied machine learning is supervised and that means we need data.

Not just normal data, we need very clean highly structured data.

Where’s the easiest place in the world to upload and model a Petabyte of structured dataBigQuery of course.

Why BigQuery? I don’t have to do anything but upload my data. No spinning up RedShit clusters or whatever I have to do in Azure, just upload and massage data with my familiar SQL. If I do have to wrangle my data it won’t take my six months to update 5 rows here, minutes usually.

Then, you’ll need a front end. Cloud datalab is a Jupyter notebook, which is good because I don’t want nor do I need anything else.

Then, with a single line of code I connect by datalab (Jupyter) notebook to my data in BigQuery and build away.

I’ve worked in all three and the only thing I care about is getting to my job the fastest and right now that means I build my models in GCP.

If you’re new to machine learning don’t start in GCP or any cloud vendor for that matter. Start learning Python from the comfort of your laptop.

The course below is free to the first 20.

The Complete Python Course for Machine Learning Engineers

Here, I want to share the best research paper on Machine Learning classification methods, titled ‘Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?’, published in the ‘Journal of Machine Learning Research’.

This paper nicely explained 179 classification techniques and applied them on 121 data sets thus sharing small summary of the paper:

Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?

 
 
 

The paper evaluated 179 classifiers arising from 17 ML families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest neighbours, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R ( with and without the caret package), C and Matlab, including all the relevant classifiers available today.

Experiments used total 121 data sets , which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behaviour, not dependent on the data set collection.

The whole data set and partitions are available from: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/data.tar.gz

The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package).

The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).

You can see the table with the complete results: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/results.txt

I hope it will be helpful for Statistic and Machine Leaning aspirants!

Thank you!

 
 
 

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

  • Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
  • Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
  • Model versioning: add a hash key to your different models. You will thank me later.
  • Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
  • Monitor performances: execution time and statistical scores of your models.
  • Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

  1. Not understanding the structure of the dataset
  2. Not giving proper care during features selection
  3. Leaving out categorical features and considering just numerical variables
  4. Falling into dummy variable trap
  5. Selection of inefficient machine learning algorithm
  6. Not trying out various ML algorithms for building the model based on structure of data.
  7. Improper tuning of model parameters
  8. Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
  9. Read more here…

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.

However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….

The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.

Thus, the data science life-cycle can include the following steps:

  1. Business requirement understanding.
  2. Data collection.
  3. Data cleaning.
  4. Data analysis.
  5. Modeling.
  6. Performance evaluation.
  7. Communicating with stakeholders.
  8. Deployment.
  9. Real-world testing.
  10. Business buy-in.
  11. Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.

Read more here….

 

Top

[appbox appstore 1611045854-iphone screenshots]

[appbox microsoftstore  9n8rl80hvm4t-mobile screenshots]

Machine Learning Q&A -Part II:

 
 
 

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

  • Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
  • Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
  • Model versioning: add a hash key to your different models. You will thank me later.
  • Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
  • Monitor performances: execution time and statistical scores of your models.
  • Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

  1. Not understanding the structure of the dataset
  2. Not giving proper care during features selection
  3. Leaving out categorical features and considering just numerical variables
  4. Falling into dummy variable trap
  5. Selection of inefficient machine learning algorithm
  6. Not trying out various ML algorithms for building the model based on structure of data.
  7. Improper tuning of model parameters
  8. Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
  9. Read more here…

Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.

However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….

The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.

Thus, the data science life-cycle can include the following steps:

  1. Business requirement understanding.
  2. Data collection.
  3. Data cleaning.
  4. Data analysis.
  5. Modeling.
  6. Performance evaluation.
  7. Communicating with stakeholders.
  8. Deployment.
  9. Real-world testing.
  10. Business buy-in.
  11. Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.

Read more here….

 

Top

 

AWS machine Learning Specialty Exam Prep MLS-C01

iOs: https://apps.apple.com/ca/app/aws-machine-learning-prep-pro/id1611045854

Windows: https://www.microsoft.com/en-ca/p/aws-machine-learning-mls-c01-specialty-certification-exam-prep/9n8rl80hvm4t

Android/Amazon: https://www.amazon.com/gp/product/B09TZ4H8V6

AWS MLS-C01 Machine Learning Exam Prep

Quizzes, Practice Exams: Modeling, Data Engineering, Vision, Exploratory Data Analysis, ML Ops, Cheat Sheets, ML Jobs Interview Q&A

Use this App to learn about Machine Learning on AWS and prepare for the AWS Machine Learning Specialty Certification MLS-C01.

Earning AWS Certified Machine Learning Specialty validates expertise in building, training, tuning, and deploying machine learning (ML) models on AWS.

The App provides hundreds of quizzes and practice exam about:

– Machine Learning Operation on AWS

– Modelling

– Data Engineering

– Computer Vision,

– Exploratory Data Analysis,

– ML implementation & Operations

– Machine Learning Basics Questions and Answers

– Machine Learning Advanced Questions and Answers

– Scorecard

– Countdown timer

– Machine Learning Cheat Sheets

– Machine Learning Interview Questions and Answers

– Machine Learning Latest News

The App covers Machine Learning Basics and Advanced topics including: NLP, Computer Vision, Python, linear regression, logistic regression, Sampling, dataset, statistical interaction, selection bias, non-Gaussian distribution, bias-variance trade-off, Normal Distribution, correlation and covariance, Point Estimates and Confidence Interval, A/B Testing, p-value, statistical power of sensitivity, over-fitting and under-fitting, regularization, Law of Large Numbers, Confounding Variables, Survivorship Bias, univariate, bivariate and multivariate, Resampling, ROC curve, TF/IDF vectorization, Cluster Sampling, etc.

Domain 1: Data Engineering

Create data repositories for machine learning.

Identify data sources (e.g., content and location, primary sources such as user data)

Determine storage mediums (e.g., DB, Data Lake, S3, EFS, EBS)

Identify and implement a data ingestion solution.

Data job styles/types (batch load, streaming)

Data ingestion pipelines (Batch-based ML workloads and streaming-based ML workloads), etc.

Domain 2: Exploratory Data Analysis

Sanitize and prepare data for modeling.

Perform feature engineering.

Analyze and visualize data for machine learning.

Domain 3: Modeling

Frame business problems as machine learning problems.

Select the appropriate model(s) for a given machine learning problem.

Train machine learning models.

Perform hyperparameter optimization.

Evaluate machine learning models.

Domain 4: Machine Learning Implementation and Operations

Build machine learning solutions for performance, availability, scalability, resiliency, and fault tolerance.

Recommend and implement the appropriate machine learning services and features for a given problem.

Apply basic AWS security practices to machine learning solutions.

Deploy and operationalize machine learning solutions.

Machine Learning Services covered:

Amazon Comprehend

AWS Deep Learning AMIs (DLAMI)

AWS DeepLens

Amazon Forecast

Amazon Fraud Detector

Amazon Lex

Amazon Polly

Amazon Rekognition

Amazon SageMaker

Amazon Textract

Amazon Transcribe

Amazon Translate

Other Services and topics covered are:

Ingestion/Collection

Processing/ETL

Data analysis/visualization

Model training

Model deployment/inference

Operational

AWS ML application services

Language relevant to ML (for example, Python, Java, Scala, R, SQL)

Notebooks and integrated development environments (IDEs),

S3, SageMaker, Kinesis, Lake Formation, Athena, Kibana, Redshift, Textract, EMR, Glue, SageMaker, CSV, JSON, IMG, parquet or databases, Amazon Athena

Amazon EC2, Amazon Elastic Container Registry (Amazon ECR), Amazon Elastic Container Service, Amazon Elastic Kubernetes Service , Amazon Redshift

Sagemaker API Explained:

SageMaker API

AWS Certified Machine Learning Engineer Specialty Questions and Answers:

Question1: An advertising and analytics company uses machine learning to predict user response to online advertisements using a custom XGBoost model. The company wants to improve its ML pipeline by porting its training and inference code, written in R, to Amazon SageMaker, and do so with minimal changes to the existing code.

Answer1: Use the Build Your Own Container (BYOC) Amazon Sagemaker option.
Create a new docker container with the existing code. Register the container in Amazon Elastic Container registry. with the existing code. Register the container in Amazon Elastic Container Registry. Finally run the training and inference jobs using this container.

Question2: Which feature of Amazon SageMaker can you use for preprocessing the data?

 

Answer2: Amazon Sagemaker Notebook instances

Amazon SageMaker enables developers and data scientists to build, train, tune, and deploy machine learning (ML) models at scale. You can deploy trained ML models for real-time or batch predictions on unseen data, a process known as inference. However, in most cases, the raw input data must be preprocessed and can’t be used directly for making predictions. This is because most ML models expect the data in a predefined format, so the raw data needs to be first cleaned and formatted in order for the ML model to process the data.  You can use the Amazon SageMaker built-in Scikit-learn library for preprocessing input data and then use the Amazon SageMaker built-in Linear Learner algorithm for predictions.

Question3: What setting, when creating an Amazon SageMaker notebook instance, can you use to install libraries and import data?

Answer3: LifeCycle Configuration

Question4: How to Choose the right Sagemaker built-in algorithm?

How to chose the right built in algorithm in SageMaker?
How to chose the right built in algorithm in SageMaker?

 

Guide to choosing the right unsupervised learning algorithm
Guide to choosing the right unsupervised learning algorithm
Choosing the right  ML algorithm based on Data Type
Choosing the right ML algorithm based on Data Type
Choosing the right ML algo based on data type
Choosing the right ML algo based on data type

This is a general guide for choosing which algorithm to use depending on what business problem you have and what data you have. 

 

Top

Top 10 Google Professional Machine Learning Engineer Sample Questions

Question 1: You work for a textile manufacturer and have been asked to build a model to detect and classify fabric defects. You trained a machine learning model with high recall based on high resolution images taken at the end of the production line. You want quality control inspectors to gain trust in your model. Which technique should you use to understand the rationale of your classifier?

A. Use K-fold cross validation to understand how the model performs on different test datasets.

B. Use the Integrated Gradients method to efficiently compute feature attributions for each predicted image.

C. Use PCA (Principal Component Analysis) to reduce the original feature set to a smaller set of easily understood features.

D. Use k-means clustering to group similar images together, and calculate the Davies-Bouldin index to evaluate the separation between clusters.

Answer 1)

B

Notes 1)

B is correct because it identifies the pixel of the input image that leads to the classification of the image itself.

Question 2: You need to write a generic test to verify whether Dense Neural Network (DNN) models automatically released by your team have a sufficient number of parameters to learn the task for which they were built. What should you do?

A. Train the model for a few iterations, and check for NaN values.
B. Train the model for a few iterations, and verify that the loss is constant.
C. Train a simple linear model, and determine if the DNN model outperforms it.
D. Train the model with no regularization, and verify that the loss function is close to zero.
 

Answer 2)

D

Notes 2)

D is correct because the test can check that the model has enough parameters to memorize the task.

[appbox appstore 1560083470-iphone screenshots]
[appbox googleplay com.awssolutionarchitectassociateexampreppro.app]

Question 3: Your team is using a TensorFlow Inception-v3 CNN model pretrained on ImageNet for an image classification prediction challenge on 10,000 images. You will use AI Platform to perform the model training. What TensorFlow distribution strategy and AI Platform training job configuration should you use to train the model and optimize for wall-clock time?

 

A. Default Strategy; Custom tier with a single master node and four v100 GPUs.
B. One Device Strategy; Custom tier with a single master node and four v100 GPUs.
C. One Device Strategy; Custom tier with a single master node and eight v100 GPUs.
D. Central Storage Strategy; Custom tier with a single master node and four v100 GPUs.
 

Answer 3)

D

Notes 3)

D is correct because this is the only strategy that can perform distributed training; albeit there is only a single copy of the variables on the CPU host.

Question 4: You work on a team where the process for deploying a model into production starts with data scientists training different versions of models in a Kubeflow pipeline. The workflow then stores the new model artifact into the corresponding Cloud Storage bucket. You need to build the next steps of the pipeline after the submitted model is ready to be tested and deployed in production on AI Platform. How should you configure the architecture before deploying the model to production?

 
A. Deploy model in test environment -> Validate model -> Create a new AI Platform model version
 
B. Validate model -> Deploy model in test environment -> Create a new AI Platform model version
 
C. Create a new AI Platform model version -> Validate model -> Deploy model in test environment
D. Create a new AI Platform model version – > Deploy model in test environment -> Validate model
 
Answer 4)
A
 
Notes 4)
A is correct because the model can be validated after it is deployed to the test environment, and the release version is established before the model is deployed in production.
 
Question 5: You work for a maintenance company and have built and trained a deep learning model that identifies defects based on thermal images of underground electric cables. Your dataset contains 10,000 images, 100 of which contain visible defects. How should you evaluate the performance of the model on a test dataset?
 
A. Calculate the Area Under the Curve (AUC) value.
 
B. Calculate the number of true positive results predicted by the model.
C. Calculate the fraction of images predicted by the model to have a visible defect.
D. Calculate the Cosine Similarity to compare the model’s performance on the test dataset to the model’s performance on the training dataset.
 
Answer 5)
A
 
Notes 5)
A is correct because it is scale-invariant. AUC measures how well predictions are ranked, rather than their absolute values. AUC is also classification-threshold invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.
 
Question 6: You work for a manufacturing company that owns a high-value machine which has several machine settings and multiple sensors. A history of the machine’s hourly sensor readings and known failure event data are stored in BigQuery. You need to predict if the machine will fail within the next 3 days in order to schedule maintenance before the machine fails. Which data preparation and model training steps should you take?

 

A. Data preparation: Daily max value feature engineering with DataPrep; Model training: AutoML classification with BQML
 
B. Data preparation: Daily min value feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
C. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to False
D. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
Answer 6)
D
 
Notes 6)
D is correct because it uses the rolling average of the sensor data and balances the weights using the BQML auto class weight balance parameter.
 
 
Question 7: You are an ML engineer at a media company. You need to build an ML model to analyze video content frame-by-frame, identify objects, and alert user