The typical Google Machine Learning Engineer salary is $147,218. Machine Learning Engineer salaries at Google can range from $110,000 – $152,183.
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.
By the end of 2020, 85% of customer interactions will be handled without a human (Call Center, Chatbot, etc…)
61% of marketers say artificial intelligence is the most important aspect of their data strategy.
80% of business and tech leaders say AI already boosts productivity (Robotic Process Automation, Power Automate, etc..)
Current AI technology can boost business productivity by up to 40%
A Professional Machine Learning Engineer designs, builds, and productionizes ML models to solve business challenges using Google Cloud technologies and knowledge of proven ML models and techniques. The ML Engineer collaborates closely with other job roles to ensure long-term success of models. The ML Engineer should be proficient in all aspects of model architecture, data pipeline interaction, and metrics interpretation. The ML Engineer needs familiarity with application development, infrastructure management, data engineering, and security. Through an understanding of training, retraining, deploying, scheduling, monitoring, and improving models, they design and create scalable solutions for optimal performance.
The AWS Certified Machine Learning – Specialty certification is intended for individuals who perform a development or data science role. It validates a candidate’s ability to design, implement, deploy, and maintain machine learning (ML) solutions for given business problems.
Question1: A machine learning team has several large CSV datasets in Amazon S3. Historically, models built with the Amazon SageMaker Linear Learner algorithm have taken hours to train on similar-sized datasets. The team’s leaders need to accelerate the training process. What can a machine learning specialist do to address this concern?
A) Use Amazon SageMaker Pipe mode. B) Use Amazon Machine Learning to train the models. C) Use Amazon Kinesis to stream the data to Amazon SageMaker. D) Use AWS Glue to transform the CSV dataset to the JSON format. ANSWER1:
Amazon SageMaker Pipe mode streams the data directly to the container, which improves the performance of training jobs. (Refer to this link for supporting information.) In Pipe mode, your training job streams data directly from Amazon S3. Streaming can provide faster start times for training jobs and better throughput. With Pipe mode, you also reduce the size of the Amazon EBS volumes for your training instances. B would not apply in this scenario. C is a streaming ingestion solution, but is not applicable in this scenario. D transforms the data structure.
Question 2) A local university wants to track cars in a parking lot to determine which students are parking in the lot. The university is wanting to ingest videos of the cars parking in near-real time, use machine learning to identify license plates, and store that data in an AWS data store. Which solution meets these requirements with the LEAST amount of development effort?
A) Use Amazon Kinesis Data Streams to ingest the video in near-real time, use the Kinesis Data Streams consumer integrated with Amazon Rekognition Video to process the license plate information, and then store results in DynamoDB.
B) Use Amazon Kinesis Video Streams to ingest the videos in near-real time, use the Kinesis Video Streams integration with Amazon Rekognition Video to identify the license plate information, and then store the results in DynamoDB.
C) Use Amazon Kinesis Data Streams to ingest videos in near-real time, call Amazon Rekognition to identify license plate information, and then store results in DynamoDB.
D) Use Amazon Kinesis Firehose to ingest the video in near-real time and outputs results onto S3. Set up a Lambda function that triggers when a new video is PUT onto S3 to send results to Amazon Rekognition to identify license plate information, and then store results in DynamoDB.
Kinesis Video Streams is used to stream videos in near-real time. Amazon Rekognition Video uses Amazon Kinesis Video Streams to receive and process a video stream. After the videos have been processed by Rekognition we can output the results in DynamoDB.
Question 3) A term frequency–inverse document frequency (tf–idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences:
1. Please call the number below.
2. Please do not call us. What are the dimensions of the tf–idf matrix?
A) (2, 16)
B) (2, 8)
C) (2, 10)
D) (8, 10)
There are 2 sentences, 8 unique unigrams, and 8 unique bigrams, so the result would be (2,16). The phrases are “Please call the number below” and “Please do not call us.” Each word individually (unigram) is “Please,” “call,” ”the,” ”number,” “below,” “do,” “not,” and “us.” The unique bigrams are “Please call,” “call the,” ”the number,” “number below,” “Please do,” “do not,” “not call,” and “call us.” The tf–idf vectorizer is described at this link.
Question 4: A company is setting up a system to manage all of the datasets it stores in Amazon S3. The company would like to automate running transformation jobs on the data and maintaining a catalog of the metadata concerning the datasets. The solution should require the least amount of setup and maintenance. Which solution will allow the company to achieve its goals?
A) Create an Amazon EMR cluster with Apache Hive installed. Then, create a Hive metastore and a script to run transformation jobs on a schedule.
B) Create an AWS Glue crawler to populate the AWS Glue Data Catalog. Then, author an AWS Glue ETL job, and set up a schedule for data transformation jobs.
C) Create an Amazon EMR cluster with Apache Spark installed. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule. D) Create an AWS Data Pipeline that transforms the data. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule.
AWS Glue is the correct answer because this option requires the least amount of setup and maintenance since it is serverless, and it does not require management of the infrastructure. Refer to this link for supporting information. A, C, and D are all solutions that can solve the problem, but require more steps for configuration, and require higher operational overhead to run and maintain.
Question 6) A data scientist is working on optimizing a model during the training process by varying multiple parameters. The data scientist observes that, during multiple runs with identical parameters, the loss function converges to different, yet stable, values. What should the data scientist do to improve the training process?
A) Increase the learning rate. Keep the batch size the same.
B) Reduce the batch size. Decrease the learning rate.
C) Keep the batch size the same. Decrease the learning rate.
D) Do not change the learning rate. Increase the batch size.
It is most likely that the loss function is very curvy and has multiple local minima where the training is getting stuck. Decreasing the batch size would help the data scientist stochastically get out of the local minima saddles. Decreasing the learning rate would prevent overshooting the global loss function minimum. Refer to the paper at this link for an explanation. Reference 6) : Here
A) The Kinesis API (AWS SDK) provides greater functionality over the Kinesis Producer Library.
C) The Kinesis Producer Library must be installed as a Java application to use with Kinesis Data Streams.
The KPL must be installed as a Java application before it can be used with your Kinesis Data Streams. There are ways to process KPL serialized data within AWS Lambda, in Java, Node.js, and Python, but not if these answers mentions Lambda.
Question 8) A data scientist is evaluating different binary classification models. A false positive result is 5 times more expensive (from a business perspective) than a false negative result. The models should be evaluated based on the following criteria:
1) Must have a recall rate of at least 80%
2) Must have a false positive rate of 10% or less
3) Must minimize business costs After creating each binary classification model, the data scientist generates the corresponding confusion matrix. Which confusion matrix represents the model that satisfies the requirements?
A) TN = 91, FP = 9 FN = 22, TP = 78
B) TN = 99, FP = 1 FN = 21, TP = 79
C) TN = 96, FP = 4 FN = 10, TP = 90
D) TN = 98, FP = 2 FN = 18, TP = 82
The following calculations are required: TP = True Positive FP = False Positive FN = False Negative TN = True Negative FN = False Negative Recall = TP / (TP + FN) False Positive Rate (FPR) = FP / (FP + TN) Cost = 5 * FP + FN A B C D Recall 78 / (78 + 22) = 0.78 79 / (79 + 21) = 0.79 90 / (90 + 10) = 0.9 82 / (82 + 18) = 0.82 False Positive Rate 9 / (9 + 91) = 0.09 1 / (1 + 99) = 0.01 4 / (4 + 96) = 0.04 2 / (2 + 98) = 0.02 Costs 5 * 9 + 22 = 67 5 * 1 + 21 = 26 5 * 4 + 10 = 30 5 * 2 + 18 = 28 Options C and D have a recall greater than 80% and an FPR less than 10%, but D is the most cost effective. For supporting information, refer to this link. Reference 8:Here
Question 9) A data scientist uses logistic regression to build a fraud detection model. While the model accuracy is 99%, 90% of the fraud cases are not detected by the model. What action will definitely help the model detect more than 10% of fraud cases?
A) Using undersampling to balance the dataset
B) Decreasing the class probability threshold
C) Using regularization to reduce overfitting
D) Using oversampling to balance the dataset
Decreasing the class probability threshold makes the model more sensitive and, therefore, marks more cases as the positive class, which is fraud in this case. This will increase the likelihood of fraud detection. However, it comes at the price of lowering precision. This is covered in the Discussion section of the paper at this link Reference 9:Here
Question 10) A company is interested in building a fraud detection model. Currently, the data scientist does not have a sufficient amount of information due to the low number of fraud cases. Which method is MOST likely to detect the GREATEST number of valid fraud cases?
A) Oversampling using bootstrapping
C) Oversampling using SMOTE
D) Class weight adjustment
With datasets that are not fully populated, the Synthetic Minority Over-sampling Technique (SMOTE) adds new information by adding synthetic data points to the minority class. This technique would be the most effective in this scenario. Refer to Section 4.2 at this link for supporting information. Reference 10) : Here
Question 11) A machine learning engineer is preparing a data frame for a supervised learning task with the Amazon SageMaker Linear Learner algorithm. The ML engineer notices the target label classes are highly imbalanced and multiple feature columns contain missing values. The proportion of missing values across the entire data frame is less than 5%. What should the ML engineer do to minimize bias due to missing values?
A) Replace each missing value by the mean or median across non-missing values in same row.
B) Delete observations that contain missing values because these represent less than 5% of the data.
C) Replace each missing value by the mean or median across non-missing values in the same column.
D) For each feature, approximate the missing values using supervised learning based on other features.
Use supervised learning to predict missing values based on the values of other features. Different supervised learning approaches might have different performances, but any properly implemented supervised learning approach should provide the same or better approximation than mean or median approximation, as proposed in responses A and C. Supervised learning applied to the imputation of missing values is an active field of research. Refer to this link for an example. Reference 11): Here
Question 12) A company has collected customer comments on its products, rating them as safe or unsafe, using decision trees. The training dataset has the following features: id, date, full review, full review summary, and a binary safe/unsafe tag. During training, any data sample with missing features was dropped. In a few instances, the test set was found to be missing the full review text field. For this use case, which is the most effective course of action to address test data samples with missing features?
A) Drop the test samples with missing full review text fields, and then run through the test set.
B) Copy the summary text fields and use them to fill in the missing full review text fields, and then run through the test set.
C) Use an algorithm that handles missing data better than decision trees.
D) Generate synthetic data to fill in the fields that are missing data, and then run through the test set.
In this case, a full review summary usually contains the most descriptive phrases of the entire review and is a valid stand-in for the missing full review text field. For supporting information, refer to page 1627 at this link, and this link and this link.
Question 13) An insurance company needs to automate claim compliance reviews because human reviews are expensive and error-prone. The company has a large set of claims and a compliance label for each. Each claim consists of a few sentences in English, many of which contain complex related information. Management would like to use Amazon SageMaker built-in algorithms to design a machine learning supervised model that can be trained to read each claim and predict if the claim is compliant or not. Which approach should be used to extract features from the claims to be used as inputs for the downstream supervised task?
A) Derive a dictionary of tokens from claims in the entire dataset. Apply one-hot encoding to tokens found in each claim of the training set. Send the derived features space as inputs to an Amazon SageMaker builtin supervised learning algorithm.
B) Apply Amazon SageMaker BlazingText in Word2Vec mode to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
C) Apply Amazon SageMaker BlazingText in classification mode to labeled claims in the training set to derive features for the claims that correspond to the compliant and non-compliant labels, respectively.
D) Apply Amazon SageMaker Object2Vec to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
Amazon SageMaker Object2Vec generalizes the Word2Vec embedding technique for words to more complex objects, such as sentences and paragraphs. Since the supervised learning task is at the level of whole claims, for which there are labels, and no labels are available at the word level, Object2Vec needs be used instead of Word2Vec.
Question 14) You have been tasked with capturing two different types of streaming events. The first event type includes mission-critical data that needs to immediately be processed before operations can continue. The second event type includes data of less importance, but operations can continue without immediately processing. What is the most appropriate solution to record these different types of events?
A) Capture both events with the PutRecords API call.
B) Capture both event types using the Kinesis Producer Library (KPL).
C) Capture the mission critical events with the PutRecords API call and the second event type with the Kinesis Producer Library (KPL).
D) Capture the mission critical events with the Kinesis Producer Library (KPL) and the second event type with the Putrecords API call.
The question is about sending data to Kinesis synchronously vs. asynchronously. PutRecords is a synchronous send function, so it must be used for the first event type (critical events). The Kinesis Producer Library (KPL) implements an asynchronous send function, so it can be used for the second event type. In this scenario, the reason to use the KPL over the PutRecords API call is because: KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user-configurable). Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance. Applications that cannot tolerate this additional delay may need to use the AWS SDK directly. For more information about using the AWS SDK with Kinesis Data Streams, see Developing Producers Using the Amazon Kinesis Data Streams API with the AWS SDK for Java. For more information about RecordMaxBufferedTime and other user-configurable properties of the KPL, see Configuring the Kinesis Producer Library.
Question 15) You are collecting clickstream data from an e-commerce website to make near-real time product suggestions for users actively using the site. Which combination of tools can be used to achieve the quickest recommendations and meets all of the requirements?
A) Use Kinesis Data Streams to ingest clickstream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
B) Use Kinesis Data Firehose to ingest click stream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions, then use Lambda to load these results into S3.
C) Use Kinesis Data Streams to ingest clickstream data, then use Lambda to process that data and write it to S3. Once the data is on S3, use Athena to query based on conditions that data and make real time recommendations to users.
D) Use the Kinesis Data Analytics to ingest the clickstream data directly and run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
Kinesis Data Analytics gets its input streaming data from Kinesis Data Streams or Kinesis Data Firehose. You can use Kinesis Data Analytics to run real-time SQL queries on your data. Once certain conditions are met you can trigger Lambda functions to make real time product suggestions to users. It is not important that we store or persist the clickstream data.
Question 17) You have been tasked with capturing data from an online gaming platform to run analytics on and process through a machine learning pipeline. The data that you are ingesting is players controller inputs every 1 second (up to 10 players in a game) that is in JSON format. The data needs to be ingested through Kinesis Data Streams and the JSON data blob is 100 KB in size. What is the minimum number of shards you can use to successfully ingest this data?
A) 10 shards
B) Greater than 500 shards, so you’ll need to request more shards from AWS
C) 1 shard
D) 100 shards
In this scenario, there will be a maximum of 10 records per second with a max payload size of 1000 KB (10 records x 100 KB = 1000KB) written to the shard. A single shard can ingest up to 1 MB of data per second, which is enough to ingest the 1000 KB from the streaming game play. Therefor 1 shard is enough to handle the streaming data.
Question 19) You are a ML specialist needing to collect data from Twitter tweets. Your goal is to collect tweets that include only the name of your company and the tweet body, and store it off into a data store in AWS. What set of tools can you use to stream, transform, and load the data into AWS with the LEAST amount of effort?
A) Setup a Kinesis Data Firehose for data ingestion and immediately write that data to S3. Next, setup a Lambda function to trigger when data lands in S3 to transform it and finally write it to DynamoDB.
B) Setup A Kinesis Data Stream for data ingestion, setup EC2 instances as data consumers to poll and transform the data from the stream. Once the data is transformed, make an API call to write the data to DynamoDB.
C) Setup Kinesis Data Streams for data ingestion. Next, setup Kinesis Data Firehouse to load that data into RedShift. Next, setup a Lambda function to query data using RedShift spectrum and store the results onto DynamoDB.
D) Create a Kinesis Data Stream to ingest the data. Next, setup a Kinesis Data Firehose and use Lambda to transform the data from the Kinesis Data Stream, then use Lambda to write the data to DynamoDB. Finally, use S3 as the data destination for Kinesis Data Firehose.
All of these could be used to stream, transform, and load the data into an AWS data store. The setup that requires the LEAST amount of effort and moving parts involves setting up a Kinesis Data Firehose to stream the data into S3, have it transformed by Lambda with an S3 trigger, and then written to DynamoDB.