Multilingual and Platform Independent Cloud Certification and Education App for AWS, Azure, Google Cloud

Cloud Edu Cert

Data Center Proxies - Data Collectors - Data Unblockers

The Cloud Education Certification App is an EduFlix App for AWS, Azure, Google Cloud Certification Prep.

Technology is changing and is moving towards the cloud. The cloud will power most businesses in the coming years and is not taught in schools. How do we ensure that our kids and youth and ourselves are best prepared for this challenge?

Building mobile educational apps that work offline and on any device can help greatly in that sense.

The ability to tab on a button and learn the cloud fundamentals and take quizzes is a great opportunity to help our children and youth to boost their job prospects and be more productive at work.

The App covers the following certifications :
AWS Cloud Practitioner Exam Prep CCP CLF-C01, Azure Fundamentals AZ 900 Exam Prep, AWS Certified Solution Architect Associate SAA-C02 Exam Prep, AWS Certified Developer Associate DVA-C01 Exam Prep, Azure Administrator AZ 104 Exam Prep, Google Associate Cloud Engineer Exam Prep, Data Analytics for AWS DAS-C01, Machine Learning for AWS and Google, AWS Certified Security – Specialty (SCS-C01), AWS Certified Machine Learning – Specialty (MLS-C01), Google Cloud Professional Machine Learning Engineer and more…

The App covers the following cloud categories:

AWS Technology, AWS Security and Compliance, AWS Cloud Concepts, AWS Billing and Pricing , AWS Design High Performing Architectures, AWS Design Cost Optimized Architectures, AWS Specify Secure Applications And Architectures, AWS Design Resilient Architecture, Development With AWS, AWS Deployment, AWS Security, AWS Monitoring, AWS Troubleshooting, AWS Refactoring, Azure Pricing and Support, Azure Cloud Concepts , Azure Identity, governance, and compliance, Azure Services , Implement and Manage Azure Storage, Deploy and Manage Azure Compute Resources, Configure and Manage Azure Networking Services, Monitor and Backup Azure Resources, GCP Plan and configure a cloud solution, GCP Deploy and implement a cloud solution, GCP Ensure successful operation of a cloud solution, GCP Configure access and security, GCP Setting up a cloud solution environment, AWS Incident Response, AWS Logging and Monitoring, AWS Infrastructure Security, AWS Identity and Access Management, AWS Data Protection, AWS Data Engineering, AWS Exploratory Data Analysis, AWS Modeling, AWS Machine Learning Implementation and Operations, GCP Frame ML problems, GCP Architect ML solutions, GCP Prepare and process data, GCP Develop ML models, GCP Automate & orchestrate ML pipelines, GCP Monitor, optimize, and maintain ML solutions, etc..

Cloud Education and Certification

The App covers the following Cloud Services, Framework and technologies:

AWS: VPC, S3, DynamoDB, EC2, ECS, Lambda, API Gateway, CloudWatch, CloudTrail, Code Pipeline, Code Deploy, TCO Calculator, SES, EBS, ELB, AWS Autoscaling , RDS, Aurora, Route 53, Amazon CodeGuru, Amazon Bracket, AWS Billing and Pricing, Simply Monthly Calculator, cost calculator, Ec2 pricing on-demand, IAM, AWS Pricing, Pay As You Go, No Upfront Cost, Cost Explorer, AWS Organizations, Consolidated billing, Instance Scheduler, on-demand instances, Reserved instances, Spot Instances, CloudFront, Workspace, S3 storage classes, Regions, Availability Zones, Placement Groups, Amazon lightsail, Redshift, EC2 G4ad instances, DAAS, PAAS, IAAS, SAAS, NAAS, Machine Learning, Key Pairs, AWS CloudFormation, Amazon Macie, Amazon Textract, Glacier Deep Archive, 99.999999999% durability, AWS Codestar, Amazon Neptune, S3 Bucket, EMR, SNS, Desktop As A Service, Emazon EC2 for Mac, Aurora Postgres SQL, Kubernetes, Containers, Cluster.

Azure: Virtual Machines, Azure App Services, Azure Container Instances (ACI), Azure Kubernetes Service (AKS), and Windows Virtual Desktop, Virtual Networks, VPN Gateway, Virtual Network peering, and ExpressRoute, Container (Blob) Storage, Disk Storage, File Storage, and storage tiers, Cosmos DB, Azure SQL Database, Azure Database for MySQL, Azure Database for PostgreSQL, and SQL Managed Instance, Azure Marketplace, Azure consumption-based mode, management groups, resources and RG, Geographic distribution concepts such as Azure regions, region pairs, and AZ Internet of Things (IoT) Hub, IoT Central, and Azure Sphere, Azure Synapse Analytics, HDInsight, and Azure Databricks, Azure Machine Learning, Cognitive Services and Azure Bot Service, Serverless computing solutions that include Azure Functions and Logic Apps, Azure DevOps, GitHub, GitHub Actions, and Azure DevTest Labs, Azure Mobile, Azure Advisor, Azure Resource Manager (ARM) templates, Azure Security, Privacy and Workloads, General security and network security, Azure security features, Azure Security Centre, policy compliance, security alerts, secure score, and resource hygiene, Key Vault, Azure Sentinel, Azure Dedicated Hosts, Concept of defense in depth, NSG, Azure Firewall, Azure DDoS protection, Identity, governance, Conditional Access, Multi-Factor Authentication (MFA), and Single Sign-On (SSO),Azure Services, Core Azure architectural components, Management Groups, Azure Resource Manager,

Google Cloud Platform: Compute Engine, App Engine, BigQuery, Bigtable, Pub/Sub, flow logs, CORS, CLI, pod, Firebase, Cloud Run, Cloud Firestore, Cloud CDN, Cloud Storage, Persistent Disk, Kubernetes engine, Container registry, Cloud Load Balancing, Cloud Dataflow, gsutils, Cloud SQL,

Cloud Education Certification: Eduflix App for Cloud Education and Certification (AWS, Azure, Google Cloud)

Features:
– Practice exams
– 1000+ Q&A updated frequently.
– 3+ Practice exams per Certification
– Scorecard / Scoreboard to track your progress
– Quizzes with score tracking, progress bar, countdown timer.
– Can only see scoreboard after completing the quiz.
– FAQs for most popular Cloud services
– Cheat Sheets
– Flashcards
– works offline

Note and disclaimer: We are not affiliated with AWS, Azure, Microsoft or Google. The questions are put together based on the certification study guide and materials available online. The questions in this app should help you pass the exam but it is not guaranteed. We are not responsible for any exam you did not pass.

Important: To succeed with the real exam, do not memorize the answers in this app. It is very important that you understand why a question is right or wrong and the concepts behind it by carefully reading the reference documents in the answers.

Top 50 Google Certified Cloud Professional Architect Exam Questions and Answers Dumps

Azure Administrator AZ-104 Exam Questions and Answers Dumps

Data Center Proxies - Data Collectors - Data Unblockers

Google Certified Cloud Professional Architect is the top high paying certification in the world: Google Certified Professional Cloud Architect Average Salary – $175,761

The Google Certified Cloud Professional Architect Exam assesses your ability to:

  • Design and plan a cloud solution architecture
  • Manage and provision the cloud solution infrastructure
  • Design for security and compliance
  • Analyze and optimize technical and business processes
  • Manage implementations of cloud architecture
  • Ensure solution and operations reliability
  • Designing and planning a cloud solution architecture

The Google Certified Cloud Professional Architect covers the following topics:

Designing and planning a cloud solution architecture: 36%

This domain tests your ability to design a solution infrastructure that meets business and technical requirements and considers network, storage and compute resources. It will test your ability to create a migration plan, and that you can envision future solution improvements.

Managing and provisioning a solution Infrastructure: 20%

This domain will test your ability to configure network topologies, individual storage systems and design solutions using Google Cloud networking, storage and compute services.

Designing for security and compliance: 12%

This domain assesses your ability to design for security and compliance by considering IAM policies, separation of duties, encryption of data and that you can design your solutions while considering any compliance requirements such as those for healthcare and financial information.

Managing implementation: 10%

This domain tests your ability to advise development/operation team(s) to make sure you have successful deployment of your solution. It also tests yours ability to interact with Google Cloud using GCP SDK (gcloud, gsutil, and bq).

Ensuring solution and operations reliability: 6%

This domain tests your ability to run your solutions reliably in Google Cloud by building monitoring and logging solutions, quality control measures and by creating release management processes.

Analyzing and optimizing technical and business processes: 16%

This domain will test how you analyze and define technical processes, business processes and develop procedures to ensure resilience of your solutions in production.

Below are the Top 50 Google Certified Cloud Professional Architect Exam Questions and Answers Dumps: You will need to have the three case studies referred to in the exam open in separate tabs in order to complete the exam: Company A , Company B, Company C

Question 1:  Because you do not know every possible future use for the data Company A collects, you have decided to build a system that captures and stores all raw data in case you need it later. How can you most cost-effectively accomplish this goal?

 A. Have the vehicles in the field stream the data directly into BigQuery.

B. Have the vehicles in the field pass the data to Cloud Pub/Sub and dump it into a Cloud Dataproc cluster that stores data in Apache Hadoop Distributed File System (HDFS) on persistent disks.

C. Have the vehicles in the field continue to dump data via FTP, adjust the existing Linux machines, and use a collector to upload them into Cloud Dataproc HDFS for storage.

D. Have the vehicles in the field continue to dump data via FTP, and adjust the existing Linux machines to immediately upload it to Cloud Storage with gsutil.

ANSWER1:

D

Notes/References1:

D is correct because several load-balanced Compute Engine VMs would suffice to ingest 9 TB per day, and Cloud Storage is the cheapest per-byte storage offered by Google. Depending on the format, the data could be available via BigQuery immediately, or shortly after running through an ETL job. Thus, this solution meets business and technical requirements while optimizing for cost.

Reference: Streaming insertsApache Hadoop and Spark10 tips for building long running cluster using cloud dataproc

‎Cloud Education Certification
‎Cloud Education Certification
  • ‎Cloud Education Certification Screenshot
  • ‎Cloud Education Certification Screenshot
  • ‎Cloud Education Certification Screenshot
  • ‎Cloud Education Certification Screenshot
  • ‎Cloud Education Certification Screenshot
  • ‎Cloud Education Certification Screenshot
  • ‎Cloud Education Certification Screenshot

Question 2: Today, Company A maintenance workers receive interactive performance graphs for the last 24 hours (86,400 events) by plugging their maintenance tablets into the vehicle. The support group wants support technicians to view this data remotely to help troubleshoot problems. You want to minimize the latency of graph loads. How should you provide this functionality?

A. Execute queries against data stored in a Cloud SQL.

B. Execute queries against data indexed by vehicle_id.timestamp in Cloud Bigtable.

C. Execute queries against data stored on daily partitioned BigQuery tables.

D. Execute queries against BigQuery with data stored in Cloud Storage via BigQuery federation.

ANSWER2:

B

Notes/References2:

B is correct because Cloud Bigtable is optimized for time-series data. It is cost-efficient, highly available, and low-latency. It scales well. Best of all, it is a managed service that does not require significant operations work to keep running.

Reference: BigTables time series clusterBigQuery

Question 3: Your agricultural division is experimenting with fully autonomous vehicles. You want your architecture to promote strong security during vehicle operation. Which two architecture characteristics should you consider?

A. Use multiple connectivity subsystems for redundancy. 

B. Require IPv6 for connectivity to ensure a secure address space. 

C. Enclose the vehicle’s drive electronics in a Faraday cage to isolate chips.

D. Use a functional programming language to isolate code execution cycles.

E. Treat every microservice call between modules on the vehicle as untrusted.

F. Use a Trusted Platform Module (TPM) and verify firmware and binaries on boot.

ANSWER3:

E and F

Notes/References3:

E is correct because this improves system security by making it more resistant to hacking, especially through man-in-the-middle attacks between modules.

F is correct because this improves system security by making it more resistant to hacking, especially rootkits or other kinds of corruption by malicious actors.

Reference 3: Trusted Platform Module

Question 4: For this question, refer to the Company A case study.

Which of Company A’s legacy enterprise processes will experience significant change as a result of increased Google Cloud Platform adoption?

A. OpEx/CapEx allocation, LAN change management, capacity planning

B. Capacity planning, TCO calculations, OpEx/CapEx allocation 

C. Capacity planning, utilization measurement, data center expansion

D. Data center expansion, TCO calculations, utilization measurement

ANSWER4:

B

Notes/References4:

B is correct because all of these tasks are big changes when moving to the cloud. Capacity planning for cloud is different than for on-premises data centers; TCO calculations are adjusted because Company A is using services, not leasing/buying servers; OpEx/CapEx allocation is adjusted as services are consumed vs. using capital expenditures.

Reference: Cloud Economics

Question 5: For this question, refer to the Company A case study.

You analyzed Company A’s business requirement to reduce downtime and found that they can achieve a majority of time saving by reducing customers’ wait time for parts. You decided to focus on reduction of the 3 weeks’ aggregate reporting time. Which modifications to the company’s processes should you recommend?

A. Migrate from CSV to binary format, migrate from FTP to SFTP transport, and develop machine learning analysis of metrics.

B. Migrate from FTP to streaming transport, migrate from CSV to binary format, and develop machine learning analysis of metrics.

C. Increase fleet cellular connectivity to 80%, migrate from FTP to streaming transport, and develop machine learning analysis of metrics.

D. Migrate from FTP to SFTP transport, develop machine learning analysis of metrics, and increase dealer local inventory by a fixed factor.

ANSWER5:

C

Notes/References5:

C is correct because using cellular connectivity will greatly improve the freshness of data used for analysis from where it is now, collected when the machines are in for maintenance. Streaming transport instead of periodic FTP will tighten the feedback loop even more. Machine learning is ideal for predictive maintenance workloads.

Question 6: Your company wants to deploy several microservices to help their system handle elastic loads. Each microservice uses a different version of software libraries. You want to enable their developers to keep their development environment in sync with the various production services. Which technology should you choose?

A. RPM/DEB

B. Containers 

C. Chef/Puppet

D. Virtual machines

ANSWER6:

B

Notes/References6:

B is correct because using containers for development, test, and production deployments abstracts away system OS environments, so that a single host OS image can be used for all environments. Changes that are made during development are captured using a copy-on-write filesystem, and teams can easily publish new versions of the microservices in a repository.

Question 7: Your company wants to track whether someone is present in a meeting room reserved for a scheduled meeting. There are 1000 meeting rooms across 5 offices on 3 continents. Each room is equipped with a motion sensor that reports its status every second. You want to support the data upload and collection needs of this sensor network. The receiving infrastructure needs to account for the possibility that the devices may have inconsistent connectivity. Which solution should you design?

A. Have each device create a persistent connection to a Compute Engine instance and write messages to a custom application.

B. Have devices poll for connectivity to Cloud SQL and insert the latest messages on a regular interval to a device specific table. 

C. Have devices poll for connectivity to Cloud Pub/Sub and publish the latest messages on a regular interval to a shared topic for all devices.

D. Have devices create a persistent connection to an App Engine application fronted by Cloud Endpoints, which ingest messages and write them to Cloud Datastore.

ANSWER7:

C

Notes/References7:

C is correct because Cloud Pub/Sub can handle the frequency of this data, and consumers of the data can pull from the shared topic for further processing.

Question 8: Your company wants to try out the cloud with low risk. They want to archive approximately 100 TB of their log data to the cloud and test the analytics features available to them there, while also retaining that data as a long-term disaster recovery backup. Which two steps should they take?

A. Load logs into BigQuery. 

B. Load logs into Cloud SQL.

C. Import logs into Stackdriver. 

D. Insert logs into Cloud Bigtable.

E. Upload log files into Cloud Storage.

ANSWER8:

A and E

Notes/References8:

A is correct because BigQuery is the fully managed cloud data warehouse for analytics and supports the analytics requirement.

E is correct because Cloud Storage provides the Coldline storage class to support long-term storage with infrequent access, which would support the long-term disaster recovery backup requirement.

References: BigQueryStackDriverBigTableStorage Class: ColdLine

Question 9: You set up an autoscaling instance group to serve web traffic for an upcoming launch. After configuring the instance group as a backend service to an HTTP(S) load balancer, you notice that virtual machine (VM) instances are being terminated and re-launched every minute. The instances do not have a public IP address. You have verified that the appropriate web response is coming from each instance using the curl command. You want to ensure that the backend is configured correctly. What should you do?

A. Ensure that a firewall rule exists to allow source traffic on HTTP/HTTPS to reach the load balancer. 

B. Assign a public IP to each instance, and configure a firewall rule to allow the load balancer to reach the instance public IP.

C. Ensure that a firewall rule exists to allow load balancer health checks to reach the instances in the instance group.

D. Create a tag on each instance with the name of the load balancer. Configure a firewall rule with the name of the load balancer as the source and the instance tag as the destination.

ANSWER9:

C

Notes/References9:

C is correct because health check failures lead to a VM being marked unhealthy and can result in termination if the health check continues to fail. Because you have already verified that the instances are functioning properly, the next step would be to determine why the health check is continuously failing.

Reference: Load balancingLoad Balancing Health Checking

Question 10: Your organization has a 3-tier web application deployed in the same network on Google Cloud Platform. Each tier (web, API, and database) scales independently of the others. Network traffic should flow through the web to the API tier, and then on to the database tier. Traffic should not flow between the web and the database tier. How should you configure the network?

A. Add each tier to a different subnetwork.

B. Set up software-based firewalls on individual VMs. 

C. Add tags to each tier and set up routes to allow the desired traffic flow.

D. Add tags to each tier and set up firewall rules to allow the desired traffic flow.

ANSWER10:

D

Notes/References10:

D is correct because as instances scale, they will all have the same tag to identify the tier. These tags can then be leveraged in firewall rules to allow and restrict traffic as required, because tags can be used for both the target and source.

Reference: Using VPCRoutesAdd Remove Network

Question 11: Your organization has 5 TB of private data on premises. You need to migrate the data to Cloud Storage. You want to maximize the data transfer speed. How should you migrate the data?

A. Use gsutil.

B. Use gcloud.

C. Use GCS REST API. 

D. Use Storage Transfer Service.

ANSWER11:

A

Notes/References11:

A is correct because gsutil gives you access to write data to Cloud Storage.

Reference: gsutilsgcloud sdkcloud storage json apiuploading objectsstorage transfer

Question 12: You are designing a mobile chat application. You want to ensure that people cannot spoof chat messages by proving that a message was sent by a specific user. What should you do?

A. Encrypt the message client-side using block-based encryption with a shared key.

B. Tag messages client-side with the originating user identifier and the destination user.

C. Use a trusted certificate authority to enable SSL connectivity between the client application and the server. 

D. Use public key infrastructure (PKI) to encrypt the message client-side using the originating user’s private key.

ANSWER12:

D

Notes/References12:

D is correct because PKI requires that both the server and the client have signed certificates, validating both the client and the server.

Question 13: You are designing a large distributed application with 30 microservices. Each of your distributed microservices needs to connect to a database backend. You want to store the credentials securely. Where should you store the credentials?

A. In the source code

B. In an environment variable 

C. In a key management system

D. In a config file that has restricted access through ACLs

ANSWER13:

C

Notes/References13:

Question 14: For this question, refer to the Company B case study.

Company B wants to set up a real-time analytics platform for their new game. The new platform must meet their technical requirements. Which combination of Google technologies will meet all of their requirements?

A. Kubernetes Engine, Cloud Pub/Sub, and Cloud SQL

B. Cloud Dataflow, Cloud Storage, Cloud Pub/Sub, and BigQuery 

C. Cloud SQL, Cloud Storage, Cloud Pub/Sub, and Cloud Dataflow

D. Cloud Pub/Sub, Compute Engine, Cloud Storage, and Cloud Dataproc

ANSWER14:

B

Notes/References14:

B is correct because:
Cloud Dataflow dynamically scales up or down, can process data in real time, and is ideal for processing data that arrives late using Beam windows and triggers.
Cloud Storage can be the landing space for files that are regularly uploaded by users’ mobile devices.
Cloud Pub/Sub can ingest the streaming data from the mobile users.
BigQuery can query more than 10 TB of historical data.

References: GCP QuotasBeam Apache WindowingBeam Apache TriggersBigQuery External Data SolutionsApache Hive on Cloud Dataproc

Question 15: For this question, refer to the Company B case study.

Company B has deployed their new backend on Google Cloud Platform (GCP). You want to create a thorough testing process for new versions of the backend before they are released to the public. You want the testing environment to scale in an economical way. How should you design the process?A. Create a scalable environment in GCP for simulating production load.B. Use the existing infrastructure to test the GCP-based backend at scale. C. Build stress tests into each component of your application and use resources from the already deployed production backend to simulate load.D. Create a set of static environments in GCP to test different levels of load—for example, high, medium, and low.

ANSWER15:

A

Notes/References15:

A is correct because simulating production load in GCP can scale in an economical way.

Reference: Load Testing iot using gcp and locustDistributed Load Testing Using Kubernetes

Question 16: For this question, refer to the Company B case study.

Company B wants to set up a continuous delivery pipeline. Their architecture includes many small services that they want to be able to update and roll back quickly. Company B has the following requirements:

  • Services are deployed redundantly across multiple regions in the US and Europe
  • Only frontend services are exposed on the public internet.
  • They can reserve a single frontend IP for their fleet of services.
  • Deployment artifacts are immutable

Which set of products should they use?

A. Cloud Storage, Cloud Dataflow, Compute Engine

B. Cloud Storage, App Engine, Cloud Load Balancing

C. Container Registry, Google Kubernetes Engine, Cloud Load Balancing

D. Cloud Functions, Cloud Pub/Sub, Cloud Deployment Manager

ANSWER16:

C

Notes/References16:

C is correct because:
Google Kubernetes Engine is ideal for deploying small services that can be updated and rolled back quickly. It is a best practice to manage services using immutable containers.
Cloud Load Balancing supports globally distributed services across multiple regions. It provides a single global IP address that can be used in DNS records. Using URL Maps, the requests can be routed to only the services that Company B wants to expose.
Container Registry is a single place for a team to manage Docker images for the services.

References: Load Balancing https – load balancing overview GCP lb global forwarding rulesreserve static external ip addressbest practice for operating containerscontainer registrydataflowcalling https

Question 17: Your customer is moving their corporate applications to Google Cloud Platform. The security team wants detailed visibility of all resources in the organization. You use Resource Manager to set yourself up as the org admin. What Cloud Identity and Access Management (Cloud IAM) roles should you give to the security team?

A. Org viewer, Project owner

B. Org viewer, Project viewer 

C. Org admin, Project browser

D. Project owner, Network admin

ANSWER17:

B

Notes/References17:

B is correct because:
Org viewer grants the security team permissions to view the organization's display name.
Project viewer grants the security team permissions to see the resources within projects.

Reference: GCP Resource Manager – User Roles

Question 18: To reduce costs, the Director of Engineering has required all developers to move their development infrastructure resources from on-premises virtual machines (VMs) to Google Cloud Platform. These resources go through multiple start/stop events during the day and require state to persist. You have been asked to design the process of running a development environment in Google Cloud while providing cost visibility to the finance department. Which two steps should you take?

A. Use persistent disks to store the state. Start and stop the VM as needed. 

B. Use the –auto-delete flag on all persistent disks before stopping the VM. 

C. Apply VM CPU utilization label and include it in the BigQuery billing export.

D. Use BigQuery billing export and labels to relate cost to groups. 

E. Store all state in local SSD, snapshot the persistent disks, and terminate the VM.F. Store all state in Cloud Storage, snapshot the persistent disks, and terminate the VM.

ANSWER18:

A and D

Notes/References18:

A is correct because persistent disks will not be deleted when an instance is stopped.

D is correct because exporting daily usage and cost estimates automatically throughout the day to a BigQuery dataset is a good way of providing visibility to the finance department. Labels can then be used to group the costs based on team or cost center.

References: GCP instances life cycleGCP instances set disk auto deleteGCP Local Data PersistanceGCP export data BigQueryGCP Creating Managing Labels

Question 19: Your company has decided to make a major revision of their API in order to create better experiences for their developers. They need to keep the old version of the API available and deployable, while allowing new customers and testers to try out the new API. They want to keep the same SSL and DNS records in place to serve both APIs. What should they do?

A. Configure a new load balancer for the new version of the API.

B. Reconfigure old clients to use a new endpoint for the new API. 

C. Have the old API forward traffic to the new API based on the path.

D. Use separate backend services for each API path behind the load balancer.

ANSWER19:

D

Notes/References19:

D is correct because an HTTP(S) load balancer can direct traffic reaching a single IP to different backends based on the incoming URL.

References: load balancing httpsload balancing backendGCP lb global forwarding rules

Question 20: The database administration team has asked you to help them improve the performance of their new database server running on Compute Engine. The database is used for importing and normalizing the company’s performance statistics. It is built with MySQL running on Debian Linux. They have an n1-standard-8 virtual machine with 80 GB of SSD zonal persistent disk. What should they change to get better performance from this system in a cost-effective manner?

A. Increase the virtual machine’s memory to 64 GB.

B. Create a new virtual machine running PostgreSQL. 

C. Dynamically resize the SSD persistent disk to 500 GB.

D. Migrate their performance metrics warehouse to BigQuery.

ANSWER20:

C

Notes/References20:

C is correct because persistent disk performance is based on the total persistent disk capacity attached to an instance and the number of vCPUs that the instance has. Incrementing the persistent disk capacity will increment its throughput and IOPS, which in turn improve the performance of MySQL.

References: GCP compute disks pdsspecsGCP Compute Disks Performances

Question 21: You need to ensure low-latency global access to data stored in a regional GCS bucket. Data access is uniform across many objects and relatively high. What should you do to address the latency concerns?

A. Use Google’s Cloud CDN.

B. Use Premium Tier routing and Cloud Functions to accelerate access at the edges.

C. Do nothing.

D. Use global BigTable storage.

E. Use a global Cloud Spanner instance.

F. Migrate the data to a new multi-regional GCS bucket.

G. Change the storage class to multi-regional.

ANSWER21:

A

Notes/References21:

Cloud Functions cannot be used to affect GCS data access, so that option is simply wrong. BigTable does not have any “global” mode, so that option is wrong, too. Cloud Spanner is not a good replacement for GCS data: the data use cases are different enough that we can assume it would probably not be a good fit. You cannot change a bucket’s location after it has been created–not via the storage class nor any other way; you would have to migrate the data to a new bucket. Google’s Cloud CDN is very easy to turn on, but it does only work for data that comes from within GCP and only if the objects are being accessed frequently enough. 

Reference: Google Cloud Storage : What bucket class for the best performance?

Question 22: You are building a sign-up app for your local neighbourhood barbeque party and you would like to quickly throw together a low-cost application that tracks who will bring what. Which of the following options should you choose?

A. Python, Flask, App Engine Standard

B. Ruby, Nginx, GKE

C. HTML, CSS, Cloud Storage

D. Node.js, Express, Cloud Functions

E. Rust, Rocket, App Engine Flex

F. Perl, CGI, GCE

ANSWER22:

A

Notes/References22:

The Cloud Storage option doesn’t offer any way to coordinate the guest data. App Engine Flex would cost much more to run when no one is on the sign-up site. Cloud Functions could handle processing some API calls, but it would be more work to set up and that option doesn’t mention anything about storage. GKE is way overkill for such a small and simple application. Running Perl CGI scripts on GCE would also cost more than it needs (and probably make you very sad). App Engine Standard makes it super-easy to stand up a Python Flask app and includes easy data storage options, too. 

Reference: Building a Python 3.7 App on App Engine

Question 23: Your company has decided to migrate your AWS DynamoDB database to a multi-regional Cloud Spanner instance and you are designing the system to transfer and load all the data to synchronize the DBs and eventually allow for a quick cut-over. A member of your team has some previous experience working with Apache Hadoop. Which of the following options will you choose for the streamed updates that follow the initial import?

A. The DynamoDB table change is captured by Cloud Pub/Sub and written to Cloud Dataproc for processing into a Spanner-compatible format.

B. The DynamoDB table change is captured by Cloud Pub/Sub and written to Cloud Dataflow for processing into a Spanner-compatible format.

C. Changes to the DynamoDB table are captured by DynamoDB Streams. A Lambda function triggered by the stream writes the change to Cloud Pub/Sub. Cloud Dataflow processes the data from Cloud Pub/Sub and writes it to Cloud Spanner.

D. The DynamoDB table is rescanned by a GCE instance and written to a Cloud Storage bucket. Cloud Dataproc processes the data from Cloud Storage and writes it to Cloud Spanner.

E. The DynamoDB table is rescanned by an EC2 instance and written to an S3 bucket. Storage Transfer Service moves the data from S3 to a Cloud Storage bucket. Cloud Dataflow processes the data from Cloud Storage and writes it to Cloud Spanner.

ANSWER23:

C

Notes/References23:

Rescanning the DynamoDB table is not an appropriate approach to tracking data changes to keep the GCP-side of this in synch. The fact that someone on your team has previous Hadoop experience is not a good enough reason to choose Cloud Dataproc; that’s a red herring. The options purporting to connect Cloud Pub/Sub directly to the DynamoDB table won’t work because there is no such functionality. 

References: Cloud Solutions Architecture Reference

Question 24: Your client is a manufacturing company and they have informed you that they will be pausing all normal business activities during a five-week summer holiday period. They normally employ thousands of workers who constantly connect to their internal systems for day-to-day manufacturing data such as blueprints and machine imaging, but during this period the few on-site staff will primarily be re-tooling the factory for the next year’s production runs and will not be performing any manufacturing tasks that need to access these cloud-based systems. When the bulk of the staff return, they will primarily work on the new models but may spend about 20% of their time working with models from previous years. The company has asked you to reduce their GCP costs during this time, so which of the following options will you suggest?

A. Pause all Cloud Functions via the UI and unpause them when work starts back up.

B. Disable all Cloud Functions via the command line and re-enable them when work starts back up.

C. Delete all Cloud Functions and recreate them when work starts back up.

D. Convert all Cloud Functions to run as App Engine Standard applications during the break.

E. None of these options is a good suggestion.

ANSWER24:

E

Notes/References24:

Cloud Functions scale themselves down to zero when they’re not being used. There is no need to do anything with them.

Question 25: You need a place to store images before updating them by file-based render farm software running on a cluster of machines. Which of the following options will you choose?

A. Container Registry

B. Cloud Storage

C. Cloud Filestore

D. Persistent Disk

ANSWER25:

C

Notes/References25:

There are several different kinds of “images” that you might need to consider–maybe they are normal picture-image files, maybe they are Docker container images, maybe VM or disk images, or maybe something else. In this question, “images” refers to visual images, thus eliminating CI/CD products like Container Registry. Compute Engine is not a storage product and should be eliminated. The term “file-based” software means that it is unlikely to work well with object-based storage like Cloud Storage (or any of its storage classes). Persistent Disk cannot offer shared access across a cluster of machines when writes are involved; it only handles multiple readers. However, Cloud Filestore is made to provide shared, file-based storage for a cluster of machines as described in the question. 

Reference: Cloud Filestore | Google Cloud

Question 26: Your company has decided to migrate your AWS DynamoDB database to a multi-regional Cloud Spanner instance and you are designing the system to transfer and load all the data to synchronize the DBs and eventually allow for a quick cut-over. A member of your team has some previous experience working with Apache Hadoop. Which of the following options will you choose for the initial data import?

A. The DynamoDB table is scanned by an EC2 instance and written to an S3 bucket. Storage Transfer Service moves the data from S3 to a Cloud Storage bucket. Cloud Dataflow processes the data from Cloud Storage and writes it to Cloud Spanner.

B. The DynamoDB table data is captured by DynamoDB Streams. A Lambda function triggered by the stream writes the data to Cloud Pub/Sub. Cloud Dataflow processes the data from Cloud Pub/Sub and writes it to Cloud Spanner.

C. The DynamoDB table data is captured by Cloud Pub/Sub and written to Cloud Dataproc for processing into a Spanner-compatible format.

D. The DynamoDB table is scanned by a GCE instance and written to a Cloud Storage bucket. Cloud Dataproc processes the data from Cloud Storage and writes it to Cloud Spanner.

ANSWER26:

A

Notes/References26:

The same data processing will have to happen for both the initial (batch) data load and the incremental (streamed) data changes that follow it. So if the solution built to handle the initial batch doesn't also work for the stream that follows it, then the processing code would have to be written twice. A Professional Cloud Architect should recognize this project-level issue and not over-focus on the (batch) portion called out in this particular question. This is why you don’t want to choose Cloud Dataproc. Instead, Cloud Dataflow will handle both the initial batch load and also the subsequent streamed data. The fact that someone on your team has previous Hadoop experience is not a good enough reason to choose Cloud Dataproc; that’s a red herring. The DynamoDB streams option would be great for the db synchronization that follows, but it can’t handle the initial data load because DynamoDB Streams only fire for data changes. The option purporting to connect Cloud Pub/Sub directly to the DynamoDB table won’t work because there is no such functionality. 

Reference: Cloud Solutions Architecture Reference

Question 27: You need a managed service to handle logging data coming from applications running in GKE and App Engine Standard. Which option should you choose?

A. Cloud Storage

B. Logstash

C. Cloud Monitoring

D. Cloud Logging

E. BigQuery

F. BigTable

ANSWER27:

D

Notes/References27:

Cloud Monitoring is made to handle metrics, not logs. Logstash is not a managed service. And while you could store application logs in almost any storage service, the Cloud Logging service–aka Stackdriver Logging–is purpose-built to accept and process application logs from many different sources. Oh, and you should also be comfortable dealing with products and services by names other than their current official ones. For example, “GKE” used to be called “Container Engine”, “Cloud Build” used to be “Container Builder”, the “GCP Marketplace” used to be called “Cloud Launcher”, and so on. 

Reference: Cloud Logging | Google Cloud

Question 28: You need a place to store images before serving them from AppEngine Standard. Which of the following options will you choose?

A. Compute Engine

B. Cloud Filestore

C. Cloud Storage

D. Persistent Disk

E. Container Registry

F. Cloud Source Repositories

G. Cloud Build

H. Nearline

ANSWER28:

C

Notes/References28:

There are several different kinds of “images” that you might need to consider–maybe they are normal picture-image files, maybe they are Docker container images, maybe VM or disk images, or maybe something else. In this question, “images” refers to picture files, because that’s something that you would serve from a web server product like AppEngine Standard, so we eliminate Cloud Build (which isn’t actually for storage, at all) and the other two CI/CD products: Cloud Source Repositories and Container Registry. You definitely could store image files on Cloud Filestore or Persistent Disk, but you can’t hook those up to AppEngine Standard, so those options need to be eliminated, too. The only options left are both types of Cloud Storage, but since “Cloud Storage” sits next to “Coldline” as an option, we can confidently infer that the former refers to the “Standard” storage class. Since the question implies that these images will be served by AppEngine Standard, we would prefer to use the Standard storage class over the Coldline one–so there’s our answer. 

Reference: The App Engine Standard Environment Cloud Storage: Object Storage | Google Cloud Storage classes | Cloud Storage | Google Cloud

Question 29: You need to ensure low-latency global access to data stored in a multi-regional GCS bucket. Data access is uniform across many objects and relatively low. What should you do to address the latency concerns?

A. Use a global Cloud Spanner instance.

B. Change the storage class to multi-regional.

C. Use Google’s Cloud CDN.

D. Migrate the data to a new regional GCS bucket.

E. Do nothing.

F. Use global BigTable storage.

ANSWER29:

E

Notes/References29:

Cloud Functions cannot be used to affect GCS data access, so that option is simply wrong. BigTable does not have any “global” mode, so that option is wrong, too. Cloud Spanner is not a good replacement for GCS data: the data use cases are different enough that we can assume it would probably not be a good fit. You cannot change a bucket’s location after it has been created–not via the storage class nor any other way; you would have to migrate the data to a new bucket. But migrating the data to a regional bucket only helps when the data access will primarily be from that region. Google’s Cloud CDN is very easy to turn on, but it does only work for data that comes from within GCP and only if the objects are being accessed frequently enough to get cached based on previous requests. Because the access per object is so low, Cloud CDN won’t really help. This then brings us back to the question. Now, it may seem implied, but the question does not specifically state that there is currently a problem with latency, only that you need to ensure low latency–and we are already using what would be the best fit for this situation: a multi-regional CS bucket. 

Reference: Google Cloud Storage : What bucket class for the best performance?

Question 30: You need to ensure low-latency GCP access to a volume of historical data that is currently stored in an S3 bucket. Data access is uniform across many objects and relatively high. What should you do to address the latency concerns?

A. Use Premium Tier routing and Cloud Functions to accelerate access at the edges.

B. Use Google’s Cloud CDN.

C. Use global BigTable storage.

D. Do nothing.

E. Migrate the data to a new multi-regional GCS bucket.

F. Use a global Cloud Spanner instance.

ANSWER30:

E

Notes/References30:

Cloud Functions cannot be used to affect GCS data access, so that option is simply wrong. BigTable does not have any “global” mode, so that option is wrong, too. Cloud Spanner is not a good replacement for GCS data: the data use cases are different enough that we can assume it would probably not be a good fit–and it would likely be unnecessarily expensive. You cannot change a bucket’s location after it has been created–not via the storage class nor any other way; you would have to migrate the data to a new bucket. Google’s Cloud CDN is very easy to turn on, but it does only work for data that comes from within GCP and only if the objects are being accessed frequently enough. So even if you would want to use Cloud CDN, you have to migrate the data into a GCS bucket first, so that’s a better option. 

Reference: Google Cloud Storage : What bucket class for the best performance?

Question 31: You are lifting and shifting into GCP a system that uses a subnet-based security model. It has frontend and backend tiers and will be deployed in three regions. How many subnets will you need?

A. Six

B. One

C. Three

D. Four

E. Two

F. Nine

ANSWER31:

A

Notes/References31:

A single subnet spans and can be used across all zones in a single region, but you will need different subnets in different regions. Also, to implement subnet-level network security, you need to separate each tier into its own subnet. In this case, you have two tiers which will each need their own subnet in each of the three regions in which you will deploy this system. 

Reference: VPC network overview | Google Cloud Best practices and reference architectures for VPC design | Solutions

Question 32: You need a place to produce images before deploying them to AppEngine Flex. Which of the following options will you choose?

A. Container Registry

B. Cloud Storage

C. Persistent Disk

D. Nearline

E. Cloud Source Repositories

F. Cloud Build

G. Cloud Filestore

H. Compute Engine

ANSWER32:

F

Notes/References32:

There are several different kinds of “images” that you might need to consider–maybe they are normal picture-image files, maybe they are Docker container images, maybe VM or disk images, or maybe something else. In this question, “deploying [these images] to AppEngine Flex” lets us know that we are dealing with Docker container images, and thus although they would likely be stored in the Container Registry, after being built, this question asks us where that building might happen, which is Cloud Build. Cloud Build, which used to be called Container Builder, is ideal for building container images–though it can also be used to build almost any artifacts, really. You could also do this on Compute Engine, but that option requires much more work to manage and is therefore worse. 

Reference: Google App Engine flexible environment docs | Google Cloud Container Registry | Google Cloud

Question 33: You are lifting and shifting into GCP a system that uses a subnet-based security model. It has frontend, app, and data tiers and will be deployed in three regions. How many subnets will you need?

A. Two

B. One

C. Three

D. Nine

E. Four

F. Six

ANSWER33:

D

Notes/References33:

A single subnet spans and can be used across all zones in a single region, but you will need different subnets in different regions. Also, to implement subnet-level network security, you need to separate each tier into its own subnet. In this case, you have three tiers which will each need their own subnet in each of the three regions in which you will deploy this system. 

Reference: VPC network overview | Google Cloud Best practices and reference architectures for VPC design | Solutions

Question 34: You need a place to store images in case any of them are needed as evidence for a tax audit over the next seven years. Which of the following options will you choose?

A. Cloud Filestore

B. Coldline

C. Nearline

D. Persistent Disk

E. Cloud Source Repositories

F. Cloud Storage

G. Container Registry

ANSWER34:

B

Notes/References34:

There are several different kinds of “images” that you might need to consider–maybe they are normal picture-image files, maybe they are Docker container images, maybe VM or disk images, or maybe something else. In this question, “images” probably refers to picture files, and so Cloud Storage seems like an interesting option. But even still, when “Cloud Storage” is used without any qualifier, it generally refers to the “Standard” storage class, and this question also offers other storage classes as response options. Because the images in this scenario are unlikely to be used more than once a year (we can assume that taxes are filed annually and there’s less than 100% chance of being audited), the right storage class is Coldline. 

Reference: Cloud Storage: Object Storage | Google Cloud Storage classes | Cloud Storage | Google Cloud

Question 35: You need a place to store images before deploying them to AppEngine Flex. Which of the following options will you choose?

A. Container Registry

B. Cloud Filestore

C. Cloud Source Repositories

D. Persistent Disk

E. Cloud Storage

F. Code Build

G. Nearline

ANSWER35:

A

Notes/References35:

Compute Engine is not a storage product and should be eliminated. There are several different kinds of “images” that you might need to consider–maybe they are normal picture-image files, maybe they are Docker container images, maybe VM or disk images, or maybe something else. In this question, “deploying [these images] to AppEngine Flex” lets us know that we are dealing with Docker container images, and thus they would likely have been stored in the Container Registry. 

Reference: Google App Engine flexible environment docs | Google Cloud Container Registry | Google Cloud

Question 36: You are configuring a SaaS security application that updates your network’s allowed traffic configuration to adhere to internal policies. How should you set this up?

A. Install the application on a new appropriately-sized GCE instance running in your host VPC, and apply a read-only service account to it.

B. Create a new service account for the app to use and grant it the compute.networkViewer role on the production VPC.

C. Create a new service account for the app to use and grant it the compute.securityAdmin role on the production VPC.

D. Run the application as a container in your system’s staging GKE cluster and grant it access to a read-only service account.

E. Install the application on a new appropriately-sized GCE instance running in your host VPC, and let it use the default service account.

ANSWER36:

C

Notes/References36:

You do not install a Software-as-a-Service application yourself; instead, it runs on the vendor's own hardware and you configure it for external access. Service accounts are great for this, as they can be used externally and you maintain full control over them (disabling them, rotating their keys, etc.). The principle of least privilege dictates that you should not give any application more ability than it needs, but this app does need to make changes, so you'll need to grant securityAdmin, not networkViewer. 

Reference: VPC network overview | Google Cloud Best practices and reference architectures for VPC design | Solutions Understanding roles | Cloud IAM Documentation | Google Cloud

Question 37: You are lifting and shifting into GCP a system that uses a subnet-based security model. It has frontend and backend tiers and will be deployed across three zones. How many subnets will you need?

A. One

B. Six

C. Four

D. Three

E. Nine

ANSWER37:

F

Notes/References37:

A single subnet spans and can be used across all zones in a given region. But to implement subnet-level network security, you need to separate each tier into its own subnet. In this case, you have two tiers, so you only need two subnets. 

Reference: VPC network overview | Google Cloud Best practices and reference architectures for VPC design | Solutions

Question 38: You have been tasked with setting up a system to comply with corporate standards for container image approvals. Which of the following is your best choice for this project?

A. Binary Authorization

B. Cloud IAM

C. Security Key Enforcement

D. Cloud SCC

E. Cloud KMS

ANSWER38:

A

Notes/References38:

Cloud KMS is Google's product for managing encryption keys. Security Key Enforcement is about making sure that people's accounts do not get taken over by attackers, not about managing encryption keys. Cloud IAM is about managing what identities (both humans and services) can access in GCP. Cloud DLP–or Data Loss Prevention–is for preventing data loss by scanning for and redacting sensitive information. Cloud SCC–the Security Command Center–centralizes security information so you can manage it all in one place. Binary Authorization is about making sure that only properly-validated containers can run in your environments. 

Reference: Cloud Key Management Service | Google Cloud Cloud IAM | Google Cloud Cloud Data Loss Prevention | Google Cloud Security Command Center | Google Cloud Binary Authorization | Google Cloud Security Key Enforcement – 2FA

Question 39: For this question, refer to the Company B‘s case study. Which of the following are most likely to impact the operations of Company B’s game backend and analytics systems?

A. PCI

B. PII

C. SOX

D. GDPR

E. HIPAA

ANSWER39:

B and D

Notes/References39:

There is no patient/health information, so HIPAA does not apply. It would be a very bad idea to put payment card information directly into these systems, so we should assume they’ve not done that–therefore the Payment Card Industry (PCI) standards/regulations should not affect normal operation of these systems. Besides, it’s entirely likely that they never deal with payments directly, anyway–choosing to offload that to the relevant app stores for each mobile platform. Sarbanes-Oxley (SOX) is about proper management of financial records for publicly traded companies and should therefore not apply to these systems. However, these systems are likely to contain some Personally-Identifying Information (PII) about the users who may reside in the European Union and therefore the EU’s General Data Protection Regulations (GDPR) will apply and may require ongoing operations to comply with the “Right to be Forgotten/Erased”. 

Reference: Sarbanes–Oxley Act – Wikipedia Payment Card Industry Data Security Standard – Wikipedia Personal data – Wikipedia Personal data – Wikipedia

Question 40: Your new client has advised you that their organization falls within the scope of HIPAA. What can you infer about their information systems?

A. Their customers located in the EU may require them to delete their user data and provide evidence of such.

B. They will also need to pass a SOX audit.

C. They handle money-linked information.

D. Their system deals with medical information.

ANSWER40:

D

Notes/References40:

SOX stands for Sarbanes Oxley and is US regulation governing financial reporting for publicly-traded companies. HIPAA–the Health Insurance Portability and Accountability Act of 1996–is US regulation aimed at safeguarding individuals' (i.e. patients’) health information. PCI is the Payment Card Industry, and they have Data Security Standards (DSS) that must be adhered to by systems handling payment information of any of their member brands (which include Visa, Mastercard, and several others). 

Reference: Cloud Compliance & Regulations Resources | Google Cloud

Question 41: Your new client has advised you that their organization needs to pass audits by ISO and PCI. What can you infer about their information systems?

A. They handle money-linked information.

B. Their customers located in the EU may require them to delete their user data and provide evidence of such.

C. Their system deals with medical information.

D. They will also need to pass a SOX audit.

ANSWER42:

A

Notes/References42:

SOX stands for Sarbanes Oxley and is US regulation governing financial reporting for publicly-traded companies. HIPAA–the Health Insurance Portability and Accountability Act of 1996–is US regulation aimed at safeguarding individuals' (i.e. patients’) health information. PCI is the Payment Card Industry, and they have Data Security Standards (DSS) that must be adhered to by systems handling payment information of any of their member brands (which include Visa, Mastercard, and several others). ISO is the International Standards Organization, and since they have so many completely different certifications, this does not tell you much. 

Reference: Cloud Compliance & Regulations Resources | Google Cloud

Question 43: Your new client has advised you that their organization deals with GDPR. What can you infer about their information systems?

A. Their system deals with medical information.

B. Their customers located in the EU may require them to delete their user data and provide evidence of such.

C. They will also need to pass a SOX audit.

D. They handle money-linked information.

ANSWER43:

B

Notes/References43:

SOX stands for Sarbanes Oxley and is US regulation governing financial reporting for publicly-traded companies. HIPAA–the Health Insurance Portability and Accountability Act of 1996–is US regulation aimed at safeguarding individuals' (i.e. patients’) health information. PCI is the Payment Card Industry, and they have Data Security Standards (DSS) that must be adhered to by systems handling payment information of any of their member brands (which include Visa, Mastercard, and several others). 

Reference: Cloud Compliance & Regulations Resources | Google Cloud

Question 44: For this question, refer to the Company C case study. Once Company C has completed their initial cloud migration as described in the case study, which option would represent the quickest way to migrate their production environment to GCP?

A. Apply the strangler pattern to their applications and reimplement one piece at a time in the cloud

B. Lift and shift all servers at one time

C. Lift and shift one application at a time

D. Lift and shift one server at a time

E. Set up cloud-based load balancing then divert traffic from the DC to the cloud system

F. Enact their disaster recovery plan and fail over

ANSWER44:

F

Notes/References44:

The proposed Lift and Shift options are all talking about different situations than Dress4Win would find themselves in, at that time: they’d then have automation to build a complete prod system in the cloud, but they’d just need to migrate to it. “Just”, right? 🙂 The strangler pattern approach is similarly problematic (in this case), in that it proposes a completely different cloud migration strategy than the one they’ve almost completed. Now, if we purely consider the kicker’s key word “quickest”, using the DR plan to fail over definitely seems like it wins. Setting up an additional load balancer and migrating slowly/carefully would take more time. 

Reference: Strangler pattern – Cloud Design Patterns | Microsoft Docs StranglerFigApplication Monolith to Microservices Using the Strangler Pattern – DZone Microservices Understanding Lift and Shift and If It’s Right For You

Question 45: Which of the following commands is most likely to appear in an environment setup script?

A. gsutil mb -l asia gs://${project_id}-logs

B. gcloud compute instances create –zone–machine-type=n1-highmem-16 newvm

C. gcloud compute instances create –zone–machine-type=f1-micro newvm

D. gcloud compute ssh ${instance_id}

E. gsutil cp -r gs://${project_id}-setup ./install

F. gsutil cp -r logs/* gs://${project_id}-logs/${instance_id}/

ANSWER45:

A

Notes/References45:

The context here indicates that “environment” is an infrastructure environment like “staging” or “prod”, not just a particular command shell. In that sort of a situation, it is likely that you might create some core per-environment buckets that will store different kinds of data like configuration, communication, logging, etc. You're not likely to be creating, deleting, or connecting (sshing) to instances, nor copying files to or from any instances. 

Reference: mb – Make buckets | Cloud Storage | Google Cloud cp – Copy files and objects | Cloud Storage | Google Cloud gcloud compute instances | Cloud SDK Documentation | Google Cloud

Question 46: Your developers are working to expose a RESTful API for your company’s physical dealer locations. Which of the following endpoints would you advise them to include in their design?

A. /dealerLocations/get

B. /dealerLocations

C. /dealerLocations/list

D. Source and destination

E. /getDealerLocations

ANSWER46:

B

Notes/References46:

It might not feel like it, but this is in scope and a fair question. Google expects Professional Cloud Architects to be able to advise on designing APIs according to best practices (check the exam guide!). In this case, it's important to know that RESTful interfaces (when properly designed) use nouns for the resources identified by a given endpoint. That, by itself, eliminates most of the listed options. In HTTP, verbs like GET, PUT, and POST are then used to interact with those endpoints to retrieve and act upon those resources. To choose between the two noun-named options, it helps to know that plural resources are generally already understood to be lists, so there should be no need to add another “/list” to the endpoint. 

Reference: RESTful API Design — Step By Step Guide – By

Question 47: Which of the following commands is most likely to appear in an instance shutdown script?

A. gsutil cp -r gs://${project_id}-setup ./install

B. gcloud compute instances create –zone–machine-type=n1-highmem-16 newvm

C. gcloud compute ssh ${instance_id}

D. gsutil mb -l asia gs://${project_id}-logs

E. gcloud compute instances delete ${instance_id}

F. gsutil cp -r logs/* gs://${project_id}-logs/${instance_id}/

G. gcloud compute instances create –zone–machine-type=f1-micro newvm

ANSWER47:

F

Notes/References47:

The startup and shutdown scripts run on an instance at the time when that instance is starting up or shutting down. Those situations do not generally call for any other instances to be created, deleted, or connected (sshed) to. Also, those would be a very unusual time to make a Cloud Storage bucket, since buckets are the overall and highly-scalable containers that would likely hold the data for all (or at least many) instances in a given project. That said, instance shutdown time may be a time when you'd want to copy some final logs from the instance into some project-wide bucket. (In general, though, you really want to be doing that kind of thing continuously and not just at shutdown time, in case the instance shuts down unexpectedly and not in an orderly fashion that runs your shutdown script.)

Reference:  Running startup scripts | Compute Engine Documentation | Google Cloud Running shutdown scripts | Compute Engine Documentation | Google Cloud cp – Copy files and objects | Cloud Storage | Google Cloud gcloud compute instances | Cloud SDK Documentation | Google Cloud

Question 48: It is Saturday morning and you have been alerted to a serious issue in production that is both reducing availability to 95% and corrupting some data. Your monitoring tools noticed the issue 5 minutes ago and it was just escalated to you because the on-call tech in line before you did not respond to the page. Your system has an RPO of 10 minutes and an RTO of 120 minutes, with an SLA of 90% uptime. What should you do first?

A. Escalate the decision to the business manager responsible for the SLA

B. Take the system offline

C. Revert the system to the state it was in on Friday morning

D. Investigate the cause of the issue

ANSWER48:

B

Notes/References48:

The data corruption is your primary concern, as your Recovery Point Objective allows only 10 minutes of data loss and you may already have lost 5. (The data corruption means that you may well need to roll back the data to before that started happening.) It might seem crazy, but you should as quickly as possible stop the system so that you do not lose any more data. It would almost certainly take more time than you have left in your RPO to properly investigate and address the issue, but you should then do that next, during the disaster response clock set by your Recovery Time Objective. Escalating the issue to a business manager doesn't make any sense. And neither does it make sense to knee-jerk revert the system to an earlier state unless you have some good indication that doing so will address the issue. Plus, we'd better assume that “revert the system” refers only to the deployment and not the data, because rolling the data back that far would definitely violate the RPO. 

Reference: Disaster recovery – Wikipedia

Question 49: Which of the following are not processes or practices that you would associate with DevOps?

A. Raven-test the candidate

B. Obfuscate the code

C. Only one of the other options is made up

D. Run the code in your cardinal environment

E. Do a canary deploy

ANSWER49:

A and D

Notes/References49:

Testing your understanding of development and operations in DevOps. In particular, you need to know that a canary deploy is a real thing and it can be very useful to identify problems with a new change you're making before it is fully rolled out to and therefore impacts everyone. You should also understand that “obfuscating” code is a real part of a release process that seeks to protect an organization's source code from theft (by making it unreadable by humans) and usually happens in combination with “minification” (which improves the speed of downloading and interpreting/running the code). On the other hand, “raven-testing” isn't a thing, and neither is a “cardinal environment”. Those bird references are just homages to canary deployments.

Reference: Intro to deployment strategies: blue-green, canary, and more – DEV Community ‍‍

Question 50: Your CTO is going into budget meetings with the board, next month, and has asked you to draw up plans to optimize your GCP-based systems for capex. Which of the following options will you prioritize in your proposal?

A. Object lifecycle management

B. BigQuery Slots

C. Committed use discounts

D. Sustained use discounts

E. Managed instance group autoscaling

F. Pub/Sub topic centralization

ANSWER50:

B and C

Notes/References50:

Pub/Sub usage is based on how much data you send through it, not any sort of “topic centralization” (which isn't really a thing). Sustained use discounts can reduce costs, but that's not really something you structure your system around. Now, most organizations prefer to turn Capital Expenditures into Operational Expenses, but since this question is instead asking you to prioritize CapEx, we need to consider the remaining options from the perspective of “spending” (or maybe reserving) defined amounts of money up-front for longer-term use. (Fair warning, though: You may still have some trouble classifying some cloud expenses as “capital” expenditures). With that in mind, GCE's Committed Use Discounts do fit: you “buy” (reserve/prepay) some instances ahead of time and then not have to pay (again) for them as you use them (or don't use them; you've already paid). BigQuery Slots are a similar flat-rate pricing model: you pre-purchase a certain amount of BigQuery processing capacity and your queries use that instead of the on-demand capacity. That means you won't pay more than you planned/purchased, but your queries may finish rather more slowly, too. Managed instance group autoscaling and object lifecycle management can help to reduce costs, but they are not really about capex. 

Reference: CapEx vs OpEx: Capital Expenses and Operating Expenses Explained – BMC Blogs Sustained use discounts | Compute Engine Documentation | Google Cloud Committed use discounts | Compute Engine Documentation | Google Cloud Slots | BigQuery | Google Cloud Autoscaling groups of instances | Compute Engine Documentation Object Lifecycle Management | Cloud Storage | Google Cloud

Question 51: In your last retrospective, there was significant disagreement voiced by the members of your team about what part of your system should be built next. Your scrum master is currently away, but how should you proceed when she returns, on Monday?

A. The scrum master is the one who decides

B. The lead architect should get the final say

C. The product owner should get the final say

D. You should put it to a vote of key stakeholders

E. You should put it to a vote of all stakeholders

ANSWER51:

C

Notes/References51:

In Scrum, it is the Product Owner's role to define and prioritize (i.e. set order for) the product backlog items that the dev team will work on. If you haven't ever read it, the Scrum Guide is not too long and quite valuable to have read at least once, for context. 

Reference: Scrum Guide | Scrum Guides

Question 52: Your development team needs to evaluate the behavior of a new version of your application for approximately two hours before committing to making it available to all users. Which of the following strategies will you suggest?

A. Split testing

B. Red-Black

C. A/B

D. Canary

E. Rolling

F. Blue-Green

G. Flex downtime

ANSWER52:

D and E

Notes/References52:

A Blue-Green deployment, also known as a Red-Black deployment, entails having two complete systems set up and cutting over from one of them to the other with the ability to cut back to the known-good old one if there’s any problem with the experimental new one. A canary deployment is where a new version of an app is deployed to only one (or a very small number) of the servers, to see whether it experiences or causes trouble before that version is rolled out to the rest of the servers. When the canary looks good, a Rolling deployment can be used to update the rest of the servers, in-place, one after another to keep the overall system running. “Flex downtime” is something I just made up, but it sounds bad, right? A/B testing–also known as Split testing–is not generally used for deployments but rather to evaluate two different application behaviours by showing both of them to different sets of users. Its purpose is to gather higher-level information about how users interact with the application. 

Reference: BlueGreenDeployment design patterns – What's the difference between Red/Black deployment and Blue/Green Deployment? – Stack Overflow design patterns – What's the difference between Red/Black deployment and Blue/Green Deployment? – Stack Overflow What is rolling deployment? – Definition from WhatIs.com A/B testing – Wikipedia

Question 53: You are mentoring a Junior Cloud Architect on software projects. Which of the following “words of wisdom” will you pass along?

A. Identifying and fixing one issue late in the product cycle could cost the same as handling a hundred such issues earlier on

B. Hiring and retaining 10X developers is critical to project success

C. A key goal of a proper post-mortem is to identify what processes need to be changed

D. Adding 100% is a safe buffer for estimates made by skilled estimators at the beginning of a project

E. A key goal of a proper post-mortem is to determine who needs additional training

ANSWER53:

A and C

Notes/References53:

There really can be 10X (and even larger!) differences in productivity between individual contributors, but projects do not only succeed or fail because of their contributions. Bugs are crazily more expensive to find and fix once a system has gone into production, compared to identifying and addressing that issue right up front–yes, even 100x. A post-mortem should not focus on blaming an individual but rather on understanding the many underlying causes that led to a particular event, with an eye toward how such classes of problems can be systematically prevented in the future. 

Reference: 403 Forbidden 403 Forbidden Google – Site Reliability Engineering The Cone of Uncertainty

Question 54: Your team runs a service with an SLA to achieve p99 latency of 200ms. This month, your service achieved p95 latency of 250ms. What will happen now?

A. The next month’s SLA will be increased.

B. The next month’s SLO will be reduced.

C. Your client(s) will have to pay you extra.

D. You will have to pay your client(s).

E. There is no impact on payments.

F. There is not enough information to make a determination.

ANSWER54:

D

Notes/References54:

It would be highly unusual for clients to have to pay extra, even if the service performs better than agreed by the SLA. SLAs generally set out penalties (i.e. you pay the client) for below-standard performance. While SLAs are external-facing, SLOs are internal-facing and do not generally relate to performance penalties. Neither SLAs nor SLOs are adaptively changed just because of one month’s performance; such changes would have to happen through rather different processes. A p99 metric is a tougher measure than p95, and p95 is tougher than p90–so meeting the tougher measure would surpass a required SLA, but meeting a weaker measure would not give enough information to say. 

Reference: What's the Difference Between DevOps and SRE? (class SRE implements DevOps) – YouTube Percentile rank – Wikipedia

Question 55: Your team runs a service with an SLO to achieve p90 latency of 200ms. This month, your service achieved p95 latency of 250ms. What will happen now?

A. The next month’s SLA will be increased.

B. There is no impact on payments.

C. There is not enough information to make a determination.

D. Your client(s) will have to pay you extra.

E. The next month’s SLO will be reduced.

F. You will have to pay your client(s).

ANSWER55:

B

Notes/References55:

It would be highly unusual for clients to have to pay extra, even if the service performs better than agreed by the SLA. SLAs generally set out penalties (i.e. you pay the client) for below-standard performance. While SLAs are external-facing, SLOs are internal-facing and do not generally relate to performance penalties. Neither SLAs nor SLOs are adaptively changed just because of one month’s performance; such changes would have to happen through rather different processes. A p99 metric is a tougher measure than p95, and p95 is tougher than p90–so meeting the tougher measure would surpass a required SLA, but meeting a weaker measure would not give enough information to say. 

Reference: What's the Difference Between DevOps and SRE? (class SRE implements DevOps) – YouTube Percentile rank – Wikipedia

Question 56: For this question, refer to the Company C case study. How would you recommend Company C address their capacity and utilization concerns?

A. Configure the autoscaling thresholds to follow changing load

B. Provision enough servers to handle trough load and offload to Cloud Functions for higher demand

C. Run cron jobs on their application servers to scale down at night and up in the morning

D. Use Cloud Load Balancing to balance the traffic highs and lows

D. Run automated jobs in Cloud Scheduler to scale down at night and up in the morning

E. Provision enough servers to handle peak load and sell back excess on-demand capacity to the marketplace

ANSWER56:

A

Notes/References56:

The case study notes, “Our traffic patterns are highest in the mornings and weekend evenings; during other times, 80% of our capacity is sitting idle.” Cloud Load Balancing could definitely scale itself to handle this type of load fluctuation, but it would not do anything to address the issue of having enough application server capacity. Provisioning servers to handle peak load is generally inefficient, but selling back excess on-demand capacity to the marketplace just isn’t a thing, so that option must be eliminated, too. Using Cloud Functions would require a different architectural approach for their application servers and it is generally not worth the extra work it would take to coordinate workloads across Cloud Functions and GCE–in practice, you’d just use one or the other. It is possible to manually effect scaling via automated jobs like in Cloud Scheduler or cron running somewhere (though cron running everywhere could create a coordination nightmare), but manual scaling based on predefined expected load levels is far from ideal, as capacity would only very crudely match demand. Rather, it is much better to configure the managed instance group’s autoscaling to follow demand curves–both expected and unexpected. A properly-architected system should rise to the occasion of unexpectedly going viral, and not fall over. 

Reference: Load Balancing | Google Cloud Google Cloud Platform Marketplace Solutions Cloud Functions | Google Cloud Cloud Scheduler | Google Cloud

Google Cloud Latest News, Questions and Answers online:

Cloud Run vs App Engine: In a nutshell, you give Google’s Cloud Run a Docker container containing a webserver. Google will run this container and create an HTTP endpoint. All the scaling is automatically done for you by Google. Cloud Run depends on the fact that your application should be stateless. This is because Google will spin up multiple instances of your app to scale it dynamically. If you want to host a traditional web application this means that you should divide it up into a stateless API and a frontend app.

With Google’s App Engine you tell Google how your app should be run. The App Engine will create and run a container from these instructions. Deploying with App Engine is super easy. You simply fill out an app.yml file and Google handles everything for you.

With Cloud Run, you have more control. You can go crazy and build a ridiculous custom Docker image, no problem! Cloud Run is made for Devops engineers, App Engine is made for developers. Read more here…

Cloud Run VS Cloud Functions: What to consider?

The best choice depends on what you want to optimize, your use-cases and your specific needs.

If your objective is the lowest latency, choose Cloud Run.

Indeed, Cloud Run use always 1 vCPU (at least 2.4Ghz) and you can choose the memory size from 128Mb to 2Gb.

With Cloud Functions, if you want the best processing performance (2.4Ghz of CPU), you have to pay 2Gb of memory. If your memory footprint is low, a Cloud Functions with 2Gb of memory is overkill and cost expensive for nothing.

Cutting cost is not always the best strategy for customer satisfaction, but business reality may require it. Anyway, it highly depends of your use-case

Both Cloud Run and Cloud Function round up to the nearest 100ms. As you could play with the GSheet, the Cloud Functions are cheaper when the processing time of 1 request is below the first 100ms. Indeed, you can slow the Cloud Functions vCPU, with has for consequence to increase the duration of the processing but while staying under 100ms if you tune it well. Thus less Ghz/s are used and thereby you pay less.

the cost comparison between Cloud Functions and Cloud Run goes further than simply comparing a pricing list. Moreover, on your projects, you often will have to use the 2 solutions for taking advantage of their strengths and capabilities.

My first choice for development is Cloud Run. Its portability, its testability, its openess on the libraries, the languages and the binaries confer it too much advantages for, at least, a similar pricing, and often with a real advantage in cost but also in performance, in particular for concurrent requests. Even if you need the same level of isolation of Cloud functions (1 instance per request), simply set the concurrent param to 1!

In addition, the GA of Cloud Run is applied on all containers, whatever the languages and the binaries used. Read more here…

What does the launch of Google’s App Maker mean for professional app developers?

Should I go with AWS Elastic Beanstalk or Google App Engine (Managed VMs) for deploying my Parse-Server backend?

Why can a company such as Google sell me a cloud gaming service where I can “rent” GPU power over miles of internet, but when I seek out information on how to create a version of this everyone says that it is not possible or has too much latency?

AWS wins hearts of developers while Azure those of C-levels. Google is a black horse with special expertise like K8s and ML. The cloud world is evolving. Who is the winner in the next 5 years?

What is GCP (Google Cloud Platform) and how does it work?

What is the maximum amount of storage that you could have in your Google drive?

How do I deploy Spring Boot application (Web MVC) on Google App Engine(GAE) or HEROKU using Eclipse IDE?

What are some downsides of building softwares on top of Google App Engine?

Why is Google losing the cloud computing race?

How did new products like Google Drive, Microsoft SkyDrive, Yandex.Disk and other cloud storage solutions affect Dropbox’s growth and revenues?

What is the capacity of Google servers?

What is the Hybrid Cloud platform?

What is the difference between Docker and Google App engines?

How do I get to cloud storage?

How does Google App Engine compare to Heroku?

What is equivalent of Google Cloud BigTable in Microsoft Azure?

How big is the storage capacity of Google organization and who comes second?

It seems strange that Google Cloud Platform offer “everything” except cloud search/inverted index?

Where are the files on Google Drive stored?

Is Google app engine similar to lambda?

Was Diane Greene a failure as the CEO of Google Cloud considering her replacement’s strategy and philosophy is the polar opposite?

How is Google Cloud for heavy real-time traffic? Is there any optimization needed for handling more than 100k RT?

When it comes to iCloud, why does Apple rely on Google Cloud instead of using their own data centers?

Google Cloud Storage : What bucket class for the best performance?: Multiregional buckets perform significantly better for cross-the-ocean fetches, however the details are a bit more nuanced than that. The performance is dominated by the latency of physical distance between the client and the cloud storage bucket.

  • If caching is on, and your access volume is high enough to take advantage of caching, there’s not a huge difference between the two offerings (that I can see with the tests). This shows off the power of Google’s Awesome CDN environment.
  • If caching is off, or the access volume is low enough that you can’t take advantage of caching, then the performance overhead is dominated directly by physics. You should be trying to get the assets as close to the clients as possible, while also considering cost, and the types of redundancy and consistency you’ll need for your data needs.

Top- high paying certifications:

  1. Google Certified Professional Cloud Architect – $139,529
  2. PMP® – Project Management Professional – $135,798
  3. Certified ScrumMaster® – $135,441
  4. AWS Certified Solutions Architect – Associate – $132,840
  5. AWS Certified Developer – Associate – $130,369
  6. Microsoft Certified Solutions Expert (MCSE): Server Infrastructure – $121,288
  7. ITIL® Foundation – $120,566
  8. CISM – Certified Information Security Manager – $118,412
  9. CRISC – Certified in Risk and Information Systems Control – $117,395
  10. CISSP – Certified Information Systems Security Professional – $116,900
  11. CEH – Certified Ethical Hacker – $116,306
  12. Citrix Certified Associate – Virtualization (CCA-V) – $113,442
  13. CompTIA Security+ – $110,321
  14. CompTIA Network+ – $107,143
  15. Cisco Certified Networking Professional (CCNP) Routing and Switching – $106,957

According to the 2020 Global Knowledge report, the top-paying cloud certifications for the year are (drumroll, please):

1- Google Certified Professional Cloud Architect — $175,761

2- AWS Certified Solutions Architect – Associate — $149,446

3- AWS Certified Cloud Practitioner — $131,465

4- Microsoft Certified: Azure Fundamentals — $126,653

5- Microsoft Certified: Azure Administrator Associate — $125,993

Sources:

1- Google Cloud

2- Linux Academy

3- WhizLabs

4- GCP Space on Quora

5- Udemy

6- Acloud Guru

7. Question and Answers are sent to us by good people all over the world.

Machine Learning 101 – Top 20 AWS and Google Certified Machine Learning Specialty Questions and Answers Dumps

Azure Administrator AZ-104 Exam Questions and Answers Dumps

Data Center Proxies - Data Collectors - Data Unblockers

The typical Google Machine Learning Engineer salary is $147,218. Machine Learning Engineer salaries at Google can range from $110,000 – $152,183.

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.

  • By the end of 2020, 85% of customer interactions will be handled without a human (Call Center, Chatbot, etc…)
  • 61% of marketers say artificial intelligence is the most important aspect of their data strategy.
  • 80% of business and tech leaders say AI already boosts productivity (Robotic Process Automation, Power Automate, etc..)
  • Current AI technology can boost business productivity by up to 40%

What does a Professional Machine Learning Engineer do?

Professional Machine Learning Engineer designs, builds, and productionizes ML models to solve business challenges using Google Cloud technologies and knowledge of proven ML models and techniques. The ML Engineer collaborates closely with other job roles to ensure long-term success of models. The ML Engineer should be proficient in all aspects of model architecture, data pipeline interaction, and metrics interpretation. The ML Engineer needs familiarity with application development, infrastructure management, data engineering, and security. Through an understanding of training, retraining, deploying, scheduling, monitoring, and improving models, they design and create scalable solutions for optimal performance.

The AWS Certified Machine Learning – Specialty certification is intended for individuals who perform a development or data science role. It validates a candidate’s ability to design, implement, deploy, and maintain machine learning (ML) solutions for given business problems.

This blog covers Machine Learning 101, Top 20 AWS Certified Machine Learning Specialty Questions and Answers, Top 20 Google Professional Machine Learning Engineer Sample Questions, Machine Learning Quizzes, Machine Learning Q&A, Top 10 Machine Learning Algorithms, Machine Learning Latest Hot News, Machine Learning Demos (Ex: Tensorflow Demos)

Below are the Top 20 AWS Certified Machine Learning Specialty Questions and Answers Dumps.

Top

Question1: A machine learning team has several large CSV datasets in Amazon S3. Historically, models built with the Amazon SageMaker Linear Learner algorithm have taken hours to train on similar-sized datasets. The team’s leaders need to accelerate the training process. What can a machine learning specialist do to address this concern?

A) Use Amazon SageMaker Pipe mode.
B) Use Amazon Machine Learning to train the models.
C) Use Amazon Kinesis to stream the data to Amazon SageMaker.
D) Use AWS Glue to transform the CSV dataset to the JSON format.
ANSWER1:

A

Notes/Hint1:

Amazon SageMaker Pipe mode streams the data directly to the container, which improves the performance of training jobs. (Refer to this link for supporting information.) In Pipe mode, your training job streams data directly from Amazon S3. Streaming can provide faster start times for training jobs and better throughput. With Pipe mode, you also reduce the size of the Amazon EBS volumes for your training instances. B would not apply in this scenario. C is a streaming ingestion solution, but is not applicable in this scenario. D transforms the data structure.

Reference1: Amazon SageMaker

Question 2) A local university wants to track cars in a parking lot to determine which students are parking in the lot. The university is wanting to ingest videos of the cars parking in near-real time, use machine learning to identify license plates, and store that data in an AWS data store. Which solution meets these requirements with the LEAST amount of development effort?

A) Use Amazon Kinesis Data Streams to ingest the video in near-real time, use the Kinesis Data Streams consumer integrated with Amazon Rekognition Video to process the license plate information, and then store results in DynamoDB.

B) Use Amazon Kinesis Video Streams to ingest the videos in near-real time, use the Kinesis Video Streams integration with Amazon Rekognition Video to identify the license plate information, and then store the results in DynamoDB.

C) Use Amazon Kinesis Data Streams to ingest videos in near-real time, call Amazon Rekognition to identify license plate information, and then store results in DynamoDB.

D) Use Amazon Kinesis Firehose to ingest the video in near-real time and outputs results onto S3. Set up a Lambda function that triggers when a new video is PUT onto S3 to send results to Amazon Rekognition to identify license plate information, and then store results in DynamoDB.

Answer 2)

B

Notes/Hint2)

Kinesis Video Streams is used to stream videos in near-real time. Amazon Rekognition Video uses Amazon Kinesis Video Streams to receive and process a video stream. After the videos have been processed by Rekognition we can output the results in DynamoDB.

Reference: Kinesis Video Streams

Question 3) A term frequency–inverse document frequency (tf–idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences:

1. Please call the number below.
2. Please do not call us. What are the dimensions of the tf–idf matrix?
A) (2, 16)
B) (2, 8)
C) (2, 10)
D) (8, 10)

ANSWER3:

A

Notes/Hint3:

There are 2 sentences, 8 unique unigrams, and 8 unique bigrams, so the result would be (2,16). The phrases are “Please call the number below” and “Please do not call us.” Each word individually (unigram) is “Please,” “call,” ”the,” ”number,” “below,” “do,” “not,” and “us.” The unique bigrams are “Please call,” “call the,” ”the number,” “number below,” “Please do,” “do not,” “not call,” and “call us.” The tf–idf vectorizer is described at this link.

Reference3:  tf-idf vertorizer

Question 4: A company is setting up a system to manage all of the datasets it stores in Amazon S3. The company would like to automate running transformation jobs on the data and maintaining a catalog of the metadata concerning the datasets. The solution should require the least amount of setup and maintenance. Which solution will allow the company to achieve its goals? 

A) Create an Amazon EMR cluster with Apache Hive installed. Then, create a Hive metastore and a script to run transformation jobs on a schedule.
B) Create an AWS Glue crawler to populate the AWS Glue Data Catalog. Then, author an AWS Glue ETL job, and set up a schedule for data transformation jobs.
C) Create an Amazon EMR cluster with Apache Spark installed. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule. D) Create an AWS Data Pipeline that transforms the data. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule.
 

ANSWER4:

B

Notes/Hint4:

AWS Glue is the correct answer because this option requires the least amount of setup and maintenance since it is serverless, and it does not require management of the infrastructure. Refer to this link for supporting information. A, C, and D are all solutions that can solve the problem, but require more steps for configuration, and require higher operational overhead to run and maintain.
Reference4:  Glue

Question 5) Which service in the Kinesis family allows you to easily load streaming data into data stores and analytics tools?

A) Kinesis Firehose
B) Kinesis Streams
C) Kinesis Data Analytics
D) Kinesis Video Streams
 

ANSWER5:

A

Notes/Hint5:

Kinesis Firehose is perfect for streaming data into AWS and sending it directly to its final destination – places like S3, Redshift, Elastisearch, and Splunk Instances.

Reference 5): Kinesis Firehose

Question 6) A data scientist is working on optimizing a model during the training process by varying multiple parameters. The data scientist observes that, during multiple runs with identical parameters, the loss function converges to different, yet stable, values. What should the data scientist do to improve the training process? 
A) Increase the learning rate. Keep the batch size the same.
B) Reduce the batch size. Decrease the learning rate.
C) Keep the batch size the same. Decrease the learning rate.
D) Do not change the learning rate. Increase the batch size.
 
Answer  6)

B
 

Notes 6)

It is most likely that the loss function is very curvy and has multiple local minima where the training is getting stuck. Decreasing the batch size would help the data scientist stochastically get out of the local minima saddles. Decreasing the learning rate would prevent overshooting the global loss function minimum. Refer to the paper at this link for an explanation.
Reference 6) : Here

Question 7) Your organization has a standalone Javascript (Node.js) application that streams data into AWS using Kinesis Data Streams. You notice that they are using the Kinesis API (AWS SDK) over the Kinesis Producer Library (KPL). What might be the reasoning behind this?
A) The Kinesis API (AWS SDK) provides greater functionality over the Kinesis Producer Library.
B) The Kinesis API (AWS SDK) runs faster in Javascript applications over the Kinesis Producer Library.
C) The Kinesis Producer Library must be installed as a Java application to use with Kinesis Data Streams.
D) The Kinesis Producer Library cannot be integrated with a Javascript application because of its asynchronous architecture.
Answer 7)

C
Notes/Hint7:

The KPL must be installed as a Java application before it can be used with your Kinesis Data Streams. There are ways to process KPL serialized data within AWS Lambda, in Java, Node.js, and Python, but not if these answers mentions Lambda.
Reference 7) KPL
 
 
Question 8) A data scientist is evaluating different binary classification models. A false positive result is 5 times more expensive (from a business perspective) than a false negative result. The models should be evaluated based on the following criteria: 
1) Must have a recall rate of at least 80%
2) Must have a false positive rate of 10% or less
3) Must minimize business costs After creating each binary classification model, the data scientist generates the corresponding confusion matrix. Which confusion matrix represents the model that satisfies the requirements?
A) TN = 91, FP = 9 FN = 22, TP = 78
 B) TN = 99, FP = 1 FN = 21, TP = 79
C) TN = 96, FP = 4 FN = 10, TP = 90
D) TN = 98, FP = 2 FN = 18, TP = 82
 
Answer 8): 

D
 

Notes/Hint 8)


The following calculations are required: TP = True Positive FP = False Positive FN = False Negative TN = True Negative FN = False Negative Recall = TP / (TP + FN) False Positive Rate (FPR) = FP / (FP + TN) Cost = 5 * FP + FN A B C D Recall 78 / (78 + 22) = 0.78 79 / (79 + 21) = 0.79 90 / (90 + 10) = 0.9 82 / (82 + 18) = 0.82 False Positive Rate 9 / (9 + 91) = 0.09 1 / (1 + 99) = 0.01 4 / (4 + 96) = 0.04 2 / (2 + 98) = 0.02 Costs 5 * 9 + 22 = 67 5 * 1 + 21 = 26 5 * 4 + 10 = 30 5 * 2 + 18 = 28 Options C and D have a recall greater than 80% and an FPR less than 10%, but D is the most cost effective. For supporting information, refer to this link.
Reference 8: Here

 
 
Question 9) A data scientist uses logistic regression to build a fraud detection model. While the model accuracy is 99%, 90% of the fraud cases are not detected by the model. What action will definitely help the model detect more than 10% of fraud cases? 
A) Using undersampling to balance the dataset
B) Decreasing the class probability threshold
C) Using regularization to reduce overfitting
D) Using oversampling to balance the dataset
 

Answer  9)

B

 

Notes 9)

Decreasing the class probability threshold makes the model more sensitive and, therefore, marks more cases as the positive class, which is fraud in this case. This will increase the likelihood of fraud detection. However, it comes at the price of lowering precision. This is covered in the Discussion section of the paper at this link
Reference 9: Here

 
Question 10) A company is interested in building a fraud detection model. Currently, the data scientist does not have a sufficient amount of information due to the low number of fraud cases. Which method is MOST likely to detect the GREATEST number of valid fraud cases?
A) Oversampling using bootstrapping
B) Undersampling
C) Oversampling using SMOTE
D) Class weight adjustment
 

Answer  10)

C

 
Notes 10)

With datasets that are not fully populated, the Synthetic Minority Over-sampling Technique (SMOTE) adds new information by adding synthetic data points to the minority class. This technique would be the most effective in this scenario. Refer to Section 4.2 at this link for supporting information.
Reference 10) : Here

 
Question 11) A machine learning engineer is preparing a data frame for a supervised learning task with the Amazon SageMaker Linear Learner algorithm. The ML engineer notices the target label classes are highly imbalanced and multiple feature columns contain missing values. The proportion of missing values across the entire data frame is less than 5%. What should the ML engineer do to minimize bias due to missing values? 
 
A) Replace each missing value by the mean or median across non-missing values in same row.
B) Delete observations that contain missing values because these represent less than 5% of the data.
C) Replace each missing value by the mean or median across non-missing values in the same column.
D) For each feature, approximate the missing values using supervised learning based on other features.
 

Answer  11)

D

 

Notes 11)

Use supervised learning to predict missing values based on the values of other features. Different supervised learning approaches might have different performances, but any properly implemented supervised learning approach should provide the same or better approximation than mean or median approximation, as proposed in responses A and C. Supervised learning applied to the imputation of missing values is an active field of research. Refer to this link for an example.
Reference 11): Here

 
Question 12) A company has collected customer comments on its products, rating them as safe or unsafe, using decision trees. The training dataset has the following features: id, date, full review, full review summary, and a binary safe/unsafe tag. During training, any data sample with missing features was dropped. In a few instances, the test set was found to be missing the full review text field. For this use case, which is the most effective course of action to address test data samples with missing features? 
A) Drop the test samples with missing full review text fields, and then run through the test set.
B) Copy the summary text fields and use them to fill in the missing full review text fields, and then run through the test set.
C) Use an algorithm that handles missing data better than decision trees.
D) Generate synthetic data to fill in the fields that are missing data, and then run through the test set.
 
Answer  12)

B

 

 

Notes 12) 

In this case, a full review summary usually contains the most descriptive phrases of the entire review and is a valid stand-in for the missing full review text field. For supporting information, refer to page 1627 at this link, and this link and this link.

Reference 12) Here


 
Question 13) An insurance company needs to automate claim compliance reviews because human reviews are expensive and error-prone. The company has a large set of claims and a compliance label for each. Each claim consists of a few sentences in English, many of which contain complex related information. Management would like to use Amazon SageMaker built-in algorithms to design a machine learning supervised model that can be trained to read each claim and predict if the claim is compliant or not. Which approach should be used to extract features from the claims to be used as inputs for the downstream supervised task? 
A) Derive a dictionary of tokens from claims in the entire dataset. Apply one-hot encoding to tokens found in each claim of the training set. Send the derived features space as inputs to an Amazon SageMaker builtin supervised learning algorithm.
B) Apply Amazon SageMaker BlazingText in Word2Vec mode to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
C) Apply Amazon SageMaker BlazingText in classification mode to labeled claims in the training set to derive features for the claims that correspond to the compliant and non-compliant labels, respectively.
D) Apply Amazon SageMaker Object2Vec to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
 

Answer  13)

D

 

Notes 13)

Amazon SageMaker Object2Vec generalizes the Word2Vec embedding technique for words to more complex objects, such as sentences and paragraphs. Since the supervised learning task is at the level of whole claims, for which there are labels, and no labels are available at the word level, Object2Vec needs be used instead of Word2Vec.

Reference 13)  Amazon SageMaker
Object2Vec 

Question 14) You have been tasked with capturing two different types of streaming events. The first event type includes mission-critical data that needs to immediately be processed before operations can continue. The second event type includes data of less importance, but operations can continue without immediately processing. What is the most appropriate solution to record these different types of events?

A) Capture both events with the PutRecords API call.
B) Capture both event types using the Kinesis Producer Library (KPL).
C) Capture the mission critical events with the PutRecords API call and the second event type with the Kinesis Producer Library (KPL).
D) Capture the mission critical events with the Kinesis Producer Library (KPL) and the second event type with the Putrecords API call.
 

Answer  14)

C

 

Notes 14)

The question is about sending data to Kinesis synchronously vs. asynchronously. PutRecords is a synchronous send function, so it must be used for the first event type (critical events). The Kinesis Producer Library (KPL) implements an asynchronous send function, so it can be used for the second event type. In this scenario, the reason to use the KPL over the PutRecords API call is because: KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user-configurable). Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance. Applications that cannot tolerate this additional delay may need to use the AWS SDK directly. For more information about using the AWS SDK with Kinesis Data Streams, see Developing Producers Using the Amazon Kinesis Data Streams API with the AWS SDK for Java. For more information about RecordMaxBufferedTime and other user-configurable properties of the KPL, see Configuring the Kinesis Producer Library.

Reference 14: KCL vs PutRecords


Question 15) You are collecting clickstream data from an e-commerce website to make near-real time product suggestions for users actively using the site. Which combination of tools can be used to achieve the quickest recommendations and meets all of the requirements?

A) Use Kinesis Data Streams to ingest clickstream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
B) Use Kinesis Data Firehose to ingest click stream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions, then use Lambda to load these results into S3.
C) Use Kinesis Data Streams to ingest clickstream data, then use Lambda to process that data and write it to S3. Once the data is on S3, use Athena to query based on conditions that data and make real time recommendations to users.
D) Use the Kinesis Data Analytics to ingest the clickstream data directly and run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
 

Answer  15)

A

 

Notes 15)

Kinesis Data Analytics gets its input streaming data from Kinesis Data Streams or Kinesis Data Firehose. You can use Kinesis Data Analytics to run real-time SQL queries on your data. Once certain conditions are met you can trigger Lambda functions to make real time product suggestions to users. It is not important that we store or persist the clickstream data.

Reference 15: Kinesis Data Analytics

Question 16) Which service built by AWS makes it easy to set up a retry mechanism, aggregate records to improve throughput, and automatically submits CloudWatch metrics?

A) Kinesis API (AWS SDK)
B) Kinesis Producer Library (KPL)
C) Kinesis Consumer Library
D) Kinesis Client Library (KCL)

Answer  16)

B

 

Notes 16)

Although the Kinesis API built into the AWS SDK can be used for all of this, the Kinesis Producer Library (KPL) makes it easy to integrate all of this into your applications.

Reference 16:  Kinesis Producer Library (KPL) 


Question 17) You have been tasked with capturing data from an online gaming platform to run analytics on and process through a machine learning pipeline. The data that you are ingesting is players controller inputs every 1 second (up to 10 players in a game) that is in JSON format. The data needs to be ingested through Kinesis Data Streams and the JSON data blob is 100 KB in size. What is the minimum number of shards you can use to successfully ingest this data?

A) 10 shards
B) Greater than 500 shards, so you’ll need to request more shards from AWS
C) 1 shard
D) 100 shards

Answer  17)

C

 

Notes 17)

In this scenario, there will be a maximum of 10 records per second with a max payload size of 1000 KB (10 records x 100 KB = 1000KB) written to the shard. A single shard can ingest up to 1 MB of data per second, which is enough to ingest the 1000 KB from the streaming game play. Therefor 1 shard is enough to handle the streaming data.

Reference 17: shards

Question 18) Which services in the Kinesis family allows you to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time?

A) Kinesis Streams
B) Kinesis Firehose
C) Kinesis Video Streams
D) Kinesis Data Analytics

Answer  18)

D

 

Notes 18)

Kinesis Data Analytics allows you to run real-time SQL queries on your data to gain insights and respond to events in real time.

Reference 18: Kinesis Data Analytics


Question 19) You are a ML specialist needing to collect data from Twitter tweets. Your goal is to collect tweets that include only the name of your company and the tweet body, and store it off into a data store in AWS. What set of tools can you use to stream, transform, and load the data into AWS with the LEAST amount of effort?

A) Setup a Kinesis Data Firehose for data ingestion and immediately write that data to S3. Next, setup a Lambda function to trigger when data lands in S3 to transform it and finally write it to DynamoDB.
B) Setup A Kinesis Data Stream for data ingestion, setup EC2 instances as data consumers to poll and transform the data from the stream. Once the data is transformed, make an API call to write the data to DynamoDB.
C) Setup Kinesis Data Streams for data ingestion. Next, setup Kinesis Data Firehouse to load that data into RedShift. Next, setup a Lambda function to query data using RedShift spectrum and store the results onto DynamoDB.
D) Create a Kinesis Data Stream to ingest the data. Next, setup a Kinesis Data Firehose and use Lambda to transform the data from the Kinesis Data Stream, then use Lambda to write the data to DynamoDB. Finally, use S3 as the data destination for Kinesis Data Firehose.
 

Answer 19)

A

Notes 19)

All of these could be used to stream, transform, and load the data into an AWS data store. The setup that requires the LEAST amount of effort and moving parts involves setting up a Kinesis Data Firehose to stream the data into S3, have it transformed by Lambda with an S3 trigger, and then written to DynamoDB.

Reference 19: Kinesis Data Firehose to stream the data into S3

Question 20) Which service in the Kinesis family allows you to build custom applications that process or analyze streaming data for specialized needs?

A) Kinesis Firehose
B) Kinesis Streams
C) Kinesis Video Streams
D) Kinesis Data Analytics

Answer 20)

B

Notes 20)

Kinesis Streams allows you to stream data into AWS and build custom applications around that streaming data.

Reference 20: Kinesis Streams


 

Top

Top 10 Google Professional Machine Learning Engineer Sample Questions

Question 1: You work for a textile manufacturer and have been asked to build a model to detect and classify fabric defects. You trained a machine learning model with high recall based on high resolution images taken at the end of the production line. You want quality control inspectors to gain trust in your model. Which technique should you use to understand the rationale of your classifier?

A. Use K-fold cross validation to understand how the model performs on different test datasets.

B. Use the Integrated Gradients method to efficiently compute feature attributions for each predicted image.

C. Use PCA (Principal Component Analysis) to reduce the original feature set to a smaller set of easily understood features.

D. Use k-means clustering to group similar images together, and calculate the Davies-Bouldin index to evaluate the separation between clusters.

Answer 1)

B

Notes 1)

B is correct because it identifies the pixel of the input image that leads to the classification of the image itself.

Question 2: You need to write a generic test to verify whether Dense Neural Network (DNN) models automatically released by your team have a sufficient number of parameters to learn the task for which they were built. What should you do?

A. Train the model for a few iterations, and check for NaN values.
B. Train the model for a few iterations, and verify that the loss is constant.
C. Train a simple linear model, and determine if the DNN model outperforms it.
D. Train the model with no regularization, and verify that the loss function is close to zero.
 

Answer 2)

D

Notes 2)

D is correct because the test can check that the model has enough parameters to memorize the task.


Question 3: Your team is using a TensorFlow Inception-v3 CNN model pretrained on ImageNet for an image classification prediction challenge on 10,000 images. You will use AI Platform to perform the model training. What TensorFlow distribution strategy and AI Platform training job configuration should you use to train the model and optimize for wall-clock time?

 

A. Default Strategy; Custom tier with a single master node and four v100 GPUs.
B. One Device Strategy; Custom tier with a single master node and four v100 GPUs.
C. One Device Strategy; Custom tier with a single master node and eight v100 GPUs.
D. Central Storage Strategy; Custom tier with a single master node and four v100 GPUs.
 

Answer 3)

D

Notes 3)

D is correct because this is the only strategy that can perform distributed training; albeit there is only a single copy of the variables on the CPU host.

Question 4: You work on a team where the process for deploying a model into production starts with data scientists training different versions of models in a Kubeflow pipeline. The workflow then stores the new model artifact into the corresponding Cloud Storage bucket. You need to build the next steps of the pipeline after the submitted model is ready to be tested and deployed in production on AI Platform. How should you configure the architecture before deploying the model to production?

 
A. Deploy model in test environment -> Validate model -> Create a new AI Platform model version
 
B. Validate model -> Deploy model in test environment -> Create a new AI Platform model version
 
C. Create a new AI Platform model version -> Validate model -> Deploy model in test environment
D. Create a new AI Platform model version – > Deploy model in test environment -> Validate model
 
Answer 4)

A
 
Notes 4)

A is correct because the model can be validated after it is deployed to the test environment, and the release version is established before the model is deployed in production.
 
Question 5: You work for a maintenance company and have built and trained a deep learning model that identifies defects based on thermal images of underground electric cables. Your dataset contains 10,000 images, 100 of which contain visible defects. How should you evaluate the performance of the model on a test dataset?
 
A. Calculate the Area Under the Curve (AUC) value.
 
B. Calculate the number of true positive results predicted by the model.
C. Calculate the fraction of images predicted by the model to have a visible defect.
D. Calculate the Cosine Similarity to compare the model’s performance on the test dataset to the model’s performance on the training dataset.
 
Answer 5)

A
 
Notes 5)

A is correct because it is scale-invariant. AUC measures how well predictions are ranked, rather than their absolute values. AUC is also classification-threshold invariant. It measures the quality of the model's predictions irrespective of what classification threshold is chosen.
 
Question 6: You work for a manufacturing company that owns a high-value machine which has several machine settings and multiple sensors. A history of the machine’s hourly sensor readings and known failure event data are stored in BigQuery. You need to predict if the machine will fail within the next 3 days in order to schedule maintenance before the machine fails. Which data preparation and model training steps should you take?

 

A. Data preparation: Daily max value feature engineering with DataPrep; Model training: AutoML classification with BQML
 
B. Data preparation: Daily min value feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
C. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to False
D. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
Answer 6)

D
 
Notes 6)

D is correct because it uses the rolling average of the sensor data and balances the weights using the BQML auto class weight balance parameter.
 
 
Question 7: You are an ML engineer at a media company. You need to build an ML model to analyze video content frame-by-frame, identify objects, and alert users if there is inappropriate content. Which Google Cloud products should you use to build this project?

 

A. Pub/Sub, Cloud Function, Cloud Vision API
 
B. Pub/Sub, Cloud IoT, Dataflow, Cloud Vision API, Cloud Logging
C. Pub/Sub, Cloud Function, Video Intelligence API, Cloud Logging
D. Pub/Sub, Cloud Function, AutoML Video Intelligence, Cloud Logging
 
Answer 7)

C
 
Notes 7)

C is correct as Video Intelligence API can find inappropriate components and other components satisfy the requirements of real-time processing and notification.
 
Question 8: You work for a large retailer. You want to use ML to forecast future sales leveraging 10 years of historical sales data. The historical data is stored in Cloud Storage in Avro format. You want to rapidly experiment with all the available data. How should you build and train your model for the sales forecast?
 
A. Load data into BigQuery and use the ARIMA model type on BigQuery ML.
B. Convert the data into CSV format and create a regression model on AutoML Tables.
C. Convert the data into TFRecords and create an RNN model on TensorFlow on AI Platform Notebooks.
D. Convert and refactor the data into CSV format and use the built-in XGBoost algorithm on AI Platform Training.
 
Answer 8)

A
 
Notes 8)

A is correct because BigQuery ML is designed for fast and rapid experimentation and it is possible to use federated queries to read data directly from Cloud Storage. Moreover, ARIMA is considered one of the best in class for time series forecasting.
 
Question 9) You need to build an object detection model for a small startup company to identify if and where the company’s logo appears in an image. You were given a large repository of images, some with logos and some without. These images are not yet labelled. You need to label these pictures, and then train and deploy the model. What should you do?

 

A. Use Google Cloud’s Data Labelling Service to label your data. Use AutoML Object Detection to train and deploy the model.
B. Use Vision API to detect and identify logos in pictures and use it as a label. Use AI Platform to build and train a convolutional neural network.
 
C. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a convolutional neural network.
D. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a real time object detection model.
 
Answer 9)

A
 
Notes 9)

A is correct as this will allow you to easily create a request for a labelling task and deploy a high-performance model.
 

Question 10) You work for a large financial institution that is planning to use Dialogflow to create a chatbot for the company’s mobile app. You have reviewed old chat logs and tagged each conversation for intent based on each customer’s stated intention for contacting customer service. About 70% of customer inquiries are simple requests that are solved within 10 intents. The remaining 30% of inquiries require much longer and more complicated requests. Which intents should you automate first?

A. Automate a blend of the shortest and longest intents to be representative of all intents.
B. Automate the more complicated requests first because those require more of the agents’ time.
C. Automate the 10 intents that cover 70% of the requests so that live agents can handle the more complicated requests.
 
D. Automate intents in places where common words such as “payment” only appear once to avoid confusing the software.
Answer 10)

C
 
Notes 10)


 

Machine Learning Q&A Part I:

Google.

Azure and AWS are second class citizens in this area.

Sure, AWS has 70% of the market.

Sure, Azure is the easiest turn key and super user friendly.

But, the king of machine learning in the cloud is GCP.

GCP = Google Cloud Platform

Google has the largest data science team in the world, not mention they have Hinton.

Let’s forgot for a minute they created TensorFlow and give it away.

Let’s just talk about building a real world model with data that doesn’t fit into a excel spreadsheet.

The vast majority of applied machine learning is supervised and that means we need data.

Not just normal data, we need very clean highly structured data.

Where’s the easiest place in the world to upload and model a Petabyte of structured dataBigQuery of course.

Why BigQuery? I don’t have to do anything but upload my data. No spinning up RedShit clusters or whatever I have to do in Azure, just upload and massage data with my familiar SQL. If I do have to wrangle my data it won’t take my six months to update 5 rows here, minutes usually.

Then, you’ll need a front end. Cloud datalab is a Jupyter notebook, which is good because I don’t want nor do I need anything else.

Then, with a single line of code I connect by datalab (Jupyter) notebook to my data in BigQuery and build away.

I’ve worked in all three and the only thing I care about is getting to my job the fastest and right now that means I build my models in GCP.

If you’re new to machine learning don’t start in GCP or any cloud vendor for that matter. Start learning Python from the comfort of your laptop.

The course below is free to the first 20.

The Complete Python Course for Machine Learning Engineers

Here, I want to share the best research paper on Machine Learning classification methods, titled ‘Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?’, published in the ‘Journal of Machine Learning Research’.

This paper nicely explained 179 classification techniques and applied them on 121 data sets thus sharing small summary of the paper:

Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?

 
 
 

The paper evaluated 179 classifiers arising from 17 ML families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest neighbours, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R ( with and without the caret package), C and Matlab, including all the relevant classifiers available today.

Experiments used total 121 data sets , which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behaviour, not dependent on the data set collection.

The whole data set and partitions are available from: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/data.tar.gz

The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package).

The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).

You can see the table with the complete results: http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/results.txt

I hope it will be helpful for Statistic and Machine Leaning aspirants!

Thank you!

 
 
 

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

  • Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
  • Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
  • Model versioning: add a hash key to your different models. You will thank me later.
  • Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
  • Monitor performances: execution time and statistical scores of your models.
  • Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

  1. Not understanding the structure of the dataset
  2. Not giving proper care during features selection
  3. Leaving out categorical features and considering just numerical variables
  4. Falling into dummy variable trap
  5. Selection of inefficient machine learning algorithm
  6. Not trying out various ML algorithms for building the model based on structure of data.
  7. Improper tuning of model parameters
  8. Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
  9. Read more here…


Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.

However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….

The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.

Thus, the data science life-cycle can include the following steps:

  1. Business requirement understanding.
  2. Data collection.
  3. Data cleaning.
  4. Data analysis.
  5. Modeling.
  6. Performance evaluation.
  7. Communicating with stakeholders.
  8. Deployment.
  9. Real-world testing.
  10. Business buy-in.
  11. Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.

Read more here….

 

Top

 

Machine Learning Q&A -Part II:

 
 
 

At a high level, these skills are a combination of software and data engineering.

The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.

That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:

  • Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
  • Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
  • Model versioning: add a hash key to your different models. You will thank me later.
  • Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
  • Monitor performances: execution time and statistical scores of your models.
  • Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..

Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:

  1. Not understanding the structure of the dataset
  2. Not giving proper care during features selection
  3. Leaving out categorical features and considering just numerical variables
  4. Falling into dummy variable trap
  5. Selection of inefficient machine learning algorithm
  6. Not trying out various ML algorithms for building the model based on structure of data.
  7. Improper tuning of model parameters
  8. Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
  9. Read more here…

Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.

That’s just the surface-level comparison though. The image above gives an overview of how the two differ.

One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.

However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….

The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.

Thus, the data science life-cycle can include the following steps:

  1. Business requirement understanding.
  2. Data collection.
  3. Data cleaning.
  4. Data analysis.
  5. Modeling.
  6. Performance evaluation.
  7. Communicating with stakeholders.
  8. Deployment.
  9. Real-world testing.
  10. Business buy-in.
  11. Support and maintenance.

Looks neat, but here is the scheme to visualize how it is happening in reality:

Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.

Read more here….

 

Top

Machine Learning Latest News

Top

Top 10 Machine Learning Algorithms

Source: Top 10 Machine Learning Algorithms for Data Scientist

In machine learning, there’s something called the “No Free Lunch” theorem. In a nutshell, it states that no one algorithm works best for every problem. It’s especially relevant for supervised learning. For example, you can’t say that neural networks are always better than decision trees or vice-versa. Furthermore, there are many factors at play, such as the size and structure of your dataset. As a result, you should try many different algorithms for your problem!

Top ML Algorithms

1. Linear Regression

Regression is a technique for numerical prediction. Additionally, regression is a statistical measure that attempts to determine the strength of the relationship between two variables. One is a dependent variable. Other is from a series of other changing variables which are our independent variables. Moreover, just like Classification is for predicting categorical labels, Regression is for predicting a continuous value. For example, we may wish to predict the salary of university graduates with 5 years of work experience. We use regression to determine how much specific factors or sectors influence the dependent variable.

Linear regression attempts to model the relationship between a scalar variable and explanatory variables by fitting a linear equation. For example, one might want to relate the weights of individuals to their heights using a linear regression model.

Additionally, this operator calculates a linear regression model. It uses the Akaike criterion for model selection. Furthermore, the Akaike information criterion is a measure of the relative goodness of a fit of a statistical model.

2. Logistic Regression

Logistic regression is a classification model. It uses input variables to predict a categorical outcome variable. The variable can take on one of a limited set of class values. A binomial logistic regression relates to two binary output categories. A multinomial logistic regression allows for more than two classes. Examples of logistic regression include classifying a binary condition as “healthy” / “not healthy”. Logistic regression applies the logistic sigmoid function to weighted input values to generate a prediction of the data class.

A logistic regression model estimates the probability of a dependent variable as a function of independent variables. The dependent variable is the output that we are trying to predict. The independent variables or explanatory variables are the factors that we feel could influence the output. Multiple regression refers to regression analysis with two or more independent variables. Multivariate regression, on the other hand, refers to regression analysis with two or more dependent variables.

3. Linear Discriminant Analysis

Logistic Regression is a classification algorithm traditionally for two-class classification problems. If you have more than two classes then the Linear Discriminant Analysis algorithm is the preferred linear classification technique.

The representation of LDA is pretty straight forward. It consists of statistical properties of your data, calculated for each class. For a single input variable this includes:

  1. The mean value for each class.
  2. The variance calculated across all classes.

We make predictions by calculating a discriminate value for each class. After that we make a prediction for the class with the largest value. The technique assumes that the data has a Gaussian distribution. Hence, it is a good idea to remove outliers from your data beforehand. It’s a simple and powerful method for classification predictive modelling problems.

4. Classification and Regression Trees

Prediction Trees are for predicting response or class YY from input X1, X2,…,XnX1,X2,…,Xn. If it is a continuous response it is a regression tree, if it is categorical, it is a classification tree. At each node of the tree, we check the value of one the input XiXi. Depending on the (binary) answer we continue to the left or to the right subbranch. When we reach a leaf we will find the prediction.

Contrary to linear or polynomial regression which are global models, trees try to partition the data space into small enough parts where we can apply a simple different model on each part. The non-leaf part of the tree is just the procedure to determine for each data xx what is the model we will use to classify it.

5. Naive Bayes

A Naive Bayes Classifier is a supervised machine-learning algorithm that uses the Bayes’ Theorem, which assumes that features are statistically independent. The theorem relies on the naive assumption that input variables are independent of each other, i.e. there is no way to know anything about other variables when given an additional variable. Regardless of this assumption, it has proven itself to be a classifier with good results.

Naive Bayes Classifiers rely on the Bayes’ Theorem, which is based on conditional probability or in simple terms, the likelihood that an event (A) will happen given that another event (B) has already happened. Essentially, the theorem allows a hypothesis to be updated each time new evidence is introduced. The equation below expresses Bayes’ Theorem in the language of probability:

Let’s explain what each of these terms means.

  • “P” is the symbol to denote probability.
  • P(A | B) = The probability of event A (hypothesis) occurring given that B (evidence) has occurred.
  • P(B | A) = The probability of the event B (evidence) occurring given that A (hypothesis) has occurred.
  • P(A) = The probability of event B (hypothesis) occurring.
  • P(B) = The probability of event A (evidence) occurring.

6. K-Nearest Neighbors

k-nearest neighbours (or k-NN for short) is a simple machine learning algorithm that categorizes an input by using its k nearest neighbours.

For example, suppose a k-NN algorithm has an input of data points of specific men and women’s weight and height, as plotted below. To determine the gender of an unknown input (green point), k-NN can look at the nearest k neighbours (suppose ) and will determine that the input’s gender is male. This method is a very simple and logical way of marking unknown inputs, with a high rate of success.

Also, we can k-NN in a variety of machine learning tasks; for example, in computer vision, k-NN can help identify handwritten letters and in gene expression analysis, the algorithm can determine which genes contribute to a certain characteristic. Overall, k-nearest neighbours provide a combination of simplicity and effectiveness that makes it an attractive algorithm to use for many machine learning tasks.

7. Learning Vector Quantization

A downside of K-Nearest Neighbors is that you need to hang on to your entire training dataset. The Learning Vector Quantization algorithm (or LVQ for short) is an artificial neural network algorithm that allows you to choose how many training instances to hang onto and learns exactly what those instances should look like.

Additionally, the representation for LVQ is a collection of codebook vectors. We select them randomly in the beginning and adapted to best summarize the training dataset over a number of iterations of the learning algorithm. After learned, the codebook vectors can make predictions just like K-Nearest Neighbors. Also, we find the most similar neighbour (best matching codebook vector) by calculating the distance between each codebook vector and the new data instance. The class value or (real value in the case of regression) for the best matching unit is then returned as the prediction. Moreover, you can get the best results if you rescale your data to have the same range, such as between 0 and 1.

If you discover that KNN gives good results on your dataset try using LVQ to reduce the memory requirements of storing the entire training dataset.

8. Bagging and Random Forest

A Random Forest consists of a collection or ensemble of simple tree predictors, each capable of producing a response when presented with a set of predictor values. For classification problems, this response takes the form of a class membership, which associates, or classifies, a set of independent predictor values with one of the categories present in the dependent variable. Alternatively, for regression problems, the tree response is an estimate of the dependent variable given the predictors.e

A Random Forest consists of an arbitrary number of simple trees, which determine the final outcome. For classification problems, the ensemble of simple trees votes for the most popular class. In the regression problem, we average responses to obtain an estimate of the dependent variable. Using tree ensembles can lead to significant improvement in prediction accuracy (i.e., better ability to predict new data cases).

9. SVM

A Support Vector Machine (SVM) is a supervised machine learning algorithm that can be employed for both classification and regression purposes. Also, SVMs have more common usage in classification problems and as such, this is what we will focus on in this post.

SVMs are based on the idea of finding a hyperplane that best divides a dataset into two classes, as shown in the image below.

Also, you can think of a hyperplane as a line that linearly separates and classifies a set of data.

Intuitively, the further from the hyperplane our data points lie, the more confident we are that they have been correctly classified. We, therefore, want our data points to be as far away from the hyperplane as possible, while still being on the correct side of it.

So when we add a new testing data , whatever side of the hyperplane it lands will decide the class that we assign to it.

The distance between the hyperplane and the nearest data point from either set is the margin. Furthermore, the goal is to choose a hyperplane with the greatest possible margin between the hyperplane and any point within the training set, giving a greater chance of correct classification of data.

But the data is rarely ever as clean as our simple example above. A dataset will often look more like the jumbled balls below which represent a linearly non-separable dataset.

10. Boosting and AdaBoost

Boosting is an ensemble technique that attempts to create a strong classifier from a number of weak classifiers. We do this by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. We can add models until the training set is predicted perfectly or a maximum number of models are added.

AdaBoost was the first really successful boosting algorithm developed for binary classification. It is the best starting point for understanding boosting. Modern boosting methods build on AdaBoost, most notably stochastic gradient boosting machines.

AdaBoost is used with short decision trees. After the first tree is created, the performance of the tree on each training instance is used to weight how much attention the next tree that is created should pay attention to each training instance. Training data that is hard to predict is given more weight, whereas easy to predict instances are given less weight. Models are created sequentially one after the other, each updating the weights on the training instances that affect the learning performed by the next tree in the sequence. After all the trees are built, predictions are made for new data, and the performance of each tree is weighted by how accurate it was on training data.

Because so much attention is put on correcting mistakes by the algorithm it is important that you have clean data with outliers removed.

Summary

A typical question asked by a beginner, when facing a wide variety of machine learning algorithms, is “which algorithm should I use?” The answer to the question varies depending on many factors, including: (1) The size, quality, and nature of data; (2) The available computational time; (3) The urgency of the task; and (4) What you want to do with the data.

Even an experienced data scientist cannot tell which algorithm will perform the best before trying different algorithms. Although there are many other Machine Learning algorithms, these are the most popular ones. If you’re a newbie to Machine Learning, these would be a good starting point to learn.

Follow this link, if you are looking to learn Data Science Course Online!

Additionally, if you are having an interest in learning Data Science, Learn online Data Science Course to boost your career in Data Science.

Also, learn AWS Big Data Course click here, AWS Online Course

Furthermore, if you want to read more about data science, read this Data Science blogs

Top

The foundations of most algorithms lie in linear algebra, multivariable calculus, and optimization methods. Most algorithms use a sequence of combinations to estimate an objective function given a set of data, and the sequence order and included methods distinguish one algorithm from another. It’s helpful to learn enough math to read the development papers associated with key algorithms in the field, as many other methods (or one’s own innovations) include pieces of those algorithms. It’s like learning the language of machine learning. Once you are fluent in it, it’s pretty easy to modify algorithms as needed and create new ones likely to improve on a problem in a short period of time.

Matrix factorization: a simple, beautiful way to do dimensionality reduction —and dimensionality reduction is the essence of cognition. Recommender systems would be a big application of matrix factorization. Another application I’ve been using over the years (starting in 2010 with video data) is factorizing a matrix of pairwise mutual information (or pointwise mutual information, which is more common) between features, which can be used for feature extraction, computing word embeddings, computing label embeddings (that was the topic of a recent paper of mine [1]), etc.

Used in a convolutional settings, this acts as an excellent unsupervised feature extractor for images and videos. There’s one big issue though: it is fundamentally a shallow algorithm. Deep neural networks will quickly outperform it if any kind of supervision labels are available.

[1] [1607.05691] Information-theoretical label embeddings for large-scale image classification

Machine Learning Demos:

1- TensorFlow Demos

LipSync by YouTube

See how well you synchronize to the lyrics of the popular hit “Dance Monkey.” This in-browser experience uses the Facemesh model for estimating key points around the lips to score lip-syncing accuracy.Explore demo  View code  

Emoji Scavenger Hunt

Use your phone’s camera to identify emojis in the real world. Can you find all the emojis before time expires?Explore demo  View code  

Webcam Controller

Play Pac-Man using images trained in your browser.Explore demo  View code  

Teachable Machine

No coding required! Teach a machine to recognize images and play sounds.Explore demo  View code  

Move Mirror

Explore pictures in a fun new way, just by moving around.Explore demo  View code  

Performance RNN

Enjoy a real-time piano performance by a neural network.Explore demo  View code  

Node.js Pitch Prediction

Train a server-side model to classify baseball pitch types using Node.js.View code  

Visualize Model Training

See how to visualize in-browser training and model behaviour and training using tfjs-vis.Explore demo  View code  

Community demos

Get started with official templates and explore top picks from the community for inspiration.Glitch 

Check out community Glitches and make your own TensorFlow.js-powered projects.Explore Glitch  Codepen 

Fork boilerplate templates and check out working examples from the community.Explore CodePen  GitHub Community Projects 

See what the community has created and submitted to the TensorFlow.js gallery page.Explore GitHub  

https://cdpn.io/jasonmayes/fullcpgrid/QWbNeJdOpen in Editor

Real time body segmentation using TensorFlow.js

Load in a pre-trained Body-Pix model from the TensorFlow.js team so that you can locate all pixels in an image that are part of a body, and what part of the body they belong to. Clone this to make your own TensorFlow.js powered projects to recognize body parts in images from your webcam and more!

New Pen from Templatehttps://cdpn.io/jasonmayes/fullcpgrid/qBEJxggOpen in Editor

Multiple object detection using pre trained model in TensorFlow.js

This demo shows how we can use a pre made machine learning solution to recognize objects (yes, more than one at a time!) on any image you wish to present to it. Even better, not only do we know that the image contains an object, but we can also get the co-ordinates of the bounding box for each object it finds, which allows you to highlight the found object in the image.

For this demo we are loading a model using the ImageNet-SSD architecture, to recognize 90 common objects it has already been taught to find from the COCO dataset.

If what you want to recognize is in that list of things it knows about (for example a cat, dog, etc), this may be useful to you as is in your own projects, or just to experiment with Machine Learning in the browser and get familiar with the possibilities of machine learning.

If you are feeling particularly confident you can check out our GitHub documentation (https://github.com/tensorflow/tfjs-models/tree/master/coco-ssd) which goes into much more detail for customizing various parameters to tailor performance to your needs.

New Pen from Templatehttps://cdpn.io/jasonmayes/fullcpgrid/JjompwwOpen in Editor

Classifying images using a pre trained model in TensorFlow.js

This demo shows how we can use a pre made machine learning solution to classify images (aka a binary image classifier). It should be noted that this model works best when a single item is in the image at a time. Busy images may not work so well. You may want to try our demo for Multiple Object Detection (https://codepen.io/jasonmayes/pen/qBEJxgg) for that.

For this demo we are loading a model using the MobileNet architecture, to recognize 1000 common objects it has already been taught to find from the ImageNet data set (http://image-net.org/).

If what you want to recognize is in that list of things it knows about (for example a cat, dog, etc), this may be useful to you as is in your own projects, or just to experiment with Machine Learning in the browser and get familiar with the possibilities of machine learning.

Please note: This demo loads an easy to use JavaScript class made by the TensorFlow.js team to do the hardwork for you so no machine learning knowledge is needed to use it.

If you were looking to learn how to load in a TensorFlow.js saved model directly yourself then please see our tutorial on loading TensorFlow.js models directly.

If you want to train a system to recognize your own objects, using your own data, then check out our tutorials on “transfer learning”.

New Pen from TemplateOpen in Editor

Tensorflow.js Boilerplate

The hello world for TensorFlow.js 🙂 Absolute minimum needed to import into your website and simply prints the loaded TensorFlow.js version. From here we can do great things. Clone this to make your own TensorFlow.js powered projects or if you are following a tutorial that needs TensorFlow.js to work.

New Pen from Template

Examples

tfjs-examples provides small code examples that implement various ML tasks using TensorFlow.js.MNIST Digit Recognizer

Train a model to recognize handwritten digits from the MNIST database.Explore example  View code