Djamgatech – Multilingual and Platform Independent Cloud Certification and Education App for AWS Azure Google Cloud
Djamgatech is the ultimate Cloud Education Certification App. It is an EduFlix App for AWS, Azure, Google Cloud Certification Prep, School Subjects, Python, Math, SAT, etc.[Android, iOS]
Technology is changing and is moving towards the cloud. The cloud will power most businesses in the coming years and is not taught in schools. How do we ensure that our kids and youth and ourselves are best prepared for this challenge?
Building mobile educational apps that work offline and on any device can help greatly in that sense.
The ability to tab on a button and learn the cloud fundamentals and take quizzes is a great opportunity to help our children and youth to boost their job prospects and be more productive at work.
Get 20% off Google Google Workspace (Google Meet) Standard Plan with the following codes: 96DRHDRA9J7GTN6 Get 20% off Google Workspace (Google Meet) Business Plan (AMERICAS): M9HNXHX3WC9H7YE (Email us for more codes)
Features: – Practice exams – 1000+ Q&A updated frequently. – 3+ Practice exams per Certification – Scorecard / Scoreboard to track your progress – Quizzes with score tracking, progress bar, countdown timer. – Can only see scoreboard after completing the quiz. – FAQs for most popular Cloud services – Cheat Sheets – Flashcards – works offline
Note and disclaimer: We are not affiliated with AWS, Azure, Microsoft or Google. The questions are put together based on the certification study guide and materials available online. The questions in this app should help you pass the exam but it is not guaranteed. We are not responsible for any exam you did not pass.
Important: To succeed with the real exam, do not memorize the answers in this app. It is very important that you understand why a question is right or wrong and the concepts behind it by carefully reading the reference documents in the answers.
Top 50 Google Certified Cloud Professional Architect Exam Questions and Answers Dumps
GCP, Google Cloud Platform, has been a game changer in the tech industry. It allows organizations to build and run applications on Google’s infrastructure. The GCP platform is trusted by many companies because it is reliable, secure and scalable. In order to become a GCP certified professional, one must pass the GCP Professional Architect exam. The GCP Professional Architect exam is not easy, but with the right practice questions and answers dumps, you can pass the GCP PA exam with flying colors.
Google Certified Cloud Professional Architect is the top high paying certification in the world: Google Certified Professional Cloud Architect Average Salary – $175,761
The Google Certified Cloud Professional Architect Exam assesses your ability to:
Get 20% off Google Google Workspace (Google Meet) Standard Plan with the following codes: 96DRHDRA9J7GTN6 Get 20% off Google Workspace (Google Meet) Business Plan (AMERICAS): M9HNXHX3WC9H7YE (Email us for more codes)
Designing and planning a cloud solution architecture: 36%
This domain tests your ability to design a solution infrastructure that meets business and technical requirements and considers network, storage and compute resources. It will test your ability to create a migration plan, and that you can envision future solution improvements.
Managing and provisioning a solution Infrastructure: 20%
This domain will test your ability to configure network topologies, individual storage systems and design solutions using Google Cloud networking, storage and compute services.
This domain assesses your ability to design for security and compliance by considering IAM policies, separation of duties, encryption of data and that you can design your solutions while considering any compliance requirements such as those for healthcare and financial information.
This domain tests your ability to advise development/operation team(s) to make sure you have successful deployment of your solution. It also tests yours ability to interact with Google Cloud using GCP SDK (gcloud, gsutil, and bq).
This domain tests your ability to run your solutions reliably in Google Cloud by building monitoring and logging solutions, quality control measures and by creating release management processes.
Analyzing and optimizing technical and business processes: 16%
This domain will test how you analyze and define technical processes, business processes and develop procedures to ensure resilience of your solutions in production.
Below are the Top 50 Google Certified Cloud Professional Architect Exam Questions and Answers Dumps that will help you ace the GCP Professional Architect exam:
You will need to have the three case studies referred to in the exam open in separate tabs in order to complete the exam: Company A , Company B, Company C
Question 1:Because you do not know every possible future use for the data Company A collects, you have decided to build a system that captures and stores all raw data in case you need it later. How can you most cost-effectively accomplish this goal?
A. Have the vehicles in the field stream the data directly into BigQuery.
B. Have the vehicles in the field pass the data to Cloud Pub/Sub and dump it into a Cloud Dataproc cluster that stores data in Apache Hadoop Distributed File System (HDFS) on persistent disks.
C. Have the vehicles in the field continue to dump data via FTP, adjust the existing Linux machines, and use a collector to upload them into Cloud Dataproc HDFS for storage.
D. Have the vehicles in the field continue to dump data via FTP, and adjust the existing Linux machines to immediately upload it to Cloud Storage with gsutil.
ANSWER1:
D
Notes/References1:
D is correct because several load-balanced Compute Engine VMs would suffice to ingest 9 TB per day, and Cloud Storage is the cheapest per-byte storage offered by Google. Depending on the format, the data could be available via BigQuery immediately, or shortly after running through an ETL job. Thus, this solution meets business and technical requirements while optimizing for cost.
Question 2: Today, Company A maintenance workers receive interactive performance graphs for the last 24 hours (86,400 events) by plugging their maintenance tablets into the vehicle. The support group wants support technicians to view this data remotely to help troubleshoot problems. You want to minimize the latency of graph loads. How should you provide this functionality?
A. Execute queries against data stored in a Cloud SQL.
B. Execute queries against data indexed by vehicle_id.timestamp in Cloud Bigtable.
C. Execute queries against data stored on daily partitioned BigQuery tables.
D. Execute queries against BigQuery with data stored in Cloud Storage via BigQuery federation.
ANSWER2:
B
Notes/References2:
B is correct because Cloud Bigtable is optimized for time-series data. It is cost-efficient, highly available, and low-latency. It scales well. Best of all, it is a managed service that does not require significant operations work to keep running.
Question 3: Your agricultural division is experimenting with fully autonomous vehicles. You want your architecture to promote strong security during vehicle operation. Which two architecture characteristics should you consider?
A. Use multiple connectivity subsystems for redundancy.
B. Require IPv6 for connectivity to ensure a secure address space.
C. Enclose the vehicle’s drive electronics in a Faraday cage to isolate chips.
D. Use a functional programming language to isolate code execution cycles.
E. Treat every microservice call between modules on the vehicle as untrusted.
F. Use a Trusted Platform Module (TPM) and verify firmware and binaries on boot.
ANSWER3:
E and F
Notes/References3:
E is correct because this improves system security by making it more resistant to hacking, especially through man-in-the-middle attacks between modules.
F is correct because this improves system security by making it more resistant to hacking, especially rootkits or other kinds of corruption by malicious actors.
Question 4: For this question, refer to the Company A case study.
Which of Company A’s legacy enterprise processes will experience significant change as a result of increased Google Cloud Platform adoption?
A. OpEx/CapEx allocation, LAN change management, capacity planning
B. Capacity planning, TCO calculations, OpEx/CapEx allocation
C. Capacity planning, utilization measurement, data center expansion
D. Data center expansion, TCO calculations, utilization measurement
ANSWER4:
B
Notes/References4:
B is correct because all of these tasks are big changes when moving to the cloud. Capacity planning for cloud is different than for on-premises data centers; TCO calculations are adjusted because Company A is using services, not leasing/buying servers; OpEx/CapEx allocation is adjusted as services are consumed vs. using capital expenditures.
Question 5: For this question, refer to the Company A case study.
You analyzed Company A’s business requirement to reduce downtime and found that they can achieve a majority of time saving by reducing customers’ wait time for parts. You decided to focus on reduction of the 3 weeks’ aggregate reporting time. Which modifications to the company’s processes should you recommend?
A. Migrate from CSV to binary format, migrate from FTP to SFTP transport, and develop machine learning analysis of metrics.
B. Migrate from FTP to streaming transport, migrate from CSV to binary format, and develop machine learning analysis of metrics.
C. Increase fleet cellular connectivity to 80%, migrate from FTP to streaming transport, and develop machine learning analysis of metrics.
D. Migrate from FTP to SFTP transport, develop machine learning analysis of metrics, and increase dealer local inventory by a fixed factor.
ANSWER5:
C
Notes/References5:
C is correct because using cellular connectivity will greatly improve the freshness of data used for analysis from where it is now, collected when the machines are in for maintenance. Streaming transport instead of periodic FTP will tighten the feedback loop even more. Machine learning is ideal for predictive maintenance workloads.
Question 6: Your company wants to deploy several microservices to help their system handle elastic loads. Each microservice uses a different version of software libraries. You want to enable their developers to keep their development environment in sync with the various production services. Which technology should you choose?
A. RPM/DEB
B. Containers
C. Chef/Puppet
D. Virtual machines
ANSWER6:
B
Notes/References6:
B is correct because using containers for development, test, and production deployments abstracts away system OS environments, so that a single host OS image can be used for all environments. Changes that are made during development are captured using a copy-on-write filesystem, and teams can easily publish new versions of the microservices in a repository.
Question 7: Your company wants to track whether someone is present in a meeting room reserved for a scheduled meeting. There are 1000 meeting rooms across 5 offices on 3 continents. Each room is equipped with a motion sensor that reports its status every second. You want to support the data upload and collection needs of this sensor network. The receiving infrastructure needs to account for the possibility that the devices may have inconsistent connectivity. Which solution should you design?
A. Have each device create a persistent connection to a Compute Engine instance and write messages to a custom application.
B. Have devices poll for connectivity to Cloud SQL and insert the latest messages on a regular interval to a device specific table.
C. Have devices poll for connectivity to Cloud Pub/Sub and publish the latest messages on a regular interval to a shared topic for all devices.
D. Have devices create a persistent connection to an App Engine application fronted by Cloud Endpoints, which ingest messages and write them to Cloud Datastore.
ANSWER7:
C
Notes/References7:
C is correct becauseCloudPub/Sub can handle the frequency of this data, and consumers of the data can pull from the shared topic for further processing.
Question 8: Your company wants to try out the cloud with low risk. They want to archive approximately 100 TB of their log data to the cloud and test the analytics features available to them there, while also retaining that data as a long-term disaster recovery backup. Which two steps should they take?
A. Load logs into BigQuery.
B. Load logs into Cloud SQL.
C. Import logs into Stackdriver.
D. Insert logs into Cloud Bigtable.
E. Upload log files into Cloud Storage.
ANSWER8:
A and E
Notes/References8:
A is correct because BigQuery is the fully managed cloud data warehouse for analytics and supports the analytics requirement.
E is correct because Cloud Storage provides the Coldline storage class to support long-term storage with infrequent access, which would support the long-term disaster recovery backup requirement.
Question 9: You set up an autoscaling instance group to serve web traffic for an upcoming launch. After configuring the instance group as a backend service to an HTTP(S) load balancer, you notice that virtual machine (VM) instances are being terminated and re-launched every minute. The instances do not have a public IP address. You have verified that the appropriate web response is coming from each instance using the curl command. You want to ensure that the backend is configured correctly. What should you do?
A. Ensure that a firewall rule exists to allow source traffic on HTTP/HTTPS to reach the load balancer.
B. Assign a public IP to each instance, and configure a firewall rule to allow the load balancer to reach the instance public IP.
C. Ensure that a firewall rule exists to allow load balancer health checks to reach the instances in the instance group.
D. Create a tag on each instance with the name of the load balancer. Configure a firewall rule with the name of the load balancer as the source and the instance tag as the destination.
ANSWER9:
C
Notes/References9:
C is correct because health check failures lead to a VM being marked unhealthy and can result in termination if the health check continues to fail. Because you have already verified that the instances are functioning properly, the next step would be to determine why the health check is continuously failing.
Question 10: Your organization has a 3-tier web application deployed in the same network on Google Cloud Platform. Each tier (web, API, and database) scales independently of the others. Network traffic should flow through the web to the API tier, and then on to the database tier. Traffic should not flow between the web and the database tier. How should you configure the network?
A. Add each tier to a different subnetwork.
B. Set up software-based firewalls on individual VMs.
C. Add tags to each tier and set up routes to allow the desired traffic flow.
D. Add tags to each tier and set up firewall rules to allow the desired traffic flow.
ANSWER10:
D
Notes/References10:
D is correct because as instances scale, they will all have the same tag to identify the tier. These tags can then be leveraged in firewall rules to allow and restrict traffic as required, because tags can be used for both the target and source.
Question 11: Your organization has 5 TB of private data on premises. You need to migrate the data to Cloud Storage. You want to maximize the data transfer speed. How should you migrate the data?
A. Use gsutil.
B. Use gcloud.
C. Use GCS REST API.
D. Use Storage Transfer Service.
ANSWER11:
A
Notes/References11:
A is correct because gsutil gives you access to write data to Cloud Storage.
Question 12: You are designing a mobile chat application. You want to ensure that people cannot spoof chat messages by proving that a message was sent by a specific user. What should you do?
A. Encrypt the message client-side using block-based encryption with a shared key.
B. Tag messages client-side with the originating user identifier and the destination user.
C. Use a trusted certificate authority to enable SSL connectivity between the client application and the server.
D. Use public key infrastructure (PKI) to encrypt the message client-side using the originating user’s private key.
ANSWER12:
D
Notes/References12:
D is correct because PKI requires that both the server and the client have signed certificates, validating both the client and the server.
Question 13: You are designing a large distributed application with 30 microservices. Each of your distributed microservices needs to connect to a database backend. You want to store the credentials securely. Where should you store the credentials?
A. In the source code
B. In an environment variable
C. In a key management system
D. In a config file that has restricted access through ACLs
Question 14: For this question, refer to the Company B case study.
Company B wants to set up a real-time analytics platform for their new game. The new platform must meet their technical requirements. Which combination of Google technologies will meet all of their requirements?
A. Kubernetes Engine, Cloud Pub/Sub, and Cloud SQL
B. Cloud Dataflow, Cloud Storage, Cloud Pub/Sub, and BigQuery
C. Cloud SQL, Cloud Storage, Cloud Pub/Sub, and Cloud Dataflow
D. Cloud Pub/Sub, Compute Engine, Cloud Storage, and Cloud Dataproc
ANSWER14:
B
Notes/References14:
B is correct because: – Cloud Dataflow dynamically scales up or down, can process data in real time, and is ideal for processing data that arrives late using Beam windows and triggers. – Cloud Storage can be the landing space for files that are regularly uploaded by users’ mobile devices. – Cloud Pub/Sub can ingest the streaming data from the mobile users. BigQuery can query more than 10 TB of historical data.
Question 15: For this question, refer to the Company B case study.
Company B has deployed their new backend on Google Cloud Platform (GCP). You want to create a thorough testing process for new versions of the backend before they are released to the public. You want the testing environment to scale in an economical way. How should you design the process?A. Create a scalable environment in GCP for simulating production load.B. Use the existing infrastructure to test the GCP-based backend at scale. C. Build stress tests into each component of your application and use resources from the already deployed production backend to simulate load.D. Create a set of static environments in GCP to test different levels of load—for example, high, medium, and low.
ANSWER15:
A
Notes/References15:
A is correct because simulating production load in GCP can scale in an economical way.
Question 16:For this question, refer to the Company B case study.
Company B wants to set up a continuous delivery pipeline. Their architecture includes many small services that they want to be able to update and roll back quickly. Company B has the following requirements:
Services are deployed redundantly across multiple regions in the US and Europe
Only frontend services are exposed on the public internet.
They can reserve a single frontend IP for their fleet of services.
Deployment artifacts are immutable
Which set of products should they use?
A. Cloud Storage, Cloud Dataflow, Compute Engine
B. Cloud Storage, App Engine, Cloud Load Balancing
C. Container Registry, Google Kubernetes Engine, Cloud Load Balancing
D. Cloud Functions, Cloud Pub/Sub, Cloud Deployment Manager
ANSWER16:
C
Notes/References16:
C is correct because: –Google Kubernetes Engine is ideal for deploying small services that can be updated and rolled back quickly. It is a best practice to manage services using immutable containers. –Cloud Load Balancing supports globally distributed services across multiple regions. It provides a single global IP address that can be used in DNS records. Using URL Maps, the requests can be routed to only the services that Company B wants to expose. –Container Registry is a single place for a team to manage Docker images for the services.
Question 17: Your customer is moving their corporate applications to Google Cloud Platform. The security team wants detailed visibility of all resources in the organization. You use Resource Manager to set yourself up as the org admin. What Cloud Identity and Access Management (Cloud IAM) roles should you give to the security team?
A. Org viewer, Project owner
B. Org viewer, Project viewer
C. Org admin, Project browser
D. Project owner, Network admin
ANSWER17:
B
Notes/References17:
B is correct because: –Org viewer grants the security team permissions to view the organization’s display name. –Project viewer grants the security team permissions to see the resources within projects.
Question 18: To reduce costs, the Director of Engineering has required all developers to move their development infrastructure resources from on-premises virtual machines (VMs) to Google Cloud Platform. These resources go through multiple start/stop events during the day and require state to persist. You have been asked to design the process of running a development environment in Google Cloud while providing cost visibility to the finance department. Which two steps should you take?
A. Use persistent disks to store the state. Start and stop the VM as needed.
B. Use the –auto-delete flag on all persistent disks before stopping the VM.
C. Apply VM CPU utilization label and include it in the BigQuery billing export.
D. Use BigQuery billing export and labels to relate cost to groups.
E. Store all state in local SSD, snapshot the persistent disks, and terminate the VM.F. Store all state in Cloud Storage, snapshot the persistent disks, and terminate the VM.
ANSWER18:
A and D
Notes/References18:
A is correct because persistent disks will not be deleted when an instance is stopped.
D is correct because exporting daily usage and cost estimates automatically throughout the day to a BigQuery dataset is a good way of providing visibility to the finance department. Labels can then be used to group the costs based on team or cost center.
Question 19: Your company has decided to make a major revision of their API in order to create better experiences for their developers. They need to keep the old version of the API available and deployable, while allowing new customers and testers to try out the new API. They want to keep the same SSL and DNS records in place to serve both APIs. What should they do?
A. Configure a new load balancer for the new version of the API.
B. Reconfigure old clients to use a new endpoint for the new API.
C. Have the old API forward traffic to the new API based on the path.
D. Use separate backend services for each API path behind the load balancer.
ANSWER19:
D
Notes/References19:
D is correct because an HTTP(S) load balancer can direct traffic reaching a single IP to different backends based on the incoming URL.
Question 20: The database administration team has asked you to help them improve the performance of their new database server running on Compute Engine. The database is used for importing and normalizing the company’s performance statistics. It is built with MySQL running on Debian Linux. They have an n1-standard-8 virtual machine with 80 GB of SSD zonal persistent disk. What should they change to get better performance from this system in a cost-effective manner?
A. Increase the virtual machine’s memory to 64 GB.
B. Create a new virtual machine running PostgreSQL.
C. Dynamically resize the SSD persistent disk to 500 GB.
D. Migrate their performance metrics warehouse to BigQuery.
ANSWER20:
C
Notes/References20:
C is correct because persistent disk performance is based on the total persistent disk capacity attached to an instance and the number of vCPUs that the instance has. Incrementing the persistent disk capacity will increment its throughput and IOPS, which in turn improve the performance of MySQL.
Question 21: You need to ensure low-latency global access to data stored in a regional GCS bucket. Data access is uniform across many objects and relatively high. What should you do to address the latency concerns?
A. Use Google’s Cloud CDN.
B. Use Premium Tier routing and Cloud Functions to accelerate access at the edges.
C. Do nothing.
D. Use global BigTable storage.
E. Use a global Cloud Spanner instance.
F. Migrate the data to a new multi-regional GCS bucket.
G. Change the storage class to multi-regional.
ANSWER21:
A
Notes/References21:
Cloud Functions cannot be used to affect GCS data access, so that option is simply wrong. BigTable does not have any “global” mode, so that option is wrong, too. Cloud Spanner is not a good replacement for GCS data: the data use cases are different enough that we can assume it would probably not be a good fit. You cannot change a bucket’s location after it has been created–not via the storage class nor any other way; you would have to migrate the data to a new bucket. Google’s Cloud CDN is very easy to turn on, but it does only work for data that comes from within GCP and only if the objects are being accessed frequently enough.
Question 22: You are building a sign-up app for your local neighbourhood barbeque party and you would like to quickly throw together a low-cost application that tracks who will bring what. Which of the following options should you choose?
A. Python, Flask, App Engine Standard
B. Ruby, Nginx, GKE
C. HTML, CSS, Cloud Storage
D. Node.js, Express, Cloud Functions
E. Rust, Rocket, App Engine Flex
F. Perl, CGI, GCE
ANSWER22:
A
Notes/References22:
The Cloud Storage option doesn’t offer any way to coordinate the guest data. App Engine Flex would cost much more to run when no one is on the sign-up site. Cloud Functions could handle processing some API calls, but it would be more work to set up and that option doesn’t mention anything about storage. GKE is way overkill for such a small and simple application. Running Perl CGI scripts on GCE would also cost more than it needs (and probably make you very sad). App Engine Standard makes it super-easy to stand up a Python Flask app and includes easy data storage options, too.
Question 23: Your company has decided to migrate your AWS DynamoDB database to a multi-regional Cloud Spanner instance and you are designing the system to transfer and load all the data to synchronize the DBs and eventually allow for a quick cut-over. A member of your team has some previous experience working with Apache Hadoop. Which of the following options will you choose for the streamed updates that follow the initial import?
A. The DynamoDB table change is captured by Cloud Pub/Sub and written to Cloud Dataproc for processing into a Spanner-compatible format.
B. The DynamoDB table change is captured by Cloud Pub/Sub and written to Cloud Dataflow for processing into a Spanner-compatible format.
C. Changes to the DynamoDB table are captured by DynamoDB Streams. A Lambda function triggered by the stream writes the change to Cloud Pub/Sub. Cloud Dataflow processes the data from Cloud Pub/Sub and writes it to Cloud Spanner.
D. The DynamoDB table is rescanned by a GCE instance and written to a Cloud Storage bucket. Cloud Dataproc processes the data from Cloud Storage and writes it to Cloud Spanner.
E. The DynamoDB table is rescanned by an EC2 instance and written to an S3 bucket. Storage Transfer Service moves the data from S3 to a Cloud Storage bucket. Cloud Dataflow processes the data from Cloud Storage and writes it to Cloud Spanner.
ANSWER23:
C
Notes/References23:
Rescanning the DynamoDB table is not an appropriate approach to tracking data changes to keep the GCP-side of this in synch. The fact that someone on your team has previous Hadoop experience is not a good enough reason to choose Cloud Dataproc; that’s a red herring. The options purporting to connect Cloud Pub/Sub directly to the DynamoDB table won’t work because there is no such functionality.
Question 24: Your client is a manufacturing company and they have informed you that they will be pausing all normal business activities during a five-week summer holiday period. They normally employ thousands of workers who constantly connect to their internal systems for day-to-day manufacturing data such as blueprints and machine imaging, but during this period the few on-site staff will primarily be re-tooling the factory for the next year’s production runs and will not be performing any manufacturing tasks that need to access these cloud-based systems. When the bulk of the staff return, they will primarily work on the new models but may spend about 20% of their time working with models from previous years. The company has asked you to reduce their GCP costs during this time, so which of the following options will you suggest?
A. Pause all Cloud Functions via the UI and unpause them when work starts back up.
B. Disable all Cloud Functions via the command line and re-enable them when work starts back up.
C. Delete all Cloud Functions and recreate them when work starts back up.
D. Convert all Cloud Functions to run as App Engine Standard applications during the break.
E. None of these options is a good suggestion.
ANSWER24:
E
Notes/References24:
Cloud Functions scale themselves down to zero when they’re not being used. There is no need to do anything with them.
Question 25: You need a place to store images before updating them by file-based render farm software running on a cluster of machines. Which of the following options will you choose?
A. Container Registry
B. Cloud Storage
C. Cloud Filestore
D. Persistent Disk
ANSWER25:
C
Notes/References25:
There are several different kinds of “images” that you might need to consider–maybe they are normal picture-image files, maybe they are Docker container images, maybe VM or disk images, or maybe something else. In this question, “images” refers to visual images, thus eliminating CI/CD products like Container Registry. Compute Engine is not a storage product and should be eliminated. The term “file-based” software means that it is unlikely to work well with object-based storage like Cloud Storage (or any of its storage classes). Persistent Disk cannot offer shared access across a cluster of machines when writes are involved; it only handles multiple readers. However, Cloud Filestore is made to provide shared, file-based storage for a cluster of machines as described in the question.
Question 26: Your company has decided to migrate your AWS DynamoDB database to a multi-regional Cloud Spanner instance and you are designing the system to transfer and load all the data to synchronize the DBs and eventually allow for a quick cut-over. A member of your team has some previous experience working with Apache Hadoop. Which of the following options will you choose for the initial data import?
A. The DynamoDB table is scanned by an EC2 instance and written to an S3 bucket. Storage Transfer Service moves the data from S3 to a Cloud Storage bucket. Cloud Dataflow processes the data from Cloud Storage and writes it to Cloud Spanner.
B. The DynamoDB table data is captured by DynamoDB Streams. A Lambda function triggered by the stream writes the data to Cloud Pub/Sub. Cloud Dataflow processes the data from Cloud Pub/Sub and writes it to Cloud Spanner.
C. The DynamoDB table data is captured by Cloud Pub/Sub and written to Cloud Dataproc for processing into a Spanner-compatible format.
D. The DynamoDB table is scanned by a GCE instance and written to a Cloud Storage bucket. Cloud Dataproc processes the data from Cloud Storage and writes it to Cloud Spanner.
ANSWER26:
A
Notes/References26:
The same data processing will have to happen for both the initial (batch) data load and the incremental (streamed) data changes that follow it. So if the solution built to handle the initial batch doesn’t also work for the stream that follows it, then the processing code would have to be written twice. A Professional Cloud Architect should recognize this project-level issue and not over-focus on the (batch) portion called out in this particular question. This is why you don’t want to choose Cloud Dataproc. Instead, Cloud Dataflow will handle both the initial batch load and also the subsequent streamed data. The fact that someone on your team has previous Hadoop experience is not a good enough reason to choose Cloud Dataproc; that’s a red herring. The DynamoDB streams option would be great for the db synchronization that follows, but it can’t handle the initial data load because DynamoDB Streams only fire for data changes. The option purporting to connect Cloud Pub/Sub directly to the DynamoDB table won’t work because there is no such functionality.
Question 27: You need a managed service to handle logging data coming from applications running in GKE and App Engine Standard. Which option should you choose?
A. Cloud Storage
B. Logstash
C. Cloud Monitoring
D. Cloud Logging
E. BigQuery
F. BigTable
ANSWER27:
D
Notes/References27:
Cloud Monitoring is made to handle metrics, not logs. Logstash is not a managed service. And while you could store application logs in almost any storage service, the Cloud Logging service–aka Stackdriver Logging–is purpose-built to accept and process application logs from many different sources. Oh, and you should also be comfortable dealing with products and services by names other than their current official ones. For example, “GKE” used to be called “Container Engine”, “Cloud Build” used to be “Container Builder”, the “GCP Marketplace” used to be called “Cloud Launcher”, and so on.
Question 28: You need a place to store images before serving them from AppEngine Standard. Which of the following options will you choose?
A. Compute Engine
B. Cloud Filestore
C. Cloud Storage
D. Persistent Disk
E. Container Registry
F. Cloud Source Repositories
G. Cloud Build
H. Nearline
ANSWER28:
C
Notes/References28:
There are several different kinds of “images” that you might need to consider–maybe they are normal picture-image files, maybe they are Docker container images, maybe VM or disk images, or maybe something else. In this question, “images” refers to picture files, because that’s something that you would serve from a web server product like AppEngine Standard, so we eliminate Cloud Build (which isn’t actually for storage, at all) and the other two CI/CD products: Cloud Source Repositories and Container Registry. You definitely could store image files on Cloud Filestore or Persistent Disk, but you can’t hook those up to AppEngine Standard, so those options need to be eliminated, too. The only options left are both types of Cloud Storage, but since “Cloud Storage” sits next to “Coldline” as an option, we can confidently infer that the former refers to the “Standard” storage class. Since the question implies that these images will be served by AppEngine Standard, we would prefer to use the Standard storage class over the Coldline one–so there’s our answer.
Question 29: You need to ensure low-latency global access to data stored in a multi-regional GCS bucket. Data access is uniform across many objects and relatively low. What should you do to address the latency concerns?
A. Use a global Cloud Spanner instance.
B. Change the storage class to multi-regional.
C. Use Google’s Cloud CDN.
D. Migrate the data to a new regional GCS bucket.
E. Do nothing.
F. Use global BigTable storage.
ANSWER29:
E
Notes/References29:
Cloud Functions cannot be used to affect GCS data access, so that option is simply wrong. BigTable does not have any “global” mode, so that option is wrong, too. Cloud Spanner is not a good replacement for GCS data: the data use cases are different enough that we can assume it would probably not be a good fit. You cannot change a bucket’s location after it has been created–not via the storage class nor any other way; you would have to migrate the data to a new bucket. But migrating the data to a regional bucket only helps when the data access will primarily be from that region. Google’s Cloud CDN is very easy to turn on, but it does only work for data that comes from within GCP and only if the objects are being accessed frequently enough to get cached based on previous requests. Because the access per object is so low, Cloud CDN won’t really help. This then brings us back to the question. Now, it may seem implied, but the question does not specifically state that there is currently a problem with latency, only that you need to ensure low latency–and we are already using what would be the best fit for this situation: a multi-regional CS bucket.
Question 30: You need to ensure low-latency GCP access to a volume of historical data that is currently stored in an S3 bucket. Data access is uniform across many objects and relatively high. What should you do to address the latency concerns?
A. Use Premium Tier routing and Cloud Functions to accelerate access at the edges.
B. Use Google’s Cloud CDN.
C. Use global BigTable storage.
D. Do nothing.
E. Migrate the data to a new multi-regional GCS bucket.
F. Use a global Cloud Spanner instance.
ANSWER30:
E
Notes/References30:
Cloud Functions cannot be used to affect GCS data access, so that option is simply wrong. BigTable does not have any “global” mode, so that option is wrong, too. Cloud Spanner is not a good replacement for GCS data: the data use cases are different enough that we can assume it would probably not be a good fit–and it would likely be unnecessarily expensive. You cannot change a bucket’s location after it has been created–not via the storage class nor any other way; you would have to migrate the data to a new bucket. Google’s Cloud CDN is very easy to turn on, but it does only work for data that comes from within GCP and only if the objects are being accessed frequently enough. So even if you would want to use Cloud CDN, you have to migrate the data into a GCS bucket first, so that’s a better option.
Question 31: You are lifting and shifting into GCP a system that uses a subnet-based security model. It has frontend and backend tiers and will be deployed in three regions. How many subnets will you need?
A. Six
B. One
C. Three
D. Four
E. Two
F. Nine
ANSWER31:
A
Notes/References31:
A single subnet spans and can be used across all zones in a single region, but you will need different subnets in different regions. Also, to implement subnet-level network security, you need to separate each tier into its own subnet. In this case, you have two tiers which will each need their own subnet in each of the three regions in which you will deploy this system.
Question 32: You need a place to produce images before deploying them to AppEngine Flex. Which of the following options will you choose?
A. Container Registry
B. Cloud Storage
C. Persistent Disk
D. Nearline
E. Cloud Source Repositories
F. Cloud Build
G. Cloud Filestore
H. Compute Engine
ANSWER32:
F
Notes/References32:
There are several different kinds of “images” that you might need to consider–maybe they are normal picture-image files, maybe they are Docker container images, maybe VM or disk images, or maybe something else. In this question, “deploying [these images] to AppEngine Flex” lets us know that we are dealing with Docker container images, and thus although they would likely be stored in the Container Registry, after being built, this question asks us where that building might happen, which is Cloud Build. Cloud Build, which used to be called Container Builder, is ideal for building container images–though it can also be used to build almost any artifacts, really. You could also do this on Compute Engine, but that option requires much more work to manage and is therefore worse.
Question 33: You are lifting and shifting into GCP a system that uses a subnet-based security model. It has frontend, app, and data tiers and will be deployed in three regions. How many subnets will you need?
A. Two
B. One
C. Three
D. Nine
E. Four
F. Six
ANSWER33:
D
Notes/References33:
A single subnet spans and can be used across all zones in a single region, but you will need different subnets in different regions. Also, to implement subnet-level network security, you need to separate each tier into its own subnet. In this case, you have three tiers which will each need their own subnet in each of the three regions in which you will deploy this system.
Question 34: You need a place to store images in case any of them are needed as evidence for a tax audit over the next seven years. Which of the following options will you choose?
A. Cloud Filestore
B. Coldline
C. Nearline
D. Persistent Disk
E. Cloud Source Repositories
F. Cloud Storage
G. Container Registry
ANSWER34:
B
Notes/References34:
There are several different kinds of “images” that you might need to consider–maybe they are normal picture-image files, maybe they are Docker container images, maybe VM or disk images, or maybe something else. In this question, “images” probably refers to picture files, and so Cloud Storage seems like an interesting option. But even still, when “Cloud Storage” is used without any qualifier, it generally refers to the “Standard” storage class, and this question also offers other storage classes as response options. Because the images in this scenario are unlikely to be used more than once a year (we can assume that taxes are filed annually and there’s less than 100% chance of being audited), the right storage class is Coldline.
Question 35: You need a place to store images before deploying them to AppEngine Flex. Which of the following options will you choose?
A. Container Registry
B. Cloud Filestore
C. Cloud Source Repositories
D. Persistent Disk
E. Cloud Storage
F. Code Build
G. Nearline
ANSWER35:
A
Notes/References35:
Compute Engine is not a storage product and should be eliminated. There are several different kinds of “images” that you might need to consider–maybe they are normal picture-image files, maybe they are Docker container images, maybe VM or disk images, or maybe something else. In this question, “deploying [these images] to AppEngine Flex” lets us know that we are dealing with Docker container images, and thus they would likely have been stored in the Container Registry.
Question 36: You are configuring a SaaS security application that updates your network’s allowed traffic configuration to adhere to internal policies. How should you set this up?
A. Install the application on a new appropriately-sized GCE instance running in your host VPC, and apply a read-only service account to it.
B. Create a new service account for the app to use and grant it the compute.networkViewer role on the production VPC.
C. Create a new service account for the app to use and grant it the compute.securityAdmin role on the production VPC.
D. Run the application as a container in your system’s staging GKE cluster and grant it access to a read-only service account.
E. Install the application on a new appropriately-sized GCE instance running in your host VPC, and let it use the default service account.
ANSWER36:
C
Notes/References36:
You do not install a Software-as-a-Service application yourself; instead, it runs on the vendor’s own hardware and you configure it for external access. Service accounts are great for this, as they can be used externally and you maintain full control over them (disabling them, rotating their keys, etc.). The principle of least privilege dictates that you should not give any application more ability than it needs, but this app does need to make changes, so you’ll need to grant securityAdmin, not networkViewer.
Question 37:You are lifting and shifting into GCP a system that uses a subnet-based security model. It has frontend and backend tiers and will be deployed across three zones. How many subnets will you need?
A. One
B. Six
C. Four
D. Three
E. Nine
ANSWER37:
F
Notes/References37:
A single subnet spans and can be used across all zones in a given region. But to implement subnet-level network security, you need to separate each tier into its own subnet. In this case, you have two tiers, so you only need two subnets.
Question 38:You have been tasked with setting up a system to comply with corporate standards for container image approvals. Which of the following is your best choice for this project?
A. Binary Authorization
B. Cloud IAM
C. Security Key Enforcement
D. Cloud SCC
E. Cloud KMS
ANSWER38:
A
Notes/References38:
Cloud KMS is Google’s product for managing encryption keys. Security Key Enforcement is about making sure that people’s accounts do not get taken over by attackers, not about managing encryption keys. Cloud IAM is about managing what identities (both humans and services) can access in GCP. Cloud DLP–or Data Loss Prevention–is for preventing data loss by scanning for and redacting sensitive information. Cloud SCC–the Security Command Center–centralizes security information so you can manage it all in one place. Binary Authorization is about making sure that only properly-validated containers can run in your environments.
Question 39: For this question, refer to the Company B‘s case study. Which of the following are most likely to impact the operations of Company B’s game backend and analytics systems?
A. PCI
B. PII
C. SOX
D. GDPR
E. HIPAA
ANSWER39:
B and D
Notes/References39:
There is no patient/health information, so HIPAA does not apply. It would be a very bad idea to put payment card information directly into these systems, so we should assume they’ve not done that–therefore the Payment Card Industry (PCI) standards/regulations should not affect normal operation of these systems. Besides, it’s entirely likely that they never deal with payments directly, anyway–choosing to offload that to the relevant app stores for each mobile platform. Sarbanes-Oxley (SOX) is about proper management of financial records for publicly traded companies and should therefore not apply to these systems. However, these systems are likely to contain some Personally-Identifying Information (PII) about the users who may reside in the European Union and therefore the EU’s General Data Protection Regulations (GDPR) will apply and may require ongoing operations to comply with the “Right to be Forgotten/Erased”.
Question 40:Your new client has advised you that their organization falls within the scope of HIPAA. What can you infer about their information systems?
A. Their customers located in the EU may require them to delete their user data and provide evidence of such.
B. They will also need to pass a SOX audit.
C. They handle money-linked information.
D. Their system deals with medical information.
ANSWER40:
D
Notes/References40:
SOX stands for Sarbanes Oxley and is US regulation governing financial reporting for publicly-traded companies. HIPAA–the Health Insurance Portability and Accountability Act of 1996–is US regulation aimed at safeguarding individuals’ (i.e. patients’) health information. PCI is the Payment Card Industry, and they have Data Security Standards (DSS) that must be adhered to by systems handling payment information of any of their member brands (which include Visa, Mastercard, and several others).
Question 41:Your new client has advised you that their organization needs to pass audits by ISO and PCI. What can you infer about their information systems?
A. They handle money-linked information.
B. Their customers located in the EU may require them to delete their user data and provide evidence of such.
C. Their system deals with medical information.
D. They will also need to pass a SOX audit.
ANSWER42:
A
Notes/References42:
SOX stands for Sarbanes Oxley and is US regulation governing financial reporting for publicly-traded companies. HIPAA–the Health Insurance Portability and Accountability Act of 1996–is US regulation aimed at safeguarding individuals’ (i.e. patients’) health information. PCI is the Payment Card Industry, and they have Data Security Standards (DSS) that must be adhered to by systems handling payment information of any of their member brands (which include Visa, Mastercard, and several others). ISO is the International Standards Organization, and since they have so many completely different certifications, this does not tell you much.
Question 43:Your new client has advised you that their organization deals with GDPR. What can you infer about their information systems?
A. Their system deals with medical information.
B. Their customers located in the EU may require them to delete their user data and provide evidence of such.
C. They will also need to pass a SOX audit.
D. They handle money-linked information.
ANSWER43:
B
Notes/References43:
SOX stands for Sarbanes Oxley and is US regulation governing financial reporting for publicly-traded companies. HIPAA–the Health Insurance Portability and Accountability Act of 1996–is US regulation aimed at safeguarding individuals’ (i.e. patients’) health information. PCI is the Payment Card Industry, and they have Data Security Standards (DSS) that must be adhered to by systems handling payment information of any of their member brands (which include Visa, Mastercard, and several others).
Question 44:For this question, refer to the Company C case study. Once Company C has completed their initial cloud migration as described in the case study, which option would represent the quickest way to migrate their production environment to GCP?
A. Apply the strangler pattern to their applications and reimplement one piece at a time in the cloud
B. Lift and shift all servers at one time
C. Lift and shift one application at a time
D. Lift and shift one server at a time
E. Set up cloud-based load balancing then divert traffic from the DC to the cloud system
F. Enact their disaster recovery plan and fail over
ANSWER44:
F
Notes/References44:
The proposed Lift and Shift options are all talking about different situations than Dress4Win would find themselves in, at that time: they’d then have automation to build a complete prod system in the cloud, but they’d just need to migrate to it. “Just”, right? 🙂 The strangler pattern approach is similarly problematic (in this case), in that it proposes a completely different cloud migration strategy than the one they’ve almost completed. Now, if we purely consider the kicker’s key word “quickest”, using the DR plan to fail over definitely seems like it wins. Setting up an additional load balancer and migrating slowly/carefully would take more time.
Question 45:Which of the following commands is most likely to appear in an environment setup script?
A. gsutil mb -l asia gs://${project_id}-logs
B. gcloud compute instances create –zone–machine-type=n1-highmem-16 newvm
C. gcloud compute instances create –zone–machine-type=f1-micro newvm
D. gcloud compute ssh ${instance_id}
E. gsutil cp -r gs://${project_id}-setup ./install
F. gsutil cp -r logs/* gs://${project_id}-logs/${instance_id}/
ANSWER45:
A
Notes/References45:
The context here indicates that “environment” is an infrastructure environment like “staging” or “prod”, not just a particular command shell. In that sort of a situation, it is likely that you might create some core per-environment buckets that will store different kinds of data like configuration, communication, logging, etc. You’re not likely to be creating, deleting, or connecting (sshing) to instances, nor copying files to or from any instances.
Question 46:Your developers are working to expose a RESTful API for your company’s physical dealer locations. Which of the following endpoints would you advise them to include in their design?
A. /dealerLocations/get
B. /dealerLocations
C. /dealerLocations/list
D. Source and destination
E. /getDealerLocations
ANSWER46:
B
Notes/References46:
It might not feel like it, but this is in scope and a fair question. Google expects Professional Cloud Architects to be able to advise on designing APIs according to best practices (check the exam guide!). In this case, it’s important to know that RESTful interfaces (when properly designed) use nouns for the resources identified by a given endpoint. That, by itself, eliminates most of the listed options. In HTTP, verbs like GET, PUT, and POST are then used to interact with those endpoints to retrieve and act upon those resources. To choose between the two noun-named options, it helps to know that plural resources are generally already understood to be lists, so there should be no need to add another “/list” to the endpoint.
Question 47:Which of the following commands is most likely to appear in an instance shutdown script?
A. gsutil cp -r gs://${project_id}-setup ./install
B. gcloud compute instances create –zone–machine-type=n1-highmem-16 newvm
C. gcloud compute ssh ${instance_id}
D. gsutil mb -l asia gs://${project_id}-logs
E. gcloud compute instances delete ${instance_id}
F. gsutil cp -r logs/* gs://${project_id}-logs/${instance_id}/
G. gcloud compute instances create –zone–machine-type=f1-micro newvm
ANSWER47:
F
Notes/References47:
The startup and shutdown scripts run on an instance at the time when that instance is starting up or shutting down. Those situations do not generally call for any other instances to be created, deleted, or connected (sshed) to. Also, those would be a very unusual time to make a Cloud Storage bucket, since buckets are the overall and highly-scalable containers that would likely hold the data for all (or at least many) instances in a given project. That said, instance shutdown time may be a time when you’d want to copy some final logs from the instance into some project-wide bucket. (In general, though, you really want to be doing that kind of thing continuously and not just at shutdown time, in case the instance shuts down unexpectedly and not in an orderly fashion that runs your shutdown script.)
Question 48:It is Saturday morning and you have been alerted to a serious issue in production that is both reducing availability to 95% and corrupting some data. Your monitoring tools noticed the issue 5 minutes ago and it was just escalated to you because the on-call tech in line before you did not respond to the page. Your system has an RPO of 10 minutes and an RTO of 120 minutes, with an SLA of 90% uptime. What should you do first?
A. Escalate the decision to the business manager responsible for the SLA
B. Take the system offline
C. Revert the system to the state it was in on Friday morning
D. Investigate the cause of the issue
ANSWER48:
B
Notes/References48:
The data corruption is your primary concern, as your Recovery Point Objective allows only 10 minutes of data loss and you may already have lost 5. (The data corruption means that you may well need to roll back the data to before that started happening.) It might seem crazy, but you should as quickly as possible stop the system so that you do not lose any more data. It would almost certainly take more time than you have left in your RPO to properly investigate and address the issue, but you should then do that next, during the disaster response clock set by your Recovery Time Objective. Escalating the issue to a business manager doesn’t make any sense. And neither does it make sense to knee-jerk revert the system to an earlier state unless you have some good indication that doing so will address the issue. Plus, we’d better assume that “revert the system” refers only to the deployment and not the data, because rolling the data back that far would definitely violate the RPO.
Question 49:Which of the following are not processes or practices that you would associate with DevOps?
A. Raven-test the candidate
B. Obfuscate the code
C. Only one of the other options is made up
D. Run the code in your cardinal environment
E. Do a canary deploy
ANSWER49:
A and D
Notes/References49:
Testing your understanding of development and operations in DevOps. In particular, you need to know that a canary deploy is a real thing and it can be very useful to identify problems with a new change you’re making before it is fully rolled out to and therefore impacts everyone. You should also understand that “obfuscating” code is a real part of a release process that seeks to protect an organization’s source code from theft (by making it unreadable by humans) and usually happens in combination with “minification” (which improves the speed of downloading and interpreting/running the code). On the other hand, “raven-testing” isn’t a thing, and neither is a “cardinal environment”. Those bird references are just homages to canary deployments.
Question 50:Your CTO is going into budget meetings with the board, next month, and has asked you to draw up plans to optimize your GCP-based systems for capex. Which of the following options will you prioritize in your proposal?
A. Object lifecycle management
B. BigQuery Slots
C. Committed use discounts
D. Sustained use discounts
E. Managed instance group autoscaling
F. Pub/Sub topic centralization
ANSWER50:
B and C
Notes/References50:
Pub/Sub usage is based on how much data you send through it, not any sort of “topic centralization” (which isn’t really a thing). Sustained use discounts can reduce costs, but that’s not really something you structure your system around. Now, most organizations prefer to turn Capital Expenditures into Operational Expenses, but since this question is instead asking you to prioritize CapEx, we need to consider the remaining options from the perspective of “spending” (or maybe reserving) defined amounts of money up-front for longer-term use. (Fair warning, though: You may still have some trouble classifying some cloud expenses as “capital” expenditures). With that in mind, GCE’s Committed Use Discounts do fit: you “buy” (reserve/prepay) some instances ahead of time and then not have to pay (again) for them as you use them (or don’t use them; you’ve already paid). BigQuery Slots are a similar flat-rate pricing model: you pre-purchase a certain amount of BigQuery processing capacity and your queries use that instead of the on-demand capacity. That means you won’t pay more than you planned/purchased, but your queries may finish rather more slowly, too. Managed instance group autoscaling and object lifecycle management can help to reduce costs, but they are not really about capex.
Question 51:In your last retrospective, there was significant disagreement voiced by the members of your team about what part of your system should be built next. Your scrum master is currently away, but how should you proceed when she returns, on Monday?
A. The scrum master is the one who decides
B. The lead architect should get the final say
C. The product owner should get the final say
D. You should put it to a vote of key stakeholders
E. You should put it to a vote of all stakeholders
ANSWER51:
C
Notes/References51:
In Scrum, it is the Product Owner’s role to define and prioritize (i.e. set order for) the product backlog items that the dev team will work on. If you haven’t ever read it, the Scrum Guide is not too long and quite valuable to have read at least once, for context.
Question 52:Your development team needs to evaluate the behavior of a new version of your application for approximately two hours before committing to making it available to all users. Which of the following strategies will you suggest?
A. Split testing
B. Red-Black
C. A/B
D. Canary
E. Rolling
F. Blue-Green
G. Flex downtime
ANSWER52:
D and E
Notes/References52:
A Blue-Green deployment, also known as a Red-Black deployment, entails having two complete systems set up and cutting over from one of them to the other with the ability to cut back to the known-good old one if there’s any problem with the experimental new one. A canary deployment is where a new version of an app is deployed to only one (or a very small number) of the servers, to see whether it experiences or causes trouble before that version is rolled out to the rest of the servers. When the canary looks good, a Rolling deployment can be used to update the rest of the servers, in-place, one after another to keep the overall system running. “Flex downtime” is something I just made up, but it sounds bad, right? A/B testing–also known as Split testing–is not generally used for deployments but rather to evaluate two different application behaviours by showing both of them to different sets of users. Its purpose is to gather higher-level information about how users interact with the application.
Question 53:You are mentoring a Junior Cloud Architect on software projects. Which of the following “words of wisdom” will you pass along?
A. Identifying and fixing one issue late in the product cycle could cost the same as handling a hundred such issues earlier on
B. Hiring and retaining 10X developers is critical to project success
C. A key goal of a proper post-mortem is to identify what processes need to be changed
D. Adding 100% is a safe buffer for estimates made by skilled estimators at the beginning of a project
E. A key goal of a proper post-mortem is to determine who needs additional training
ANSWER53:
A and C
Notes/References53:
There really can be 10X (and even larger!) differences in productivity between individual contributors, but projects do not only succeed or fail because of their contributions. Bugs are crazily more expensive to find and fix once a system has gone into production, compared to identifying and addressing that issue right up front–yes, even 100x. A post-mortem should not focus on blaming an individual but rather on understanding the many underlying causes that led to a particular event, with an eye toward how such classes of problems can be systematically prevented in the future.
Question 54:Your team runs a service with an SLA to achieve p99 latency of 200ms. This month, your service achieved p95 latency of 250ms. What will happen now?
A. The next month’s SLA will be increased.
B. The next month’s SLO will be reduced.
C. Your client(s) will have to pay you extra.
D. You will have to pay your client(s).
E. There is no impact on payments.
F. There is not enough information to make a determination.
ANSWER54:
D
Notes/References54:
It would be highly unusual for clients to have to pay extra, even if the service performs better than agreed by the SLA. SLAs generally set out penalties (i.e. you pay the client) for below-standard performance. While SLAs are external-facing, SLOs are internal-facing and do not generally relate to performance penalties. Neither SLAs nor SLOs are adaptively changed just because of one month’s performance; such changes would have to happen through rather different processes. A p99 metric is a tougher measure than p95, and p95 is tougher than p90–so meeting the tougher measure would surpass a required SLA, but meeting a weaker measure would not give enough information to say.
Question 55:Your team runs a service with an SLO to achieve p90 latency of 200ms. This month, your service achieved p95 latency of 250ms. What will happen now?
A. The next month’s SLA will be increased.
B. There is no impact on payments.
C. There is not enough information to make a determination.
D. Your client(s) will have to pay you extra.
E. The next month’s SLO will be reduced.
F. You will have to pay your client(s).
ANSWER55:
B
Notes/References55:
It would be highly unusual for clients to have to pay extra, even if the service performs better than agreed by the SLA. SLAs generally set out penalties (i.e. you pay the client) for below-standard performance. While SLAs are external-facing, SLOs are internal-facing and do not generally relate to performance penalties. Neither SLAs nor SLOs are adaptively changed just because of one month’s performance; such changes would have to happen through rather different processes. A p99 metric is a tougher measure than p95, and p95 is tougher than p90–so meeting the tougher measure would surpass a required SLA, but meeting a weaker measure would not give enough information to say.
Question 56:For this question, refer to the Company C case study. How would you recommend Company C address their capacity and utilization concerns?
A. Configure the autoscaling thresholds to follow changing load
B. Provision enough servers to handle trough load and offload to Cloud Functions for higher demand
C. Run cron jobs on their application servers to scale down at night and up in the morning
D. Use Cloud Load Balancing to balance the traffic highs and lows
D. Run automated jobs in Cloud Scheduler to scale down at night and up in the morning
E. Provision enough servers to handle peak load and sell back excess on-demand capacity to the marketplace
ANSWER56:
A
Notes/References56:
The case study notes, “Our traffic patterns are highest in the mornings and weekend evenings; during other times, 80% of our capacity is sitting idle.” Cloud Load Balancing could definitely scale itself to handle this type of load fluctuation, but it would not do anything to address the issue of having enough application server capacity. Provisioning servers to handle peak load is generally inefficient, but selling back excess on-demand capacity to the marketplace just isn’t a thing, so that option must be eliminated, too. Using Cloud Functions would require a different architectural approach for their application servers and it is generally not worth the extra work it would take to coordinate workloads across Cloud Functions and GCE–in practice, you’d just use one or the other. It is possible to manually effect scaling via automated jobs like in Cloud Scheduler or cron running somewhere (though cron running everywhere could create a coordination nightmare), but manual scaling based on predefined expected load levels is far from ideal, as capacity would only very crudely match demand. Rather, it is much better to configure the managed instance group’s autoscaling to follow demand curves–both expected and unexpected. A properly-architected system should rise to the occasion of unexpectedly going viral, and not fall over.
Google Cloud Latest News, Questions and Answers online:
Cloud Run vs App Engine: In a nutshell, you give Google’s Cloud Run a Docker container containing a webserver. Google will run this container and create an HTTP endpoint. All the scaling is automatically done for you by Google. Cloud Run depends on the fact that your application should be stateless. This is because Google will spin up multiple instances of your app to scale it dynamically. If you want to host a traditional web application this means that you should divide it up into a stateless API and a frontend app.
With Google’s App Engine you tell Google how your app should be run. The App Engine will create and run a container from these instructions. Deploying with App Engine is super easy. You simply fill out an app.yml file and Google handles everything for you.
With Cloud Run, you have more control. You can go crazy and build a ridiculous custom Docker image, no problem!Cloud Run is made for Devops engineers, App Engine is made for developers.Read more here…
The best choice depends on what you want to optimize, your use-cases and your specific needs.
If your objective is the lowest latency, choose Cloud Run.
Indeed, Cloud Run use always 1 vCPU (at least 2.4Ghz) and you can choose the memory size from 128Mb to 2Gb.
With Cloud Functions, if you want the best processing performance (2.4Ghz of CPU), you have to pay 2Gb of memory. If your memory footprint is low, a Cloud Functions with 2Gb of memory is overkill and cost expensive for nothing.
Cutting cost is not always the best strategy for customer satisfaction, but business reality may require it. Anyway, it highly depends of your use-case
Both Cloud Run and Cloud Function round up to the nearest 100ms. As you could play with the GSheet, the Cloud Functions are cheaper when the processing time of 1 request is below the first 100ms. Indeed, you can slow the Cloud Functions vCPU, with has for consequence to increase the duration of the processing but while staying under 100ms if you tune it well. Thus less Ghz/s are used and thereby you pay less.
the cost comparison between Cloud Functions and Cloud Run goes further than simply comparing a pricing list. Moreover, on your projects, you often will have to use the 2 solutions for taking advantage of their strengths and capabilities.
My first choice for development is Cloud Run. Its portability, its testability, its openess on the libraries, the languages and the binaries confer it too much advantages for, at least, a similar pricing, and often with a real advantage in cost but also in performance, in particular for concurrent requests. Even if you need the same level of isolation of Cloud functions (1 instance per request), simply set the concurrent param to 1!
In addition, the GA of Cloud Run is applied on all containers, whatever the languages and the binaries used. Read more here…
Google Cloud Storage : What bucket class for the best performance?: Multiregional buckets perform significantly better for cross-the-ocean fetches, however the details are a bit more nuanced than that. The performance is dominated by the latency of physical distance between the client and the cloud storage bucket.
If caching is on, and your access volume is high enough to take advantage of caching, there’s not a huge difference between the two offerings (that I can see with the tests). This shows off the power of Google’s Awesome CDN environment.
If caching is off, or the access volume is low enough that you can’t take advantage of caching, then the performance overhead is dominated directly by physics. You should be trying to get the assets as close to the clients as possible, while also considering cost, and the types of redundancy and consistency you’ll need for your data needs.
Conclusion:
GCP, or the Google Cloud Platform, is a cloud-computing platform that provides users with access to a variety of GCP services. The GCP Professional Architect Engineeer exam is designed to test a candidate’s ability to design, implement, and manage GCP solutions. The GCP questions cover a wide range of topics, from basic GCP concepts to advanced GCP features. To become a GCP Certified Professional, you must pass the GCP PE exam. Below are some basics GCP Questions to answer to get yourself familiarized with the Google Cloud Platform:
1) What is GCP? 2) What are the benefits of using GCP? 3) How can GCP help my business? 4) What are some of the features of GCP? 5) How is GCP different from other clouds? 6) Why should I use GCP? 7) What are some of GCP’s strengths? 8) How is GCP priced? 9) Is GCP easy to use? 10) Can I use GCP for my personal projects? 11) What services does GCP offer? 12) What can I do with GCP? 13) What languages does GCP support? 14) What platforms does GCP support? 15) Does GPC support hybrid deployments? 16) Does GPC support on-premises deployments?
17) Is there a free tier on GPC ?
18) How do I get started with usingG CP ?
Top- high paying certifications:
Google Certified Professional Cloud Architect – $139,529
First of all, I would like to start with the fact that I already have around 1 year of experience with GCP in depth, where I was working on GKE, IAM, storage and so on. I also obtained GCP Associate Cloud Engineer certification back in June as well, which helps with the preparation.
I started with Dan Sullivan’s Udemy course for Professional Cloud Architect and did some refresher on the topics I was not familiar with such as BigTable, BigQuery, DataFlow and all that. His videos on the case studies helps a lot to understand what each case study scenario requires for designing the best cost-effective architecture.
In order to understand the services in depth, I also went through the GCP documentation for each service at least once. It’s quite useful for knowing the syntax of the GCP commands and some miscellaneous information.
As for practice exam, I definitely recommend Whizlabs. It helped me prepare for the areas I was weak at and helped me grasp the topics a lot faster than reading through the documentation. It will also help you understand what kind of questions will appear for the exam.
I used TutorialsDojo (Jon Bonso) for preparation for Associate Cloud Engineer before and I can attest that Whizlabs is not that good. However, Whizlabs still helps a lot in tackling the tough questions that you will come across during the examination.
One thing to note is that, there wasn’t even a single question that was similar to the ones from Whizlabs practice tests. I am saying this from the perspective of the content of the questions. I got totally different scenarios for both case study and non case study questions. Many questions focused on App Engine, Data analytics and networking. There were some Kubernetes questions based on Anthos, and cluster networking. I got a tough question regarding storage as well.
I initially thought I would fail, but I pushed on and started tackling the multiple-choices based on process of elimination using the keywords in the questions. 50 questions in 2 hours is a tough one, especially due to the lengthy questions and multiple choices. I do not know how this compares to AWS Solutions Architect Professional exam in toughness. But some people do say GCP professional is tougher than AWS.
All in all, I still recommend this certification to people who are working with GCP. It’s a tough one to crack and could be useful for future prospects. It’s a bummer that it’s only valid for 2 years.
Google Associate Cloud Engineer Exam Preparation: Questions and Answers Dumps
GCP, or the Google Cloud Platform, is a cloud-computing platform that provides users with access to a variety of GCP services. The GCP ACE exam is designed to test a candidate’s ability to design, implement, and manage GCP solutions. The GCP ACE questions cover a wide range of topics, from basic GCP concepts to advanced GCP features. To become a GCP Certified Associate Cloud Engineer, you must pass the GCP ACE exam. However, before you can take the exam, you must first complete the GCP ACE Quizzes below. The GCP ACE Quiz is designed to help you prepare for the GCP ACE exam by testing your knowledge of GCP concepts. After you complete the GCP ACE Quiz, you will be able to pass the GCP Practice Exam with ease.
GCP, Google Cloud Platform, has been a game changer in the tech industry. It allows organizations to build and run applications on Google’s infrastructure. The GCP platform is trusted by many companies because it is reliable, secure and scalable. In order to become a GCP.
The Google Cloud Associate Engineer Salary Average- $145,769/yr
An Associate Cloud Engineer deploys applications, monitors operations, and manages enterprise solutions.
Get 20% off Google Google Workspace (Google Meet) Standard Plan with the following codes: 96DRHDRA9J7GTN6 Get 20% off Google Workspace (Google Meet) Business Plan (AMERICAS): M9HNXHX3WC9H7YE (Email us for more codes)
The Associate Cloud Engineer exam assesses your ability to: Set up a cloud solution environment, Plan and configure a cloud solution, Deploy and implement a cloud solution, Ensure successful operation of a cloud solution, Configure access and security.
Question 1: You are a project owner and need your co-worker to deploy a new version of your application to App Engine. You want to follow Google’s recommended practices. Which IAM roles should you grant your co-worker?
Question 2: Your company has reserved a monthly budget for your project. You want to be informed automatically of your project spend so that you can take action when you approach the limit. What should you do?
Question 3: You have a project using BigQuery. You want to list all BigQuery jobs for that project. You want to set this project as the default for the bq command-line tool. What should you do?
A. Use “gcloud config set project” to set the default project.
B. Use “bq config set project” to set the default project.
C. Use “gcloud generate config-url” to generate a URL to the Google Cloud Platform Console to set the default project.
D. Use “bq generate config-url” to generate a URL to the Google Cloud Platform Console to set the default project.
Question 4: Your project has all its Compute Engine resources in the europe-west1 region. You want to set europe-west1 as the default region for gcloud commands. What should you do?
A. Use Cloud Shell instead of the command line interface of your device. Launch Cloud Shell after you navigate to a resource in the europe-west1 region. The europe-west1 region will automatically become the default region.
B. Use “gcloud config set compute/region europe-west1” to set the default region for future gcloud commands.
C. Use “gcloud config set compute/zone europe-west1” to set the default region for future gcloud commands.
Question 5: You developed a new application for App Engine and are ready to deploy it to production. You need to estimate the costs of running your application on Google Cloud Platform as accurately as possible. What should you do?
A. Create a YAML file with the expected usage. Pass this file to the “gcloud app estimate” command to get an accurate estimation.
B. Multiply the costs of your application when it was in development by the number of expected users to get an accurate estimation.
C. Use the pricing calculator for App Engine to get an accurate estimation of the expected charges.
D. Create a ticket with Google Cloud Billing Support to get an accurate estimation.
Question 6: Your company processes high volumes of IoT data that are time-stamped. The total data volume can be several petabytes. The data needs to be written and changed at a high speed. You want to use the most performant storage option for your data. Which product should you use?
A. Cloud Datastore
B. Cloud Storage
C. Cloud Bigtable
D. BigQuery
ANSWER 6:
C
Notes/Hint 6:
Cloud Bigtable is the most performant storage option to work with IoT and time series data.
Question 7: Your application has a large international audience and runs stateless virtual machines within a managed instance group across multiple locations. One feature of the application lets users upload files and share them with other users. Files must be available for 30 days; after that, they are removed from the system entirely. Which storage solution should you choose?
A. A Cloud Datastore database.
B. A multi-regional Cloud Storage bucket.
C. Persistent SSD on virtual machine instances.
D. A managed instance group of Filestore servers.
ANSWER 7:
B
Notes/Hint 7:
Buckets can be multi-regional and have lifecycle management.
Question 8: You have a definition for an instance template that contains a web application. You are asked to deploy the application so that it can scale based on the HTTP traffic it receives. What should you do?
A. Create a VM from the instance template. Create a custom image from the VM’s disk. Export the image to Cloud Storage. Create an HTTP load balancer and add the Cloud Storage bucket as its backend service.
B. Create a VM from the instance template. Create an App Engine application in Automatic Scaling mode that forwards all traffic to the VM.
C. Create a managed instance group based on the instance template. Configure autoscaling based on HTTP traffic and configure the instance group as the backend service of an HTTP load balancer.
D. Create the necessary amount of instances required for peak user traffic based on the instance template. Create an unmanaged instance group and add the instances to that instance group. Configure the instance group as the Backend Service of an HTTP load balancer.
Question 9: You are creating a Kubernetes Engine cluster to deploy multiple pods inside the cluster. All container logs must be stored in BigQuery for later analysis. You want to follow Google-recommended practices. Which two approaches can you take?
A. Turn on Stackdriver Logging during the Kubernetes Engine cluster creation.
B. Turn on Stackdriver Monitoring during the Kubernetes Engine cluster creation.
C. Develop a custom add-on that uses Cloud Logging API and BigQuery API. Deploy the add-on to your Kubernetes Engine cluster.
D. Use the Stackdriver Logging export feature to create a sink to Cloud Storage. Create a Cloud Dataflow job that imports log files from Cloud Storage to BigQuery.
E. Use the Stackdriver Logging export feature to create a sink to BigQuery. Specify a filter expression to export log records related to your Kubernetes Engine cluster only.
Answer 9:
A and E
Notes/Hint 9:
Creating a cluster with Stackdriver Logging option will enable all the container logs to be stored in Stackdriver Logging.
Question 10: You need to create a new Kubernetes Cluster on Google Cloud Platform that can autoscale the number of worker nodes. What should you do?
A. Create a cluster on Kubernetes Engine and enable autoscaling on Kubernetes Engine.
B. Create a cluster on Kubernetes Engine and enable autoscaling on the instance group of the cluster.
C. Configure a Compute Engine instance as a worker and add it to an unmanaged instance group. Add a load balancer to the instance group and rely on the load balancer to create additional Compute Engine instances when needed.
D. Create Compute Engine instances for the workers and the master, and install Kubernetes. Rely on Kubernetes to create additional Compute Engine instances when needed.
Question 11: You have an application server running on Compute Engine in the europe-west1-d zone. You need to ensure high availability and replicate the server to the europe-west2-c zone using the fewest steps possible. What should you do?
A. Create a snapshot from the disk. Create a disk from the snapshot in the europe-west2-c zone. Create a new VM with that disk.
B. Create a snapshot from the disk. Create a disk from the snapshot in the europe-west1-d zone and then move the disk to europe-west2-c. Create a new VM with that disk.
C. Use “gcloud” to copy the disk to the europe-west2-c zone. Create a new VM with that disk.
D. Use “gcloud compute instances move” with parameter “–destination-zone europe-west2-c” to move the instance to the new zone.
Answer 11:
A
Notes/Hint 11:
This makes sure the VM gets replicated in the new zone.
Question 12: Your company has a mission-critical application that serves users globally. You need to select a transactional, relational data storage system for this application. Which two products should you consider
A. BigQuery
B. Cloud SQL
C. Cloud Spanner
D. Cloud Bigtable
E. Cloud Datastore
Answer 12:
B
Notes/Hint 12:
Cloud SQL is a relational and transactional database in the list.
Spanner is a relational and transactional database in the list.
Question 13: You have a Kubernetes cluster with 1 node-pool. The cluster receives a lot of traffic and needs to grow. You decide to add a node. What should you do?
A. Use “gcloud container clusters resize” with the desired number of nodes.
B. Use “kubectl container clusters resize” with the desired number of nodes.
C. Edit the managed instance group of the cluster and increase the number of VMs by 1.
D. Edit the managed instance group of the cluster and enable autoscaling.
Answer 13:
A
Notes/Hint 13:
This resizes the cluster to the desired number of nodes.
Question 14: You created an update for your application on App Engine. You want to deploy the update without impacting your users. You want to be able to roll back as quickly as possible if it fails. What should you do?
A. Delete the current version of your application. Deploy the update using the same version identifier as the deleted version.
B. Notify your users of an upcoming maintenance window. Deploy the update in that maintenance window.
C. Deploy the update as the same version that is currently running.
D. Deploy the update as a new version. Migrate traffic from the current version to the new version.
Question 15: You have created a Kubernetes deployment, called Deployment-A, with 3 replicas on your cluster. Another deployment, called Deployment-B, needs access to Deployment-A. You cannot expose Deployment-A outside of the cluster. What should you do?
A. Create a Service of type NodePort for Deployment A and an Ingress Resource for that Service. Have Deployment B use the Ingress IP address.
B. Create a Service of type LoadBalancer for Deployment A. Have Deployment B use the Service IP address.
C. Create a Service of type LoadBalancer for Deployment A and an Ingress Resource for that Service. Have Deployment B use the Ingress IP address.
D. Create a Service of type ClusterIP for Deployment A. Have Deployment B use the Service IP address.
Question 16: You need to estimate the annual cost of running a Bigquery query that is scheduled to run nightly. What should you do?
A. Use “gcloud query –dry_run” to determine the number of bytes read by the query. Use this number in the Pricing Calculator.
B. Use “bq query –dry_run” to determine the number of bytes read by the query. Use this number in the Pricing Calculator.
C. Use “gcloud estimate” to determine the amount billed for a single query. Multiply this amount by 365.
D. Use “bq estimate” to determine the amount billed for a single query. Multiply this amount by 365.
Answer 16:
B
Notes/Hint 16:
This is the correct way to estimate the yearly BigQuery querying costs.
Question 17: You want to find out who in your organization has Owner access to a project called “my-project”.What should you do?
A. In the Google Cloud Platform Console, go to the IAM page for your organization and apply the filter “Role:Owner”.
B. In the Google Cloud Platform Console, go to the IAM page for your project and apply the filter “Role:Owner”.
C. Use “gcloud iam list-grantable-role –project my-project” from your Terminal.
D. Use “gcloud iam list-grantable-role” from Cloud Shell on the project page.
Answer 17:
B
Notes/Hint 17:
B is correct because this shows you the Owners of the project.
Question 18: You want to create a new role for your colleagues that will apply to all current and future projects created in your organization. The role should have the permissions of the BigQuery Job User and Cloud Bigtable User roles. You want to follow Google’s recommended practices. How should you create the new role?
A. Use “gcloud iam combine-roles –global” to combine the 2 roles into a new custom role.
B. For one of your projects, in the Google Cloud Platform Console under Roles, select both roles and combine them into a new custom role. Use “gcloud iam promote-role” to promote the role from a project role to an organization role.
C. For all projects, in the Google Cloud Platform Console under Roles, select both roles and combine them into a new custom role.
D. For your organization, in the Google Cloud Platform Console under Roles, select both roles and combine them into a new custom role.
Answer 18:
D
Notes/Hint 18:
D is correct because this creates a new role with the combined permissions on the organization level.
Question 19: You work in a small company where everyone should be able to view all resources of a specific project. You want to grant them access following Google’s recommended practices. What should you do?
A. Create a script that uses “gcloud projects add-iam-policy-binding” for all users’ email addresses and the Project Viewer role.
B. Create a script that uses “gcloud iam roles create” for all users’ email addresses and the Project Viewer role.
C. Create a new Google Group and add all users to the group. Use “gcloud projects add-iam-policy-binding” with the Project Viewer role and Group email address.
D. Create a new Google Group and add all members to the group. Use “gcloud iam roles create” with the Project Viewer role and Group email address.
Question 20: You need to verify the assigned permissions in a custom IAM role. What should you do?
A. Use the GCP Console, IAM section to view the information.
B. Use the “gcloud init” command to view the information.
C. Use the GCP Console, Security section to view the information.
D. Use the GCP Console, API section to view the information.
Answer 20:
A
Notes/Hint 20:
A is correct because this is the correct console area to view permission assigned to a custom role in a particular project.
Question 21: Your coworker created a deployment for your application container. You can see the deployment under Workloads in the console. They’re out for the rest of the week, and your boss needs you to complete the setup by exposing the workload. What’s the easiest way to do that?
A. Create a new Service that points to the existing deployment.
B. Create a new DaemonSet.
C. Create a Global Load Balancer that points to the pod in the deployment.
D. Create a Static IP Address Resource for the Deployment.
Question 22: Your team is working on designing an IoT solution. There are thousands of devices that need to send periodic time series data for processing. Which services should be used to ingest and store the data?
A. Pub/Sub, Datastore
B. Pub/Sub, Dataproc
C. Dataproc, Bigtable
D. Pub/Sub, Bigtable
Answer 22:
D
Notes/Hint 22:
Pub/Sub is able to handle the ingestion, and Bigtable is a great solution for time series data.
Question 23: You have an App Engine application running in us-east1. You’ve noticed 90% of your traffic comes from the West Coast. You’d like to change the region. What’s the best way to change the App Engine region?
A. Use the gcloud app region set command and supply the name of the new region.
B. Contact Google Cloud Support and request the change.
C. From the console, under the App Engine page, click edit, and change the region drop-down.
D. Create a new project and create an App Engine instance in us-west2.
Question 24: You’ve uploaded some static web assets to a public storage bucket for the developers. However, they’re not able to see them in the browser due to what they called “CORS errors”. What’s the easiest way to resolve the errors for the developers?
A. Advise the developers to adjust the CORS configuration inside their code.
B. Use the gsutil cors set command to set the CORS configuration on the bucket.
C. Use the gsutil set cors command to set the CORS configuration on the bucket.
D. Use the gsutil set cors command to set the CORS configuration on the object.
Answer 24:
B
Notes/Hint 24:
CORS settings are made to a bucket, not an object.. You can set the CORS configuration on the bucket allowing the objects to be viewable from the required domains.
Question 25: You’ve uploaded some PDFs to a public bucket. When users browse to the documents, they’re downloaded rather than viewed in the browser. How can we ensure that the PDFs are viewed in the browser?
A. This is a browser setting and not something that can be changed.
B. Use the gsutil set file-type pdfcommand.
C. Set the Content metadata for the object to “application/pdf”.
D. Set the Content-Type metadata for the object to “application/pdf”.
Question 26: You’ve been tasked with getting all of your team’s public SSH keys onto all of the instances of a particular project. You’ve collected them all. With the fewest steps possible, what is the simplest way to get the keys deployed?
A. Use the gcloud compute ssh command to upload all the keys
B. Format all of the keys as needed and then, using the user interface, upload each key one at a time.
C. Add all of the keys into a file that’s formatted according to the requirements. Use the gcloud compute project-info add-metadata command to upload the keys.
D. Add all of the keys into a file that’s formatted according to the requirements. Use the gcloud compute instances add-metadata command to upload the keys to each instance
Answer 26:
C
Notes/Hint 26:
This will upload the keys as project metadata which allows SSH access to the user’s with uploaded keys
Question 27: What must you do before you create an instance with a GPU? ( Pick at least 2)
A. You must only select the GPU driver type. The correct base image is selected automatically.
B. You must select which boot disk image you want to use for the instance.
C. Nothing. GPU drivers are automatically included with the boot disk images.
D. You must make sure the selected image has the appropriate GPU driver is installed
Question 30: Your security team has been reluctant to move to the cloud because they don’t have the level of network visibility they’re used to. Which feature might help them to gain insights into your Google Cloud network?
A. Routes
B. Subnets
C. Flow Logs
D. Firewall rules
Answer 30:
C
Notes/Hint 30:
Flow logs are great for gaining insights into what’s happening on a network. They provide a sample of the flows to and from instances.
Question 31: You’re in charge of setting up a Stackdriver account to monitor 3 separate projects. Which of the following is a Google best practice?
A. Use the existing project with the least resources as the host project for the Stackdriver account.
B. Use the existing project with the most resources as the host project for the Stackdriver account.
C. Create a new, empty project to use as the host project for the Stackdriver account.
D. Use one of the existing projects as the host project for the Stackdriver account.
Question 32: You’re attempting to set up a File based Billing Export. Which of the following components are required?
A. A Cloud Storage bucket.
B. A BigQuery dataset.
C. A report prefix.
D. A Budget and at least one alert.
Answer 32:
A and C
Notes/Hint 32:
A cloud storage bucket is required in order to have a location for the files to be exported to. A report prefix is the portion of the file name that’s appended to each file.
Question 33: You’ve installed the Google Cloud SDK natively on your Mac. You’d like to install the kubectl component via the Google Cloud SDK. Which command would accomplish this?
A. sudo apt-get install kubectl
B. gcloud components install kubectl
C. pip install kubectl
D. brew install kubectl
Answer 33:
B
Notes/Hint 33:
For Windows and Mac, you can use the built-in component manager.
Question 34: You’re attempting to set the default Compute Engine zone with the Cloud SDK. Which of the following commands would work?
A. gcloud config set compute/zone us-east1-c
B. gcloud set compute\zone us-east1
C. gcloud set compute/zone us-east1
D. gcloud config set compute\zone us-east1
Answer 34:
A
Notes/Hint 34:
gcloud config set compute/zone us-east1-c works perfectly
Question 35: You’ve been hired as a Cloud Engineer for a 2-year-old startup company. Recently they’ve had a bit of turn over, and several engineers have left the company to pursue different projects. Shortly after one of them leaves, it is found that a core project seems to have been deleted. What is the most likely cause for of the project’s deletion?
A. You’ve been the victim of the latest malware that deletes one project per hour until you pay them to stop.
B. One of the engineers intentionally deleted the project out of spite.
C. The project was created by one of the engineers and not attached to the organization.
D. A failed attempt to pay the bill resulted in Google deleting the project.
Question 36: You’re using Stackdriver to set up some alerts. You want to reuse your existing REST-based notification tools that your ops team has created. You want the setup to be as simple as possible to configure and maintain. Which notification option would be the best option?
A. Use a Slack bot to listen for messages posted by Google.
B. Send it to an email account that is being polled by a custom process that can handle the notification.
C. Send notifications via SMS and use a custom app to forward them to the REST API.
D. Webhooks
Answer 36:
D
Notes/Hint 36:
Webhooks would allow you to easily send the notification to an HTTP(S) endpoint. Given the above scenario, this is the best option for something custom.
Question 37: A member of the finance team informed you that one of the projects is using the old billing account. What steps should you take to resolve the problem?
A. Submit a support ticket requesting the change.
B. Go to the Billing page, locate the list of projects, find the project in question and select Change billing account. Then select the correct billing account and save.
C. Go to the Project page; expand the Billing tile; select the Billing Account option; select the correct billing account and save.
D. Delete the project and recreate it with the correct billing account.
Answer 37:
B
Notes/Hint 37:
Go to the Billing page, locate the list of projects, find the project in question and select Change billing account. Then select the correct billing account and save.
Question 38: You’re using a self-serve Billing Account to pay for your 2 projects. Your billing threshold is set to $1000.00 and between the two projects you’re spending roughly 50 dollars per day. It has been 18 days since you were last charged.Given the above data, when will you likely be charged next?
A. On the first day of the next month.
B. In 2 days when you’ll hit your billing threshold.
C. On the thirtieth day of the month.
D. In 12 days, making it 30 days since the previous payment.
Answer 38:
B
Notes/Hint 38:
With Self-serve, you pay when you hit the billing threshold or every 30 days; whichever happens first. Given the scenario assumes $50 per day, you’ll hit the spending threshold in 2 more days.
Question 39: You have 3 Cloud Storage buckets that all store sensitive data. Which grantees should you audit to ensure that these buckets are not public?
A. allUsers
B. allAuthenticatedUsers
C. publicUsers
D. allUsers and allAuthenticatedUsers
Answer 39:
D
Notes/Hint 39:
Either of these tokens represents public users. allAuthenticatedUsers represents a user with a Google account. They don’t need to be part of your organization. Neither token should be used to grant permissions unless the bucket is truly public.
[appbox appstore 1574395172-iphone screenshots]
Question 40: You’ve been asked to help onboard a new member of the big-data team. They need full access to BigQuery. Which type of role would be the most efficient to set up while following the principle of least privilege?
A. Primitive Role
B. Custom Role
C. Managed Role
D. Predefined Role
Answer 40:
D
Notes/Hint 40:
Predefined roles would work great for this use case because they’re specific to resources. BigQuery has several predefined roles including a “BigQuery Admin” role.
Question 41: Your organization is a financial company that needs to store audit log files for 3 years. Your organization has hundreds of Google Cloud projects. You need to implement a cost-effective approach for log file retention. What should you do?
A. Create an export to the sink that saves logs from Cloud Audit to BigQuery.
B. Create an export to the sink that saves logs from Cloud Audit to a Coldline Storage bucket.
C. Write a custom script that uses logging API to copy the logs from Stackdriver logs to BigQuery.
D. Export these logs to Cloud Pub/Sub and write a Cloud Dataflow pipeline to store logs to Cloud SQL.
Question 42: You want to run a single caching HTTP reverse proxy on GCP for a latency-sensitive website. This specific reverse proxy consumes almost no CPU. You want to have a 30-GB in-memory cache, and need an additional 2 GB of memory for the rest of the processes. You want to minimize cost. How should you run this reverse proxy?
A. Create a Cloud Memorystore for Redis instance with 32-GB capacity.
B. Run it on Compute Engine, and choose a custom instance type with 6 vCPUs and 32 GB of memory.
C. Package it in a container image, and run it on Kubernetes Engine, using n1-standard-32 instances as nodes.
D. Run it on Compute Engine, choose the instance type n1-standard-1, and add an SSD persistent disk of 32 GB.
Answer 42: B
Question 43: You are hosting an application on bare-metal servers in your own data center. The application needs access to Cloud Storage. However, security policies prevent the servers hosting the application from having public IP addresses or access to the internet. You want to follow Google-recommended practices to provide the application with access to Cloud Storage. What should you do?
A. 1. Use nslookup to get the IP address for storage.googleapis.com. 2. Negotiate with the security team to be able to give a public IP address to the servers. 3. Only allow egress traffic from those servers to the IP addresses for storage.googleapis.com.
B. 1. Using Cloud VPN, create a VPN tunnel to a Virtual Private Cloud (VPC) in Google Cloud. 2. In this VPC, create a Compute Engine instance and install the Squid proxy server on this instance. 3. Configure your servers to use that instance as a proxy to access Cloud Storage.
C. 1. Use Migrate for Compute Engine (formerly known as Velostrata) to migrate those servers to Compute Engine. 2. Create an internal load balancer (ILB) that uses storage.googleapis.com as backend. 3. Configure your new instances to use this ILB as proxy.
D. 1. Using Cloud VPN or Interconnect, create a tunnel to a VPC in Google Cloud. 2. Use Cloud Router to create a custom route advertisement for 199.36.153.4/30. Announce that network to your on-premises network through the VPN tunnel. 3. In your on-premises network, configure your DNS server to resolve *.googleapis.com as a CNAME to restricted.googleapis.com.
Answer 43: C
Question 44: You want to deploy an application on Cloud Run that processes messages from a Cloud Pub/Sub topic. You want to follow Google-recommended practices. What should you do?
A. 1. Create a Cloud Function that uses a Cloud Pub/Sub trigger on that topic. 2. Call your application on Cloud Run from the Cloud Function for every message.
B. 1. Grant the Pub/Sub Subscriber role to the service account used by Cloud Run. 2. Create a Cloud Pub/Sub subscription for that topic. 3. Make your application pull messages from that subscription.
C. 1. Create a service account. 2. Give the Cloud Run Invoker role to that service account for your Cloud Run application. 3. Create a Cloud Pub/Sub subscription that uses that service account and uses your Cloud Run application as the push endpoint.
D. 1. Deploy your application on Cloud Run on GKE with the connectivity set to Internal. 2. Create a Cloud Pub/Sub subscription for that topic. 3. In the same Google Kubernetes Engine cluster as your application, deploy a container that takes the messages and sends them to your application.
Answer 44: D
Question 45: You need to deploy an application, which is packaged in a container image, in a new project. The application exposes an HTTP endpoint and receives very few requests per day. You want to minimize costs. What should you do?
A. Deploy the container on Cloud Run.
B. Deploy the container on Cloud Run on GKE.
C. Deploy the container on App Engine Flexible.
D. Deploy the container on GKE with cluster autoscaling and horizontal pod autoscaling enabled.
Answer 45: B
Question 46: Your company has an existing GCP organization with hundreds of projects and a billing account. Your company recently acquired another company that also has hundreds of projects and its own billing account. You would like to consolidate all GCP costs of both GCP organizations onto a single invoice. You would like to consolidate all costs as of tomorrow. What should you do?
A. Link the acquired company’s projects to your company’s billing account.
B. Configure the acquired company’s billing account and your company’s billing account to export the billing data into the same BigQuery dataset.
C. Migrate the acquired company’s projects into your company’s GCP organization. Link the migrated projects to your company’s billing account.
D. Create a new GCP organization and a new billing account. Migrate the acquired company’s projects and your company’s projects into the new GCP organization and link the projects to the new billing account.
Question 47: You built an application on Google Cloud that uses Cloud Spanner. Your support team needs to monitor the environment but should not have access to table data. You need a streamlined solution to grant the correct permissions to your support team, and you want to follow Google-recommended practices. What should you do?
A. Add the support team group to the roles/monitoring.viewer role
B. Add the support team group to the roles/spanner.databaseUser role.
C. Add the support team group to the roles/spanner.databaseReader role.
D. Add the support team group to the roles/stackdriver.accounts.viewer role.
Answer 47: B
Question 48: For analysis purposes, you need to send all the logs from all of your Compute Engine instances to a BigQuery dataset called platform-logs. You have already installed the Cloud Logging agent on all the instances. You want to minimize cost. What should you do?
A. 1. Give the BigQuery Data Editor role on the platform-logs dataset to the service accounts used by your instances. 2. Update your instancesג€™ metadata to add the following value: logs-destination: bq://platform-logs.
B. 1. In Cloud Logging, create a logs export with a Cloud Pub/Sub topic called logs as a sink. 2. Create a Cloud Function that is triggered by messages in the logs topic. 3. Configure that Cloud Function to drop logs that are not from Compute Engine and to insert Compute Engine logs in the platform-logs dataset.
C. 1. In Cloud Logging, create a filter to view only Compute Engine logs. 2. Click Create Export. 3. Choose BigQuery as Sink Service, and the platform-logs dataset as Sink Destination.
D. 1. Create a Cloud Function that has the BigQuery User role on the platform-logs dataset. 2. Configure this Cloud Function to create a BigQuery Job that executes this query: INSERT INTO dataset.platform-logs (timestamp, log) SELECT timestamp, log FROM compute.logs WHERE timestamp > DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) 3. Use Cloud Scheduler to trigger this Cloud Function once a day.
Answer 48: C
Question 49: You are using Deployment Manager to create a Google Kubernetes Engine cluster. Using the same Deployment Manager deployment, you also want to create a DaemonSet in the kube-system namespace of the cluster. You want a solution that uses the fewest possible services. What should you do?
A. Add the cluster’s API as a new Type Provider in Deployment Manager, and use the new type to create the DaemonSet.
B. Use the Deployment Manager Runtime Configurator to create a new Config resource that contains the DaemonSet definition.
C. With Deployment Manager, create a Compute Engine instance with a startup script that uses kubectl to create the DaemonSet.
D. In the cluster’s definition in Deployment Manager, add a metadata that has kube-system as key and the DaemonSet manifest as value.
Question 50: You are building an application that will run in your data center. The application will use Google Cloud Platform (GCP) services like AutoML. You created a service account that has appropriate access to AutoML. You need to enable authentication to the APIs from your on-premises environment. What should you do?
A. Use service account credentials in your on-premises application.
B. Use gcloud to create a key file for the service account that has appropriate permissions.
C. Set up direct interconnect between your data center and Google Cloud Platform to enable authentication for your on-premises applications.
D. Go to the IAM & admin console, grant a user account permissions similar to the service account permissions, and use this user account for authentication from your data center.
Yes, Google App Engine(GAE) , a fully managed PaaS is 100% worthy if :
you want ready and quick platform to build web applications and mobile backends on Cloud scale with very low cost start
want to get rid of the burden managing and provisioning Infrastructure, application security and scale
are fine with almost no control over web server and application software like Database, File storage, Messaging mechanism. You have to live with what GAE offers and choose from choices available. Forget about customization!
can live with fixed set of language runtimes like Node.js, Java, Ruby, C#, Go, Python, ….
Google App Engine is PaaS platform (Platform as a service) that is used to deploy large-scale web and mobile apps. So, the sites are:Disney
Snapchat, YouTube, Accenture, Practo, Samba Tech, Buddy, Kam Bam, Coco Cola, The New York Times, Stack
It is one of the most trusted cloud platform used by top companies. We will get to see many more sites deploying Google App Engine for their web & app hosting.
Well, I believe it because I met and discussed it with some of the Google engineers responsible for that area. And I am not special in that respect: it’s not a secret. Here’s the missing link: Google runs KVM in a container. To be crystal clear, a container is not an actual Linux construct. There is no Linux system call you can make to create a container. Instead, it is the term we give to the usage of Linux primitives like namespaces and cgroups to partition applications into their own Linux-level virtual compute space. Except we don’t call it that, we call it a container. So, at the lowest level, Google’s infrastructure schedules containers. To create a virtual machine, google runs KVM in one of those containers. So the document you link to is absolutely valid *and* KVM runs in a containe(more)
No, but to be honest, I think that’s what their gaming system is for. Reverse marketing. They don’t expect it to be a hit, but if they’re almost good enough for gaming, then they’re certainly good enough for me. They’re not aiming for gamers, but everyone else. There is definitely a market for public VDI. I was working on that concept ten years ago, but I didn’t have the resources to pull it off. Back then, watching Youtube videos on the client was not feasible. These days, you could probably kill the whole PC industry if you had the resources. If Google develops something like JackPC that is able to connect to their Stadia and provide a VM, I would recommend it to my father, but I wouldn’t use it, because I still have a long life to live and I’m not giving it to Google. But if they made i(more)
Google runs Linux on its hardware (AKA “Linux on bare metal”). As part of that Linux image, it has its own Linux container implementation based on cgroups and namespaces. In Google Cloud platform, it then runs KVM inside a Linux container, and the VMs run on top of KVM. So the hierarchy is VM->KVM->Linux->Bare metal(more)
i would suggest you to read this document thoroughly, so that you can understand logging into Compute instances is not that tedious… 🙂 Connecting to instances using advanced methods | Compute Engine Documentation | Google Cloud(more)
Lets have two variables (although they can be more): ease of administration, constraints of use. App Engine: from your side there is almost no administration, you write code (with somewhat limited possibilities), upload and basically don’t have other major concerns (well maybe how to lower your bills if your app gets popular)) all the rest (storage, scaling, installing programs etc.) handles app engine Compute Engine is virtual machine with preinstalled OS and you can do with it whatever you want. That means you have to install all programs by yourself but you are not limited with what can you do with it. Container Engine is another level above Compute Engine, i.e. it’s cluster of several Compute Engine instances which can be centrally managed. There is also one level between GAE and GCE:(more)
Both of them have almost the same price but they have different type of discounts. For instance AWS has “Reserved Instance” discount model for 1 or 3 year purchase. You have to pay almost 1/3 of the period as pre-paid and you’ll get %30–60 discounts depends on period you choose and EC2 instance type you have. Google Cloud has a monthly discount model and it applies automatically if you use a compute engine more than 10 days in a month. If you run the compute instance during the month you may have %30 discount without pre-pay anything. So both of them have discounts but in a different financial payment model. As an alternative, you can checkout DigitalOcean for the affordable prices.(more)
They’re three different approaches to running services on virtual machines. AppEngine is designed around automatic scaling of services. There’s actually two different flavors of AppEngine entirely : the “standard environment,” which is a sandbox, and the “flexible environment,” which is a more traditional (though still not traditional!) VM running in a Docker container. Both versions are designed to automatically spawn more instances of your service in response to increases in load, and isolate you from a lot of hard SRE problems. Compute Engine is just plain old virtual machines. If you want to run an instance of a VM with a certain amount of memory and hard drive space, running under a given version of Linux, and not have to worry about physical equipment, Compute Engine is for you. (Mor(more)
I do not understand why the question asks about both EC2/Compute Engine and Cloud-Storage/S3. Cloud-Storage/S3 is used to serve static websites. The EC2/computer engine is typically used to serve dynamic content (However, it can serve static websites too). I would try and figure out which one of these suits your use better. In both the cases, however, GCP is cheaper (You also get credits to use it free for one year) – they even have a page where you can calculate how much you save moving from AWS to GC → Google Cloud Platform Pricing Calculator | Google Cloud Platform | Google Cloud (The only case where I have seen GCP is more expensive is when it comes to hosting proprietary licensed DBs like MS SQL).(more)
We started offering our hadoop service on GCE. We ran hadoop workloads with a root persistent disk(storage over network) and an additional persistent disk of size 500 GB. Consistently, we observed that the performance is better than other leading cloud providers where we used local disks of the instance. Few months back, GCE was offering scratch disks. They decided to replace scratch disks with persistent disk when they went GA. This fact clearly shows that there was enough confidence, that persistent disks were performing well compared to scratch disks. (if thats not the case, Google would not have made this bold move and continued offering scratch disks also like AWS) This performance must partly be attributed to their networking stack. Its considered the best out there in the(more)
Google has been building and using its own private cloud since the start of the company. They have always been known for about setting the standards in many industries, and public cloud is what happening. For years, people would always wanted to use their cloud technology (Colossus, BigTable, GAE, etc..). Strategically, Google knows that if they focus more on providing and marketing their public cloud based on what they currently use, people who look up to them would see it as standard, and it’s all good for business. Another reason is, with recent acquisitions (for instance, Nest), Google realized that those successful startups they acquire use AWS more than GCP. Telling the existing development teams to migrate to GCP will disrupt the team (just like Microsoft’s acquisition of Minecraft(more)
I strongly suggest to move your installation to google app engine instead. It’s easy, it will leverage your maintenance costs, and it will auto scale when needed. As for cdn, you can host static files on google storage that is already managed with google cdn behind the scene. To go with WordPress on google app engine there are simple tutorials like this: GoogleCloudPlatform/php-docs-samples I did this setup many times with great success. I also wrote a small tutorial to speed up your wp installation with memcache (that comes as a free service in google app engine). giona69/wordpress-made-extremely-fast Good work!(more)
I just want to explain in a way that a person who don’t have any prior knowledge on containers and clusters should be able to understand what kubernetes is and what it does. First we understand why container. * Let’s say you want to gift a cycle to your kid on his birthday. Now if the cycle is delivered to you with parts separated and a manual that describes how to attach the parts. Well you may end up screwing things. * Instead what if the cycle itself is ready-made and packed in a container and delivered to your home address, with no manual intervention required? . Ain’t that awesome. * * The individual parts of cycle is the dependencies of the project which may work at one place and not the other. * * The cycle company is the developers hub, and the client here is the one using our product. * * To solve thi(more)
Indeed Kubernetes and Docker are two different things that are related to each other. Let’s have a look; After getting used to Docker, you realize that there should be ‘Docker run’ commands or something like that to run many containers across heterogeneous hosts. Here is when Kubernetes or k8s comes in. It solved many problems that Docker had. Kubernetes is based on Google’s container management system- Borg and language used is Go. It is a COE (Container Orchestration Environment) for Docker containers. The function of COE is to make it sure that application is launched and running properly. If in case a container fails, Kubernetes will spin up another container. It provides a complete system for running so many containers across multiple hosts. It has load balancer integrated and uses etc(more)
Kubernetes is a vendor-agnostic cluster and container management tool, open-sourced by Google in 2014. It provides a “platform for automating deployment, scaling, and operations of application containers across clusters of hosts”. Above all, this lowers the cost of cloud computing expenses and simplifies operations and architecture. Kubernetes and the Need for Containers Before we explain what Kubernetes does, we need to explain what containers are and why people are using those. A container is a mini-virtual machine. It is small, as it does not have device drivers and all the other components of a regular virtual machine. Docker is by far the most popular container and it is written in Linux. Microsoft also has added containers to Windows as well, because they have become so popular. The bes(more)
Despite the little time that Kubernetes has in the market, this tool has become a reference in terms of the management and allocation of service packages (containers) within a cluster. Initially developed by Google, Kubernetes emerged as an open-source alternative to the Borg and Omega systems, being officially launched in 2015. What is Kubernetes? Kubernetes is an open-source tool also designated as an orchestrator, which is used to carry out the distribution and organization of workloads in the form of containers. This, in order to maintain the availability and accessibility of existing resources to customers, as well as stability when carrying out the execution of multiple services simultaneously. Through this action scheme, Kubernetes makes it possible for numerous servers of different typ(more)
There are a countless number of debates, discussions and social clatter talking about Kubernetes and Docker. Nevertheless, Kubernetes and Docker Swarm are not rivals! Both have their own pros and cons and can be used depending on your application requirements. Benefits & drawbacks of Kubernetes Benefits of Kubernetes: * Kubernetes is backed by the Cloud Native Computing Foundation (CNCF). * Kubernetes have an impressively huge community among container orchestration tools. Over 50,000 commits and 1200 contributors. * Kubernetes is an open source and modular tool that works with any OS. * Kubernetes provides easy service organization with pods (Start your Kubernetes journey to resilient and highly available deployments – Free consultation on Kubernetes) Drawbacks of Kubernetes * When doing it yourself, K(more)
If you already ‘know’ Docker containers, then spin up a Kubernetes system (Not as hard as you think – check out installing Minikube) read through the docs for Kubernetes and start trying out some of the capabilities for yourself. The (free) Katacoda is a browser-based learning platform has a number of ‘scenarios’ that run on pre-deployed Kubernetes system. Follow this link to Katacoda and then search for “Kubernetes.” Note that you can copy-paste your way through most of the exercises in a minute or two, learning is on you to read and understand what it is you are pasting. Online resources such as the “Awesome Kubernetes” or “Awesome Docker” lists (you do need to have some understanding of Docker to work with Kubernetes) will give you a pile of options – free and paid – to get into greater(more)
When Linux containers appeared at the time of LXC, a lot of people in the IT world saw them as something marvelous, they offered a way of packaging software with all their dependencies and running then in any other Linux machine. Much like virtual machines, but without the performance losses. But the truth was that they weren’t widely used, they required some plumbing to make them work, and there were no standard way to distribute the images. Then docker appeared, adding to existing container technologies a workflow for building and sharing images and a common interface to start containers. This came to popularize these technologies, but they weren’t still widely used for production systems, mainly because it was not so advantageus to have just another packaging system for production. And t(more)
There is no one way to compare because they are mostly different things. That said, I’ll first try and define the need for each one of these and link them together. Let’s start with the bottom of the stack. You need infrastructure to run your servers. What could you go with? You can use a VPS provider like DigitalOcean, or use AWS. What if, for some non-technical reason, you can’t use AWS? For instance, there is a legal compliance that states that the data I store and servers I run are in the same geography as the customers I serve, and AWS does not have a region for the same? This is where OpenStack comes in. It is a platform to manage your infrastructure. Think of it as an open source implementation of AWS which you can run on bare metal data centers. Next, we move up the stack. We want an(more)
Kubernetes (also known as K8s) is a production-grade container orchestration system. It is an open source cluster management system initially developed by three Google employees during the summer of 2014 and grew exponentially and became the first project to get donated to the Cloud Native Computing Foundation(CNCF). It is basically an open source toolkit for building a fault-tolerant, scalable platform designed to automate and centrally manage containerized applications. With Kubernetes you can manage your containerized application more efficiently. Kubernetes is a HUGE project with a lot of code and functionalities. The primary responsibility of Kubernetes is container orchestration. That means making sure that all the containers that execute various workloads are sc(more)
The basic idea of Kubernetes is to further abstract machines, storage, and networks away from their physical implementation. So it is a single interface to deploy containers to all kinds of clouds, virtual machines, and physical machines. Container Orchestration & Kubernetes Containers are virtual machines. They are lightweight, scalable, and isolated. The containers are linked together for setting security policies, limiting resource utilization, etc. If your application infrastructure is similar to the image shared below, then container orchestration is necessary. It might be Nginx/Apache + PHP/Python/Ruby/Node.js app running on a few containers, communicating with the replicated database. Container orchestration wi(more)
As seen in the following diagram, Kubernetes follows client-server architecture. Wherein, we have master installed on one machine and the node on separate Linux machines. The key components of master and node are defined in the following section. Kubernetes – Master Machine Components Following are the components of Kubernetes Master Machine. etcd It stores the configuration information which can be used by each of the nodes in the cluster. It is a high availability key value store that can be distributed among multiple nodes. It is accessible only by Kubernetes API server as it may have some sensitive information. It is a distributed key value Store which is accessible to all. API Server Kubernetes is an API server which provides all the operation on cluster usi(more)
Kubernetes service discovery find services through two approaches: 1. Using the environment variables that use the same conventions as those created by Docker links. 2. Using DNS to resolve the service names to the service’s IP address. Environment Variables Kubernetes injects environment variables for each service and each port exposed by the service. This makes it easy to deploy containers that use Docker links to find their dependencies. For example, if we are exposing a RabbitMQ service, we can locate it using the RABBITMQ_SERVICE_SERVICE_HOST and RABBIT_MP_SERVICE_SERVICE_PORTvariables. Other environment variables are also exposed to support this. The easiest way to find out what environment variables are exposed are(more)
Docker is open source tool has been designed to create applications as small container on any machine. By using docker development , deployment is too easy is for developers . We can say this are very light-weight in size which includes minimal OS and your application . In a way, Docker is a bit like a virtual machine. But unlike a virtual machine, rather than creating a whole virtual operating system, Docker allows applications to use the same Linux kernel as the system that they’re running on and only requires applications be shipped with things not already running on the host computer. This gives a significant performance boost and reduces the size of the application. Kubernets : Kubernetes is a powerful system, developed by Google, for managing containerized applications in a clustered e(more)
Container cluster management system is called Kubernetes. After getting used to Docker, you realize that there should be ‘Docker run’ commands or something like that to run many containers across heterogeneous hosts. Here is when Kubernetes comes in. It provides a complete system for running different containers across multiple hosts. Kubernetes is based on Google container management system Borg and language used is Go.Basically, Google uses three languages; 1. C/C++ 2. Java 3. Python C and C++ might be little tough for new users. Java is less attractive as compared to Go for Kubernetes because of its heavy runtime download. Python is great but dynamic typing of Python is challenging for system software. Go is the best choice as it has great sets of system libraries. It has fast testing and building too(more)
Hi there, I believe container orchestration is one of the best features of Kubernetes. I will tell you why? I am sharing a section of my recently posted article on Level Up. For complete article, please visit : The Kubernetes Bible for Beginners & Developers – Level Up So here is my answer : How Kubernetes Solves the Problem? After discussing the deployment part of Kubernetes, it is necessary to understand the importance of Kubernetes. Container Orchestration & Kubernetes Containers are virtual machines. They are lightweight, scalable, and isolated. The containers are linked together for setting security policies, limiting resource utilization, etc. If your application infrastructure is similar to the image shared below, then container orchestration is necessary. It might be Nginx/Apache + PHP/(more)
Hi, I found this cheat sheet on Kubernetes. Kubernetes kubectl CLI Cheat Sheet This cheat sheet encloses first-aid commands to configure the CLI, manage a cluster, and gather information from it. On downloading the cheat sheet, you will find out how to:Create, group, update, and delete cluster resources Debug Kubernetes pods—a group of one or more containers with shared storage/network and a specification for running the containers Manage config maps, a primitive to store a pod’s configuration, and secrets, a primitive to store such sensitive data as passwords, keys, certificates, etc. You will learn how to use Helm—a package manager to define, install, and upgrade complex Kubernetes apps. Moreover, here you can find the Kubernetes training courses – Custom Hands-On IT Training Courses… Plus -(more)
Both Kubernetes and Docker are DevOps tools. Docker was started in 2013 and is developed by Docker, Inc. Kubernetes was introduced as a project at Google in 2014, and it was a successor of Google Borg. Kubernetes can run without docker, and docker can run without kubernetes. But kubernetes has great benefits in running along with docker. What is Kubernetes Kubernetes is a container management system developed by Google. It is an open-source, portable system for automatic container deployment and management. It eliminates many of the manual processes involved in deploying and scaling containerized applications. In practice, Kubernetes is most commonly used alongside Docker for better control and implementation of containerized applications. Features of Kubernetes * Automates various manual proces(more)
Yes and no. Especially for Kubernetes (which is not THAT hard, but has a steep learning curve in the beginning), I doubt that there is any certification that can tell you stuff you cannot learn for free. You can set up a Kubernetes cluster on DO for $20/month or even on you laptop to actually try out things. Create a few Helm charts for your pet applications and you have a good working knowledge of Kubernetes. BUT: How can an employer judge your level of knowledge? And this is where certifications get interesting. So basically, you are trading money for an increased chance of employment, all other things equal. Furthermore, at a certain size of projects, customers require their suppliers to have a certain number of people certified in the relevant technologies — so that they can rest assure(more)
This is a good question. I would like to say that Borg and Kubernetes both have the same kind of tasks. But Google is promoting Kubernetes for now. As such, it offering good features as well. The most important thing of all, Kubernetes has an active online community. The members of this community meet-up online as well as in person, in major cities of the world. An international conference “KubeCon” has proved to be a huge success. There is also an official Slack group for Kubernetes. Major cloud providers like Google Cloud Platform, AWS, Azure, DigitalOcean, etc also offer their support channels. For more details on Kubernetes, please visit my articles : https://www.level-up.one/kubernetes-bible-beginners/ How Does The Kubernetes Networking Work? : Part 1 – Level Up How Does The Kubernetes Ne(more)
Kubernetes is infrastructure abstraction for container manipulation. In Kubernetes there are many terms that conceptualize the execution environment. A pod is the smallest unit deployable in kubernetes. You can see it as an application that runs one container or multiple that work together. Pods have volumes, memory and networking requirements. Pods have a unique Id and can die at any minute so kubernetes provides a higher hierarchy abstraction called Service. A Service is a logical set of pods that are permanent in the cluster and offer functionality. Pods are accesible through the service names in the network of the cluster. When a pod dies, kubernetes automatically runs a new pod of the service (depending on replica configuration) to keep the service offering functionality. There are man(more)
Kubernetes’ increased adoption is showcased by a number of influential companies which have integrated the technology into their services. Let us take a look at how some of the most successful companies of our time are successfully using Kubernetes. Tinder’s move to Kubernetes Due to high traffic volume, Tinder’s engineering team faced challenges of scale and stability. What did they do? Kubernetes – Yes, the answer is Kubernetes. Tinder’s engineering team solved interesting challenges to migrate 200 services and run a Kubernetes cluster at scale totaling 1,000 nodes, 15,000 pods, and 48,000 running containers. Reddit’s Kubernetes story Reddit is one of the top busiest sites in the world. Kubernetes forms the core of Reddit’s internal Infrastructure. From many years, the Reddit infrastructure tea(more)
Here is a way you could convince him. Docker is dead. It’s not technically dead, but in reality, it’s a walking zombie. I’ll explain why. AWS is one of the best platforms for infrastructure and there is GCE and Azure, but AWS is the standard, the most capable platform from all the cloud architectures. AWS is integrating Kubernetes into it’s system and you might ask what are the benefits and why would it do that. Kubernetes is basically a competitor to AWS. It allows you to write infrastructure using YAML files and deploy them on a cluster. The only drawback right now is that you cannot provision servers using Kubernetes because it sits at a higher level in the abstraction stack. The servers are below it. However, with EKS (elastic kubernetes service). AWS has integrated all sorts of primativ(more)
If the developer put together a working solution then keep using it, thank them for the effort, and provide some private coaching on how to get buy-in so things go more smoothly in the future. Startups spawn serious problems that don’t end up on the roadmap as they should, and you’re better off with people taking initiative then fixing them. Otherwise the stake holders need to decide on a containerization solution, preferably coming to that conclusion by themselves or at least believing they did. That’s probably Kubernetes (from Google which knows how to build and run things) and docker where you already have one enthusiastic engineer willing to own the project, although they should be able to provide reasonable arguments on why that’s the best option for containerization and deployment. Peo(more)
Kubernetes is meant to simplify things and this article is meant to simplify Kubernetes for you! Kubernetes is a powerful open-source system that was developed by Google. It was developed for managing containerized applications in a clustered environment. Kubernetes has gained popularity and is becoming the new standard for deploying software in the cloud. Learning Kubernetes is not difficult (if the tutor is good) and it offers great power. The learning curve is a little steep. So let us learn Kubernetes in a simplified way. The article covers Kubernetes’ basic concepts, architecture, how it solves the problems, etc. What Is Kubernetes? Kubernetes offers or in fact, it itself is a system that is used for running and coordinating applications across numerous machines. The system manages the(more)
Kubernetes and Docker are two different tools used for DevOps. Let me explain each in brief. Kubernetes is an open-source platform used for maintaining and deploying a group of containers. In practice, Kubernetes is most commonly used alongside Docker for better control and implementation of containerized applications. Docker is a tool that is used to automate the deployment of applications in lightweight containers so that applications can work efficiently in different environments. Features of docker – Multiple containers run on the same hardware High productivity Maintains isolated applications Quick and easy configuration Differences between Kubernetes and Docker 1. In Kubernetes, applications are deployed as a combination of pods, deployments, and services. In Docker, applications are deployed i(more)
Kubernetes is built in three layers with each higher layer hiding the complexity found in a lower layer -Application Layer(Pool and Services), Kubernetes Layer and Infrastructure Layer. Pods are a part of Kubernetes layer. A pod is one or more containers controlled as a single application It encapsulates application containers, storage resources, a unique network ID and other configuration on how to run the containers A Pod represents a group of one or more application containers bundled up together and are highly scalable If a pod fails, Kubernetes automatically deploys new replicas of the pod to the cluster Pods provide two different types of shared resources -networking and storage You can also get a good understanding of content quality by watching Simplilearn’s youtube videos. Here are some(more)
Kubernetes, also sometimes called K8S (K – eight characters – S), is an open source orchestration framework for containerized applications that was born from the Google data centers.(more)
Docker, absolutely learn that first. Docker Docker is a tool designed to make it easier to create, deploy, and run applications by using containers. Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and deploy it as one package. And here comes the race between choosing an orchestration tool : Overview of Kubernetes Kubernetes is based on years of Google’s experience of running workloads at a huge scale in production. As per Kubernetes website, “Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications.” Overview of Docker Swarm Docker swarm is Docker’s own container’s orchestration. It uses the standard Docker API and networking, making it easy to drop into(more)
A node is the smallest unit of hardware in Kubernetes, also known as a minion. It is a representation of a single machine in the cluster. It is a physical machine in a data center or virtual machine hosted on a cloud provider like Google Cloud Platform. Each node has the services required to run a pod and is managed by the master components in Kubernetes architecture. The services given by a Kubernetes Node include the container runtime(Docker), Kubelet, and Kube-proxy. To know more about Node in Kubernetes, watch this video on Kubernetes Architecture: Hope this helps!(more)
They’re both good technologies with huge opportunities and potential ahead) Docker is overhyped for its relative youth, and is really a moderate set of wrapper capabilities around the Linux kernel. Operational understanding is scarse and conflicting. Requires a lot of deep street knowledge to use effectively in production. Lots of subtle performance and reliability challenges with eg. Networking, storage. Often subtle breaking changes beteeen releases. Installing and operating Kubernetes is not for the faint of heart. Assumes you can “bring your own cluster”. Pace of change and improvement on core k8s is astounding (good and bad). Using Kubernetes is relatively white box, ie. you really need to know what’s going on under the covers to a degree especially if you’re not using GKE.(more)
Used on GCP and Physical ServersA Kubernetes cluster is a group of ‘machines’ that are either on the same network segment or set up to communicate with each other over the network with low latency, and run Kubernetes software. Kubernetes software runs as a ‘service’ or ‘daemon’ on each machine in the cluster and this causes the host machine to either act as a ‘master’ or a ‘slave’ node within the cluster. During the Kubernetes cluster set up process, the master is created first and toward the end of the install process a connection command is displayed or logged to the system. This should then be run on each additional node once the base Kubernetes software has been installed. Some ‘magic’ then takes place and the new node links up with the master node to form a logical cluster. Commands can then be run on the master node t(more)
I think containers are the model of potential delivery now. We make packaging an application with their required infrastructure much easier. Tools like Docker provide containers, but also software are needed to handle items such as replication, failures and APIs for automating deployment on multiple machines. At the beginning of 2015, the status of clustering platforms such as Kubernetes and Docker Swarm was highly unstable. We tried to use them and began with Docker Swarm. Amid the news in recent weeks, several businesses have purchased container or micro-service firms to boost their portfolio for what lies ahead. What is this a really important topic now ? Amid the news in recent weeks, several businesses have purchased container or micro-service firms to boost their portfolio for what lie(more)
Lets forget all about technical stuff, lets discuss this in a way that a non-technical guy understands. * You are owner of a building and you have 5 spots where people can enter your building and you want 5 security guards guarding the spots. All good till now. * * Now consider one of the guard was out of service for 2 hours due to some personal reasons. Now as a building owner its your responsibility to guard or employ another guard replacing the existing. Do you like to be manually interrupted from your task to look after who is out and whom to replace. * * No, no one likes to be. Now the solution could be, go to a third party vendor who provides 24*7 availability of the guards. Its the responsibility of the vendor to make 24*7 availability based on the configuration set(in this case guards guarding(more)
While researching for a project, I looked into all of the available books on Kubernetes. Here’s a quick roundup. (Feel free to suggest more!) * Golden Guide to Kubernetes Application Development This book’s for web app developers who just want a short, sharp guide to grok Kubernetes. It’s also really great for people trying to get their CKAD certification. (Disclaimer: I wrote this. Yeah, this is one of those Quora answers… but I hope it’s still useful.) * The Kubernetes Book Probably the most popular and established book on Kubernetes. It’s great for new developers trying to learn Kubernetes. The author is known for his video courses as well. * Kubernetes: Up and Running Definitely written by the most authoritative authors of any book here. Kelsey Hightower is a Google dev advocate for Kubernetes(more)
It is indeed possible to use Kubernetes with out Docker. The Kubernetes community has long recognized the problem with being tied to Docker’s quasi-proprietary (and somewhat arbitrarily developed) container runtime. Early on there was support for an alternative runtime called rkt (pronounced like rocket). However, going down the path of creating separate solutions for any and every new container runtime that might get developed would be a lot of work and a bit like reinventing the wheel for each runtime. To break free of the Docker runtime constraint, the CRI (Container Runtime Interface) that allows you to use other container runtimes (e.g. ContainerD, CRI-O, etc.). The CRI plugin is a shim sits between the Kubernetes kubelet and container runtime and acts as a universal translator. Read more…
I’m not sure how to explain Kubernetes to a 10-year-old. Yet when I’m allowed to expand to older people who are not technology savvy I can come up with an example which might resonate. It will inside my company: I will use the analogy of our call center. My company services some 2 million people, we manage their pensions and the necessary administration. Every year we send out the latest status of the pensions to the participants, and sure enough people will follow up. Many follow up online – the pension fund websites – yet there is a significant number who call or send an e-mail. We measure the amount of outstanding messages, as well as the amount of unanswered calls (I recall the service level is at 80% answered within 10 seconds). These are displayed on monitors so those who work in the(more)
Assuming a basic understanding of Docker and containers, I’ll describe the Kubernetes specifics. This is from a general user point of view. Kubelet: A process which runs on each node in the cluster. Kubelet talks to the master server and gets a list of containers to run and then runs, manages, and reports container status back to the master server. Pod: The primary unit of Kubernetes scheduling and management. A Pod is list of containers that are always run together on one node. The containers in a pod share an IP address and a network stack, but are otherwise isolated from each other. Container: A Docker container, it has an isolated process space, can expose ports, can define environment variables and a run command. Read more ….
Kubernetes has a strong feature set for microservice architectures. Things like service discovery, automatic failover, rescheduling, and support for overlay networks make it the best choice in dynamic environments with many small, frequently changing applications tied together. If your application needs to start hundreds of containers quickly and will terminate them just as quickly, then Kubernetes is a good option. The converse of this is that it is not as well designed for more static, highly efficient workloads. Containerization is great for flexibility, but doesn’t come for free. There is a performance penalty for using it, somewhere between a few to high single digit percentage penalty, depending on the type of operations. Read more ….
DATA AND ANALYTICS BigQuery: Data warehouse/analytics BigQuery BI Engine: In-memory analytics engine BigQuery ML: BigQuery model training/serving Cloud Composer: Managed workflow orchestration service Cloud Data Fusion: Graphically manage data pipelines Cloud Dataflow: Stream/batch data processing Cloud Dataprep: Visual data wrangling Cloud Dataproc: Managed Spark and Hadoop
NETWORKING Carrier Peering: Peer through a carrier Direct Peering: Peer with GCP Dedicated Interconnect: Dedicated private network connection Partner Interconnect: Connect on-prem network to VPC Cloud Armor: DDoS protection and WAF Cloud CDN: Content delivery network Cloud DNS: Programmable DNS serving Cloud Load Balancing: Multi-region load distribution/balancing Cloud NAT: Network address translation service Cloud Router: VPC/on-prem network route exchange (BGP) Cloud VPN (HA): VPN (Virtual private network connection) Network Service Tiers: Price vs performance tiering Network Telemetry: Network telemetry service Traffic Director: Service mesh traffic management Google Cloud Service Mesh: Service-aware network management Virtual Private Cloud: Software defined networking VPC Service Controls: Security perimeters for API-based services Network Intelligence Center: Network monitoring and topology
GOOGLE MAPS PLATFORM Directions API: Get directions between locations Distance Matrix API: Multi-origin/destination travel times Geocoding API: Convert address to/from coordinates Geolocation API: Derive location without GPS Maps Embed API: Display iframe embedded maps Maps JavaScript API: Dynamic web maps Maps SDK for Android: Maps for Android apps Maps SDK for iOS: Maps for iOS apps Maps Static API: Display static map images Maps SDK for Unity: Unity SDK for games Maps URLs: URL scheme for maps Places API: Rest-based Places features Places Library, Maps JS API: Places features for web Places SDK for Android: Places features for Android Places SDK for iOS: Places feature for iOS Roads API: Convert coordinates to roads Street View Static API: Static street view images Street View Service: Street view for JavaScript Time Zone API: Convert coordinates to timezone
G SUITE (WORKSPACE) PLATFORM Admin SDK: Manage G Suite resources AMP for Email: Dynamic interactive email Apps Script: Extend and automate everything Calendar API: Create and manage calendars Classroom API: Provision and manage classrooms Cloud Search: Unified search for enterprise Docs API: Create and edit documents Drive Activity API: Retrieve Google Drive activity Drive API: Read and write files Drive Picker: Drive file selection widget Email Markup: Interactive email using schema.org G Suite Add-ons: Extend G Suite apps G Suite Marketplace: Storefront for integrated applications Gmail API: Enhance Gmail Hangouts Chat Bots: Conversational bots in chat People API: Manage user’s Contacts Sheets API: Read and write spreadsheets Slides API: Create and edit presentations Task API: Search, read & update Tasks Vault API: Manage your organization’s eDiscovery
MIGRATION TO GCP BigQuery Data Transfer: Service Bulk import analytics data Cloud Data Transfer: Data migration tools/CLI Google Transfer Appliance: Rentable data transport box Migrate for Anthos: Migrate VMs to GKE containers Migrate for Compute Engine: Compute Engine migration tools Migrate from Amazon Redshift: Migrate from Redshift to BigQuery Migrate from Teradata: Migrate from Teradata to BigQuery Storage Transfer Service: Online/on-premises data transfer VM Migration: VM migration tools Cloud Foundation Toolkit: Infrastructure as Code templates
Answer these questions to validate your basic knowledge of GCP:
As a prerequisite, here are the top 20 questions will help you familiarize yourself with the Google Cloud Platform.
1) What is GCP? 2) What are the benefits of using GCP? 3) How can GCP help my business? 4) What are some of the features of GCP? 5) How is GCP different from other clouds? 6) Why should I use GCP? 7) What are some of GCP’s strengths? 8) How is GCP priced? 9) Is GCP easy to use? 10) Can I use GCP for my personal projects? 11) What services does GCP offer? 12) What can I do with GCP? 13) What languages does GCP support? 14) What platforms does GCP support? 15) Does GPC support hybrid deployments?
16) Does GPC support on-premises deployments?
17) Is there a free tier on GPC ? 18) How do I get started with using GCP
What are the Top 200 AWS and Google Certified Machine Learning Specialty Questions and Answers Dumps?
This blog is the best way is the best way to prepare for your upcoming AWS Certified Machine Learning Specialty and Google Certified Professional Machine Learning Engineer exam. With over 100 questions and answers, this blog provides quizzes similar that are very similar to the real exam. It also includes the option to show and hide answers. Additionally, there are machine learning interview questions and detailed answers, as well as cheat sheets and illustrations. This blog is the best way to make sure you are well-prepared for your AWS Certified Machine Learning Specialty Exam.
The typical Google Machine Learning Engineer salary is $147,218. Machine Learning Engineer salaries at Google can range from $110,000 – $152,183.
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.
By the end of 2020, 85% of customer interactions will be handled without a human (Call Center, Chatbot, etc…)
61% of marketers say artificial intelligence is the most important aspect of their data strategy.
80% of business and tech leaders say AI already boosts productivity (Robotic Process Automation, Power Automate, etc..)
Current AI technology can boost business productivity by up to 40%
Quizzes, Practice Exams: Framing, Architecting, Designing, Developing ML Problems & Solutions, ML Jobs Interview Q&A
Get 20% off Google Google Workspace (Google Meet) Standard Plan with the following codes: 96DRHDRA9J7GTN6 Get 20% off Google Workspace (Google Meet) Business Plan (AMERICAS): M9HNXHX3WC9H7YE (Email us for more codes)
Use this App to learn about Machine Learning and Elevate your Brain with Machine Learning Quizzes, Cheat Sheets, Ml Jobs Interview Questions and Answers updated daily.
What does a Professional Machine Learning Engineer do?
A Professional Machine Learning Engineer designs, builds, and productionizes ML models to solve business challenges using Google Cloud technologies and knowledge of proven ML models and techniques. The ML Engineer collaborates closely with other job roles to ensure long-term success of models. The ML Engineer should be proficient in all aspects of model architecture, data pipeline interaction, and metrics interpretation. The ML Engineer needs familiarity with application development, infrastructure management, data engineering, and security. Through an understanding of training, retraining, deploying, scheduling, monitoring, and improving models, they design and create scalable solutions for optimal performance.
The AWS Certified Machine Learning – Specialty certification is intended for individuals who perform a development or data science role. It validates a candidate’s ability to design, implement, deploy, and maintain machine learning (ML) solutions for given business problems.
Question1: A machine learning team has several large CSV datasets in Amazon S3. Historically, models built with the Amazon SageMaker Linear Learner algorithm have taken hours to train on similar-sized datasets. The team’s leaders need to accelerate the training process. What can a machine learning specialist do to address this concern?
A) Use Amazon SageMaker Pipe mode. B) Use Amazon Machine Learning to train the models. C) Use Amazon Kinesis to stream the data to Amazon SageMaker. D) Use AWS Glue to transform the CSV dataset to the JSON format. ANSWER1:
A
Notes/Hint1:
Amazon SageMaker Pipe mode streams the data directly to the container, which improves the performance of training jobs. (Refer to this link for supporting information.) In Pipe mode, your training job streams data directly from Amazon S3. Streaming can provide faster start times for training jobs and better throughput. With Pipe mode, you also reduce the size of the Amazon EBS volumes for your training instances. B would not apply in this scenario. C is a streaming ingestion solution, but is not applicable in this scenario. D transforms the data structure.
Question 2) A local university wants to track cars in a parking lot to determine which students are parking in the lot. The university is wanting to ingest videos of the cars parking in near-real time, use machine learning to identify license plates, and store that data in an AWS data store. Which solution meets these requirements with the LEAST amount of development effort?
A) Use Amazon Kinesis Data Streams to ingest the video in near-real time, use the Kinesis Data Streams consumer integrated with Amazon Rekognition Video to process the license plate information, and then store results in DynamoDB.
B) Use Amazon Kinesis Video Streams to ingest the videos in near-real time, use the Kinesis Video Streams integration with Amazon Rekognition Video to identify the license plate information, and then store the results in DynamoDB.
C) Use Amazon Kinesis Data Streams to ingest videos in near-real time, call Amazon Rekognition to identify license plate information, and then store results in DynamoDB.
D) Use Amazon Kinesis Firehose to ingest the video in near-real time and outputs results onto S3. Set up a Lambda function that triggers when a new video is PUT onto S3 to send results to Amazon Rekognition to identify license plate information, and then store results in DynamoDB.
Answer 2)
B
Notes/Hint2)
Kinesis Video Streams is used to stream videos in near-real time. Amazon Rekognition Video uses Amazon Kinesis Video Streams to receive and process a video stream. After the videos have been processed by Rekognition we can output the results in DynamoDB.
Question 3) A term frequency–inverse document frequency (tf–idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences:
1. Please call the number below.
2. Please do not call us. What are the dimensions of the tf–idf matrix?
A) (2, 16)
B) (2, 8)
C) (2, 10)
D) (8, 10)
ANSWER3:
A
Notes/Hint3:
There are 2 sentences, 8 unique unigrams, and 8 unique bigrams, so the result would be (2,16). The phrases are “Please call the number below” and “Please do not call us.” Each word individually (unigram) is “Please,” “call,” ”the,” ”number,” “below,” “do,” “not,” and “us.” The unique bigrams are “Please call,” “call the,” ”the number,” “number below,” “Please do,” “do not,” “not call,” and “call us.” The tf–idf vectorizer is described at this link.
Question 4: A company is setting up a system to manage all of the datasets it stores in Amazon S3. The company would like to automate running transformation jobs on the data and maintaining a catalog of the metadata concerning the datasets. The solution should require the least amount of setup and maintenance. Which solution will allow the company to achieve its goals?
A) Create an Amazon EMR cluster with Apache Hive installed. Then, create a Hive metastore and a script to run transformation jobs on a schedule.
B) Create an AWS Glue crawler to populate the AWS Glue Data Catalog. Then, author an AWS Glue ETL job, and set up a schedule for data transformation jobs.
C) Create an Amazon EMR cluster with Apache Spark installed. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule. D) Create an AWS Data Pipeline that transforms the data. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule.
ANSWER4:
B
Notes/Hint4:
AWS Glue is the correct answer because this option requires the least amount of setup and maintenance since it is serverless, and it does not require management of the infrastructure. Refer to this link for supporting information. A, C, and D are all solutions that can solve the problem, but require more steps for configuration, and require higher operational overhead to run and maintain.
Question 5) Which service in the Kinesis family allows you to easily load streaming data into data stores and analytics tools?
A) Kinesis Firehose
B) Kinesis Streams
C) Kinesis Data Analytics
D) Kinesis Video Streams
ANSWER5:
A
Notes/Hint5:
Kinesis Firehose is perfect for streaming data into AWS and sending it directly to its final destination – places like S3, Redshift, Elastisearch, and Splunk Instances.
Question 6) A data scientist is working on optimizing a model during the training process by varying multiple parameters. The data scientist observes that, during multiple runs with identical parameters, the loss function converges to different, yet stable, values. What should the data scientist do to improve the training process?
A) Increase the learning rate. Keep the batch size the same.
B) Reduce the batch size. Decrease the learning rate.
C) Keep the batch size the same. Decrease the learning rate.
D) Do not change the learning rate. Increase the batch size.
Answer 6)
B
Notes 6)
It is most likely that the loss function is very curvy and has multiple local minima where the training is getting stuck. Decreasing the batch size would help the data scientist stochastically get out of the local minima saddles. Decreasing the learning rate would prevent overshooting the global loss function minimum. Refer to the paper at this link for an explanation. Reference 6) : Here
Question 7) Your organization has a standalone Javascript (Node.js) application that streams data into AWS using Kinesis Data Streams. You notice that they are using the Kinesis API (AWS SDK) over the Kinesis Producer Library (KPL). What might be the reasoning behind this?
A) The Kinesis API (AWS SDK) provides greater functionality over the Kinesis Producer Library.
B) The Kinesis API (AWS SDK) runs faster in Javascript applications over the Kinesis Producer Library.
C) The Kinesis Producer Library must be installed as a Java application to use with Kinesis Data Streams.
D) The Kinesis Producer Library cannot be integrated with a Javascript application because of its asynchronous architecture.
Answer 7)
C
Notes/Hint7:
The KPL must be installed as a Java application before it can be used with your Kinesis Data Streams. There are ways to process KPL serialized data within AWS Lambda, in Java, Node.js, and Python, but not if these answers mentions Lambda.
Question 8) A data scientist is evaluating different binary classification models. A false positive result is 5 times more expensive (from a business perspective) than a false negative result. The models should be evaluated based on the following criteria:
1) Must have a recall rate of at least 80%
2) Must have a false positive rate of 10% or less
3) Must minimize business costs After creating each binary classification model, the data scientist generates the corresponding confusion matrix. Which confusion matrix represents the model that satisfies the requirements?
A) TN = 91, FP = 9 FN = 22, TP = 78
B) TN = 99, FP = 1 FN = 21, TP = 79
C) TN = 96, FP = 4 FN = 10, TP = 90
D) TN = 98, FP = 2 FN = 18, TP = 82
Answer 8):
D
Notes/Hint 8)
The following calculations are required: TP = True Positive FP = False Positive FN = False Negative TN = True Negative FN = False Negative Recall = TP / (TP + FN) False Positive Rate (FPR) = FP / (FP + TN) Cost = 5 * FP + FN A B C D Recall 78 / (78 + 22) = 0.78 79 / (79 + 21) = 0.79 90 / (90 + 10) = 0.9 82 / (82 + 18) = 0.82 False Positive Rate 9 / (9 + 91) = 0.09 1 / (1 + 99) = 0.01 4 / (4 + 96) = 0.04 2 / (2 + 98) = 0.02 Costs 5 * 9 + 22 = 67 5 * 1 + 21 = 26 5 * 4 + 10 = 30 5 * 2 + 18 = 28 Options C and D have a recall greater than 80% and an FPR less than 10%, but D is the most cost effective. For supporting information, refer to this link. Reference 8:Here
Question 9) A data scientist uses logistic regression to build a fraud detection model. While the model accuracy is 99%, 90% of the fraud cases are not detected by the model. What action will definitely help the model detect more than 10% of fraud cases?
A) Using undersampling to balance the dataset
B) Decreasing the class probability threshold
C) Using regularization to reduce overfitting
D) Using oversampling to balance the dataset
Answer 9)
B
Notes 9)
Decreasing the class probability threshold makes the model more sensitive and, therefore, marks more cases as the positive class, which is fraud in this case. This will increase the likelihood of fraud detection. However, it comes at the price of lowering precision. This is covered in the Discussion section of the paper at this link Reference 9:Here
Question 10) A company is interested in building a fraud detection model. Currently, the data scientist does not have a sufficient amount of information due to the low number of fraud cases. Which method is MOST likely to detect the GREATEST number of valid fraud cases?
A) Oversampling using bootstrapping
B) Undersampling
C) Oversampling using SMOTE
D) Class weight adjustment
Answer 10)
C
Notes 10)
With datasets that are not fully populated, the Synthetic Minority Over-sampling Technique (SMOTE) adds new information by adding synthetic data points to the minority class. This technique would be the most effective in this scenario. Refer to Section 4.2 at this link for supporting information. Reference 10) : Here
Question 11) A machine learning engineer is preparing a data frame for a supervised learning task with the Amazon SageMaker Linear Learner algorithm. The ML engineer notices the target label classes are highly imbalanced and multiple feature columns contain missing values. The proportion of missing values across the entire data frame is less than 5%. What should the ML engineer do to minimize bias due to missing values?
A) Replace each missing value by the mean or median across non-missing values in same row.
B) Delete observations that contain missing values because these represent less than 5% of the data.
C) Replace each missing value by the mean or median across non-missing values in the same column.
D) For each feature, approximate the missing values using supervised learning based on other features.
Answer 11)
D
Notes 11)
Use supervised learning to predict missing values based on the values of other features. Different supervised learning approaches might have different performances, but any properly implemented supervised learning approach should provide the same or better approximation than mean or median approximation, as proposed in responses A and C. Supervised learning applied to the imputation of missing values is an active field of research. Refer to this link for an example. Reference 11): Here
Question 12) A company has collected customer comments on its products, rating them as safe or unsafe, using decision trees. The training dataset has the following features: id, date, full review, full review summary, and a binary safe/unsafe tag. During training, any data sample with missing features was dropped. In a few instances, the test set was found to be missing the full review text field. For this use case, which is the most effective course of action to address test data samples with missing features?
A) Drop the test samples with missing full review text fields, and then run through the test set.
B) Copy the summary text fields and use them to fill in the missing full review text fields, and then run through the test set.
C) Use an algorithm that handles missing data better than decision trees.
D) Generate synthetic data to fill in the fields that are missing data, and then run through the test set.
Answer 12)
B
Notes 12)
In this case, a full review summary usually contains the most descriptive phrases of the entire review and is a valid stand-in for the missing full review text field. For supporting information, refer to page 1627 at this link, and this link and this link.
Question 13) An insurance company needs to automate claim compliance reviews because human reviews are expensive and error-prone. The company has a large set of claims and a compliance label for each. Each claim consists of a few sentences in English, many of which contain complex related information. Management would like to use Amazon SageMaker built-in algorithms to design a machine learning supervised model that can be trained to read each claim and predict if the claim is compliant or not. Which approach should be used to extract features from the claims to be used as inputs for the downstream supervised task?
A) Derive a dictionary of tokens from claims in the entire dataset. Apply one-hot encoding to tokens found in each claim of the training set. Send the derived features space as inputs to an Amazon SageMaker builtin supervised learning algorithm.
B) Apply Amazon SageMaker BlazingText in Word2Vec mode to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
C) Apply Amazon SageMaker BlazingText in classification mode to labeled claims in the training set to derive features for the claims that correspond to the compliant and non-compliant labels, respectively.
D) Apply Amazon SageMaker Object2Vec to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
Answer 13)
D
Notes 13)
Amazon SageMaker Object2Vec generalizes the Word2Vec embedding technique for words to more complex objects, such as sentences and paragraphs. Since the supervised learning task is at the level of whole claims, for which there are labels, and no labels are available at the word level, Object2Vec needs be used instead of Word2Vec.
Question 14) You have been tasked with capturing two different types of streaming events. The first event type includes mission-critical data that needs to immediately be processed before operations can continue. The second event type includes data of less importance, but operations can continue without immediately processing. What is the most appropriate solution to record these different types of events?
A) Capture both events with the PutRecords API call.
B) Capture both event types using the Kinesis Producer Library (KPL).
C) Capture the mission critical events with the PutRecords API call and the second event type with the Kinesis Producer Library (KPL).
D) Capture the mission critical events with the Kinesis Producer Library (KPL) and the second event type with the Putrecords API call.
Answer 14)
C
Notes 14)
The question is about sending data to Kinesis synchronously vs. asynchronously. PutRecords is a synchronous send function, so it must be used for the first event type (critical events). The Kinesis Producer Library (KPL) implements an asynchronous send function, so it can be used for the second event type. In this scenario, the reason to use the KPL over the PutRecords API call is because: KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user-configurable). Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance. Applications that cannot tolerate this additional delay may need to use the AWS SDK directly. For more information about using the AWS SDK with Kinesis Data Streams, see Developing Producers Using the Amazon Kinesis Data Streams API with the AWS SDK for Java. For more information about RecordMaxBufferedTime and other user-configurable properties of the KPL, see Configuring the Kinesis Producer Library.
Question 15) You are collecting clickstream data from an e-commerce website to make near-real time product suggestions for users actively using the site. Which combination of tools can be used to achieve the quickest recommendations and meets all of the requirements?
A) Use Kinesis Data Streams to ingest clickstream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
B) Use Kinesis Data Firehose to ingest click stream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions, then use Lambda to load these results into S3.
C) Use Kinesis Data Streams to ingest clickstream data, then use Lambda to process that data and write it to S3. Once the data is on S3, use Athena to query based on conditions that data and make real time recommendations to users.
D) Use the Kinesis Data Analytics to ingest the clickstream data directly and run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
Answer 15)
A
Notes 15)
Kinesis Data Analytics gets its input streaming data from Kinesis Data Streams or Kinesis Data Firehose. You can use Kinesis Data Analytics to run real-time SQL queries on your data. Once certain conditions are met you can trigger Lambda functions to make real time product suggestions to users. It is not important that we store or persist the clickstream data.
Question 16) Which service built by AWS makes it easy to set up a retry mechanism, aggregate records to improve throughput, and automatically submits CloudWatch metrics?
A) Kinesis API (AWS SDK)
B) Kinesis Producer Library (KPL)
C) Kinesis Consumer Library
D) Kinesis Client Library (KCL)
Answer 16)
B
Notes 16)
Although the Kinesis API built into the AWS SDK can be used for all of this, the Kinesis Producer Library (KPL) makes it easy to integrate all of this into your applications.
Question 17) You have been tasked with capturing data from an online gaming platform to run analytics on and process through a machine learning pipeline. The data that you are ingesting is players controller inputs every 1 second (up to 10 players in a game) that is in JSON format. The data needs to be ingested through Kinesis Data Streams and the JSON data blob is 100 KB in size. What is the minimum number of shards you can use to successfully ingest this data?
A) 10 shards
B) Greater than 500 shards, so you’ll need to request more shards from AWS
C) 1 shard
D) 100 shards
Answer 17)
C
Notes 17)
In this scenario, there will be a maximum of 10 records per second with a max payload size of 1000 KB (10 records x 100 KB = 1000KB) written to the shard. A single shard can ingest up to 1 MB of data per second, which is enough to ingest the 1000 KB from the streaming game play. Therefor 1 shard is enough to handle the streaming data.
Question 18) Which services in the Kinesis family allows you to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time?
A) Kinesis Streams
B) Kinesis Firehose
C) Kinesis Video Streams
D) Kinesis Data Analytics
Answer 18)
D
Notes 18)
Kinesis Data Analytics allows you to run real-time SQL queries on your data to gain insights and respond to events in real time.
Question 19) You are a ML specialist needing to collect data from Twitter tweets. Your goal is to collect tweets that include only the name of your company and the tweet body, and store it off into a data store in AWS. What set of tools can you use to stream, transform, and load the data into AWS with the LEAST amount of effort?
A) Setup a Kinesis Data Firehose for data ingestion and immediately write that data to S3. Next, setup a Lambda function to trigger when data lands in S3 to transform it and finally write it to DynamoDB.
B) Setup A Kinesis Data Stream for data ingestion, setup EC2 instances as data consumers to poll and transform the data from the stream. Once the data is transformed, make an API call to write the data to DynamoDB.
C) Setup Kinesis Data Streams for data ingestion. Next, setup Kinesis Data Firehouse to load that data into RedShift. Next, setup a Lambda function to query data using RedShift spectrum and store the results onto DynamoDB.
D) Create a Kinesis Data Stream to ingest the data. Next, setup a Kinesis Data Firehose and use Lambda to transform the data from the Kinesis Data Stream, then use Lambda to write the data to DynamoDB. Finally, use S3 as the data destination for Kinesis Data Firehose.
Answer 19)
A
Notes 19)
All of these could be used to stream, transform, and load the data into an AWS data store. The setup that requires the LEAST amount of effort and moving parts involves setting up a Kinesis Data Firehose to stream the data into S3, have it transformed by Lambda with an S3 trigger, and then written to DynamoDB.
What are the Top 100 AWS and Google Certified Machine Learning Specialty Questions and Answers Dumps?
This blog is the best way is the best way to prepare for your upcoming AWS Certified Machine Learning Specialty and Google Certified Professional Machine Learning Engineer exam. With over 100 questions and answers, this blog provides quizzes similar that are very similar to the real exam. It also includes the option to show and hide answers. Additionally, there are machine learning interview questions and detailed answers, as well as cheat sheets and illustrations. This blog is the best way to make sure you are well-prepared for your AWS Certified Machine Learning Specialty Exam.
The typical Google Machine Learning Engineer salary is $147,218. Machine Learning Engineer salaries at Google can range from $110,000 – $152,183.
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.
By the end of 2020, 85% of customer interactions will be handled without a human (Call Center, Chatbot, etc…)
61% of marketers say artificial intelligence is the most important aspect of their data strategy.
80% of business and tech leaders say AI already boosts productivity (Robotic Process Automation, Power Automate, etc..)
Current AI technology can boost business productivity by up to 40%
Use this App to learn about Machine Learning and Elevate your Brain with Machine Learning Quizzes, Cheat Sheets, Ml Jobs Interview Questions and Answers updated daily.
What does a Professional Machine Learning Engineer do?
A Professional Machine Learning Engineer designs, builds, and productionizes ML models to solve business challenges using Google Cloud technologies and knowledge of proven ML models and techniques. The ML Engineer collaborates closely with other job roles to ensure long-term success of models. The ML Engineer should be proficient in all aspects of model architecture, data pipeline interaction, and metrics interpretation. The ML Engineer needs familiarity with application development, infrastructure management, data engineering, and security. Through an understanding of training, retraining, deploying, scheduling, monitoring, and improving models, they design and create scalable solutions for optimal performance.
The AWS Certified Machine Learning – Specialty certification is intended for individuals who perform a development or data science role. It validates a candidate’s ability to design, implement, deploy, and maintain machine learning (ML) solutions for given business problems.
Question1: A machine learning team has several large CSV datasets in Amazon S3. Historically, models built with the Amazon SageMaker Linear Learner algorithm have taken hours to train on similar-sized datasets. The team’s leaders need to accelerate the training process. What can a machine learning specialist do to address this concern?
A) Use Amazon SageMaker Pipe mode. B) Use Amazon Machine Learning to train the models. C) Use Amazon Kinesis to stream the data to Amazon SageMaker. D) Use AWS Glue to transform the CSV dataset to the JSON format. ANSWER1:
A
Notes/Hint1:
Amazon SageMaker Pipe mode streams the data directly to the container, which improves the performance of training jobs. (Refer to this link for supporting information.) In Pipe mode, your training job streams data directly from Amazon S3. Streaming can provide faster start times for training jobs and better throughput. With Pipe mode, you also reduce the size of the Amazon EBS volumes for your training instances. B would not apply in this scenario. C is a streaming ingestion solution, but is not applicable in this scenario. D transforms the data structure.
Question 2) A local university wants to track cars in a parking lot to determine which students are parking in the lot. The university is wanting to ingest videos of the cars parking in near-real time, use machine learning to identify license plates, and store that data in an AWS data store. Which solution meets these requirements with the LEAST amount of development effort?
A) Use Amazon Kinesis Data Streams to ingest the video in near-real time, use the Kinesis Data Streams consumer integrated with Amazon Rekognition Video to process the license plate information, and then store results in DynamoDB.
B) Use Amazon Kinesis Video Streams to ingest the videos in near-real time, use the Kinesis Video Streams integration with Amazon Rekognition Video to identify the license plate information, and then store the results in DynamoDB.
C) Use Amazon Kinesis Data Streams to ingest videos in near-real time, call Amazon Rekognition to identify license plate information, and then store results in DynamoDB.
D) Use Amazon Kinesis Firehose to ingest the video in near-real time and outputs results onto S3. Set up a Lambda function that triggers when a new video is PUT onto S3 to send results to Amazon Rekognition to identify license plate information, and then store results in DynamoDB.
Answer 2)
B
Notes/Hint2)
Kinesis Video Streams is used to stream videos in near-real time. Amazon Rekognition Video uses Amazon Kinesis Video Streams to receive and process a video stream. After the videos have been processed by Rekognition we can output the results in DynamoDB.
Question 3) A term frequency–inverse document frequency (tf–idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences:
1. Please call the number below.
2. Please do not call us. What are the dimensions of the tf–idf matrix?
A) (2, 16)
B) (2, 8)
C) (2, 10)
D) (8, 10)
ANSWER3:
A
Notes/Hint3:
There are 2 sentences, 8 unique unigrams, and 8 unique bigrams, so the result would be (2,16). The phrases are “Please call the number below” and “Please do not call us.” Each word individually (unigram) is “Please,” “call,” ”the,” ”number,” “below,” “do,” “not,” and “us.” The unique bigrams are “Please call,” “call the,” ”the number,” “number below,” “Please do,” “do not,” “not call,” and “call us.” The tf–idf vectorizer is described at this link.
Question 4: A company is setting up a system to manage all of the datasets it stores in Amazon S3. The company would like to automate running transformation jobs on the data and maintaining a catalog of the metadata concerning the datasets. The solution should require the least amount of setup and maintenance. Which solution will allow the company to achieve its goals?
A) Create an Amazon EMR cluster with Apache Hive installed. Then, create a Hive metastore and a script to run transformation jobs on a schedule.
B) Create an AWS Glue crawler to populate the AWS Glue Data Catalog. Then, author an AWS Glue ETL job, and set up a schedule for data transformation jobs.
C) Create an Amazon EMR cluster with Apache Spark installed. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule. D) Create an AWS Data Pipeline that transforms the data. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule.
ANSWER4:
B
Notes/Hint4:
AWS Glue is the correct answer because this option requires the least amount of setup and maintenance since it is serverless, and it does not require management of the infrastructure. Refer to this link for supporting information. A, C, and D are all solutions that can solve the problem, but require more steps for configuration, and require higher operational overhead to run and maintain.
Question 5) Which service in the Kinesis family allows you to easily load streaming data into data stores and analytics tools?
A) Kinesis Firehose
B) Kinesis Streams
C) Kinesis Data Analytics
D) Kinesis Video Streams
ANSWER5:
A
Notes/Hint5:
Kinesis Firehose is perfect for streaming data into AWS and sending it directly to its final destination – places like S3, Redshift, Elastisearch, and Splunk Instances.
Question 6) A data scientist is working on optimizing a model during the training process by varying multiple parameters. The data scientist observes that, during multiple runs with identical parameters, the loss function converges to different, yet stable, values. What should the data scientist do to improve the training process?
A) Increase the learning rate. Keep the batch size the same.
B) Reduce the batch size. Decrease the learning rate.
C) Keep the batch size the same. Decrease the learning rate.
D) Do not change the learning rate. Increase the batch size.
Answer 6)
B
Notes 6)
It is most likely that the loss function is very curvy and has multiple local minima where the training is getting stuck. Decreasing the batch size would help the data scientist stochastically get out of the local minima saddles. Decreasing the learning rate would prevent overshooting the global loss function minimum. Refer to the paper at this link for an explanation. Reference 6) : Here
Question 7) Your organization has a standalone Javascript (Node.js) application that streams data into AWS using Kinesis Data Streams. You notice that they are using the Kinesis API (AWS SDK) over the Kinesis Producer Library (KPL). What might be the reasoning behind this?
A) The Kinesis API (AWS SDK) provides greater functionality over the Kinesis Producer Library.
B) The Kinesis API (AWS SDK) runs faster in Javascript applications over the Kinesis Producer Library.
C) The Kinesis Producer Library must be installed as a Java application to use with Kinesis Data Streams.
D) The Kinesis Producer Library cannot be integrated with a Javascript application because of its asynchronous architecture.
Answer 7)
C
Notes/Hint7:
The KPL must be installed as a Java application before it can be used with your Kinesis Data Streams. There are ways to process KPL serialized data within AWS Lambda, in Java, Node.js, and Python, but not if these answers mentions Lambda.
Question 8) A data scientist is evaluating different binary classification models. A false positive result is 5 times more expensive (from a business perspective) than a false negative result. The models should be evaluated based on the following criteria:
1) Must have a recall rate of at least 80%
2) Must have a false positive rate of 10% or less
3) Must minimize business costs After creating each binary classification model, the data scientist generates the corresponding confusion matrix. Which confusion matrix represents the model that satisfies the requirements?
A) TN = 91, FP = 9 FN = 22, TP = 78
B) TN = 99, FP = 1 FN = 21, TP = 79
C) TN = 96, FP = 4 FN = 10, TP = 90
D) TN = 98, FP = 2 FN = 18, TP = 82
Answer 8):
D
Notes/Hint 8)
The following calculations are required: TP = True Positive FP = False Positive FN = False Negative TN = True Negative FN = False Negative Recall = TP / (TP + FN) False Positive Rate (FPR) = FP / (FP + TN) Cost = 5 * FP + FN A B C D Recall 78 / (78 + 22) = 0.78 79 / (79 + 21) = 0.79 90 / (90 + 10) = 0.9 82 / (82 + 18) = 0.82 False Positive Rate 9 / (9 + 91) = 0.09 1 / (1 + 99) = 0.01 4 / (4 + 96) = 0.04 2 / (2 + 98) = 0.02 Costs 5 * 9 + 22 = 67 5 * 1 + 21 = 26 5 * 4 + 10 = 30 5 * 2 + 18 = 28 Options C and D have a recall greater than 80% and an FPR less than 10%, but D is the most cost effective. For supporting information, refer to this link. Reference 8:Here
Question 9) A data scientist uses logistic regression to build a fraud detection model. While the model accuracy is 99%, 90% of the fraud cases are not detected by the model. What action will definitely help the model detect more than 10% of fraud cases?
A) Using undersampling to balance the dataset
B) Decreasing the class probability threshold
C) Using regularization to reduce overfitting
D) Using oversampling to balance the dataset
Answer 9)
B
Notes 9)
Decreasing the class probability threshold makes the model more sensitive and, therefore, marks more cases as the positive class, which is fraud in this case. This will increase the likelihood of fraud detection. However, it comes at the price of lowering precision. This is covered in the Discussion section of the paper at this link Reference 9:Here
Question 10) A company is interested in building a fraud detection model. Currently, the data scientist does not have a sufficient amount of information due to the low number of fraud cases. Which method is MOST likely to detect the GREATEST number of valid fraud cases?
A) Oversampling using bootstrapping
B) Undersampling
C) Oversampling using SMOTE
D) Class weight adjustment
Answer 10)
C
Notes 10)
With datasets that are not fully populated, the Synthetic Minority Over-sampling Technique (SMOTE) adds new information by adding synthetic data points to the minority class. This technique would be the most effective in this scenario. Refer to Section 4.2 at this link for supporting information. Reference 10) : Here
Question 11) A machine learning engineer is preparing a data frame for a supervised learning task with the Amazon SageMaker Linear Learner algorithm. The ML engineer notices the target label classes are highly imbalanced and multiple feature columns contain missing values. The proportion of missing values across the entire data frame is less than 5%. What should the ML engineer do to minimize bias due to missing values?
A) Replace each missing value by the mean or median across non-missing values in same row.
B) Delete observations that contain missing values because these represent less than 5% of the data.
C) Replace each missing value by the mean or median across non-missing values in the same column.
D) For each feature, approximate the missing values using supervised learning based on other features.
Answer 11)
D
Notes 11)
Use supervised learning to predict missing values based on the values of other features. Different supervised learning approaches might have different performances, but any properly implemented supervised learning approach should provide the same or better approximation than mean or median approximation, as proposed in responses A and C. Supervised learning applied to the imputation of missing values is an active field of research. Refer to this link for an example. Reference 11): Here
Question 12) A company has collected customer comments on its products, rating them as safe or unsafe, using decision trees. The training dataset has the following features: id, date, full review, full review summary, and a binary safe/unsafe tag. During training, any data sample with missing features was dropped. In a few instances, the test set was found to be missing the full review text field. For this use case, which is the most effective course of action to address test data samples with missing features?
A) Drop the test samples with missing full review text fields, and then run through the test set.
B) Copy the summary text fields and use them to fill in the missing full review text fields, and then run through the test set.
C) Use an algorithm that handles missing data better than decision trees.
D) Generate synthetic data to fill in the fields that are missing data, and then run through the test set.
Answer 12)
B
Notes 12)
In this case, a full review summary usually contains the most descriptive phrases of the entire review and is a valid stand-in for the missing full review text field. For supporting information, refer to page 1627 at this link, and this link and this link.
Question 13) An insurance company needs to automate claim compliance reviews because human reviews are expensive and error-prone. The company has a large set of claims and a compliance label for each. Each claim consists of a few sentences in English, many of which contain complex related information. Management would like to use Amazon SageMaker built-in algorithms to design a machine learning supervised model that can be trained to read each claim and predict if the claim is compliant or not. Which approach should be used to extract features from the claims to be used as inputs for the downstream supervised task?
A) Derive a dictionary of tokens from claims in the entire dataset. Apply one-hot encoding to tokens found in each claim of the training set. Send the derived features space as inputs to an Amazon SageMaker builtin supervised learning algorithm.
B) Apply Amazon SageMaker BlazingText in Word2Vec mode to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
C) Apply Amazon SageMaker BlazingText in classification mode to labeled claims in the training set to derive features for the claims that correspond to the compliant and non-compliant labels, respectively.
D) Apply Amazon SageMaker Object2Vec to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
Answer 13)
D
Notes 13)
Amazon SageMaker Object2Vec generalizes the Word2Vec embedding technique for words to more complex objects, such as sentences and paragraphs. Since the supervised learning task is at the level of whole claims, for which there are labels, and no labels are available at the word level, Object2Vec needs be used instead of Word2Vec.
Question 14) You have been tasked with capturing two different types of streaming events. The first event type includes mission-critical data that needs to immediately be processed before operations can continue. The second event type includes data of less importance, but operations can continue without immediately processing. What is the most appropriate solution to record these different types of events?
A) Capture both events with the PutRecords API call.
B) Capture both event types using the Kinesis Producer Library (KPL).
C) Capture the mission critical events with the PutRecords API call and the second event type with the Kinesis Producer Library (KPL).
D) Capture the mission critical events with the Kinesis Producer Library (KPL) and the second event type with the Putrecords API call.
Answer 14)
C
Notes 14)
The question is about sending data to Kinesis synchronously vs. asynchronously. PutRecords is a synchronous send function, so it must be used for the first event type (critical events). The Kinesis Producer Library (KPL) implements an asynchronous send function, so it can be used for the second event type. In this scenario, the reason to use the KPL over the PutRecords API call is because: KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user-configurable). Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance. Applications that cannot tolerate this additional delay may need to use the AWS SDK directly. For more information about using the AWS SDK with Kinesis Data Streams, see Developing Producers Using the Amazon Kinesis Data Streams API with the AWS SDK for Java. For more information about RecordMaxBufferedTime and other user-configurable properties of the KPL, see Configuring the Kinesis Producer Library.
Question 15) You are collecting clickstream data from an e-commerce website to make near-real time product suggestions for users actively using the site. Which combination of tools can be used to achieve the quickest recommendations and meets all of the requirements?
A) Use Kinesis Data Streams to ingest clickstream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
B) Use Kinesis Data Firehose to ingest click stream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions, then use Lambda to load these results into S3.
C) Use Kinesis Data Streams to ingest clickstream data, then use Lambda to process that data and write it to S3. Once the data is on S3, use Athena to query based on conditions that data and make real time recommendations to users.
D) Use the Kinesis Data Analytics to ingest the clickstream data directly and run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
Answer 15)
A
Notes 15)
Kinesis Data Analytics gets its input streaming data from Kinesis Data Streams or Kinesis Data Firehose. You can use Kinesis Data Analytics to run real-time SQL queries on your data. Once certain conditions are met you can trigger Lambda functions to make real time product suggestions to users. It is not important that we store or persist the clickstream data.
Question 16) Which service built by AWS makes it easy to set up a retry mechanism, aggregate records to improve throughput, and automatically submits CloudWatch metrics?
A) Kinesis API (AWS SDK)
B) Kinesis Producer Library (KPL)
C) Kinesis Consumer Library
D) Kinesis Client Library (KCL)
Answer 16)
B
Notes 16)
Although the Kinesis API built into the AWS SDK can be used for all of this, the Kinesis Producer Library (KPL) makes it easy to integrate all of this into your applications.
Question 17) You have been tasked with capturing data from an online gaming platform to run analytics on and process through a machine learning pipeline. The data that you are ingesting is players controller inputs every 1 second (up to 10 players in a game) that is in JSON format. The data needs to be ingested through Kinesis Data Streams and the JSON data blob is 100 KB in size. What is the minimum number of shards you can use to successfully ingest this data?
A) 10 shards
B) Greater than 500 shards, so you’ll need to request more shards from AWS
C) 1 shard
D) 100 shards
Answer 17)
C
Notes 17)
In this scenario, there will be a maximum of 10 records per second with a max payload size of 1000 KB (10 records x 100 KB = 1000KB) written to the shard. A single shard can ingest up to 1 MB of data per second, which is enough to ingest the 1000 KB from the streaming game play. Therefor 1 shard is enough to handle the streaming data.
Question 18) Which services in the Kinesis family allows you to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time?
A) Kinesis Streams
B) Kinesis Firehose
C) Kinesis Video Streams
D) Kinesis Data Analytics
Answer 18)
D
Notes 18)
Kinesis Data Analytics allows you to run real-time SQL queries on your data to gain insights and respond to events in real time.
Question 19) You are a ML specialist needing to collect data from Twitter tweets. Your goal is to collect tweets that include only the name of your company and the tweet body, and store it off into a data store in AWS. What set of tools can you use to stream, transform, and load the data into AWS with the LEAST amount of effort?
A) Setup a Kinesis Data Firehose for data ingestion and immediately write that data to S3. Next, setup a Lambda function to trigger when data lands in S3 to transform it and finally write it to DynamoDB.
B) Setup A Kinesis Data Stream for data ingestion, setup EC2 instances as data consumers to poll and transform the data from the stream. Once the data is transformed, make an API call to write the data to DynamoDB.
C) Setup Kinesis Data Streams for data ingestion. Next, setup Kinesis Data Firehouse to load that data into RedShift. Next, setup a Lambda function to query data using RedShift spectrum and store the results onto DynamoDB.
D) Create a Kinesis Data Stream to ingest the data. Next, setup a Kinesis Data Firehose and use Lambda to transform the data from the Kinesis Data Stream, then use Lambda to write the data to DynamoDB. Finally, use S3 as the data destination for Kinesis Data Firehose.
Answer 19)
A
Notes 19)
All of these could be used to stream, transform, and load the data into an AWS data store. The setup that requires the LEAST amount of effort and moving parts involves setting up a Kinesis Data Firehose to stream the data into S3, have it transformed by Lambda with an S3 trigger, and then written to DynamoDB.
What are the Top 100 AWS and Google Certified Machine Learning Specialty Questions and Answers Dumps?
This blog is the best way is the best way to prepare for your upcoming AWS Certified Machine Learning Specialty and Google Certified Professional Machine Learning Engineer exam. With over 100 questions and answers, this blog provides quizzes similar that are very similar to the real exam. It also includes the option to show and hide answers. Additionally, there are machine learning interview questions and detailed answers, as well as cheat sheets and illustrations. This blog is the best way to make sure you are well-prepared for your AWS Certified Machine Learning Specialty Exam.
The typical Google Machine Learning Engineer salary is $147,218. Machine Learning Engineer salaries at Google can range from $110,000 – $152,183.
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.
By the end of 2020, 85% of customer interactions will be handled without a human (Call Center, Chatbot, etc…)
61% of marketers say artificial intelligence is the most important aspect of their data strategy.
80% of business and tech leaders say AI already boosts productivity (Robotic Process Automation, Power Automate, etc..)
Current AI technology can boost business productivity by up to 40%
Use this App to learn about Machine Learning and Elevate your Brain with Machine Learning Quizzes, Cheat Sheets, Ml Jobs Interview Questions and Answers updated daily.
What does a Professional Machine Learning Engineer do?
A Professional Machine Learning Engineer designs, builds, and productionizes ML models to solve business challenges using Google Cloud technologies and knowledge of proven ML models and techniques. The ML Engineer collaborates closely with other job roles to ensure long-term success of models. The ML Engineer should be proficient in all aspects of model architecture, data pipeline interaction, and metrics interpretation. The ML Engineer needs familiarity with application development, infrastructure management, data engineering, and security. Through an understanding of training, retraining, deploying, scheduling, monitoring, and improving models, they design and create scalable solutions for optimal performance.
The AWS Certified Machine Learning – Specialty certification is intended for individuals who perform a development or data science role. It validates a candidate’s ability to design, implement, deploy, and maintain machine learning (ML) solutions for given business problems.
Question1: A machine learning team has several large CSV datasets in Amazon S3. Historically, models built with the Amazon SageMaker Linear Learner algorithm have taken hours to train on similar-sized datasets. The team’s leaders need to accelerate the training process. What can a machine learning specialist do to address this concern?
A) Use Amazon SageMaker Pipe mode. B) Use Amazon Machine Learning to train the models. C) Use Amazon Kinesis to stream the data to Amazon SageMaker. D) Use AWS Glue to transform the CSV dataset to the JSON format. ANSWER1:
A
Notes/Hint1:
Amazon SageMaker Pipe mode streams the data directly to the container, which improves the performance of training jobs. (Refer to this link for supporting information.) In Pipe mode, your training job streams data directly from Amazon S3. Streaming can provide faster start times for training jobs and better throughput. With Pipe mode, you also reduce the size of the Amazon EBS volumes for your training instances. B would not apply in this scenario. C is a streaming ingestion solution, but is not applicable in this scenario. D transforms the data structure.
Question 2) A local university wants to track cars in a parking lot to determine which students are parking in the lot. The university is wanting to ingest videos of the cars parking in near-real time, use machine learning to identify license plates, and store that data in an AWS data store. Which solution meets these requirements with the LEAST amount of development effort?
A) Use Amazon Kinesis Data Streams to ingest the video in near-real time, use the Kinesis Data Streams consumer integrated with Amazon Rekognition Video to process the license plate information, and then store results in DynamoDB.
B) Use Amazon Kinesis Video Streams to ingest the videos in near-real time, use the Kinesis Video Streams integration with Amazon Rekognition Video to identify the license plate information, and then store the results in DynamoDB.
C) Use Amazon Kinesis Data Streams to ingest videos in near-real time, call Amazon Rekognition to identify license plate information, and then store results in DynamoDB.
D) Use Amazon Kinesis Firehose to ingest the video in near-real time and outputs results onto S3. Set up a Lambda function that triggers when a new video is PUT onto S3 to send results to Amazon Rekognition to identify license plate information, and then store results in DynamoDB.
Answer 2)
B
Notes/Hint2)
Kinesis Video Streams is used to stream videos in near-real time. Amazon Rekognition Video uses Amazon Kinesis Video Streams to receive and process a video stream. After the videos have been processed by Rekognition we can output the results in DynamoDB.
Question 3) A term frequency–inverse document frequency (tf–idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences:
1. Please call the number below.
2. Please do not call us. What are the dimensions of the tf–idf matrix?
A) (2, 16)
B) (2, 8)
C) (2, 10)
D) (8, 10)
ANSWER3:
A
Notes/Hint3:
There are 2 sentences, 8 unique unigrams, and 8 unique bigrams, so the result would be (2,16). The phrases are “Please call the number below” and “Please do not call us.” Each word individually (unigram) is “Please,” “call,” ”the,” ”number,” “below,” “do,” “not,” and “us.” The unique bigrams are “Please call,” “call the,” ”the number,” “number below,” “Please do,” “do not,” “not call,” and “call us.” The tf–idf vectorizer is described at this link.
Question 4: A company is setting up a system to manage all of the datasets it stores in Amazon S3. The company would like to automate running transformation jobs on the data and maintaining a catalog of the metadata concerning the datasets. The solution should require the least amount of setup and maintenance. Which solution will allow the company to achieve its goals?
A) Create an Amazon EMR cluster with Apache Hive installed. Then, create a Hive metastore and a script to run transformation jobs on a schedule.
B) Create an AWS Glue crawler to populate the AWS Glue Data Catalog. Then, author an AWS Glue ETL job, and set up a schedule for data transformation jobs.
C) Create an Amazon EMR cluster with Apache Spark installed. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule. D) Create an AWS Data Pipeline that transforms the data. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule.
ANSWER4:
B
Notes/Hint4:
AWS Glue is the correct answer because this option requires the least amount of setup and maintenance since it is serverless, and it does not require management of the infrastructure. Refer to this link for supporting information. A, C, and D are all solutions that can solve the problem, but require more steps for configuration, and require higher operational overhead to run and maintain.
Question 5) Which service in the Kinesis family allows you to easily load streaming data into data stores and analytics tools?
A) Kinesis Firehose
B) Kinesis Streams
C) Kinesis Data Analytics
D) Kinesis Video Streams
ANSWER5:
A
Notes/Hint5:
Kinesis Firehose is perfect for streaming data into AWS and sending it directly to its final destination – places like S3, Redshift, Elastisearch, and Splunk Instances.
Question 6) A data scientist is working on optimizing a model during the training process by varying multiple parameters. The data scientist observes that, during multiple runs with identical parameters, the loss function converges to different, yet stable, values. What should the data scientist do to improve the training process?
A) Increase the learning rate. Keep the batch size the same.
B) Reduce the batch size. Decrease the learning rate.
C) Keep the batch size the same. Decrease the learning rate.
D) Do not change the learning rate. Increase the batch size.
Answer 6)
B
Notes 6)
It is most likely that the loss function is very curvy and has multiple local minima where the training is getting stuck. Decreasing the batch size would help the data scientist stochastically get out of the local minima saddles. Decreasing the learning rate would prevent overshooting the global loss function minimum. Refer to the paper at this link for an explanation. Reference 6) : Here
Question 7) Your organization has a standalone Javascript (Node.js) application that streams data into AWS using Kinesis Data Streams. You notice that they are using the Kinesis API (AWS SDK) over the Kinesis Producer Library (KPL). What might be the reasoning behind this?
A) The Kinesis API (AWS SDK) provides greater functionality over the Kinesis Producer Library.
B) The Kinesis API (AWS SDK) runs faster in Javascript applications over the Kinesis Producer Library.
C) The Kinesis Producer Library must be installed as a Java application to use with Kinesis Data Streams.
D) The Kinesis Producer Library cannot be integrated with a Javascript application because of its asynchronous architecture.
Answer 7)
C
Notes/Hint7:
The KPL must be installed as a Java application before it can be used with your Kinesis Data Streams. There are ways to process KPL serialized data within AWS Lambda, in Java, Node.js, and Python, but not if these answers mentions Lambda.
Question 8) A data scientist is evaluating different binary classification models. A false positive result is 5 times more expensive (from a business perspective) than a false negative result. The models should be evaluated based on the following criteria:
1) Must have a recall rate of at least 80%
2) Must have a false positive rate of 10% or less
3) Must minimize business costs After creating each binary classification model, the data scientist generates the corresponding confusion matrix. Which confusion matrix represents the model that satisfies the requirements?
A) TN = 91, FP = 9 FN = 22, TP = 78
B) TN = 99, FP = 1 FN = 21, TP = 79
C) TN = 96, FP = 4 FN = 10, TP = 90
D) TN = 98, FP = 2 FN = 18, TP = 82
Answer 8):
D
Notes/Hint 8)
The following calculations are required: TP = True Positive FP = False Positive FN = False Negative TN = True Negative FN = False Negative Recall = TP / (TP + FN) False Positive Rate (FPR) = FP / (FP + TN) Cost = 5 * FP + FN A B C D Recall 78 / (78 + 22) = 0.78 79 / (79 + 21) = 0.79 90 / (90 + 10) = 0.9 82 / (82 + 18) = 0.82 False Positive Rate 9 / (9 + 91) = 0.09 1 / (1 + 99) = 0.01 4 / (4 + 96) = 0.04 2 / (2 + 98) = 0.02 Costs 5 * 9 + 22 = 67 5 * 1 + 21 = 26 5 * 4 + 10 = 30 5 * 2 + 18 = 28 Options C and D have a recall greater than 80% and an FPR less than 10%, but D is the most cost effective. For supporting information, refer to this link. Reference 8:Here
Question 9) A data scientist uses logistic regression to build a fraud detection model. While the model accuracy is 99%, 90% of the fraud cases are not detected by the model. What action will definitely help the model detect more than 10% of fraud cases?
A) Using undersampling to balance the dataset
B) Decreasing the class probability threshold
C) Using regularization to reduce overfitting
D) Using oversampling to balance the dataset
Answer 9)
B
Notes 9)
Decreasing the class probability threshold makes the model more sensitive and, therefore, marks more cases as the positive class, which is fraud in this case. This will increase the likelihood of fraud detection. However, it comes at the price of lowering precision. This is covered in the Discussion section of the paper at this link Reference 9:Here
Question 10) A company is interested in building a fraud detection model. Currently, the data scientist does not have a sufficient amount of information due to the low number of fraud cases. Which method is MOST likely to detect the GREATEST number of valid fraud cases?
A) Oversampling using bootstrapping
B) Undersampling
C) Oversampling using SMOTE
D) Class weight adjustment
Answer 10)
C
Notes 10)
With datasets that are not fully populated, the Synthetic Minority Over-sampling Technique (SMOTE) adds new information by adding synthetic data points to the minority class. This technique would be the most effective in this scenario. Refer to Section 4.2 at this link for supporting information. Reference 10) : Here
Question 11) A machine learning engineer is preparing a data frame for a supervised learning task with the Amazon SageMaker Linear Learner algorithm. The ML engineer notices the target label classes are highly imbalanced and multiple feature columns contain missing values. The proportion of missing values across the entire data frame is less than 5%. What should the ML engineer do to minimize bias due to missing values?
A) Replace each missing value by the mean or median across non-missing values in same row.
B) Delete observations that contain missing values because these represent less than 5% of the data.
C) Replace each missing value by the mean or median across non-missing values in the same column.
D) For each feature, approximate the missing values using supervised learning based on other features.
Answer 11)
D
Notes 11)
Use supervised learning to predict missing values based on the values of other features. Different supervised learning approaches might have different performances, but any properly implemented supervised learning approach should provide the same or better approximation than mean or median approximation, as proposed in responses A and C. Supervised learning applied to the imputation of missing values is an active field of research. Refer to this link for an example. Reference 11): Here
Question 12) A company has collected customer comments on its products, rating them as safe or unsafe, using decision trees. The training dataset has the following features: id, date, full review, full review summary, and a binary safe/unsafe tag. During training, any data sample with missing features was dropped. In a few instances, the test set was found to be missing the full review text field. For this use case, which is the most effective course of action to address test data samples with missing features?
A) Drop the test samples with missing full review text fields, and then run through the test set.
B) Copy the summary text fields and use them to fill in the missing full review text fields, and then run through the test set.
C) Use an algorithm that handles missing data better than decision trees.
D) Generate synthetic data to fill in the fields that are missing data, and then run through the test set.
Answer 12)
B
Notes 12)
In this case, a full review summary usually contains the most descriptive phrases of the entire review and is a valid stand-in for the missing full review text field. For supporting information, refer to page 1627 at this link, and this link and this link.
Question 13) An insurance company needs to automate claim compliance reviews because human reviews are expensive and error-prone. The company has a large set of claims and a compliance label for each. Each claim consists of a few sentences in English, many of which contain complex related information. Management would like to use Amazon SageMaker built-in algorithms to design a machine learning supervised model that can be trained to read each claim and predict if the claim is compliant or not. Which approach should be used to extract features from the claims to be used as inputs for the downstream supervised task?
A) Derive a dictionary of tokens from claims in the entire dataset. Apply one-hot encoding to tokens found in each claim of the training set. Send the derived features space as inputs to an Amazon SageMaker builtin supervised learning algorithm.
B) Apply Amazon SageMaker BlazingText in Word2Vec mode to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
C) Apply Amazon SageMaker BlazingText in classification mode to labeled claims in the training set to derive features for the claims that correspond to the compliant and non-compliant labels, respectively.
D) Apply Amazon SageMaker Object2Vec to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
Answer 13)
D
Notes 13)
Amazon SageMaker Object2Vec generalizes the Word2Vec embedding technique for words to more complex objects, such as sentences and paragraphs. Since the supervised learning task is at the level of whole claims, for which there are labels, and no labels are available at the word level, Object2Vec needs be used instead of Word2Vec.
Question 14) You have been tasked with capturing two different types of streaming events. The first event type includes mission-critical data that needs to immediately be processed before operations can continue. The second event type includes data of less importance, but operations can continue without immediately processing. What is the most appropriate solution to record these different types of events?
A) Capture both events with the PutRecords API call.
B) Capture both event types using the Kinesis Producer Library (KPL).
C) Capture the mission critical events with the PutRecords API call and the second event type with the Kinesis Producer Library (KPL).
D) Capture the mission critical events with the Kinesis Producer Library (KPL) and the second event type with the Putrecords API call.
Answer 14)
C
Notes 14)
The question is about sending data to Kinesis synchronously vs. asynchronously. PutRecords is a synchronous send function, so it must be used for the first event type (critical events). The Kinesis Producer Library (KPL) implements an asynchronous send function, so it can be used for the second event type. In this scenario, the reason to use the KPL over the PutRecords API call is because: KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user-configurable). Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance. Applications that cannot tolerate this additional delay may need to use the AWS SDK directly. For more information about using the AWS SDK with Kinesis Data Streams, see Developing Producers Using the Amazon Kinesis Data Streams API with the AWS SDK for Java. For more information about RecordMaxBufferedTime and other user-configurable properties of the KPL, see Configuring the Kinesis Producer Library.
Question 15) You are collecting clickstream data from an e-commerce website to make near-real time product suggestions for users actively using the site. Which combination of tools can be used to achieve the quickest recommendations and meets all of the requirements?
A) Use Kinesis Data Streams to ingest clickstream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
B) Use Kinesis Data Firehose to ingest click stream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions, then use Lambda to load these results into S3.
C) Use Kinesis Data Streams to ingest clickstream data, then use Lambda to process that data and write it to S3. Once the data is on S3, use Athena to query based on conditions that data and make real time recommendations to users.
D) Use the Kinesis Data Analytics to ingest the clickstream data directly and run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
Answer 15)
A
Notes 15)
Kinesis Data Analytics gets its input streaming data from Kinesis Data Streams or Kinesis Data Firehose. You can use Kinesis Data Analytics to run real-time SQL queries on your data. Once certain conditions are met you can trigger Lambda functions to make real time product suggestions to users. It is not important that we store or persist the clickstream data.
Question 16) Which service built by AWS makes it easy to set up a retry mechanism, aggregate records to improve throughput, and automatically submits CloudWatch metrics?
A) Kinesis API (AWS SDK)
B) Kinesis Producer Library (KPL)
C) Kinesis Consumer Library
D) Kinesis Client Library (KCL)
Answer 16)
B
Notes 16)
Although the Kinesis API built into the AWS SDK can be used for all of this, the Kinesis Producer Library (KPL) makes it easy to integrate all of this into your applications.
Question 17) You have been tasked with capturing data from an online gaming platform to run analytics on and process through a machine learning pipeline. The data that you are ingesting is players controller inputs every 1 second (up to 10 players in a game) that is in JSON format. The data needs to be ingested through Kinesis Data Streams and the JSON data blob is 100 KB in size. What is the minimum number of shards you can use to successfully ingest this data?
A) 10 shards
B) Greater than 500 shards, so you’ll need to request more shards from AWS
C) 1 shard
D) 100 shards
Answer 17)
C
Notes 17)
In this scenario, there will be a maximum of 10 records per second with a max payload size of 1000 KB (10 records x 100 KB = 1000KB) written to the shard. A single shard can ingest up to 1 MB of data per second, which is enough to ingest the 1000 KB from the streaming game play. Therefor 1 shard is enough to handle the streaming data.
Question 18) Which services in the Kinesis family allows you to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time?
A) Kinesis Streams
B) Kinesis Firehose
C) Kinesis Video Streams
D) Kinesis Data Analytics
Answer 18)
D
Notes 18)
Kinesis Data Analytics allows you to run real-time SQL queries on your data to gain insights and respond to events in real time.
Question 19) You are a ML specialist needing to collect data from Twitter tweets. Your goal is to collect tweets that include only the name of your company and the tweet body, and store it off into a data store in AWS. What set of tools can you use to stream, transform, and load the data into AWS with the LEAST amount of effort?
A) Setup a Kinesis Data Firehose for data ingestion and immediately write that data to S3. Next, setup a Lambda function to trigger when data lands in S3 to transform it and finally write it to DynamoDB.
B) Setup A Kinesis Data Stream for data ingestion, setup EC2 instances as data consumers to poll and transform the data from the stream. Once the data is transformed, make an API call to write the data to DynamoDB.
C) Setup Kinesis Data Streams for data ingestion. Next, setup Kinesis Data Firehouse to load that data into RedShift. Next, setup a Lambda function to query data using RedShift spectrum and store the results onto DynamoDB.
D) Create a Kinesis Data Stream to ingest the data. Next, setup a Kinesis Data Firehose and use Lambda to transform the data from the Kinesis Data Stream, then use Lambda to write the data to DynamoDB. Finally, use S3 as the data destination for Kinesis Data Firehose.
Answer 19)
A
Notes 19)
All of these could be used to stream, transform, and load the data into an AWS data store. The setup that requires the LEAST amount of effort and moving parts involves setting up a Kinesis Data Firehose to stream the data into S3, have it transformed by Lambda with an S3 trigger, and then written to DynamoDB.
What are the Top 100 AWS and Google Certified Machine Learning Specialty Questions and Answers Dumps?
This blog is the best way is the best way to prepare for your upcoming AWS Certified Machine Learning Specialty and Google Certified Professional Machine Learning Engineer exam. With over 100 questions and answers, this blog provides quizzes similar that are very similar to the real exam. It also includes the option to show and hide answers. Additionally, there are machine learning interview questions and detailed answers, as well as cheat sheets and illustrations. This blog is the best way to make sure you are well-prepared for your AWS Certified Machine Learning Specialty Exam.
The typical Google Machine Learning Engineer salary is $147,218. Machine Learning Engineer salaries at Google can range from $110,000 – $152,183.
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.
By the end of 2020, 85% of customer interactions will be handled without a human (Call Center, Chatbot, etc…)
61% of marketers say artificial intelligence is the most important aspect of their data strategy.
80% of business and tech leaders say AI already boosts productivity (Robotic Process Automation, Power Automate, etc..)
Current AI technology can boost business productivity by up to 40%
Use this App to learn about Machine Learning and Elevate your Brain with Machine Learning Quizzes, Cheat Sheets, Ml Jobs Interview Questions and Answers updated daily.
What does a Professional Machine Learning Engineer do?
A Professional Machine Learning Engineer designs, builds, and productionizes ML models to solve business challenges using Google Cloud technologies and knowledge of proven ML models and techniques. The ML Engineer collaborates closely with other job roles to ensure long-term success of models. The ML Engineer should be proficient in all aspects of model architecture, data pipeline interaction, and metrics interpretation. The ML Engineer needs familiarity with application development, infrastructure management, data engineering, and security. Through an understanding of training, retraining, deploying, scheduling, monitoring, and improving models, they design and create scalable solutions for optimal performance.
The AWS Certified Machine Learning – Specialty certification is intended for individuals who perform a development or data science role. It validates a candidate’s ability to design, implement, deploy, and maintain machine learning (ML) solutions for given business problems.
Question1: A machine learning team has several large CSV datasets in Amazon S3. Historically, models built with the Amazon SageMaker Linear Learner algorithm have taken hours to train on similar-sized datasets. The team’s leaders need to accelerate the training process. What can a machine learning specialist do to address this concern?
A) Use Amazon SageMaker Pipe mode. B) Use Amazon Machine Learning to train the models. C) Use Amazon Kinesis to stream the data to Amazon SageMaker. D) Use AWS Glue to transform the CSV dataset to the JSON format. ANSWER1:
A
Notes/Hint1:
Amazon SageMaker Pipe mode streams the data directly to the container, which improves the performance of training jobs. (Refer to this link for supporting information.) In Pipe mode, your training job streams data directly from Amazon S3. Streaming can provide faster start times for training jobs and better throughput. With Pipe mode, you also reduce the size of the Amazon EBS volumes for your training instances. B would not apply in this scenario. C is a streaming ingestion solution, but is not applicable in this scenario. D transforms the data structure.
Question 2) A local university wants to track cars in a parking lot to determine which students are parking in the lot. The university is wanting to ingest videos of the cars parking in near-real time, use machine learning to identify license plates, and store that data in an AWS data store. Which solution meets these requirements with the LEAST amount of development effort?
A) Use Amazon Kinesis Data Streams to ingest the video in near-real time, use the Kinesis Data Streams consumer integrated with Amazon Rekognition Video to process the license plate information, and then store results in DynamoDB.
B) Use Amazon Kinesis Video Streams to ingest the videos in near-real time, use the Kinesis Video Streams integration with Amazon Rekognition Video to identify the license plate information, and then store the results in DynamoDB.
C) Use Amazon Kinesis Data Streams to ingest videos in near-real time, call Amazon Rekognition to identify license plate information, and then store results in DynamoDB.
D) Use Amazon Kinesis Firehose to ingest the video in near-real time and outputs results onto S3. Set up a Lambda function that triggers when a new video is PUT onto S3 to send results to Amazon Rekognition to identify license plate information, and then store results in DynamoDB.
Answer 2)
B
Notes/Hint2)
Kinesis Video Streams is used to stream videos in near-real time. Amazon Rekognition Video uses Amazon Kinesis Video Streams to receive and process a video stream. After the videos have been processed by Rekognition we can output the results in DynamoDB.
Question 3) A term frequency–inverse document frequency (tf–idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences:
1. Please call the number below.
2. Please do not call us. What are the dimensions of the tf–idf matrix?
A) (2, 16)
B) (2, 8)
C) (2, 10)
D) (8, 10)
ANSWER3:
A
Notes/Hint3:
There are 2 sentences, 8 unique unigrams, and 8 unique bigrams, so the result would be (2,16). The phrases are “Please call the number below” and “Please do not call us.” Each word individually (unigram) is “Please,” “call,” ”the,” ”number,” “below,” “do,” “not,” and “us.” The unique bigrams are “Please call,” “call the,” ”the number,” “number below,” “Please do,” “do not,” “not call,” and “call us.” The tf–idf vectorizer is described at this link.
Question 4: A company is setting up a system to manage all of the datasets it stores in Amazon S3. The company would like to automate running transformation jobs on the data and maintaining a catalog of the metadata concerning the datasets. The solution should require the least amount of setup and maintenance. Which solution will allow the company to achieve its goals?
A) Create an Amazon EMR cluster with Apache Hive installed. Then, create a Hive metastore and a script to run transformation jobs on a schedule.
B) Create an AWS Glue crawler to populate the AWS Glue Data Catalog. Then, author an AWS Glue ETL job, and set up a schedule for data transformation jobs.
C) Create an Amazon EMR cluster with Apache Spark installed. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule. D) Create an AWS Data Pipeline that transforms the data. Then, create an Apache Hive metastore and a script to run transformation jobs on a schedule.
ANSWER4:
B
Notes/Hint4:
AWS Glue is the correct answer because this option requires the least amount of setup and maintenance since it is serverless, and it does not require management of the infrastructure. Refer to this link for supporting information. A, C, and D are all solutions that can solve the problem, but require more steps for configuration, and require higher operational overhead to run and maintain.
Question 5) Which service in the Kinesis family allows you to easily load streaming data into data stores and analytics tools?
A) Kinesis Firehose
B) Kinesis Streams
C) Kinesis Data Analytics
D) Kinesis Video Streams
ANSWER5:
A
Notes/Hint5:
Kinesis Firehose is perfect for streaming data into AWS and sending it directly to its final destination – places like S3, Redshift, Elastisearch, and Splunk Instances.
Question 6) A data scientist is working on optimizing a model during the training process by varying multiple parameters. The data scientist observes that, during multiple runs with identical parameters, the loss function converges to different, yet stable, values. What should the data scientist do to improve the training process?
A) Increase the learning rate. Keep the batch size the same.
B) Reduce the batch size. Decrease the learning rate.
C) Keep the batch size the same. Decrease the learning rate.
D) Do not change the learning rate. Increase the batch size.
Answer 6)
B
Notes 6)
It is most likely that the loss function is very curvy and has multiple local minima where the training is getting stuck. Decreasing the batch size would help the data scientist stochastically get out of the local minima saddles. Decreasing the learning rate would prevent overshooting the global loss function minimum. Refer to the paper at this link for an explanation. Reference 6) : Here
Question 7) Your organization has a standalone Javascript (Node.js) application that streams data into AWS using Kinesis Data Streams. You notice that they are using the Kinesis API (AWS SDK) over the Kinesis Producer Library (KPL). What might be the reasoning behind this?
A) The Kinesis API (AWS SDK) provides greater functionality over the Kinesis Producer Library.
B) The Kinesis API (AWS SDK) runs faster in Javascript applications over the Kinesis Producer Library.
C) The Kinesis Producer Library must be installed as a Java application to use with Kinesis Data Streams.
D) The Kinesis Producer Library cannot be integrated with a Javascript application because of its asynchronous architecture.
Answer 7)
C
Notes/Hint7:
The KPL must be installed as a Java application before it can be used with your Kinesis Data Streams. There are ways to process KPL serialized data within AWS Lambda, in Java, Node.js, and Python, but not if these answers mentions Lambda.
Question 8) A data scientist is evaluating different binary classification models. A false positive result is 5 times more expensive (from a business perspective) than a false negative result. The models should be evaluated based on the following criteria:
1) Must have a recall rate of at least 80%
2) Must have a false positive rate of 10% or less
3) Must minimize business costs After creating each binary classification model, the data scientist generates the corresponding confusion matrix. Which confusion matrix represents the model that satisfies the requirements?
A) TN = 91, FP = 9 FN = 22, TP = 78
B) TN = 99, FP = 1 FN = 21, TP = 79
C) TN = 96, FP = 4 FN = 10, TP = 90
D) TN = 98, FP = 2 FN = 18, TP = 82
Answer 8):
D
Notes/Hint 8)
The following calculations are required: TP = True Positive FP = False Positive FN = False Negative TN = True Negative FN = False Negative Recall = TP / (TP + FN) False Positive Rate (FPR) = FP / (FP + TN) Cost = 5 * FP + FN A B C D Recall 78 / (78 + 22) = 0.78 79 / (79 + 21) = 0.79 90 / (90 + 10) = 0.9 82 / (82 + 18) = 0.82 False Positive Rate 9 / (9 + 91) = 0.09 1 / (1 + 99) = 0.01 4 / (4 + 96) = 0.04 2 / (2 + 98) = 0.02 Costs 5 * 9 + 22 = 67 5 * 1 + 21 = 26 5 * 4 + 10 = 30 5 * 2 + 18 = 28 Options C and D have a recall greater than 80% and an FPR less than 10%, but D is the most cost effective. For supporting information, refer to this link. Reference 8:Here
Question 9) A data scientist uses logistic regression to build a fraud detection model. While the model accuracy is 99%, 90% of the fraud cases are not detected by the model. What action will definitely help the model detect more than 10% of fraud cases?
A) Using undersampling to balance the dataset
B) Decreasing the class probability threshold
C) Using regularization to reduce overfitting
D) Using oversampling to balance the dataset
Answer 9)
B
Notes 9)
Decreasing the class probability threshold makes the model more sensitive and, therefore, marks more cases as the positive class, which is fraud in this case. This will increase the likelihood of fraud detection. However, it comes at the price of lowering precision. This is covered in the Discussion section of the paper at this link Reference 9:Here
Question 10) A company is interested in building a fraud detection model. Currently, the data scientist does not have a sufficient amount of information due to the low number of fraud cases. Which method is MOST likely to detect the GREATEST number of valid fraud cases?
A) Oversampling using bootstrapping
B) Undersampling
C) Oversampling using SMOTE
D) Class weight adjustment
Answer 10)
C
Notes 10)
With datasets that are not fully populated, the Synthetic Minority Over-sampling Technique (SMOTE) adds new information by adding synthetic data points to the minority class. This technique would be the most effective in this scenario. Refer to Section 4.2 at this link for supporting information. Reference 10) : Here
Question 11) A machine learning engineer is preparing a data frame for a supervised learning task with the Amazon SageMaker Linear Learner algorithm. The ML engineer notices the target label classes are highly imbalanced and multiple feature columns contain missing values. The proportion of missing values across the entire data frame is less than 5%. What should the ML engineer do to minimize bias due to missing values?
A) Replace each missing value by the mean or median across non-missing values in same row.
B) Delete observations that contain missing values because these represent less than 5% of the data.
C) Replace each missing value by the mean or median across non-missing values in the same column.
D) For each feature, approximate the missing values using supervised learning based on other features.
Answer 11)
D
Notes 11)
Use supervised learning to predict missing values based on the values of other features. Different supervised learning approaches might have different performances, but any properly implemented supervised learning approach should provide the same or better approximation than mean or median approximation, as proposed in responses A and C. Supervised learning applied to the imputation of missing values is an active field of research. Refer to this link for an example. Reference 11): Here
Question 12) A company has collected customer comments on its products, rating them as safe or unsafe, using decision trees. The training dataset has the following features: id, date, full review, full review summary, and a binary safe/unsafe tag. During training, any data sample with missing features was dropped. In a few instances, the test set was found to be missing the full review text field. For this use case, which is the most effective course of action to address test data samples with missing features?
A) Drop the test samples with missing full review text fields, and then run through the test set.
B) Copy the summary text fields and use them to fill in the missing full review text fields, and then run through the test set.
C) Use an algorithm that handles missing data better than decision trees.
D) Generate synthetic data to fill in the fields that are missing data, and then run through the test set.
Answer 12)
B
Notes 12)
In this case, a full review summary usually contains the most descriptive phrases of the entire review and is a valid stand-in for the missing full review text field. For supporting information, refer to page 1627 at this link, and this link and this link.
Question 13) An insurance company needs to automate claim compliance reviews because human reviews are expensive and error-prone. The company has a large set of claims and a compliance label for each. Each claim consists of a few sentences in English, many of which contain complex related information. Management would like to use Amazon SageMaker built-in algorithms to design a machine learning supervised model that can be trained to read each claim and predict if the claim is compliant or not. Which approach should be used to extract features from the claims to be used as inputs for the downstream supervised task?
A) Derive a dictionary of tokens from claims in the entire dataset. Apply one-hot encoding to tokens found in each claim of the training set. Send the derived features space as inputs to an Amazon SageMaker builtin supervised learning algorithm.
B) Apply Amazon SageMaker BlazingText in Word2Vec mode to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
C) Apply Amazon SageMaker BlazingText in classification mode to labeled claims in the training set to derive features for the claims that correspond to the compliant and non-compliant labels, respectively.
D) Apply Amazon SageMaker Object2Vec to claims in the training set. Send the derived features space as inputs for the downstream supervised task.
Answer 13)
D
Notes 13)
Amazon SageMaker Object2Vec generalizes the Word2Vec embedding technique for words to more complex objects, such as sentences and paragraphs. Since the supervised learning task is at the level of whole claims, for which there are labels, and no labels are available at the word level, Object2Vec needs be used instead of Word2Vec.
Question 14) You have been tasked with capturing two different types of streaming events. The first event type includes mission-critical data that needs to immediately be processed before operations can continue. The second event type includes data of less importance, but operations can continue without immediately processing. What is the most appropriate solution to record these different types of events?
A) Capture both events with the PutRecords API call.
B) Capture both event types using the Kinesis Producer Library (KPL).
C) Capture the mission critical events with the PutRecords API call and the second event type with the Kinesis Producer Library (KPL).
D) Capture the mission critical events with the Kinesis Producer Library (KPL) and the second event type with the Putrecords API call.
Answer 14)
C
Notes 14)
The question is about sending data to Kinesis synchronously vs. asynchronously. PutRecords is a synchronous send function, so it must be used for the first event type (critical events). The Kinesis Producer Library (KPL) implements an asynchronous send function, so it can be used for the second event type. In this scenario, the reason to use the KPL over the PutRecords API call is because: KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user-configurable). Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance. Applications that cannot tolerate this additional delay may need to use the AWS SDK directly. For more information about using the AWS SDK with Kinesis Data Streams, see Developing Producers Using the Amazon Kinesis Data Streams API with the AWS SDK for Java. For more information about RecordMaxBufferedTime and other user-configurable properties of the KPL, see Configuring the Kinesis Producer Library.
Question 15) You are collecting clickstream data from an e-commerce website to make near-real time product suggestions for users actively using the site. Which combination of tools can be used to achieve the quickest recommendations and meets all of the requirements?
A) Use Kinesis Data Streams to ingest clickstream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
B) Use Kinesis Data Firehose to ingest click stream data, then use Kinesis Data Analytics to run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions, then use Lambda to load these results into S3.
C) Use Kinesis Data Streams to ingest clickstream data, then use Lambda to process that data and write it to S3. Once the data is on S3, use Athena to query based on conditions that data and make real time recommendations to users.
D) Use the Kinesis Data Analytics to ingest the clickstream data directly and run real time SQL queries to gain actionable insights and trigger real-time recommendations with AWS Lambda functions based on conditions.
Answer 15)
A
Notes 15)
Kinesis Data Analytics gets its input streaming data from Kinesis Data Streams or Kinesis Data Firehose. You can use Kinesis Data Analytics to run real-time SQL queries on your data. Once certain conditions are met you can trigger Lambda functions to make real time product suggestions to users. It is not important that we store or persist the clickstream data.
Question 16) Which service built by AWS makes it easy to set up a retry mechanism, aggregate records to improve throughput, and automatically submits CloudWatch metrics?
A) Kinesis API (AWS SDK)
B) Kinesis Producer Library (KPL)
C) Kinesis Consumer Library
D) Kinesis Client Library (KCL)
Answer 16)
B
Notes 16)
Although the Kinesis API built into the AWS SDK can be used for all of this, the Kinesis Producer Library (KPL) makes it easy to integrate all of this into your applications.
Question 17) You have been tasked with capturing data from an online gaming platform to run analytics on and process through a machine learning pipeline. The data that you are ingesting is players controller inputs every 1 second (up to 10 players in a game) that is in JSON format. The data needs to be ingested through Kinesis Data Streams and the JSON data blob is 100 KB in size. What is the minimum number of shards you can use to successfully ingest this data?
A) 10 shards
B) Greater than 500 shards, so you’ll need to request more shards from AWS
C) 1 shard
D) 100 shards
Answer 17)
C
Notes 17)
In this scenario, there will be a maximum of 10 records per second with a max payload size of 1000 KB (10 records x 100 KB = 1000KB) written to the shard. A single shard can ingest up to 1 MB of data per second, which is enough to ingest the 1000 KB from the streaming game play. Therefor 1 shard is enough to handle the streaming data.
Question 18) Which services in the Kinesis family allows you to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time?
A) Kinesis Streams
B) Kinesis Firehose
C) Kinesis Video Streams
D) Kinesis Data Analytics
Answer 18)
D
Notes 18)
Kinesis Data Analytics allows you to run real-time SQL queries on your data to gain insights and respond to events in real time.
Question 19) You are a ML specialist needing to collect data from Twitter tweets. Your goal is to collect tweets that include only the name of your company and the tweet body, and store it off into a data store in AWS. What set of tools can you use to stream, transform, and load the data into AWS with the LEAST amount of effort?
A) Setup a Kinesis Data Firehose for data ingestion and immediately write that data to S3. Next, setup a Lambda function to trigger when data lands in S3 to transform it and finally write it to DynamoDB.
B) Setup A Kinesis Data Stream for data ingestion, setup EC2 instances as data consumers to poll and transform the data from the stream. Once the data is transformed, make an API call to write the data to DynamoDB.
C) Setup Kinesis Data Streams for data ingestion. Next, setup Kinesis Data Firehouse to load that data into RedShift. Next, setup a Lambda function to query data using RedShift spectrum and store the results onto DynamoDB.
D) Create a Kinesis Data Stream to ingest the data. Next, setup a Kinesis Data Firehose and use Lambda to transform the data from the Kinesis Data Stream, then use Lambda to write the data to DynamoDB. Finally, use S3 as the data destination for Kinesis Data Firehose.
Answer 19)
A
Notes 19)
All of these could be used to stream, transform, and load the data into an AWS data store. The setup that requires the LEAST amount of effort and moving parts involves setting up a Kinesis Data Firehose to stream the data into S3, have it transformed by Lambda with an S3 trigger, and then written to DynamoDB.
Question22:Which of the following is an appropriate use case for unsupervised learning?
A) Partitioning an image of a street scene into multiple segments
B) Finding an optimal path out of a maze
C) Identifying clusters of housing sales based on related data points
D) Analyzing sentiment of social media posts
Answer22:
C
Notes 22:
Identifying clusters of housing sales based on related data points
Question23:
Answer23:
Notes 23:
Question24: A Djamgatech retail company wants to deploy a machine learning model to predict the demand for a product using sales data from the past 5 years. What is the MOST efficient solution that the company should implement first?
Quizzes, Practice Exams: Modeling, Data Engineering, Vision, Exploratory Data Analysis, ML Ops, Cheat Sheets, ML Jobs Interview Q&A
Use this App to learn about Machine Learning on AWS and prepare for the AWS Machine Learning Specialty Certification MLS-C01.
Earning AWS Certified Machine Learning Specialty validates expertise in building, training, tuning, and deploying machine learning (ML) models on AWS.
The App provides hundreds of quizzes and practice exam about:
– Machine Learning Operation on AWS
– Modelling
– Data Engineering
– Computer Vision,
– Exploratory Data Analysis,
– ML implementation & Operations
– Machine Learning Basics Questions and Answers
– Machine Learning Advanced Questions and Answers
– Scorecard
– Countdown timer
– Machine Learning Cheat Sheets
– Machine Learning Interview Questions and Answers
– Machine Learning Latest News
The App covers Machine Learning Basics and Advanced topics including: NLP, Computer Vision, Python, linear regression, logistic regression, Sampling, dataset, statistical interaction, selection bias, non-Gaussian distribution, bias-variance trade-off, Normal Distribution, correlation and covariance, Point Estimates and Confidence Interval, A/B Testing, p-value, statistical power of sensitivity, over-fitting and under-fitting, regularization, Law of Large Numbers, Confounding Variables, Survivorship Bias, univariate, bivariate and multivariate, Resampling, ROC curve, TF/IDF vectorization, Cluster Sampling, etc.
Question1: An advertising and analytics company uses machine learning to predict user response to online advertisements using a custom XGBoost model. The company wants to improve its ML pipeline by porting its training and inference code, written in R, to Amazon SageMaker, and do so with minimal changes to the existing code.
Answer1: Use the Build Your Own Container (BYOC) Amazon Sagemaker option. Create a new docker container with the existing code. Register the container in Amazon Elastic Container registry. with the existing code. Register the container in Amazon Elastic Container Registry. Finally run the training and inference jobs using this container.
Question2: Which feature of Amazon SageMaker can you use for preprocessing the data?
Answer2: Amazon Sagemaker Notebook instances
Amazon SageMaker enables developers and data scientists to build, train, tune, and deploy machine learning (ML) models at scale. You can deploy trained ML models for real-time or batch predictions on unseen data, a process known as inference. However, in most cases, the raw input data must be preprocessed and can’t be used directly for making predictions. This is because most ML models expect the data in a predefined format, so the raw data needs to be first cleaned and formatted in order for the ML model to process the data. You can use the Amazon SageMaker built-in Scikit-learn library for preprocessing input data and then use the Amazon SageMaker built-in Linear Learner algorithm for predictions.
Question3: What setting, when creating an Amazon SageMaker notebook instance, can you use to install libraries and import data?
Top 10 Google Professional Machine Learning Engineer Sample Questions
Question 1: You work for a textile manufacturer and have been asked to build a model to detect and classify fabric defects. You trained a machine learning model with high recall based on high resolution images taken at the end of the production line. You want quality control inspectors to gain trust in your model. Which technique should you use to understand the rationale of your classifier?
A. Use K-fold cross validation to understand how the model performs on different test datasets.
B. Use the Integrated Gradients method to efficiently compute feature attributions for each predicted image.
C. Use PCA (Principal Component Analysis) to reduce the original feature set to a smaller set of easily understood features.
D. Use k-means clustering to group similar images together, and calculate the Davies-Bouldin index to evaluate the separation between clusters.
Answer 1)
B
Notes 1)
B is correct because it identifies the pixel of the input image that leads to the classification of the image itself.
Question 2: You need to write a generic test to verify whether Dense Neural Network (DNN) models automatically released by your team have a sufficient number of parameters to learn the task for which they were built. What should you do?
A. Train the model for a few iterations, and check for NaN values.
B. Train the model for a few iterations, and verify that the loss is constant.
C. Train a simple linear model, and determine if the DNN model outperforms it.
D. Train the model with no regularization, and verify that the loss function is close to zero.
Answer 2)
D
Notes 2)
D is correct because the test can check that the model has enough parameters to memorize the task.
Question 3: Your team is using a TensorFlow Inception-v3 CNN model pretrained on ImageNet for an image classification prediction challenge on 10,000 images. You will use AI Platform to perform the model training. What TensorFlow distribution strategy and AI Platform training job configuration should you use to train the model and optimize for wall-clock time?
A. Default Strategy; Custom tier with a single master node and four v100 GPUs.
B. One Device Strategy; Custom tier with a single master node and four v100 GPUs.
C. One Device Strategy; Custom tier with a single master node and eight v100 GPUs.
D. Central Storage Strategy; Custom tier with a single master node and four v100 GPUs.
Answer 3)
D
Notes 3)
D is correct because this is the only strategy that can perform distributed training; albeit there is only a single copy of the variables on the CPU host.
Question 4: You work on a team where the process for deploying a model into production starts with data scientists training different versions of models in a Kubeflow pipeline. The workflow then stores the new model artifact into the corresponding Cloud Storage bucket. You need to build the next steps of the pipeline after the submitted model is ready to be tested and deployed in production on AI Platform. How should you configure the architecture before deploying the model to production?
A. Deploy model in test environment -> Validate model -> Create a new AI Platform model version
B. Validate model -> Deploy model in test environment -> Create a new AI Platform model version
C. Create a new AI Platform model version -> Validate model -> Deploy model in test environment
D. Create a new AI Platform model version – > Deploy model in test environment -> Validate model
Answer 4)
A
Notes 4)
A is correct because the model can be validated after it is deployed to the test environment, and the release version is established before the model is deployed in production.
Question 5: You work for a maintenance company and have built and trained a deep learning model that identifies defects based on thermal images of underground electric cables. Your dataset contains 10,000 images, 100 of which contain visible defects. How should you evaluate the performance of the model on a test dataset?
A. Calculate the Area Under the Curve (AUC) value.
B. Calculate the number of true positive results predicted by the model.
C. Calculate the fraction of images predicted by the model to have a visible defect.
D. Calculate the Cosine Similarity to compare the model’s performance on the test dataset to the model’s performance on the training dataset.
Answer 5)
A
Notes 5)
A is correct because it is scale-invariant. AUC measures how well predictions are ranked, rather than their absolute values. AUC is also classification-threshold invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.
Question 6: You work for a manufacturing company that owns a high-value machine which has several machine settings and multiple sensors. A history of the machine’s hourly sensor readings and known failure event data are stored in BigQuery. You need to predict if the machine will fail within the next 3 days in order to schedule maintenance before the machine fails. Which data preparation and model training steps should you take?
A. Data preparation: Daily max value feature engineering with DataPrep; Model training: AutoML classification with BQML
B. Data preparation: Daily min value feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
C. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to False
D. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
Answer 6)
D
Notes 6)
D is correct because it uses the rolling average of the sensor data and balances the weights using the BQML auto class weight balance parameter.
Question 7: You are an ML engineer at a media company. You need to build an ML model to analyze video content frame-by-frame, identify objects, and alert users if there is inappropriate content. Which Google Cloud products should you use to build this project?
A. Pub/Sub, Cloud Function, Cloud Vision API
B. Pub/Sub, Cloud IoT, Dataflow, Cloud Vision API, Cloud Logging
C. Pub/Sub, Cloud Function, Video Intelligence API, Cloud Logging
D. Pub/Sub, Cloud Function, AutoML Video Intelligence, Cloud Logging
Answer 7)
C
Notes 7)
C is correct as Video Intelligence API can find inappropriate components and other components satisfy the requirements of real-time processing and notification.
Question 8: You work for a large retailer. You want to use ML to forecast future sales leveraging 10 years of historical sales data. The historical data is stored in Cloud Storage in Avro format. You want to rapidly experiment with all the available data. How should you build and train your model for the sales forecast?
A. Load data into BigQuery and use the ARIMA model type on BigQuery ML.
B. Convert the data into CSV format and create a regression model on AutoML Tables.
C. Convert the data into TFRecords and create an RNN model on TensorFlow on AI Platform Notebooks.
D. Convert and refactor the data into CSV format and use the built-in XGBoost algorithm on AI Platform Training.
Answer 8)
A
Notes 8)
A is correct because BigQuery ML is designed for fast and rapid experimentation and it is possible to use federated queries to read data directly from Cloud Storage. Moreover, ARIMA is considered one of the best in class for time series forecasting.
Question 9) You need to build an object detection model for a small startup company to identify if and where the company’s logo appears in an image. You were given a large repository of images, some with logos and some without. These images are not yet labelled. You need to label these pictures, and then train and deploy the model. What should you do?
A. Use Google Cloud’s Data Labelling Service to label your data. Use AutoML Object Detection to train and deploy the model.
B. Use Vision API to detect and identify logos in pictures and use it as a label. Use AI Platform to build and train a convolutional neural network.
C. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a convolutional neural network.
D. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a real time object detection model.
Answer 9)
A
Notes 9)
A is correct as this will allow you to easily create a request for a labelling task and deploy a high-performance model.
Question 10) You work for a large financial institution that is planning to use Dialogflow to create a chatbot for the company’s mobile app. You have reviewed old chat logs and tagged each conversation for intent based on each customer’s stated intention for contacting customer service. About 70% of customer inquiries are simple requests that are solved within 10 intents. The remaining 30% of inquiries require much longer and more complicated requests. Which intents should you automate first?
A. Automate a blend of the shortest and longest intents to be representative of all intents.
B. Automate the more complicated requests first because those require more of the agents’ time.
C. Automate the 10 intents that cover 70% of the requests so that live agents can handle the more complicated requests.
D. Automate intents in places where common words such as “payment” only appear once to avoid confusing the software.
Azure and AWS are second class citizens in this area.
Sure, AWS has 70% of the market.
Sure, Azure is the easiest turn key and super user friendly.
But, the king of machine learning in the cloud is GCP.
GCP = Google Cloud Platform
Google has the largest data science team in the world, not mention they have Hinton.
Let’s forgot for a minute they created TensorFlow and give it away.
Let’s just talk about building a real world model with data that doesn’t fit into a excel spreadsheet.
The vast majority of applied machine learning is supervised and that means we need data.
Not just normal data, we need very clean highly structured data.
Where’s the easiest place in the world to upload and model a Petabyte of structured data? BigQuery of course.
Why BigQuery? I don’t have to do anything but upload my data. No spinning up RedShit clusters or whatever I have to do in Azure, just upload and massage data with my familiar SQL. If I do have to wrangle my data it won’t take my six months to update 5 rows here, minutes usually.
Then, you’ll need a front end. Cloud datalab is a Jupyter notebook, which is good because I don’t want nor do I need anything else.
Then, with a single line of code I connect by datalab (Jupyter) notebook to my data in BigQuery and build away.
I’ve worked in all three and the only thing I care about is getting to my job the fastest and right now that means I build my models in GCP.
If you’re new to machine learning don’t start in GCP or any cloud vendor for that matter. Start learning Python from the comfort of your laptop.
Here, I want to share the best research paper on Machine Learning classification methods, titled ‘Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?’, published in the ‘Journal of Machine Learning Research’.
This paper nicely explained 179 classification techniques and applied them on 121 data sets thus sharing small summary of the paper:
Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?
https://jmlr.org/papers/v15/delgado14a.html
The paper evaluated 179 classifiers arising from 17 ML families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest neighbours, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R ( with and without the caret package), C and Matlab, including all the relevant classifiers available today.
Experiments used total 121 data sets , which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behaviour, not dependent on the data set collection.
The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package).
The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).
1. Is the classification going to be supervised or unsupervised? Several well defined techniques likes SVM (Support Vector Machines), trained neural net,etc. are applicable for supervised classification. For unsupervised classification, GMMs (Gaussian Mixture Models), HMMs (Hidden Markov models) with Baye’s techniques could be used. (Several other techniques could of course be used as well)
2.How much training data do you have in case it is supervised ? A small number of training data may yield discouraging classification accuracy even if the chosen classifier is the most suitable one for the problem. In such a case, try to obtain more number of samples. There’s also generally a correlation (for practical purposes at least) between the feature dimensionality and the number of samples for given technique. For example, while using SVM, the linear kernel tends to yield better results when the number of training samples are less than or equal to or only slightly more than the number of feature dimensions as compared to RBF or any other kernel.
3. If the feature vector dimensionality is small enough (1/2/3 -D) then it makes sense to plot and visually inspect if techniques like clustering could be more useful. With very high number of feature dimensions, methods like clustering are generally not advisable(Refer : “The Curse Of Dimensionality”).
4. Are you doing classification in real time ? Some techniques ,e.g. “Template Match” in image classification may lead to a higher number of errors but is generally faster than most other techniques if the number of templates to be evaluated are not excessively high.
5. Depending upon the problem domain, you can decide if you can choose the underlying model in such a way that it can use certain temporal/spatial correlations that may be inherent in the data. For example, HMMs use the temporal continuity of speech samples for enhancing classification results in speech recognition problems.
Another point, slightly off the topic perhaps, but the classification performance is as much a function of choosing the correct feature vectors, the pre-processing of the feature vectors as much as the classifier itself. It’s generally a good idea to give reserve some initial part of the project to try out various classifiers on the same data-set. It may at least help you reject the ones which are highly inaccurate.
At a high level, these skills are a combination of software and data engineering.
The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.
That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:
Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
Model versioning: add a hash key to your different models. You will thank me later.
Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
Monitor performances: execution time and statistical scores of your models.
Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..
Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:
Not understanding the structure of the dataset
Not giving proper care during features selection
Leaving out categorical features and considering just numerical variables
Falling into dummy variable trap
Selection of inefficient machine learning algorithm
Not trying out various ML algorithms for building the model based on structure of data.
Improper tuning of model parameters
Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.
That’s just the surface-level comparison though. The image above gives an overview of how the two differ.
One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.
However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….
The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.
Thus, the data science life-cycle can include the following steps:
Business requirement understanding.
Data collection.
Data cleaning.
Data analysis.
Modeling.
Performance evaluation.
Communicating with stakeholders.
Deployment.
Real-world testing.
Business buy-in.
Support and maintenance.
Looks neat, but here is the scheme to visualize how it is happening in reality:
Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.
At a high level, these skills are a combination of software and data engineering.
The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.
That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:
Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
Model versioning: add a hash key to your different models. You will thank me later.
Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
Monitor performances: execution time and statistical scores of your models.
Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..
Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:
Not understanding the structure of the dataset
Not giving proper care during features selection
Leaving out categorical features and considering just numerical variables
Falling into dummy variable trap
Selection of inefficient machine learning algorithm
Not trying out various ML algorithms for building the model based on structure of data.
Improper tuning of model parameters
Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.
That’s just the surface-level comparison though. The image above gives an overview of how the two differ.
One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.
However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….
The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.
Thus, the data science life-cycle can include the following steps:
Business requirement understanding.
Data collection.
Data cleaning.
Data analysis.
Modeling.
Performance evaluation.
Communicating with stakeholders.
Deployment.
Real-world testing.
Business buy-in.
Support and maintenance.
Looks neat, but here is the scheme to visualize how it is happening in reality:
Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.
Quizzes, Practice Exams: Modeling, Data Engineering, Vision, Exploratory Data Analysis, ML Ops, Cheat Sheets, ML Jobs Interview Q&A
Use this App to learn about Machine Learning on AWS and prepare for the AWS Machine Learning Specialty Certification MLS-C01.
Earning AWS Certified Machine Learning Specialty validates expertise in building, training, tuning, and deploying machine learning (ML) models on AWS.
The App provides hundreds of quizzes and practice exam about:
– Machine Learning Operation on AWS
– Modelling
– Data Engineering
– Computer Vision,
– Exploratory Data Analysis,
– ML implementation & Operations
– Machine Learning Basics Questions and Answers
– Machine Learning Advanced Questions and Answers
– Scorecard
– Countdown timer
– Machine Learning Cheat Sheets
– Machine Learning Interview Questions and Answers
– Machine Learning Latest News
The App covers Machine Learning Basics and Advanced topics including: NLP, Computer Vision, Python, linear regression, logistic regression, Sampling, dataset, statistical interaction, selection bias, non-Gaussian distribution, bias-variance trade-off, Normal Distribution, correlation and covariance, Point Estimates and Confidence Interval, A/B Testing, p-value, statistical power of sensitivity, over-fitting and under-fitting, regularization, Law of Large Numbers, Confounding Variables, Survivorship Bias, univariate, bivariate and multivariate, Resampling, ROC curve, TF/IDF vectorization, Cluster Sampling, etc.
Question1: An advertising and analytics company uses machine learning to predict user response to online advertisements using a custom XGBoost model. The company wants to improve its ML pipeline by porting its training and inference code, written in R, to Amazon SageMaker, and do so with minimal changes to the existing code.
Answer1: Use the Build Your Own Container (BYOC) Amazon Sagemaker option. Create a new docker container with the existing code. Register the container in Amazon Elastic Container registry. with the existing code. Register the container in Amazon Elastic Container Registry. Finally run the training and inference jobs using this container.
Question2: Which feature of Amazon SageMaker can you use for preprocessing the data?
Answer2: Amazon Sagemaker Notebook instances
Amazon SageMaker enables developers and data scientists to build, train, tune, and deploy machine learning (ML) models at scale. You can deploy trained ML models for real-time or batch predictions on unseen data, a process known as inference. However, in most cases, the raw input data must be preprocessed and can’t be used directly for making predictions. This is because most ML models expect the data in a predefined format, so the raw data needs to be first cleaned and formatted in order for the ML model to process the data. You can use the Amazon SageMaker built-in Scikit-learn library for preprocessing input data and then use the Amazon SageMaker built-in Linear Learner algorithm for predictions.
Question3: What setting, when creating an Amazon SageMaker notebook instance, can you use to install libraries and import data?
Top 10 Google Professional Machine Learning Engineer Sample Questions
Question 1: You work for a textile manufacturer and have been asked to build a model to detect and classify fabric defects. You trained a machine learning model with high recall based on high resolution images taken at the end of the production line. You want quality control inspectors to gain trust in your model. Which technique should you use to understand the rationale of your classifier?
A. Use K-fold cross validation to understand how the model performs on different test datasets.
B. Use the Integrated Gradients method to efficiently compute feature attributions for each predicted image.
C. Use PCA (Principal Component Analysis) to reduce the original feature set to a smaller set of easily understood features.
D. Use k-means clustering to group similar images together, and calculate the Davies-Bouldin index to evaluate the separation between clusters.
Answer 1)
B
Notes 1)
B is correct because it identifies the pixel of the input image that leads to the classification of the image itself.
Question 2: You need to write a generic test to verify whether Dense Neural Network (DNN) models automatically released by your team have a sufficient number of parameters to learn the task for which they were built. What should you do?
A. Train the model for a few iterations, and check for NaN values.
B. Train the model for a few iterations, and verify that the loss is constant.
C. Train a simple linear model, and determine if the DNN model outperforms it.
D. Train the model with no regularization, and verify that the loss function is close to zero.
Answer 2)
D
Notes 2)
D is correct because the test can check that the model has enough parameters to memorize the task.
Question 3: Your team is using a TensorFlow Inception-v3 CNN model pretrained on ImageNet for an image classification prediction challenge on 10,000 images. You will use AI Platform to perform the model training. What TensorFlow distribution strategy and AI Platform training job configuration should you use to train the model and optimize for wall-clock time?
A. Default Strategy; Custom tier with a single master node and four v100 GPUs.
B. One Device Strategy; Custom tier with a single master node and four v100 GPUs.
C. One Device Strategy; Custom tier with a single master node and eight v100 GPUs.
D. Central Storage Strategy; Custom tier with a single master node and four v100 GPUs.
Answer 3)
D
Notes 3)
D is correct because this is the only strategy that can perform distributed training; albeit there is only a single copy of the variables on the CPU host.
Question 4: You work on a team where the process for deploying a model into production starts with data scientists training different versions of models in a Kubeflow pipeline. The workflow then stores the new model artifact into the corresponding Cloud Storage bucket. You need to build the next steps of the pipeline after the submitted model is ready to be tested and deployed in production on AI Platform. How should you configure the architecture before deploying the model to production?
A. Deploy model in test environment -> Validate model -> Create a new AI Platform model version
B. Validate model -> Deploy model in test environment -> Create a new AI Platform model version
C. Create a new AI Platform model version -> Validate model -> Deploy model in test environment
D. Create a new AI Platform model version – > Deploy model in test environment -> Validate model
Answer 4)
A
Notes 4)
A is correct because the model can be validated after it is deployed to the test environment, and the release version is established before the model is deployed in production.
Question 5: You work for a maintenance company and have built and trained a deep learning model that identifies defects based on thermal images of underground electric cables. Your dataset contains 10,000 images, 100 of which contain visible defects. How should you evaluate the performance of the model on a test dataset?
A. Calculate the Area Under the Curve (AUC) value.
B. Calculate the number of true positive results predicted by the model.
C. Calculate the fraction of images predicted by the model to have a visible defect.
D. Calculate the Cosine Similarity to compare the model’s performance on the test dataset to the model’s performance on the training dataset.
Answer 5)
A
Notes 5)
A is correct because it is scale-invariant. AUC measures how well predictions are ranked, rather than their absolute values. AUC is also classification-threshold invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.
Question 6: You work for a manufacturing company that owns a high-value machine which has several machine settings and multiple sensors. A history of the machine’s hourly sensor readings and known failure event data are stored in BigQuery. You need to predict if the machine will fail within the next 3 days in order to schedule maintenance before the machine fails. Which data preparation and model training steps should you take?
A. Data preparation: Daily max value feature engineering with DataPrep; Model training: AutoML classification with BQML
B. Data preparation: Daily min value feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
C. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to False
D. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
Answer 6)
D
Notes 6)
D is correct because it uses the rolling average of the sensor data and balances the weights using the BQML auto class weight balance parameter.
Question 7: You are an ML engineer at a media company. You need to build an ML model to analyze video content frame-by-frame, identify objects, and alert users if there is inappropriate content. Which Google Cloud products should you use to build this project?
A. Pub/Sub, Cloud Function, Cloud Vision API
B. Pub/Sub, Cloud IoT, Dataflow, Cloud Vision API, Cloud Logging
C. Pub/Sub, Cloud Function, Video Intelligence API, Cloud Logging
D. Pub/Sub, Cloud Function, AutoML Video Intelligence, Cloud Logging
Answer 7)
C
Notes 7)
C is correct as Video Intelligence API can find inappropriate components and other components satisfy the requirements of real-time processing and notification.
Question 8: You work for a large retailer. You want to use ML to forecast future sales leveraging 10 years of historical sales data. The historical data is stored in Cloud Storage in Avro format. You want to rapidly experiment with all the available data. How should you build and train your model for the sales forecast?
A. Load data into BigQuery and use the ARIMA model type on BigQuery ML.
B. Convert the data into CSV format and create a regression model on AutoML Tables.
C. Convert the data into TFRecords and create an RNN model on TensorFlow on AI Platform Notebooks.
D. Convert and refactor the data into CSV format and use the built-in XGBoost algorithm on AI Platform Training.
Answer 8)
A
Notes 8)
A is correct because BigQuery ML is designed for fast and rapid experimentation and it is possible to use federated queries to read data directly from Cloud Storage. Moreover, ARIMA is considered one of the best in class for time series forecasting.
Question 9) You need to build an object detection model for a small startup company to identify if and where the company’s logo appears in an image. You were given a large repository of images, some with logos and some without. These images are not yet labelled. You need to label these pictures, and then train and deploy the model. What should you do?
A. Use Google Cloud’s Data Labelling Service to label your data. Use AutoML Object Detection to train and deploy the model.
B. Use Vision API to detect and identify logos in pictures and use it as a label. Use AI Platform to build and train a convolutional neural network.
C. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a convolutional neural network.
D. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a real time object detection model.
Answer 9)
A
Notes 9)
A is correct as this will allow you to easily create a request for a labelling task and deploy a high-performance model.
Question 10) You work for a large financial institution that is planning to use Dialogflow to create a chatbot for the company’s mobile app. You have reviewed old chat logs and tagged each conversation for intent based on each customer’s stated intention for contacting customer service. About 70% of customer inquiries are simple requests that are solved within 10 intents. The remaining 30% of inquiries require much longer and more complicated requests. Which intents should you automate first?
A. Automate a blend of the shortest and longest intents to be representative of all intents.
B. Automate the more complicated requests first because those require more of the agents’ time.
C. Automate the 10 intents that cover 70% of the requests so that live agents can handle the more complicated requests.
D. Automate intents in places where common words such as “payment” only appear once to avoid confusing the software.
Azure and AWS are second class citizens in this area.
Sure, AWS has 70% of the market.
Sure, Azure is the easiest turn key and super user friendly.
But, the king of machine learning in the cloud is GCP.
GCP = Google Cloud Platform
Google has the largest data science team in the world, not mention they have Hinton.
Let’s forgot for a minute they created TensorFlow and give it away.
Let’s just talk about building a real world model with data that doesn’t fit into a excel spreadsheet.
The vast majority of applied machine learning is supervised and that means we need data.
Not just normal data, we need very clean highly structured data.
Where’s the easiest place in the world to upload and model a Petabyte of structured data? BigQuery of course.
Why BigQuery? I don’t have to do anything but upload my data. No spinning up RedShit clusters or whatever I have to do in Azure, just upload and massage data with my familiar SQL. If I do have to wrangle my data it won’t take my six months to update 5 rows here, minutes usually.
Then, you’ll need a front end. Cloud datalab is a Jupyter notebook, which is good because I don’t want nor do I need anything else.
Then, with a single line of code I connect by datalab (Jupyter) notebook to my data in BigQuery and build away.
I’ve worked in all three and the only thing I care about is getting to my job the fastest and right now that means I build my models in GCP.
If you’re new to machine learning don’t start in GCP or any cloud vendor for that matter. Start learning Python from the comfort of your laptop.
Here, I want to share the best research paper on Machine Learning classification methods, titled ‘Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?’, published in the ‘Journal of Machine Learning Research’.
This paper nicely explained 179 classification techniques and applied them on 121 data sets thus sharing small summary of the paper:
Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?
https://jmlr.org/papers/v15/delgado14a.html
The paper evaluated 179 classifiers arising from 17 ML families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest neighbours, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R ( with and without the caret package), C and Matlab, including all the relevant classifiers available today.
Experiments used total 121 data sets , which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behaviour, not dependent on the data set collection.
The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package).
The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).
1. Is the classification going to be supervised or unsupervised? Several well defined techniques likes SVM (Support Vector Machines), trained neural net,etc. are applicable for supervised classification. For unsupervised classification, GMMs (Gaussian Mixture Models), HMMs (Hidden Markov models) with Baye’s techniques could be used. (Several other techniques could of course be used as well)
2.How much training data do you have in case it is supervised ? A small number of training data may yield discouraging classification accuracy even if the chosen classifier is the most suitable one for the problem. In such a case, try to obtain more number of samples. There’s also generally a correlation (for practical purposes at least) between the feature dimensionality and the number of samples for given technique. For example, while using SVM, the linear kernel tends to yield better results when the number of training samples are less than or equal to or only slightly more than the number of feature dimensions as compared to RBF or any other kernel.
3. If the feature vector dimensionality is small enough (1/2/3 -D) then it makes sense to plot and visually inspect if techniques like clustering could be more useful. With very high number of feature dimensions, methods like clustering are generally not advisable(Refer : “The Curse Of Dimensionality”).
4. Are you doing classification in real time ? Some techniques ,e.g. “Template Match” in image classification may lead to a higher number of errors but is generally faster than most other techniques if the number of templates to be evaluated are not excessively high.
5. Depending upon the problem domain, you can decide if you can choose the underlying model in such a way that it can use certain temporal/spatial correlations that may be inherent in the data. For example, HMMs use the temporal continuity of speech samples for enhancing classification results in speech recognition problems.
Another point, slightly off the topic perhaps, but the classification performance is as much a function of choosing the correct feature vectors, the pre-processing of the feature vectors as much as the classifier itself. It’s generally a good idea to give reserve some initial part of the project to try out various classifiers on the same data-set. It may at least help you reject the ones which are highly inaccurate.
At a high level, these skills are a combination of software and data engineering.
The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.
That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:
Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
Model versioning: add a hash key to your different models. You will thank me later.
Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
Monitor performances: execution time and statistical scores of your models.
Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..
Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:
Not understanding the structure of the dataset
Not giving proper care during features selection
Leaving out categorical features and considering just numerical variables
Falling into dummy variable trap
Selection of inefficient machine learning algorithm
Not trying out various ML algorithms for building the model based on structure of data.
Improper tuning of model parameters
Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.
That’s just the surface-level comparison though. The image above gives an overview of how the two differ.
One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.
However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….
The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.
Thus, the data science life-cycle can include the following steps:
Business requirement understanding.
Data collection.
Data cleaning.
Data analysis.
Modeling.
Performance evaluation.
Communicating with stakeholders.
Deployment.
Real-world testing.
Business buy-in.
Support and maintenance.
Looks neat, but here is the scheme to visualize how it is happening in reality:
Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.
At a high level, these skills are a combination of software and data engineering.
The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.
That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:
Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
Model versioning: add a hash key to your different models. You will thank me later.
Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
Monitor performances: execution time and statistical scores of your models.
Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..
Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:
Not understanding the structure of the dataset
Not giving proper care during features selection
Leaving out categorical features and considering just numerical variables
Falling into dummy variable trap
Selection of inefficient machine learning algorithm
Not trying out various ML algorithms for building the model based on structure of data.
Improper tuning of model parameters
Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.
That’s just the surface-level comparison though. The image above gives an overview of how the two differ.
One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.
However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….
The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.
Thus, the data science life-cycle can include the following steps:
Business requirement understanding.
Data collection.
Data cleaning.
Data analysis.
Modeling.
Performance evaluation.
Communicating with stakeholders.
Deployment.
Real-world testing.
Business buy-in.
Support and maintenance.
Looks neat, but here is the scheme to visualize how it is happening in reality:
Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.
Quizzes, Practice Exams: Modeling, Data Engineering, Vision, Exploratory Data Analysis, ML Ops, Cheat Sheets, ML Jobs Interview Q&A
Use this App to learn about Machine Learning on AWS and prepare for the AWS Machine Learning Specialty Certification MLS-C01.
Earning AWS Certified Machine Learning Specialty validates expertise in building, training, tuning, and deploying machine learning (ML) models on AWS.
The App provides hundreds of quizzes and practice exam about:
– Machine Learning Operation on AWS
– Modelling
– Data Engineering
– Computer Vision,
– Exploratory Data Analysis,
– ML implementation & Operations
– Machine Learning Basics Questions and Answers
– Machine Learning Advanced Questions and Answers
– Scorecard
– Countdown timer
– Machine Learning Cheat Sheets
– Machine Learning Interview Questions and Answers
– Machine Learning Latest News
The App covers Machine Learning Basics and Advanced topics including: NLP, Computer Vision, Python, linear regression, logistic regression, Sampling, dataset, statistical interaction, selection bias, non-Gaussian distribution, bias-variance trade-off, Normal Distribution, correlation and covariance, Point Estimates and Confidence Interval, A/B Testing, p-value, statistical power of sensitivity, over-fitting and under-fitting, regularization, Law of Large Numbers, Confounding Variables, Survivorship Bias, univariate, bivariate and multivariate, Resampling, ROC curve, TF/IDF vectorization, Cluster Sampling, etc.
Question1: An advertising and analytics company uses machine learning to predict user response to online advertisements using a custom XGBoost model. The company wants to improve its ML pipeline by porting its training and inference code, written in R, to Amazon SageMaker, and do so with minimal changes to the existing code.
Answer1: Use the Build Your Own Container (BYOC) Amazon Sagemaker option. Create a new docker container with the existing code. Register the container in Amazon Elastic Container registry. with the existing code. Register the container in Amazon Elastic Container Registry. Finally run the training and inference jobs using this container.
Question2: Which feature of Amazon SageMaker can you use for preprocessing the data?
Answer2: Amazon Sagemaker Notebook instances
Amazon SageMaker enables developers and data scientists to build, train, tune, and deploy machine learning (ML) models at scale. You can deploy trained ML models for real-time or batch predictions on unseen data, a process known as inference. However, in most cases, the raw input data must be preprocessed and can’t be used directly for making predictions. This is because most ML models expect the data in a predefined format, so the raw data needs to be first cleaned and formatted in order for the ML model to process the data. You can use the Amazon SageMaker built-in Scikit-learn library for preprocessing input data and then use the Amazon SageMaker built-in Linear Learner algorithm for predictions.
Question3: What setting, when creating an Amazon SageMaker notebook instance, can you use to install libraries and import data?
Top 10 Google Professional Machine Learning Engineer Sample Questions
Question 1: You work for a textile manufacturer and have been asked to build a model to detect and classify fabric defects. You trained a machine learning model with high recall based on high resolution images taken at the end of the production line. You want quality control inspectors to gain trust in your model. Which technique should you use to understand the rationale of your classifier?
A. Use K-fold cross validation to understand how the model performs on different test datasets.
B. Use the Integrated Gradients method to efficiently compute feature attributions for each predicted image.
C. Use PCA (Principal Component Analysis) to reduce the original feature set to a smaller set of easily understood features.
D. Use k-means clustering to group similar images together, and calculate the Davies-Bouldin index to evaluate the separation between clusters.
Answer 1)
B
Notes 1)
B is correct because it identifies the pixel of the input image that leads to the classification of the image itself.
Question 2: You need to write a generic test to verify whether Dense Neural Network (DNN) models automatically released by your team have a sufficient number of parameters to learn the task for which they were built. What should you do?
A. Train the model for a few iterations, and check for NaN values.
B. Train the model for a few iterations, and verify that the loss is constant.
C. Train a simple linear model, and determine if the DNN model outperforms it.
D. Train the model with no regularization, and verify that the loss function is close to zero.
Answer 2)
D
Notes 2)
D is correct because the test can check that the model has enough parameters to memorize the task.
Question 3: Your team is using a TensorFlow Inception-v3 CNN model pretrained on ImageNet for an image classification prediction challenge on 10,000 images. You will use AI Platform to perform the model training. What TensorFlow distribution strategy and AI Platform training job configuration should you use to train the model and optimize for wall-clock time?
A. Default Strategy; Custom tier with a single master node and four v100 GPUs.
B. One Device Strategy; Custom tier with a single master node and four v100 GPUs.
C. One Device Strategy; Custom tier with a single master node and eight v100 GPUs.
D. Central Storage Strategy; Custom tier with a single master node and four v100 GPUs.
Answer 3)
D
Notes 3)
D is correct because this is the only strategy that can perform distributed training; albeit there is only a single copy of the variables on the CPU host.
Question 4: You work on a team where the process for deploying a model into production starts with data scientists training different versions of models in a Kubeflow pipeline. The workflow then stores the new model artifact into the corresponding Cloud Storage bucket. You need to build the next steps of the pipeline after the submitted model is ready to be tested and deployed in production on AI Platform. How should you configure the architecture before deploying the model to production?
A. Deploy model in test environment -> Validate model -> Create a new AI Platform model version
B. Validate model -> Deploy model in test environment -> Create a new AI Platform model version
C. Create a new AI Platform model version -> Validate model -> Deploy model in test environment
D. Create a new AI Platform model version – > Deploy model in test environment -> Validate model
Answer 4)
A
Notes 4)
A is correct because the model can be validated after it is deployed to the test environment, and the release version is established before the model is deployed in production.
Question 5: You work for a maintenance company and have built and trained a deep learning model that identifies defects based on thermal images of underground electric cables. Your dataset contains 10,000 images, 100 of which contain visible defects. How should you evaluate the performance of the model on a test dataset?
A. Calculate the Area Under the Curve (AUC) value.
B. Calculate the number of true positive results predicted by the model.
C. Calculate the fraction of images predicted by the model to have a visible defect.
D. Calculate the Cosine Similarity to compare the model’s performance on the test dataset to the model’s performance on the training dataset.
Answer 5)
A
Notes 5)
A is correct because it is scale-invariant. AUC measures how well predictions are ranked, rather than their absolute values. AUC is also classification-threshold invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.
Question 6: You work for a manufacturing company that owns a high-value machine which has several machine settings and multiple sensors. A history of the machine’s hourly sensor readings and known failure event data are stored in BigQuery. You need to predict if the machine will fail within the next 3 days in order to schedule maintenance before the machine fails. Which data preparation and model training steps should you take?
A. Data preparation: Daily max value feature engineering with DataPrep; Model training: AutoML classification with BQML
B. Data preparation: Daily min value feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
C. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to False
D. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
Answer 6)
D
Notes 6)
D is correct because it uses the rolling average of the sensor data and balances the weights using the BQML auto class weight balance parameter.
Question 7: You are an ML engineer at a media company. You need to build an ML model to analyze video content frame-by-frame, identify objects, and alert users if there is inappropriate content. Which Google Cloud products should you use to build this project?
A. Pub/Sub, Cloud Function, Cloud Vision API
B. Pub/Sub, Cloud IoT, Dataflow, Cloud Vision API, Cloud Logging
C. Pub/Sub, Cloud Function, Video Intelligence API, Cloud Logging
D. Pub/Sub, Cloud Function, AutoML Video Intelligence, Cloud Logging
Answer 7)
C
Notes 7)
C is correct as Video Intelligence API can find inappropriate components and other components satisfy the requirements of real-time processing and notification.
Question 8: You work for a large retailer. You want to use ML to forecast future sales leveraging 10 years of historical sales data. The historical data is stored in Cloud Storage in Avro format. You want to rapidly experiment with all the available data. How should you build and train your model for the sales forecast?
A. Load data into BigQuery and use the ARIMA model type on BigQuery ML.
B. Convert the data into CSV format and create a regression model on AutoML Tables.
C. Convert the data into TFRecords and create an RNN model on TensorFlow on AI Platform Notebooks.
D. Convert and refactor the data into CSV format and use the built-in XGBoost algorithm on AI Platform Training.
Answer 8)
A
Notes 8)
A is correct because BigQuery ML is designed for fast and rapid experimentation and it is possible to use federated queries to read data directly from Cloud Storage. Moreover, ARIMA is considered one of the best in class for time series forecasting.
Question 9) You need to build an object detection model for a small startup company to identify if and where the company’s logo appears in an image. You were given a large repository of images, some with logos and some without. These images are not yet labelled. You need to label these pictures, and then train and deploy the model. What should you do?
A. Use Google Cloud’s Data Labelling Service to label your data. Use AutoML Object Detection to train and deploy the model.
B. Use Vision API to detect and identify logos in pictures and use it as a label. Use AI Platform to build and train a convolutional neural network.
C. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a convolutional neural network.
D. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a real time object detection model.
Answer 9)
A
Notes 9)
A is correct as this will allow you to easily create a request for a labelling task and deploy a high-performance model.
Question 10) You work for a large financial institution that is planning to use Dialogflow to create a chatbot for the company’s mobile app. You have reviewed old chat logs and tagged each conversation for intent based on each customer’s stated intention for contacting customer service. About 70% of customer inquiries are simple requests that are solved within 10 intents. The remaining 30% of inquiries require much longer and more complicated requests. Which intents should you automate first?
A. Automate a blend of the shortest and longest intents to be representative of all intents.
B. Automate the more complicated requests first because those require more of the agents’ time.
C. Automate the 10 intents that cover 70% of the requests so that live agents can handle the more complicated requests.
D. Automate intents in places where common words such as “payment” only appear once to avoid confusing the software.
Azure and AWS are second class citizens in this area.
Sure, AWS has 70% of the market.
Sure, Azure is the easiest turn key and super user friendly.
But, the king of machine learning in the cloud is GCP.
GCP = Google Cloud Platform
Google has the largest data science team in the world, not mention they have Hinton.
Let’s forgot for a minute they created TensorFlow and give it away.
Let’s just talk about building a real world model with data that doesn’t fit into a excel spreadsheet.
The vast majority of applied machine learning is supervised and that means we need data.
Not just normal data, we need very clean highly structured data.
Where’s the easiest place in the world to upload and model a Petabyte of structured data? BigQuery of course.
Why BigQuery? I don’t have to do anything but upload my data. No spinning up RedShit clusters or whatever I have to do in Azure, just upload and massage data with my familiar SQL. If I do have to wrangle my data it won’t take my six months to update 5 rows here, minutes usually.
Then, you’ll need a front end. Cloud datalab is a Jupyter notebook, which is good because I don’t want nor do I need anything else.
Then, with a single line of code I connect by datalab (Jupyter) notebook to my data in BigQuery and build away.
I’ve worked in all three and the only thing I care about is getting to my job the fastest and right now that means I build my models in GCP.
If you’re new to machine learning don’t start in GCP or any cloud vendor for that matter. Start learning Python from the comfort of your laptop.
Here, I want to share the best research paper on Machine Learning classification methods, titled ‘Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?’, published in the ‘Journal of Machine Learning Research’.
This paper nicely explained 179 classification techniques and applied them on 121 data sets thus sharing small summary of the paper:
Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?
https://jmlr.org/papers/v15/delgado14a.html
The paper evaluated 179 classifiers arising from 17 ML families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest neighbours, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R ( with and without the caret package), C and Matlab, including all the relevant classifiers available today.
Experiments used total 121 data sets , which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behaviour, not dependent on the data set collection.
The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package).
The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).
1. Is the classification going to be supervised or unsupervised? Several well defined techniques likes SVM (Support Vector Machines), trained neural net,etc. are applicable for supervised classification. For unsupervised classification, GMMs (Gaussian Mixture Models), HMMs (Hidden Markov models) with Baye’s techniques could be used. (Several other techniques could of course be used as well)
2.How much training data do you have in case it is supervised ? A small number of training data may yield discouraging classification accuracy even if the chosen classifier is the most suitable one for the problem. In such a case, try to obtain more number of samples. There’s also generally a correlation (for practical purposes at least) between the feature dimensionality and the number of samples for given technique. For example, while using SVM, the linear kernel tends to yield better results when the number of training samples are less than or equal to or only slightly more than the number of feature dimensions as compared to RBF or any other kernel.
3. If the feature vector dimensionality is small enough (1/2/3 -D) then it makes sense to plot and visually inspect if techniques like clustering could be more useful. With very high number of feature dimensions, methods like clustering are generally not advisable(Refer : “The Curse Of Dimensionality”).
4. Are you doing classification in real time ? Some techniques ,e.g. “Template Match” in image classification may lead to a higher number of errors but is generally faster than most other techniques if the number of templates to be evaluated are not excessively high.
5. Depending upon the problem domain, you can decide if you can choose the underlying model in such a way that it can use certain temporal/spatial correlations that may be inherent in the data. For example, HMMs use the temporal continuity of speech samples for enhancing classification results in speech recognition problems.
Another point, slightly off the topic perhaps, but the classification performance is as much a function of choosing the correct feature vectors, the pre-processing of the feature vectors as much as the classifier itself. It’s generally a good idea to give reserve some initial part of the project to try out various classifiers on the same data-set. It may at least help you reject the ones which are highly inaccurate.
At a high level, these skills are a combination of software and data engineering.
The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.
That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:
Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
Model versioning: add a hash key to your different models. You will thank me later.
Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
Monitor performances: execution time and statistical scores of your models.
Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..
Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:
Not understanding the structure of the dataset
Not giving proper care during features selection
Leaving out categorical features and considering just numerical variables
Falling into dummy variable trap
Selection of inefficient machine learning algorithm
Not trying out various ML algorithms for building the model based on structure of data.
Improper tuning of model parameters
Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.
That’s just the surface-level comparison though. The image above gives an overview of how the two differ.
One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.
However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….
The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.
Thus, the data science life-cycle can include the following steps:
Business requirement understanding.
Data collection.
Data cleaning.
Data analysis.
Modeling.
Performance evaluation.
Communicating with stakeholders.
Deployment.
Real-world testing.
Business buy-in.
Support and maintenance.
Looks neat, but here is the scheme to visualize how it is happening in reality:
Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.
At a high level, these skills are a combination of software and data engineering.
The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.
That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:
Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
Model versioning: add a hash key to your different models. You will thank me later.
Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
Monitor performances: execution time and statistical scores of your models.
Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..
Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:
Not understanding the structure of the dataset
Not giving proper care during features selection
Leaving out categorical features and considering just numerical variables
Falling into dummy variable trap
Selection of inefficient machine learning algorithm
Not trying out various ML algorithms for building the model based on structure of data.
Improper tuning of model parameters
Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.
That’s just the surface-level comparison though. The image above gives an overview of how the two differ.
One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.
However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….
The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.
Thus, the data science life-cycle can include the following steps:
Business requirement understanding.
Data collection.
Data cleaning.
Data analysis.
Modeling.
Performance evaluation.
Communicating with stakeholders.
Deployment.
Real-world testing.
Business buy-in.
Support and maintenance.
Looks neat, but here is the scheme to visualize how it is happening in reality:
Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.
Quizzes, Practice Exams: Modeling, Data Engineering, Vision, Exploratory Data Analysis, ML Ops, Cheat Sheets, ML Jobs Interview Q&A
Use this App to learn about Machine Learning on AWS and prepare for the AWS Machine Learning Specialty Certification MLS-C01.
Earning AWS Certified Machine Learning Specialty validates expertise in building, training, tuning, and deploying machine learning (ML) models on AWS.
The App provides hundreds of quizzes and practice exam about:
– Machine Learning Operation on AWS
– Modelling
– Data Engineering
– Computer Vision,
– Exploratory Data Analysis,
– ML implementation & Operations
– Machine Learning Basics Questions and Answers
– Machine Learning Advanced Questions and Answers
– Scorecard
– Countdown timer
– Machine Learning Cheat Sheets
– Machine Learning Interview Questions and Answers
– Machine Learning Latest News
The App covers Machine Learning Basics and Advanced topics including: NLP, Computer Vision, Python, linear regression, logistic regression, Sampling, dataset, statistical interaction, selection bias, non-Gaussian distribution, bias-variance trade-off, Normal Distribution, correlation and covariance, Point Estimates and Confidence Interval, A/B Testing, p-value, statistical power of sensitivity, over-fitting and under-fitting, regularization, Law of Large Numbers, Confounding Variables, Survivorship Bias, univariate, bivariate and multivariate, Resampling, ROC curve, TF/IDF vectorization, Cluster Sampling, etc.
Question1: An advertising and analytics company uses machine learning to predict user response to online advertisements using a custom XGBoost model. The company wants to improve its ML pipeline by porting its training and inference code, written in R, to Amazon SageMaker, and do so with minimal changes to the existing code.
Answer1: Use the Build Your Own Container (BYOC) Amazon Sagemaker option. Create a new docker container with the existing code. Register the container in Amazon Elastic Container registry. with the existing code. Register the container in Amazon Elastic Container Registry. Finally run the training and inference jobs using this container.
Question2: Which feature of Amazon SageMaker can you use for preprocessing the data?
Answer2: Amazon Sagemaker Notebook instances
Amazon SageMaker enables developers and data scientists to build, train, tune, and deploy machine learning (ML) models at scale. You can deploy trained ML models for real-time or batch predictions on unseen data, a process known as inference. However, in most cases, the raw input data must be preprocessed and can’t be used directly for making predictions. This is because most ML models expect the data in a predefined format, so the raw data needs to be first cleaned and formatted in order for the ML model to process the data. You can use the Amazon SageMaker built-in Scikit-learn library for preprocessing input data and then use the Amazon SageMaker built-in Linear Learner algorithm for predictions.
Question3: What setting, when creating an Amazon SageMaker notebook instance, can you use to install libraries and import data?
Top 10 Google Professional Machine Learning Engineer Sample Questions
Question 1: You work for a textile manufacturer and have been asked to build a model to detect and classify fabric defects. You trained a machine learning model with high recall based on high resolution images taken at the end of the production line. You want quality control inspectors to gain trust in your model. Which technique should you use to understand the rationale of your classifier?
A. Use K-fold cross validation to understand how the model performs on different test datasets.
B. Use the Integrated Gradients method to efficiently compute feature attributions for each predicted image.
C. Use PCA (Principal Component Analysis) to reduce the original feature set to a smaller set of easily understood features.
D. Use k-means clustering to group similar images together, and calculate the Davies-Bouldin index to evaluate the separation between clusters.
Answer 1)
B
Notes 1)
B is correct because it identifies the pixel of the input image that leads to the classification of the image itself.
Question 2: You need to write a generic test to verify whether Dense Neural Network (DNN) models automatically released by your team have a sufficient number of parameters to learn the task for which they were built. What should you do?
A. Train the model for a few iterations, and check for NaN values.
B. Train the model for a few iterations, and verify that the loss is constant.
C. Train a simple linear model, and determine if the DNN model outperforms it.
D. Train the model with no regularization, and verify that the loss function is close to zero.
Answer 2)
D
Notes 2)
D is correct because the test can check that the model has enough parameters to memorize the task.
Question 3: Your team is using a TensorFlow Inception-v3 CNN model pretrained on ImageNet for an image classification prediction challenge on 10,000 images. You will use AI Platform to perform the model training. What TensorFlow distribution strategy and AI Platform training job configuration should you use to train the model and optimize for wall-clock time?
A. Default Strategy; Custom tier with a single master node and four v100 GPUs.
B. One Device Strategy; Custom tier with a single master node and four v100 GPUs.
C. One Device Strategy; Custom tier with a single master node and eight v100 GPUs.
D. Central Storage Strategy; Custom tier with a single master node and four v100 GPUs.
Answer 3)
D
Notes 3)
D is correct because this is the only strategy that can perform distributed training; albeit there is only a single copy of the variables on the CPU host.
Question 4: You work on a team where the process for deploying a model into production starts with data scientists training different versions of models in a Kubeflow pipeline. The workflow then stores the new model artifact into the corresponding Cloud Storage bucket. You need to build the next steps of the pipeline after the submitted model is ready to be tested and deployed in production on AI Platform. How should you configure the architecture before deploying the model to production?
A. Deploy model in test environment -> Validate model -> Create a new AI Platform model version
B. Validate model -> Deploy model in test environment -> Create a new AI Platform model version
C. Create a new AI Platform model version -> Validate model -> Deploy model in test environment
D. Create a new AI Platform model version – > Deploy model in test environment -> Validate model
Answer 4)
A
Notes 4)
A is correct because the model can be validated after it is deployed to the test environment, and the release version is established before the model is deployed in production.
Question 5: You work for a maintenance company and have built and trained a deep learning model that identifies defects based on thermal images of underground electric cables. Your dataset contains 10,000 images, 100 of which contain visible defects. How should you evaluate the performance of the model on a test dataset?
A. Calculate the Area Under the Curve (AUC) value.
B. Calculate the number of true positive results predicted by the model.
C. Calculate the fraction of images predicted by the model to have a visible defect.
D. Calculate the Cosine Similarity to compare the model’s performance on the test dataset to the model’s performance on the training dataset.
Answer 5)
A
Notes 5)
A is correct because it is scale-invariant. AUC measures how well predictions are ranked, rather than their absolute values. AUC is also classification-threshold invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.
Question 6: You work for a manufacturing company that owns a high-value machine which has several machine settings and multiple sensors. A history of the machine’s hourly sensor readings and known failure event data are stored in BigQuery. You need to predict if the machine will fail within the next 3 days in order to schedule maintenance before the machine fails. Which data preparation and model training steps should you take?
A. Data preparation: Daily max value feature engineering with DataPrep; Model training: AutoML classification with BQML
B. Data preparation: Daily min value feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
C. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to False
D. Data preparation: Rolling average feature engineering with DataPrep; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
Answer 6)
D
Notes 6)
D is correct because it uses the rolling average of the sensor data and balances the weights using the BQML auto class weight balance parameter.
Question 7: You are an ML engineer at a media company. You need to build an ML model to analyze video content frame-by-frame, identify objects, and alert users if there is inappropriate content. Which Google Cloud products should you use to build this project?
A. Pub/Sub, Cloud Function, Cloud Vision API
B. Pub/Sub, Cloud IoT, Dataflow, Cloud Vision API, Cloud Logging
C. Pub/Sub, Cloud Function, Video Intelligence API, Cloud Logging
D. Pub/Sub, Cloud Function, AutoML Video Intelligence, Cloud Logging
Answer 7)
C
Notes 7)
C is correct as Video Intelligence API can find inappropriate components and other components satisfy the requirements of real-time processing and notification.
Question 8: You work for a large retailer. You want to use ML to forecast future sales leveraging 10 years of historical sales data. The historical data is stored in Cloud Storage in Avro format. You want to rapidly experiment with all the available data. How should you build and train your model for the sales forecast?
A. Load data into BigQuery and use the ARIMA model type on BigQuery ML.
B. Convert the data into CSV format and create a regression model on AutoML Tables.
C. Convert the data into TFRecords and create an RNN model on TensorFlow on AI Platform Notebooks.
D. Convert and refactor the data into CSV format and use the built-in XGBoost algorithm on AI Platform Training.
Answer 8)
A
Notes 8)
A is correct because BigQuery ML is designed for fast and rapid experimentation and it is possible to use federated queries to read data directly from Cloud Storage. Moreover, ARIMA is considered one of the best in class for time series forecasting.
Question 9) You need to build an object detection model for a small startup company to identify if and where the company’s logo appears in an image. You were given a large repository of images, some with logos and some without. These images are not yet labelled. You need to label these pictures, and then train and deploy the model. What should you do?
A. Use Google Cloud’s Data Labelling Service to label your data. Use AutoML Object Detection to train and deploy the model.
B. Use Vision API to detect and identify logos in pictures and use it as a label. Use AI Platform to build and train a convolutional neural network.
C. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a convolutional neural network.
D. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a real time object detection model.
Answer 9)
A
Notes 9)
A is correct as this will allow you to easily create a request for a labelling task and deploy a high-performance model.
Question 10) You work for a large financial institution that is planning to use Dialogflow to create a chatbot for the company’s mobile app. You have reviewed old chat logs and tagged each conversation for intent based on each customer’s stated intention for contacting customer service. About 70% of customer inquiries are simple requests that are solved within 10 intents. The remaining 30% of inquiries require much longer and more complicated requests. Which intents should you automate first?
A. Automate a blend of the shortest and longest intents to be representative of all intents.
B. Automate the more complicated requests first because those require more of the agents’ time.
C. Automate the 10 intents that cover 70% of the requests so that live agents can handle the more complicated requests.
D. Automate intents in places where common words such as “payment” only appear once to avoid confusing the software.
Azure and AWS are second class citizens in this area.
Sure, AWS has 70% of the market.
Sure, Azure is the easiest turn key and super user friendly.
But, the king of machine learning in the cloud is GCP.
GCP = Google Cloud Platform
Google has the largest data science team in the world, not mention they have Hinton.
Let’s forgot for a minute they created TensorFlow and give it away.
Let’s just talk about building a real world model with data that doesn’t fit into a excel spreadsheet.
The vast majority of applied machine learning is supervised and that means we need data.
Not just normal data, we need very clean highly structured data.
Where’s the easiest place in the world to upload and model a Petabyte of structured data? BigQuery of course.
Why BigQuery? I don’t have to do anything but upload my data. No spinning up RedShit clusters or whatever I have to do in Azure, just upload and massage data with my familiar SQL. If I do have to wrangle my data it won’t take my six months to update 5 rows here, minutes usually.
Then, you’ll need a front end. Cloud datalab is a Jupyter notebook, which is good because I don’t want nor do I need anything else.
Then, with a single line of code I connect by datalab (Jupyter) notebook to my data in BigQuery and build away.
I’ve worked in all three and the only thing I care about is getting to my job the fastest and right now that means I build my models in GCP.
If you’re new to machine learning don’t start in GCP or any cloud vendor for that matter. Start learning Python from the comfort of your laptop.
Here, I want to share the best research paper on Machine Learning classification methods, titled ‘Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?’, published in the ‘Journal of Machine Learning Research’.
This paper nicely explained 179 classification techniques and applied them on 121 data sets thus sharing small summary of the paper:
Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?
https://jmlr.org/papers/v15/delgado14a.html
The paper evaluated 179 classifiers arising from 17 ML families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest neighbours, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R ( with and without the caret package), C and Matlab, including all the relevant classifiers available today.
Experiments used total 121 data sets , which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behaviour, not dependent on the data set collection.
The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package).
The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).
1. Is the classification going to be supervised or unsupervised? Several well defined techniques likes SVM (Support Vector Machines), trained neural net,etc. are applicable for supervised classification. For unsupervised classification, GMMs (Gaussian Mixture Models), HMMs (Hidden Markov models) with Baye’s techniques could be used. (Several other techniques could of course be used as well)
2.How much training data do you have in case it is supervised ? A small number of training data may yield discouraging classification accuracy even if the chosen classifier is the most suitable one for the problem. In such a case, try to obtain more number of samples. There’s also generally a correlation (for practical purposes at least) between the feature dimensionality and the number of samples for given technique. For example, while using SVM, the linear kernel tends to yield better results when the number of training samples are less than or equal to or only slightly more than the number of feature dimensions as compared to RBF or any other kernel.
3. If the feature vector dimensionality is small enough (1/2/3 -D) then it makes sense to plot and visually inspect if techniques like clustering could be more useful. With very high number of feature dimensions, methods like clustering are generally not advisable(Refer : “The Curse Of Dimensionality”).
4. Are you doing classification in real time ? Some techniques ,e.g. “Template Match” in image classification may lead to a higher number of errors but is generally faster than most other techniques if the number of templates to be evaluated are not excessively high.
5. Depending upon the problem domain, you can decide if you can choose the underlying model in such a way that it can use certain temporal/spatial correlations that may be inherent in the data. For example, HMMs use the temporal continuity of speech samples for enhancing classification results in speech recognition problems.
Another point, slightly off the topic perhaps, but the classification performance is as much a function of choosing the correct feature vectors, the pre-processing of the feature vectors as much as the classifier itself. It’s generally a good idea to give reserve some initial part of the project to try out various classifiers on the same data-set. It may at least help you reject the ones which are highly inaccurate.
At a high level, these skills are a combination of software and data engineering.
The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.
That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:
Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
Model versioning: add a hash key to your different models. You will thank me later.
Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
Monitor performances: execution time and statistical scores of your models.
Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..
Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:
Not understanding the structure of the dataset
Not giving proper care during features selection
Leaving out categorical features and considering just numerical variables
Falling into dummy variable trap
Selection of inefficient machine learning algorithm
Not trying out various ML algorithms for building the model based on structure of data.
Improper tuning of model parameters
Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.
That’s just the surface-level comparison though. The image above gives an overview of how the two differ.
One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.
However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….
The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.
Thus, the data science life-cycle can include the following steps:
Business requirement understanding.
Data collection.
Data cleaning.
Data analysis.
Modeling.
Performance evaluation.
Communicating with stakeholders.
Deployment.
Real-world testing.
Business buy-in.
Support and maintenance.
Looks neat, but here is the scheme to visualize how it is happening in reality:
Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.
At a high level, these skills are a combination of software and data engineering.
The persons that are more appropriate to do this job are a data engineer and/or a machine learning engineer.
That being said, if you work at a startup or happen to be in a small company and need to put the models into production yourself, here are the top skills you need to get:
Well structured code: it doesn’t need to be perfect but at least can be understood and updated by other team members. Avoid spaghetti code[1] as the plague.
Add logs: if you are a Python user, the logging[2] module is your friend. Avoid print statements at any cost.
Model versioning: add a hash key to your different models. You will thank me later.
Metadata everywhere: save as much data about your models and ML experiments as you can (running time, hyperparameters, used features, CV scores, and so on). You will thank me later, again.
Monitor performances: execution time and statistical scores of your models.
Data and models management: store the necessary data and models somewhere that is available to everyone (S3[3] for example). Avoid uploading these to your VCS[4] system. Don’t share them using Slack or Drive. I won’t judge you though, I do it sometimes (read often). Read more here …..
Some of the mistakes that might involve during building a machine learning model (I can think of) are listed here:
Not understanding the structure of the dataset
Not giving proper care during features selection
Leaving out categorical features and considering just numerical variables
Falling into dummy variable trap
Selection of inefficient machine learning algorithm
Not trying out various ML algorithms for building the model based on structure of data.
Improper tuning of model parameters
Most importantly: Building an idiotstic imperfect model i.e. suppose we have a classification problem with 99% chances of falling into class1 and remaining to class2. The built model may develop a mapping function which all the time for all data inputs, may predict the result to be class1. Well, one might say his/her model has 99% accuracy. But in reality the 1% class2 case hasn’t been included in the model. So this must be taken into consideration.
Basically, data mining is a key aspect of data analytics. Some even consider the former as essential to execute before the latter. While data analytics is the complete package and involves most components needed to examine a data set and extract valuable information, data mining focuses specifically on identifying hidden patterns.
That’s just the surface-level comparison though. The image above gives an overview of how the two differ.
One such difference is the presence of a hypothesis. Data analytics usually requires coming up with one, as it aims to find specific answers. Data mining, on the other hand, generally doesn’t need one to test or prove. The expected output are patterns or trends, which doesn’t require coming up with a statement or fact to test.
However, that doesn’t mean you mine data blindly. You still have a goal, whether it’s to come up with a recommender system or identify predictors of a certain dimension. Ultimately though, you strive to come up with data patterns or trends. For data analysis on the other hand, you’re expected to come up with valuable and actionable insights, usually in relation to a predetermined hypothesis. Read more here ….
The data science life cycle is not something well-defined like the software development life-cycle, and there is no ‘one-size-fits-all’ solution for data science projects. Every step in the life-cycle of a data science project depends on various data scientist skills and data science tools. The typical life-cycle of a data science project involves jumping back and forth among various interdependent science tasks using a variety of tools, techniques, programming, etc.
Thus, the data science life-cycle can include the following steps:
Business requirement understanding.
Data collection.
Data cleaning.
Data analysis.
Modeling.
Performance evaluation.
Communicating with stakeholders.
Deployment.
Real-world testing.
Business buy-in.
Support and maintenance.
Looks neat, but here is the scheme to visualize how it is happening in reality:
Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. The early comparison helps the data science team to change approaches, refine hypotheses and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.
In machine learning, there’s something called the “No Free Lunch” theorem. In a nutshell, it states that no one algorithm works best for every problem. It’s especially relevant for supervised learning. For example, you can’t say that neural networks are always better than decision trees or vice-versa. Furthermore, there are many factors at play, such as the size and structure of your dataset. As a result, you should try many different algorithms for your problem!
Top ML Algorithms
1. Linear Regression
Regression is a technique for numerical prediction. Additionally, regression is a statistical measure that attempts to determine the strength of the relationship between two variables. One is a dependent variable. Other is from a series of other changing variables which are our independent variables. Moreover, just like Classification is for predicting categorical labels, Regression is for predicting a continuous value. For example, we may wish to predict the salary of university graduates with 5 years of work experience. We use regression to determine how much specific factors or sectors influence the dependent variable.
Linear regression attempts to model the relationship between a scalar variable and explanatory variables by fitting a linear equation. For example, one might want to relate the weights of individuals to their heights using a linear regression model.
Additionally, this operator calculates a linear regression model. It uses the Akaike criterion for model selection. Furthermore, the Akaike information criterion is a measure of the relative goodness of a fit of a statistical model.
2. Logistic Regression
Logistic regression is a classification model. It uses input variables to predict a categorical outcome variable. The variable can take on one of a limited set of class values. A binomial logistic regression relates to two binary output categories. A multinomial logistic regression allows for more than two classes. Examples of logistic regression include classifying a binary condition as “healthy” / “not healthy”. Logistic regression applies the logistic sigmoid function to weighted input values to generate a prediction of the data class.
A logistic regression model estimates the probability of a dependent variable as a function of independent variables. The dependent variable is the output that we are trying to predict. The independent variables or explanatory variables are the factors that we feel could influence the output. Multiple regression refers to regression analysis with two or more independent variables. Multivariate regression, on the other hand, refers to regression analysis with two or more dependent variables.
3. Linear Discriminant Analysis
Logistic Regression is a classification algorithm traditionally for two-class classification problems. If you have more than two classes then the Linear Discriminant Analysis algorithm is the preferred linear classification technique.
The representation of LDA is pretty straight forward. It consists of statistical properties of your data, calculated for each class. For a single input variable this includes:
The mean value for each class.
The variance calculated across all classes.
We make predictions by calculating a discriminate value for each class. After that we make a prediction for the class with the largest value. The technique assumes that the data has a Gaussian distribution. Hence, it is a good idea to remove outliers from your data beforehand. It’s a simple and powerful method for classification predictive modelling problems.
4. Classification and Regression Trees
Prediction Trees are for predicting response or class YY from input X1, X2,…,XnX1,X2,…,Xn. If it is a continuous response it is a regression tree, if it is categorical, it is a classification tree. At each node of the tree, we check the value of one the input XiXi. Depending on the (binary) answer we continue to the left or to the right subbranch. When we reach a leaf we will find the prediction.
Contrary to linear or polynomial regression which are global models, trees try to partition the data space into small enough parts where we can apply a simple different model on each part. The non-leaf part of the tree is just the procedure to determine for each data xx what is the model we will use to classify it.
5. Naive Bayes
A Naive Bayes Classifier is a supervised machine-learning algorithm that uses the Bayes’ Theorem, which assumes that features are statistically independent. The theorem relies on the naive assumption that input variables are independent of each other, i.e. there is no way to know anything about other variables when given an additional variable. Regardless of this assumption, it has proven itself to be a classifier with good results.
Naive Bayes Classifiers rely on the Bayes’ Theorem, which is based on conditional probability or in simple terms, the likelihood that an event (A) will happen given that another event (B) has already happened. Essentially, the theorem allows a hypothesis to be updated each time new evidence is introduced. The equation below expresses Bayes’ Theorem in the language of probability:
Let’s explain what each of these terms means.
“P” is the symbol to denote probability.
P(A | B) = The probability of event A (hypothesis) occurring given that B (evidence) has occurred.
P(B | A) = The probability of the event B (evidence) occurring given that A (hypothesis) has occurred.
P(A) = The probability of event B (hypothesis) occurring.
P(B) = The probability of event A (evidence) occurring.
6. K-Nearest Neighbors
k-nearest neighbours (or k-NN for short) is a simple machine learning algorithm that categorizes an input by using its k nearest neighbours.
For example, suppose a k-NN algorithm has an input of data points of specific men and women’s weight and height, as plotted below. To determine the gender of an unknown input (green point), k-NN can look at the nearest k neighbours (suppose ) and will determine that the input’s gender is male. This method is a very simple and logical way of marking unknown inputs, with a high rate of success.
Also, we can k-NN in a variety of machine learning tasks; for example, in computer vision, k-NN can help identify handwritten letters and in gene expression analysis, the algorithm can determine which genes contribute to a certain characteristic. Overall, k-nearest neighbours provide a combination of simplicity and effectiveness that makes it an attractive algorithm to use for many machine learning tasks.
7. Learning Vector Quantization
A downside of K-Nearest Neighbors is that you need to hang on to your entire training dataset. The Learning Vector Quantization algorithm (or LVQ for short) is an artificial neural network algorithm that allows you to choose how many training instances to hang onto and learns exactly what those instances should look like.
Additionally, the representation for LVQ is a collection of codebook vectors. We select them randomly in the beginning and adapted to best summarize the training dataset over a number of iterations of the learning algorithm. After learned, the codebook vectors can make predictions just like K-Nearest Neighbors. Also, we find the most similar neighbour (best matching codebook vector) by calculating the distance between each codebook vector and the new data instance. The class value or (real value in the case of regression) for the best matching unit is then returned as the prediction. Moreover, you can get the best results if you rescale your data to have the same range, such as between 0 and 1.
If you discover that KNN gives good results on your dataset try using LVQ to reduce the memory requirements of storing the entire training dataset.
8. Bagging and Random Forest
A Random Forest consists of a collection or ensemble of simple tree predictors, each capable of producing a response when presented with a set of predictor values. For classification problems, this response takes the form of a class membership, which associates, or classifies, a set of independent predictor values with one of the categories present in the dependent variable. Alternatively, for regression problems, the tree response is an estimate of the dependent variable given the predictors.e
A Random Forest consists of an arbitrary number of simple trees, which determine the final outcome. For classification problems, the ensemble of simple trees votes for the most popular class. In the regression problem, we average responses to obtain an estimate of the dependent variable. Using tree ensembles can lead to significant improvement in prediction accuracy (i.e., better ability to predict new data cases).
9. SVM
A Support Vector Machine (SVM) is a supervised machine learning algorithm that can be employed for both classification and regression purposes. Also, SVMs have more common usage in classification problems and as such, this is what we will focus on in this post.
SVMs are based on the idea of finding a hyperplane that best divides a dataset into two classes, as shown in the image below.
Also, you can think of a hyperplane as a line that linearly separates and classifies a set of data.
Intuitively, the further from the hyperplane our data points lie, the more confident we are that they have been correctly classified. We, therefore, want our data points to be as far away from the hyperplane as possible, while still being on the correct side of it.
So when we add a new testing data , whatever side of the hyperplane it lands will decide the class that we assign to it.
The distance between the hyperplane and the nearest data point from either set is the margin. Furthermore, the goal is to choose a hyperplane with the greatest possible margin between the hyperplane and any point within the training set, giving a greater chance of correct classification of data.
But the data is rarely ever as clean as our simple example above. A dataset will often look more like the jumbled balls below which represent a linearly non-separable dataset.
10. Boosting and AdaBoost
Boosting is an ensemble technique that attempts to create a strong classifier from a number of weak classifiers. We do this by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. We can add models until the training set is predicted perfectly or a maximum number of models are added.
AdaBoost was the first really successful boosting algorithm developed for binary classification. It is the best starting point for understanding boosting. Modern boosting methods build on AdaBoost, most notably stochastic gradient boosting machines.
AdaBoost is used with short decision trees. After the first tree is created, the performance of the tree on each training instance is used to weight how much attention the next tree that is created should pay attention to each training instance. Training data that is hard to predict is given more weight, whereas easy to predict instances are given less weight. Models are created sequentially one after the other, each updating the weights on the training instances that affect the learning performed by the next tree in the sequence. After all the trees are built, predictions are made for new data, and the performance of each tree is weighted by how accurate it was on training data.
Because so much attention is put on correcting mistakes by the algorithm it is important that you have clean data with outliers removed.
Summary
A typical question asked by a beginner, when facing a wide variety of machine learning algorithms, is “which algorithm should I use?” The answer to the question varies depending on many factors, including: (1) The size, quality, and nature of data; (2) The available computational time; (3) The urgency of the task; and (4) What you want to do with the data.
Even an experienced data scientist cannot tell which algorithm will perform the best before trying different algorithms. Although there are many other Machine Learning algorithms, these are the most popular ones. If you’re a newbie to Machine Learning, these would be a good starting point to learn.
The foundations of most algorithms lie in linear algebra, multivariable calculus, and optimization methods. Most algorithms use a sequence of combinations to estimate an objective function given a set of data, and the sequence order and included methods distinguish one algorithm from another. It’s helpful to learn enough math to read the development papers associated with key algorithms in the field, as many other methods (or one’s own innovations) include pieces of those algorithms. It’s like learning the language of machine learning. Once you are fluent in it, it’s pretty easy to modify algorithms as needed and create new ones likely to improve on a problem in a short period of time.
Matrix factorization: a simple, beautiful way to do dimensionality reduction —and dimensionality reduction is the essence of cognition. Recommender systems would be a big application of matrix factorization. Another application I’ve been using over the years (starting in 2010 with video data) is factorizing a matrix of pairwise mutual information (or pointwise mutual information, which is more common) between features, which can be used for feature extraction, computing word embeddings, computing label embeddings (that was the topic of a recent paper of mine [1]), etc.
Used in a convolutional settings, this acts as an excellent unsupervised feature extractor for images and videos. There’s one big issue though: it is fundamentally a shallow algorithm. Deep neural networks will quickly outperform it if any kind of supervision labels are available.
See how well you synchronize to the lyrics of the popular hit “Dance Monkey.” This in-browser experience uses the Facemesh model for estimating key points around the lips to score lip-syncing accuracy.Explore demo View code
Load in a pre-trained Body-Pix model from the TensorFlow.js team so that you can locate all pixels in an image that are part of a body, and what part of the body they belong to. Clone this to make your own TensorFlow.js powered projects to recognize body parts in images from your webcam and more!
This demo shows how we can use a pre made machine learning solution to recognize objects (yes, more than one at a time!) on any image you wish to present to it. Even better, not only do we know that the image contains an object, but we can also get the co-ordinates of the bounding box for each object it finds, which allows you to highlight the found object in the image.
For this demo we are loading a model using the ImageNet-SSD architecture, to recognize 90 common objects it has already been taught to find from the COCO dataset.
If what you want to recognize is in that list of things it knows about (for example a cat, dog, etc), this may be useful to you as is in your own projects, or just to experiment with Machine Learning in the browser and get familiar with the possibilities of machine learning.
If you are feeling particularly confident you can check out our GitHub documentation (https://github.com/tensorflow/tfjs-models/tree/master/coco-ssd) which goes into much more detail for customizing various parameters to tailor performance to your needs.
This demo shows how we can use a pre made machine learning solution to classify images (aka a binary image classifier). It should be noted that this model works best when a single item is in the image at a time. Busy images may not work so well. You may want to try our demo for Multiple Object Detection (https://codepen.io/jasonmayes/pen/qBEJxgg) for that.
For this demo we are loading a model using the MobileNet architecture, to recognize 1000 common objects it has already been taught to find from the ImageNet data set (http://image-net.org/).
If what you want to recognize is in that list of things it knows about (for example a cat, dog, etc), this may be useful to you as is in your own projects, or just to experiment with Machine Learning in the browser and get familiar with the possibilities of machine learning.
Please note: This demo loads an easy to use JavaScript class made by the TensorFlow.js team to do the hardwork for you so no machine learning knowledge is needed to use it.
If you were looking to learn how to load in a TensorFlow.js saved model directly yourself then please see our tutorial on loading TensorFlow.js models directly.
If you want to train a system to recognize your own objects, using your own data, then check out our tutorials on “transfer learning”.
The hello world for TensorFlow.js 🙂 Absolute minimum needed to import into your website and simply prints the loaded TensorFlow.js version. From here we can do great things. Clone this to make your own TensorFlow.js powered projects or if you are following a tutorial that needs TensorFlow.js to work.
Are you interested in becoming an AWS Certified Machine Learning Specialist? If so, then this exam preparation blog is for you! The blog contains over 100 quiz and practice exam questions, as well as detailed answers. The questions are very similar to those you will encounter on the actual exam, so this is a great way to prepare. In addition, the blog also includes cheat sheets and illustrations to help you understand the concepts better.
Bring your own algorithm to an MLOps Pipeline: Architecture
I used DoWhy to create some synthetic data. The causal graph is shown below. Treatment is v0 and y is the outcome. True ATE is 10. I also used the DoWhy package to find ATE (propensity score matching) and I obtained ~10, which is great. For fun, I fitted a OLS model (y ~ W1 + W2 + v0 + Z1 + Z2) on the data and, surprisingly the beta for the treatment v0 is 10. I was expecting something different from 10, because of the confounders. What am I missing here? https://preview.redd.it/ve6753p75yqc1.png?width=458&format=png&auto=webp&s=0935bbb15fba1dc63bdb3f8f445dca73fa2988e9 submitted by /u/Amazing_Alarm6130 [link] [comments]
Lowly BI person here -- just curious outside of maths, data modeling, and drinking scotch in the library, do data scientists make an effort to automate their work? Like are there tools or scripts you all are building to be more efficient or is it not really a part of the job? submitted by /u/Marion_Shepard [link] [comments]
I work for a large chain grocer and I've been tasked with "Missed Opportunity." Missed Opportunity (MO) is defined as such: When a customer wants to buy an item, and the item IS stocked, but is not on the shelf. I.e. in most cases, this translates to the item is in the backrooms. But it could be the case that someone grabbed an item and did not return it to the right place. Now my goal is to look at what items (in the past couple of months) are experiencing the "most" MO, quantified by $ value or units. The limited amount of data I have is sales. I can tell you what time an item was sold, how many units, in what store it was sold, and the price. I do NOT have: anything related to inventory, even delivery dates. I also do NOT have a "true" dataset of actual MO being experienced. Thus, how in the hell do I figure out my goal with this little data??? The only thing that I have been trying is to cluster stores (K-means) based off sales of a particular item, and if the store is underperforming in its cluster, then it could be somewhat assumed that it may be experiencing MO. However, this runs into its own problems and assumptions. So what other statistical methods, techniques, manipulations, etc. could possibly help me here? I feel like I need to get pretty creative submitted by /u/bernful [link] [comments]
Robots intended to be used by the general public, with the ability to execute critical tasks must be governed by a trustless, transparent, auditable authorisation system. There are 3 main points of vulnerability for a robot deployed into the real world. Malicious intent from the robot Malicious intent from the robot manufacturer 3.Malicious intent from hackers A blockchain based authorisation system seems like the perfect solution. The blockchain authorisation control system will have 4 fundamental aspects: 1.Soul-bound NFTs Multi-Sig Roles Smart contract events Read the full proposed approach here: https://github.com/dev-diaries41/robo-auth What are you thoughts? submitted by /u/d41_fpflabs [link] [comments]
Hey there, I am training a deep lesrning model using a dataset of 400Go in an external SSD disk and I noticed that training is very slow, any tricks to make dataloading faster ? PS : I have to use the external disk submitted by /u/bkffadia [link] [comments]
Curious to hear from those that are building and deploying products with AI copilots. How are you tracking the interactions? And are you feeding the interaction back into the model for retraining? Put together a how-to to do this with an OS Copilot (Vercel AI SDK) and Segment and would love any feedback to improve the spec: https://segment.com/blog/instrumenting-user-insights-for-your-ai-copilot/ submitted by /u/n2parko [link] [comments]
Not only with job postings, but I know a few individuals who work as data scientists at reputable companies, and often they are tasked with the responsibilities of a data engineer. I believe the issue stems from a lack of data literacy among companies and data managers. In terms of job postings, most of them require extensive experience in SQL, data cleaning, ETL, Pipelines and data quality-related tasks, which I believe fall within the realm of data engineering. I would like to hear your thoughts on this. Have any of you experienced something similar or perhaps dealt with it firsthand? submitted by /u/trafalgar28 [link] [comments]
I have the following problem. Imagine I have a 'supervised' dataset of 1D curves with inputs and outputs, where the input is a modulated noisy signal and the output is the cleaned desired signal. Is there a consensus in the machine learning community on how to tackle this simple problem? Have you ever worked on anything similar? What algorithm did you end up using? Example: https://imgur.com/JYgkXEe submitted by /u/XmintMusic [link] [comments]
State of the art Tts question Hey! I'm currently working on a project and I'd like to implement speech using TTS, I tried many things and I can't seem to find something that fits my needs, I haven't worked on TTS for a while now so I was wondering if maybe they were newer technologies I could use. Here is what I'm looking for : I need to be be quite fast and without too many sound artifacts (I tried bark and while the possibility of manipulating emotion is quite remarkable the generated voice is full of artifacts and noise) It'd be a bonus if I could stream the audio and pipe it through other things, I'd like to apply an RVC Model on top of it (live) Another 'nice to have' is to have some controls over the emotions or tone of the voice. I tried these so far (either myself or through demos) : TORTOISETTS and EDGETTS seem to have a nice voice quality but are relatively monotone. Bark as I said is very good at emotions and controls but lots of artifacts in the voice, if I have time I'd try to apply postprocessing but idk to what extent it can help OpenAI models don't have much emotions IMO Same as eleven labs I used Uber duck in the past but it seems a lot of fun functionalities disappeared. If you have any advice, suggestion or if you think I should try somethings further feel free to reply! I also want to thanks everyone in advance! Have a nice day! submitted by /u/Zireaone [link] [comments]
Currently working on a classification model, which entails data cleaning. We've got 8000 images categorized into 3 classes. After removing duplicates and corrupted images, what else should we consider? submitted by /u/fardin__khan [link] [comments]
Hi, I've finished Andrew Ng's course on Coursera. I think I've got the basics. I've started learning ML for my master's thesis. I want to develop a method to estimate scope 3 emissions. I studied business and I do not have any python background except for a 6-month data analytics bootcamp. I've got the data needed for my thesis, but when I try to work on it, I'm not sure what I'm doing, and ofc a sh*t ton of bugs and errors. Do I need to just keep trying to push through and learn through the experience by working on my thesis or do I need to study more? I've been considering to by a book <\Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow> by Aurelien Geron. Any guidance/recommendation would be much appreciated! submitted by /u/qheeeee [link] [comments]
I'm currently pursuing my undergraduate degree in robotics engineering and have been immersing myself in concepts related to machine learning, deep learning, and computer vision, both modern and traditional. With strong programming skills and a habit of regularly reading research papers, I'm eager to understand the job landscape in my field and pursue a Phd. Are there ample opportunities available? What can I expect in terms of salaries and future prospects? Additionally, I'm curious about the comparative job market between natural language processing (NLP) and computer vision. Given my background and interests, what areas or skills should I focus on learning to enhance my career prospects? Thanks in advance for your time and advice. submitted by /u/MD24IB [link] [comments]
Hey again everyone! Checking back in with more updates on Zen because of how enthusiastic the community has been about it! We've done a lot of work the past two months or so since I last posted, but first I'll drop a couple of the most important things / highlights about the app here: Zen is still a candidate / seeker-first job board. This means we have no ads, we have no promoted jobs from companies who are paying us, we have no recruiters, etc. The whole point of Zen is to help you find jobs quickly at companies you're interested in without any headaches. On that point, we'll send you emails notifying you when companies you care about post new jobs that match your preferences, so you don't need to continuously check their job boards. In the past two months, we've made some major changes! Many of them are discussed in the changelog: We've continued adding postings and companies, so you can now explore over 170k open jobs at >6,200 companies We've continued to completely overhaul the UX of the app We've added some new preference filters to help you filter for relevant jobs better We've launched a premium tier. The reason for this was as we've grown (largely thanks to all of your support!) our costs have continued to go up significantly, and we want to be able to keep providing an ad-free, spam-free, promotion-free service to all of you without making any compromises. We're launching on ProductHunt today! You can check out our launch here I started building Zen when I was on the job hunt and realized it was harder than it should've been to just get notifications when a company I was interested in posted a job that was relevant to me. And we hope that this goal -- to cut out all the noise and make it easier for you to find great matches -- is valuable for everyone here 🙂 Here are the original posts: https://www.reddit.com/r/datascience/comments/1ad5lxa/update_2_i_built_an_app_to_make_my_job_search_a/ https://www.reddit.com/r/datascience/comments/183562x/update_i_built_an_app_to_make_my_job_search_a/ https://www.reddit.com/r/datascience/comments/17s5fyq/i_built_an_app_to_make_my_job_search_a_little/ And here's one more link to the app submitted by /u/eipi-10 [link] [comments]
https://x.com/vitaliychiley/status/1772958872891752868?s=20 Shill disclaimer: I was the pretraining lead for the project DBRX deets: 16 Experts (12B params per single expert; top_k=4 routing) 36B active params (132B total params) trained for 12T tokens 32k sequence length training submitted by /u/artificial_intelect [link] [comments]
Hello everyone, I'm a prospective graduate student who will be starting my studies in September this year, specializing in AIoT (Artificial Intelligence of Things) Systems. Recently, I've been reading papers from journals like INFOCOM and SIGCOMM, and I've noticed that they mostly focus on relatively low-level aspects of operating systems, including GPU/CPU scheduling, optimization of deep learning model inference, operator optimization, cross-platform migration, and deployment. I find it challenging to grasp the implementation details of these works at the code level. When I looked at the implementations of these works uploaded on GitHub, I found it relatively difficult to understand. My primary programming languages are Java and Python. During my undergraduate studies, I gained proficiency in implementing engineering projects and ideas using Python, especially in the fields of deep learning and machine learning. However, I lack experience and familiarity with C/C++ (many of the aforementioned works are based on C/C++). Therefore, I would like to ask for advice from senior professionals and friends on which areas of knowledge I should focus on. Do I need to learn CUDA programming, operating system programming, or other directions? Any recommended learning paths would be greatly appreciated. PS: Recently, I have started studying the MIT 6.S081 Operating System Engineering course. Thank you all sincerely for your advice. submitted by /u/MaTwickenham [link] [comments]
Hi all - I wanted to share an app I’ve been working on with a small team over the past year that I thought this community would be interested in. Odyssey is a completely native Mac app for creating remarkable art, getting work done, and automating repetitive tasks with the power of AI and machine learning models. We just made a major feature update and added the ability to create your own Widgets. Odyssey Widgets are fully interactive mini applications that live in their own windows or panels and are driven by a workflow. This means you can take a workflow you create with Odyssey and add it directly to your desktop. So, as an example, you could generate an image, chat with locally run chatbot, run bulk image processing, etc. straight from your desktop without even opening the Odyssey app. Widgets can be built with Odyssey and triggered from the Odyssey logo in your Mac’s menu. https://i.redd.it/8s9s6i0clvqc1.gif We're in public beta but here's a full list of everything Odyssey supports: Image generation and processing Run Stable Diffusion 1.5, SDXL, SDXL Lightning, and SDXL Turbo locally or connect your Stable Diffusion API key Add custom models & LoRAs ControlNet support including canny edges, pose detection, depth estimation, and QR Code Monster Inpainting and outpainting Super resolution models (Best Buddy GAN, Ultrasharp 4x, Remacri, and ESRGAN) Multiple image segmentation models Erase objects Dozens of image processing nodes including aspect ratio, resizing, and extracting dominant colors Custom image transitions for powerful slideshows Large language models and math equations Run Llama2 locally or connect your ChatGPT API key Supports both chatbot mode and instructions mode Solver node for word problems and math nodes for complex equations Lots of updates coming here in the next few weeks Automation and batch workflows Batch image and text nodes support hundreds of images and lines of text at once Remove backgrounds, upscale, change aspect ratios, and run dozens of image processors in bulk Private, customizable, and shareable No images, chats, or inputs are stored or accessible by the Odyssey team Completely private and secure. The only tracking is anonymized usage data to help us improve Odyssey Process your own data entirely locally No internet connection required to run local models Use your own API keys for ChatGPT and Stable Diffusion Easily save and share custom workflows What’s coming soon: Custom LLMs & more text processing nodes - we are adding support for bringing in custom LLMs, document uploads, and more Batch text and workflow automation - we are building in document upload, batch text support, and an integration with Apple shortcuts Plug-in support - we are opening up the Odyssey to 3P developers. If you’re interested, please reach out - would love to learn more from you as we work on building this out Feel free to reach out to [john@odysseyapp.io](mailto:john@odysseyapp.io) if you have any questions or feedback. submitted by /u/creatorai [link] [comments]
Project: https://github.com/DoMusic/Hybrid-Net A transformer-based hybrid multimodal model, various transformer models address different problems in the field of music information retrieval, these models generate corresponding information dependencies that mutually influence each other. An AI-powered multimodal project focused on music, generate chords, beats, lyrics, melody, and tabs for any song. submitted by /u/CheekProfessional146 [link] [comments]
Hey all, I've recently published a tutorial at Towards Data Science that explores a somewhat overlooked aspect of Retrieval-Augmented Generation (RAG) systems: the visualization of documents and questions in the embedding space: https://towardsdatascience.com/visualize-your-rag-data-evaluate-your-retrieval-augmented-generation-system-with-ragas-fc2486308557 While much of the focus in RAG discussions tends to be on the algorithms and data processing, I believe that visualization can help to explore the data and to gain insights into problematic subgroups within the data. This might be interesting for some of you, although I'm aware that not everyone is keen on this kind of visualization. I believe it can add a unique dimension to understanding RAG systems. submitted by /u/DocBrownMS [link] [comments]
Today I Learned (TIL) You learn something new every day; what did you learn today? Submit interesting and specific facts about something that you just found out here.
Reddit Science This community is a place to share and discuss new scientific research. Read about the latest advances in astronomy, biology, medicine, physics, social science, and more. Find and submit new publications and popular science coverage of current research.