What’s the difference between a proxy and a VPN and why is one security stronger than the other? Which security feature is stronger and why?

Proxy vs VPN

You can translate the content of this page by selecting a language in the select box.

What’s the difference between a proxy and a VPN, and why is one security stronger than the other? Which security feature is stronger and why?

When it comes to online security, there are a number of different factors to consider. Two of the most popular methods for protecting your identity and data are proxy servers and VPNs. Both proxy servers and VPNs can help to mask your IP address and encrypt your traffic, but there are some key differences between the two. One major difference is that proxy servers only encrypt traffic going through the server, while VPNs encrypt all traffic from your device. This means that proxy servers are only effective if you’re using specific apps or visiting specific websites. VPNs, on the other hand, provide a more comprehensive solution as they can encrypt all traffic from your device, no matter where you’re accessing the internet from. Another key difference is that proxy servers tend to be less expensive than VPNs, but they also offer less privacy and security. When it comes to online security, proxy servers and VPNs both have their pros and cons. It’s important to weigh these factors carefully before decide which option is right for you.

VPN is virtual private network connects your incoming traffic and outgoing traffic to another network.

A proxy just relays your internet traffic. To websites you visit, your IP appears to be that of the proxy server.

A VPN is a type of proxy for which all the communication between your computer and the proxy server is encrypted. With a VPN, no one snooping your internet connection (e.g., your ISP) can see what websites you are visiting or what you are doing there. Security is much better.

VPN PROS:

What is a Proxy Server?

A proxy server is a computer system that performs as an intermediary in the request made by users. This type of server helps prevent an attacker from attacking the network and serves as a tool used to create a firewall.

The etymology of the word proxy means “a figure that can be used to represent the value of something”, this means that a proxy server represents or acts on behalf of the user. The fundamental purpose of proxy servers is to safeguard the direct connection of internet users and resources.


All requests made by the users from the internet go to the proxy server. The responses of the request return back to the proxy server for evaluation and then to the user. Proxy servers serve as an intermediary between the local network and the world wide web. Proxy servers are used for several reasons, such as to filter web content, to avert restrictions like parental blocks, to screen downloads and uploads, and to provide privacy when browsing the internet. The proxy server also prevents and protects the identity of the users.

There are different types of proxy servers used according to the different purposes of a request made by the clients and users. Proxies provide a valuable layer of security for your network and computers. It can be set up as web filters or firewalls which can protect computers from threats such as malware or ransomware. This extra security is also significant when linked with a secured gateway or attached security products. This way, network administrators can filter traffic according to its level of safety or traffic consumption of the network.

Are Proxies and VPNs the same?

Proxies are not the same as VPNs. The only similarity between Proxies and VPNs is that they both connect you to the internet via an intermediary server. An online proxy forwards your traffic to its destination, while a VPN, on the other hand, encrypts all traffic between the VPN server and your device. Here are some more differences between proxies and VPNs:

If you are looking for an all-in-one solution to help you prepare for the AWS Cloud Practitioner Certification Exam, look no further than this AWS Cloud Practitioner CCP CLFC01 book below.


  • VPNs help you encrypt your traffic while proxy servers don’t do that.
  • Proxies don’t protect you from government surveillance, ISP tracking, and hackers, which is why they are never used to handle sensitive information. VPN protects you from the same.
  • VPNs function on the operating system level while proxies work on the application level.
  • Proxies only reroute the traffic of a specific app or browser while VPNs reroute it through a VPN server.
  • Since VPNs need to encrypt your sensitive data, they can be slower than proxies.
  • Most proxy servers are free while most VPNs are paid. Don’t trust free VPN services as they can compromise your data.
  • A VPN connection is found to be more reliable than proxy server connections that can drop more frequently.

Why Is a VPN Considered to be More Secure Than a Proxy Server?

By now, you might have already noticed the reason since we have discussed it. The question is: Is a VPN better than a proxy? The simple answer is “Yes.”

Invest in your future today by enrolling in this Azure Fundamentals - Microsoft Azure Certification and Training ebook below. This Azure Fundamentals Exam Prep Book will prepare you for the Azure Fundamentals AZ900 Certification Exam.


How? A VPN provides privacy and security by routing your traffic through a secure VPN server and encrypting your traffic while a proxy, on the other hand, simply passes that traffic through a mediating server. It doesn’t necessarily offer any extra protection unless you use some extra features.

Proxy PROS:

However, when the motivation is to avoid geo-blocking, a proxy is more likely to be successful. Websites that need to do geo-blocking can normally tell that your IP is that of a VPN server. They don’t account for all the possible proxy servers.

But the problem here is they use datacenter IP (the server IP),

Also VPNs save logs and save EVERYTHING you do.

In the other hand, there are many types of proxy: datacenter proxy (worst one), Residential proxy, Mobile proxy 4G, and Mobile Proxy 5G.

If you use residential proxy or mobile proxy it might be much better and safer for many reasons:

  1. Residential IP means that the Proxy use a regular ISP like comcast, Charter, Sprint, etc.
  2. They don’t save logs.
  3. The connection is not even direct, it goes to their server first and then to a a real device in another place.
  4. Websites like facebook and shopping sites won’t block you, because you use residential or mobile proxy, so they won’t know that you use a proxy to hide your real IP, while VPN will be easily detected.

Now people would say that the problem with socks5 residential and mobile proxy is the cost, because most of websites sells it on very expensive price.

I use a good cheap and very high quality socks5 residential proxy costs only 3 USD a month per dedicated residential proxy, and the traffic is unlimited.

And it is very fast because it is dedicated and also virgin with fraud score 0.

The website name is Liber8Proxy.com

With average increases in salary of over 25% for certified individuals, you’re going to be in a much better position to secure your dream job or promotion if you earn your AWS Certified Solutions Architect Associate our Cloud Practitioner certification. Get the books below to for real practice exams:

Use the promo codes: W6XM9XP4TWN9 or T6K9P4J9JPPR or 9LWMYKJ7TWPN or TN4NTERJYHY4 for AWS CCP eBook at Apple iBook store.


Use Promo Codes XKPHAATA6LRL 4XJRP9XLT9XL or LTFFY6JA33EL or HKRMTMTHFMAM or 4XHAFTWT4FN6 for AWS SAA-C03 eBook at Apple iBook store



Use Promo Codes EF46PT44LXPN or L6L9R9LKEFFR or TWELPA4JFJWM for Azure Fundamentals eBook at Apple iBook store.

Moreover socks5 residential proxy uses socks5 connection port with promixitron so it would cover your entire PC traffic.

Also their customers support are nice and they always online.

Source: https://qr.ae/pvWauF

What are the Top 10 AWS jobs you can get with an AWS certification in 2022 plus AWS Interview Questions

AWS Certified Cloud Practitioner Exam Preparation

You can translate the content of this page by selecting a language in the select box.

What are the Top 10 AWS jobs you can get with an AWS certification in 2022 plus AWS Interview Questions

AWS certifications are becoming increasingly popular as the demand for AWS-skilled workers continues to grow. AWS certifications show that an individual has the necessary skills to work with AWS technologies, which can be beneficial for both job seekers and employers. AWS-certified individuals can often command higher salaries and are more likely to be hired for AWS-related positions. So, what are the top 10 AWS jobs that you can get with an AWS certification?

1. AWS Solutions Architect / Cloud Architect:

AWS solutions architects are responsible for designing, implementing, and managing AWS solutions. They work closely with other teams to ensure that AWS solutions are designed and implemented correctly.

AWS Architects, AWS Cloud Architects, and AWS solutions architects spend their time architecting, building, and maintaining highly available, cost-efficient, and scalable AWS cloud environments. They also make recommendations regarding AWS toolsets and keep up with the latest in cloud computing.

Professional AWS cloud architects deliver technical architectures and lead implementation efforts, ensuring new technologies are successfully integrated into customer environments. This role works directly with customers and engineers, providing both technical leadership and an interface with client-side stakeholders.

What are the Top 10 AWS jobs you can get with an AWS certification in 2022 plus AWS Interview Questions
AWS SAA-C02 SAA-C03 Exam Prep

Average yearly salary: $148,000-$158,000 USD

2. AWS SysOps Administrator / Cloud System Administrators:

AWS sysops administrators are responsible for managing and operating AWS systems. They work closely with AWS developers to ensure that systems are running smoothly and efficiently.

A Cloud Systems Administrator, or AWS SysOps administrator, is responsible for the effective provisioning, installation/configuration, operation, and maintenance of virtual systems, software, and related infrastructures. They also maintain analytics software and build dashboards for reporting.

Average yearly salary: $97,000-$107,000 USD


3. AWS DevOps Engineer:

AWS devops engineers are responsible for designing and implementing automated processes for Amazon Web Services. They work closely with other teams to ensure that processes are efficient and effective.

AWS DevOps engineers design AWS cloud solutions that impact and improve the business. They also perform server maintenance and implement any debugging or patching that may be necessary. Among other DevOps things!

Average yearly salary: $118,000-$138,000 USD

What are the Top 10 AWS jobs you can get with an AWS certification in 2022 plus AWS Interview Questions
AWS Developer Associate DVA-C01 Exam Prep

4. AWS Cloud Engineer:

AWS cloud engineers are responsible for designing, implementing, and managing cloud-based solutions using AWS technologies. They work closely with other teams to ensure that solutions are designed and implemented correctly.

If you are looking for an all-in-one solution to help you prepare for the AWS Cloud Practitioner Certification Exam, look no further than this AWS Cloud Practitioner CCP CLFC01 book below.


5. AWS Network Engineer:

AWS network engineers are responsible for designing, implementing, and managing networking solutions using AWS technologies. They work closely with other teams to ensure that networking solutions are designed and implemented correctly.

Cloud network specialists, engineers, and architects help organizations successfully design, build, and maintain cloud-native and hybrid networking infrastructures, including integrating existing networks with AWS cloud resources.

Invest in your future today by enrolling in this Azure Fundamentals - Microsoft Azure Certification and Training ebook below. This Azure Fundamentals Exam Prep Book will prepare you for the Azure Fundamentals AZ900 Certification Exam.


Average yearly salary: $107,000-$127,000 USD

6. AWS Security Engineer:

AWS security engineers are responsible for ensuring the security of Amazon Web Services environments. They work closely with other teams to identify security risks and implement controls to mitigate those risks.

Cloud security engineers provide security for AWS systems, protect sensitive and confidential data, and ensure regulatory compliance by designing and implementing security controls according to the latest security best practices.

Average yearly salary: $132,000-$152,000 USD

What are the Top 10 AWS jobs you can get with an AWS certification in 2022 plus AWS Interview Questions
AWS Certified Security Specialty

7. AWS Database administrator:

As a database administrator on Amazon Web Services (AWS), you’ll be responsible for setting up, maintaining, and securing databases hosted on the Amazon cloud platform. You’ll work closely with other teams to ensure that databases are properly configured and secured.

8. Cloud Support Engineer:

Support engineers are responsible for providing technical support to AWS customers. They work closely with customers to troubleshoot problems and provide resolution within agreed upon SLAs.

9. Sales Engineer:

Sales engineers are responsible for working with sales teams to generate new business opportunities through the use of AWS products and services .They must have a deep understanding of AWS products and how they can be used by potential customers to solve their business problems .

10. Cloud Developer

An AWS Developer builds software services and enterprise-level applications. Generally, previous experience working as a software developer and a working knowledge of the most common cloud orchestration tools is required to get and succeed at an AWS cloud developer job

Average yearly salary: $132,000 USD

11. Cloud Consultant

Cloud consultants provide organizations with technical expertise and strategy in designing and deploying AWS cloud solutions or in consulting on specific issues such as performance, security, or data migration.

With average increases in salary of over 25% for certified individuals, you’re going to be in a much better position to secure your dream job or promotion if you earn your AWS Certified Solutions Architect Associate our Cloud Practitioner certification. Get the books below to for real practice exams:

Use the promo codes: W6XM9XP4TWN9 or T6K9P4J9JPPR or 9LWMYKJ7TWPN or TN4NTERJYHY4 for AWS CCP eBook at Apple iBook store.


Use Promo Codes XKPHAATA6LRL 4XJRP9XLT9XL or LTFFY6JA33EL or HKRMTMTHFMAM or 4XHAFTWT4FN6 for AWS SAA-C03 eBook at Apple iBook store



Use Promo Codes EF46PT44LXPN or L6L9R9LKEFFR or TWELPA4JFJWM for Azure Fundamentals eBook at Apple iBook store.

Average yearly salary: $104,000-$124,000

12. Cloud Data Architect

Cloud data architects and data engineers may be cloud database administrators or data analytics professionals who know how to leverage AWS database resources, technologies, and services to unlock the value of enterprise data.

Average yearly salary: $130,000-$140,000 USD

What are the Top 10 AWS jobs you can get with an AWS certification in 2022 plus AWS Interview Questions
AWS Data analytics DAS-C01 Exam Prep

Getting a job after getting an AWS certification

The field of cloud computing will continue to grow and even more different types of jobs will surface in the future.

AWS certified professionals are in high demand across a variety of industries. AWS certs can open the door to a number of AWS jobs, including cloud engineer, solutions architect, and DevOps engineer.


We know you like your hobbies and especially coding, We do too, but you should find time to build the skills that’ll drive your career into Six Figures. Cloud skills and certifications can be just the thing you need to make the move into cloud or to level up and advance your career. 85% of hiring managers say cloud certifications make a candidate more attractive. Start your cloud journey with these excellent books below:

Through studying and practice, any of the listed jobs could becoming available to you if you pass your AWS certification exams. Educating yourself on AWS concepts plays a key role in furthering your career and receiving not only a higher salary, but a more engaging position.

Source: 8 AWS jobs you can get with an AWS certification

AWS Tech Jobs  Interview Questions in 2022

Graphs

1) Process Ordering – LeetCode link…

2) Number of Islands – LeetCode link…

3) k Jumps on Grid – Loading…)

Sort

1) Finding Prefix in Dictionary – LeetCode Link…

Tree

1) Binary Tree Top Down View – LeetCode link…

2) Traversing binary tree in an outward manner.

3) Diameter of a binary tree [Path is needed] – Diameter of a Binary Tree – GeeksforGeeks

Sliding window

1) Contains Duplicates III – LeetCode link…

2) Minimum Window Substring [Variation of this question] – LeetCode link..

Linked List

1) Reverse a Linked List II – LeetCode link…

2) Remove Loop From Linked List – Remove Loop in Linked List

3) Reverse a Linked List in k-groups – LeetCode link…

Binary Search

1) Search In rotate sorted Array – LeetCode link…

Solution:

def pivotedBinarySearch(arr, n, key):
 
    pivot = findPivot(arr, 0, n-1)
 
    # If we didn't find a pivot,
    # then array is not rotated at all
    if pivot == -1:
        return binarySearch(arr, 0, n-1, key)
 
    # If we found a pivot, then first
    # compare with pivot and then
    # search in two subarrays around pivot
    if arr[pivot] == key:
        return pivot
    if arr[0] <= key:
        return binarySearch(arr, 0, pivot-1, key)
    return binarySearch(arr, pivot + 1, n-1, key)
 
 
# Function to get pivot. For array
# 3, 4, 5, 6, 1, 2 it returns 3
# (index of 6)
def findPivot(arr, low, high):
 
    # base cases
    if high < low:
        return -1
    if high == low:
        return low
 
    # low + (high - low)/2;
    mid = int((low + high)/2)
 
    if mid < high and arr[mid] > arr[mid + 1]:
        return mid
    if mid > low and arr[mid] < arr[mid - 1]:
        return (mid-1)
    if arr[low] >= arr[mid]:
        return findPivot(arr, low, mid-1)
    return findPivot(arr, mid + 1, high)
 
# Standard Binary Search function
def binarySearch(arr, low, high, key):
 
    if high < low:
        return -1
 
    # low + (high - low)/2;
    mid = int((low + high)/2)
 
    if key == arr[mid]:
        return mid
    if key > arr[mid]:
        return binarySearch(arr, (mid + 1), high,
                            key)
    return binarySearch(arr, low, (mid - 1), key)
 
# Driver program to check above functions
# Let us search 3 in below array
if __name__ == '__main__':
    arr1 = [5, 6, 7, 8, 9, 10, 1, 2, 3]
    n = len(arr1)
    key = 3
    print("Index of the element is : ", \
          pivotedBinarySearch(arr1, n, key))
 
# This is contributed by Smitha Dinesh Semwal

Arrays

1) Max bandWidth [Priority Queue, Sorting] – Loading…

2) Next permutation – Loading…

3) Largest Rectangle in Histogram – Loading…

Content by – Sandeep Kumar

#AWS #interviews #leetcode #questions #array #sorting #queue #loop #tree #graphs #amazon #sde —-#interviewpreparation #coding #computerscience #softwareengineer

You can translate the content of this page by selecting a language in the select box.

What are the Top 5 things that can say a lot about a software engineer or programmer’s quality?

When it comes to the quality of a software engineer or programmer, there are a few key things that can give you a good indication. First, take a look at their code quality. A good software engineer will take pride in their work and produce clean, well-organized code. They will also be able to explain their code concisely and confidently. Another thing to look for is whether they are up-to-date on the latest coding technologies and trends. A good programmer will always be learning and keeping up with the latest industry developments. Finally, pay attention to how they handle difficult problems. A good software engineer will be able to think creatively and come up with innovative solutions to complex issues. If you see these qualities in a software engineer or programmer, chances are they are of high quality.

Below are the top 5 things can say a lot about a software engineer/ programmer’s quality?

  1. The number of possible paths through the code (branch points) is minimized. Top quality code tends to be much more straight line than poor code. As a result, the author can design, code and test very quickly and is often looked at as a programming guru. In addition this code is far more resilient in Production.
  2. The code clearly matches the underlying business requirements and can therefore be understood very quickly by new resources. As a result there is much less tendency for a maintenance programmer to break the basic design as opposed to spaghetti code where small changes can have catastrophic effects.
  3. There is an overall sense of pride in the source code itself. If the enterprise has clear written standards, these are followed to the letter. If not, the code is internally consistent in terms of procedure/object, function/method or variable/attribute naming. Also indentation and continuations are universally consistent throughout. Last but not least, the majority of code blocks are self-evident to the requirements and where not the case, adequate purpose focused documentation is provided.

    In general, I have seen two types of programs provided for initial Production deployment. One looks like it was just written moments ago and the other looks like it has had 20 years of maintenance performed on it. Unfortunately, the authors of the second type cannot generally see the difference so it is a lost cause and we just have to continue to deal with the problems.
  4. In today’s programming environment, a project may span many platforms, languages etc. A simple web page may invoke an API which in turn accesses a database. For this example lets say JavaScript – Rest API – C# – SQL – RDBMS. The programmer can basically embed logic anywhere in this chain, but needs to be aware of reuse, performance and maintenance issues. For instance, if a part of the process requires access to three database tables, it is both faster and clearer to allow the DBMS engine return a single query than compare the tables in the API code. Similarly every business rule coded in the client side reduces re-usability potential.
    Top quality developers understand these issues and can optimize their designs to take advantages of the strengths of the component technologies.
  5. The ability to stay current with new trends and technologies. Technology is constantly evolving, and a good software engineer or programmer should be able to stay up-to-date on the latest trends and technologies in order to be able to create the best possible products.

To conclude:

Below are other things to consider when hiring good software engineers or programmers:

  1. The ability to write clean, well-organized code. This is a key indicator of a good software engineer or programmer. The ability to write code that is easy to read and understand is essential for creating high-quality software.
  2. The ability to test and debug code. A good coder should be able to test their code thoroughly and identify and fix any errors that may exist.
  3. The ability to write efficient code. Software engineering is all about creating efficient solutions to problems. A good software engineer or programmer will be able to write code that is efficient and effective.
  4. The ability to work well with others. Software engineering is typically a team-based effort. A good software engineer or programmer should be able to work well with others in order to create the best possible product.
  5. The ability to stay current with new trends and technologies.

How to Protect Yourself from Man-in-the-Middle Attacks: Tips for Safer Communication

Man in the middle attacks

You can translate the content of this page by selecting a language in the select box.

How to Protect Yourself from Man-in-the-Middle Attacks: Tips for Safer Communication

Man-in-the-middle (MITM) attacks are a type of cyberattack where a malicious actor intercepts communications between two parties in order to secretly access sensitive data or inject false information. While MITM attacks can be difficult to detect, there are some steps you can take to protect yourself.

For example, always verifying the identity of the person you’re communicating with and using encrypted communication tools whenever possible. Additionally, it’s important to be aware of common signs that an attack may be happening, such as unexpected messages or requests for sensitive information.

Man-in-the-middle attacks are one of the most common types of cyberattacks. MITM attacks can allow the attacker to gain access to sensitive information, such as passwords or financial data. Man-in-the-middle attacks can be very difficult to detect, but there are some steps you can take to protect yourself. First, be aware of the warning signs of a man-in-the-middle attack. These include:

– unexpected changes in login pages,

– unexpected requests for personal information,

– and unusual account activity.

If you see any of these warning signs, do not enter any sensitive information and contact the company or individual involved immediately. Second, use strong security measures, such as two-factor authentication, to protect your accounts. This will make it more difficult for attackers to gain access to your information. Finally, keep your software and operating system up to date with the latest security patches. This will help to close any potential vulnerabilities that could be exploited by attackers.

Man-in-the-middle attacks can be devastating for individuals and businesses alike. By intercepting communications between two parties, attackers can gain access to sensitive information or even impersonate one of the parties involved. Fortunately, there are a number of steps you can take to protect yourself from man-in-the-middle attacks.


  • First, avoid using public Wi-Fi networks for sensitive transactions. Attackers can easily set up their own rogue networks, and it can be difficult to tell the difference between a legitimate network and a malicious one. If you must use public Wi-Fi, be sure to use a VPN to encrypt your traffic.
  • Second, be cautious about the links you click on. When in doubt, hover over a link to see where it will actually take you. And always be suspicious of links that come from untrustworthy sources.
  • Finally, keep your software and security tools up to date. Man-in-the-middle attacks are constantly evolving, so it’s important to have the latest defenses in place.

By following these simple tips, you can help keep yourself safe from man-in-the-middle attacks.

Read more here

Is MITM attack possible when on HTTPS?

HTTPS (or really, SSL) is specifically designed to thwart MITM attacks.

If you are looking for an all-in-one solution to help you prepare for the AWS Cloud Practitioner Certification Exam, look no further than this AWS Cloud Practitioner CCP CLFC01 book below.


Web browsers validate that both the certificate presented by the server is labeled correctly with the website’s domain name and that it has a chain of trust back to a well-known certificate authority. Under normal circumstances, this is enough to prevent anyone from impersonating the website.

As the question points out, you can thwart this by somehow acquiring the secret key for the existing website’s certificate.

Invest in your future today by enrolling in this Azure Fundamentals - Microsoft Azure Certification and Training ebook below. This Azure Fundamentals Exam Prep Book will prepare you for the Azure Fundamentals AZ900 Certification Exam.


You can also launch a MITM attack by getting one of the well-known certificate authorities to issue you a certificate with the domain name of the website you wish to impersonate. This can be (and has been) accomplished by social engineering and hacking into the registrars.

Outside of those two main methods, you would have to rely upon bugs in the SSL protocol or its implementations (of which a few have been discovered over the years).

What are the countermeasures of MITM?

1- Certificates.

For the web, we use a similar principle. A certificate is a specific document issued by a third party that validate the identity of a website. Your PC can ask the third party if the certificate is correct, and only if it is allow the traffic. This is what HTTPs does.

2- Simple…encryption!

Man In The Middle attacks are carried out because an attacker is in between both communicators (let’s say two clients or a client and a server). If he is able to see the communication in clear text, he can do a whole lot ranging from stealing login credentials to snooping on conversations. If encryption is implemented, the attacker would see gibberish and “un-understandable” text instead.

In terms of web communication, digital certificates would do a great job of encrypting communication stream (any website using HTTPS encrypts communication stream by default). For social media apps like whats app and Skype, it is the responsibility of the vendor to implement encryption.

MitM Attack Techniques and Types

  • ARP Cache Poisoning. Address Resolution Protocol (ARP) is a low-level process that translates the machine address (MAC) to the IP address on the local network. …
  • DNS Cache Poisoning. …
  • Wi-Fi Eavesdropping. …
  • Session Hijacking.
  • IP Spoofing
  • DNS Spoofing
  • HTTPS Spoofing
  • SSL Hijacking
  • Email Hijacking
  • Wifi Eavesdropping
  • Cookie Stealing and so on.

Can MITM attacks steal credit card information?

When you enter your sensitive information on an HTTP website and press that “Send” button, all your private details travel in plain text from your web browser to the destination server.

A cyber-attacker can employ a man-in-the-middle attack and intercept your information. Since it’s not encrypted, the hacker can see everything: your name, physical address, card numbers, and anything else you entered.

With average increases in salary of over 25% for certified individuals, you’re going to be in a much better position to secure your dream job or promotion if you earn your AWS Certified Solutions Architect Associate our Cloud Practitioner certification. Get the books below to for real practice exams:

Use the promo codes: W6XM9XP4TWN9 or T6K9P4J9JPPR or 9LWMYKJ7TWPN or TN4NTERJYHY4 for AWS CCP eBook at Apple iBook store.


Use Promo Codes XKPHAATA6LRL 4XJRP9XLT9XL or LTFFY6JA33EL or HKRMTMTHFMAM or 4XHAFTWT4FN6 for AWS SAA-C03 eBook at Apple iBook store



Use Promo Codes EF46PT44LXPN or L6L9R9LKEFFR or TWELPA4JFJWM for Azure Fundamentals eBook at Apple iBook store.

To avoid MITM attacks, don’t share your info on HTTP sites. More on SSL certificates and man-in-the-middle attacks in this detailed medium article

How common are MITM attacks in public places with free WIFI?

Not common by people, but common by malware and other software that are designed to do that.

How do you ensure your RDP is secure from MITM attacks?

  • Make sure all of your workstations and remote servers are patched.
  • On highly sensitive devices, use two-factor authentication.
  • Reduce the number of remote account users with elevated privileges on the server.
  • Make a safe password.
  • Your credentials should not be saved in your RDP register.
  • Remove the RDP file from your computer.

What is Problem Formulation in Machine Learning and Top 4 examples of Problem Formulation in Machine Learning?

Summary of Machine Learning and Artificial Intelligence Capabilities

You can translate the content of this page by selecting a language in the select box.

What is Problem Formulation in Machine Learning and Top 4 examples of Problem Formulation in Machine Learning?

Machine Learning (ML) is a field of Artificial Intelligence (AI) that enables computers to learn from data, without being explicitly programmed. Machine learning algorithms build models based on sample data, known as “training data”, in order to make predictions or decisions, rather than following rules written by humans. Machine learning is closely related to and often overlaps with computational statistics; a discipline that also focuses on prediction-making through the use of computers. Machine learning can be applied in a wide variety of domains, such as medical diagnosis, stock trading, robot control, manufacturing and more.

The process of machine learning consists of several steps: first, data is collected; then, a model is selected or created; finally, the model is trained on the collected data and then applied to new data. This process is often referred to as the “machine learning pipeline”. Problem formulation is the second step in this pipeline and it consists of selecting or creating a suitable model for the task at hand and determining how to represent the collected data so that it can be used by the selected model. In other words, problem formulation is the process of taking a real-world problem and translating it into a format that can be solved by a machine learning algorithm.

There are many different types of machine learning problems, such as classification, regression, prediction and so on. The choice of which type of problem to formulate depends on the nature of the task at hand and the type of data available. For example, if we want to build a system that can automatically detect fraudulent credit card transactions, we would formulate a classification problem. On the other hand, if our goal is to predict the sale price of houses given information about their size, location and age, we would formulate a regression problem. In general, it is best to start with a simple problem formulation and then move on to more complex ones if needed.

Some common examples of problem formulations in machine learning are:
Classification: given an input data point (e.g., an image), predict its category label (e.g., dog vs cat).
Regression: given an input data point (e.g., size and location of a house), predict a continuous output value (e.g., sale price).
Prediction: given an input sequence (e.g., a series of past stock prices), predict the next value in the sequence (e.g., future stock price).
Anomaly detection: given an input data point (e.g., transaction details), decide whether it is normal or anomalous (i.e., fraudulent).
Recommendation: given information about users (e.g., age and gender) and items (e.g., books and movies), recommend items to users (e.g., suggest books for someone who likes romance novels).
Optimization: given a set of constraints (e.g., budget) and objectives (e.g., maximize profit), find the best solution (e.g., product mix).

Machine Learning For Dummies
Machine Learning For Dummies

ML For Dummies on iOs

ML PRO without ADS on iOs [No Ads]

ML PRO without ADS on Windows [No Ads]


ML PRO For Web/Android on Amazon [No Ads]

Problem Formulation: What this pipeline phase entails and why it’s important

The problem formulation phase of the ML Pipeline is critical, and it’s where everything begins. Typically, this phase is kicked off with a question of some kind. Examples of these kinds of questions include: Could cars really drive themselves?  What additional product should we offer someone as they checkout? How much storage will clients need from a data center at a given time?

The problem formulation phase starts by seeing a problem and thinking “what question, if I could answer it, would provide the most value to my business?” If I knew the next product a customer was going to buy, is that most valuable? If I knew what was going to be popular over the holidays, is that most valuable? If I better understood who my customers are, is that most valuable?

However, some problems are not so obvious. When sales drop, new competitors emerge, or there’s a big change to a company/team/org, it can be easy to say, “I see the problem!” But sometimes the problem isn’t so clear. Consider self-driving cars. How many people think to themselves, “driving cars is a huge problem”? Probably not many. In fact, there isn’t a problem in the traditional sense of the word but there is an opportunity. Creating self-driving cars is a huge opportunity. That doesn’t mean there isn’t a problem or challenge connected to that opportunity. How do you design a self-driving system? What data would you look at to inform the decisions you make? Will people purchase self-driving cars?

If you are looking for an all-in-one solution to help you prepare for the AWS Cloud Practitioner Certification Exam, look no further than this AWS Cloud Practitioner CCP CLFC01 book below.


Part of the problem formulation phase includes seeing where there are opportunities to use machine learning.

In the following practice examples, you are presented with four different business scenarios. For each scenario, consider the following questions:

Invest in your future today by enrolling in this Azure Fundamentals - Microsoft Azure Certification and Training ebook below. This Azure Fundamentals Exam Prep Book will prepare you for the Azure Fundamentals AZ900 Certification Exam.


  1. Is machine learning appropriate for this problem, and why or why not?
  2. What is the ML problem if there is one, and what would a success metric look like?
  3. What kind of ML problem is this?
  4. Is the data appropriate?’

The solutions given in this article are one of the many ways you can formulate a business problem.

I)  Amazon recently began advertising to its customers when they visit the company website. The Director in charge of the initiative wants the advertisements to be as tailored to the customer as possible. You will have access to all the data from the retail webpage, as well as all the customer data.

  1. ML is appropriate because of the scale, variety and speed required. There are potentially thousands of ads and millions of customers that need to be served customized ads immediately as they arrive to the site.
  2. The problem is ads that are not useful to customers are a wasted opportunity and a nuisance to customers, yet not serving ads at all is a wasted opportunity. So how does Amazon serve the most relevant advertisements to its retail customers?
    1. Success would be the purchase of a product that was advertised.
  3. This is a supervised learning problem because we have a labeled data point, our success metric, which is the purchase of a product.
  4. This data is appropriate because it is both the retail webpage data as well as the customer data.

II) You’re a Senior Business Analyst at a social media company that focuses on streaming. Streamers use a combination of hashtags and predefined categories to be discoverable by your platform’s consumers. You ran an analysis on unique streamer counts by hashtags and categories over the last month and found that out of tens of thousands of streamers, almost all use only 40 hashtags and 10 categories despite innumerable hashtags and hundreds of categories. You presume the predefined categories don’t represent all the possibilities very well, and that streamers are simply picking the closest fit. You figure there are likely many categories and groupings of streamers that are not accounted for. So you collect a dataset that consists of all streamer profile descriptions (all text), all the historical chat information for each streamer, and all their videos that have been streamed.

  1. ML is appropriate because of the scale and variability.
  2. The problem is the content of streamers is not being represented by the existing categories. Success would be naturally grouping the streamers into categories based on content and seeing if those align with the hashtags and categories that are being commonly used.  If they do not, then the streamers are not being well represented and you can use these groupings to create new categories.
  3. There isn’t a specific outcome variable. There’s no target or label. So this is an unsupervised problem.
  4. The data is appropriate.

III) You’re a headphone manufacturer who sells directly to big and small electronic stores. As an attempt to increase competitive pricing, Store 1 and Store 2 decided to put together the pricing details for all headphone manufacturers and their products (about 350 products) and conduct daily releases of the data. You will have all the specs from each manufacturer and their product’s pricing. Your sales have recently been dropping so your first concern is whether there are competing products that are priced lower than your flagship product.

  1. ML is probably not necessary for this. You can just search the dataset to see which headphones are priced lower than the flagship, then compare their features and build quality.

IV) You’re a Senior Product Manager at a leading ridesharing company. You did some market research, collected customer feedback, and discovered that both customers and drivers are not happy with an app feature. This feature allows customers to place a pin exactly where they want to be picked up. The customers say drivers rarely stop at the pin location. Drivers say customers most often put the pin in a place they can’t stop. Your company has a relationship with the most used maps app for the driver’s navigation so you leverage this existing relationship to get direct, backend access to their data. This includes latitude and longitude, visual photos of each lat/long, traffic delay details, and regulation data if available (ie- No Parking zones, 3 minute parking zones, fire hydrants, etc.).

  1. ML is appropriate because of the scale and automation involved. It’s not feasible to drive everywhere and write down all the places that are ok for pickup. However, maybe we can predict whether a location is ok for pickup.
  2. The problem is drivers and customers are having poor experiences connecting for pickup, which is pushing customers away from the platform.
    1. Success would be properly identifying appropriate pickup locations so they can be integrated into the feature.
  3. This is a supervised learning problem even though there aren’t any labels, yet. Someone will have to go through a sample of the data to label where there are ok places to park and not park, giving the algorithms some target information.
  4. The data is appropriate once a sample of the dataset has been labeled. There may be some other data that could be included too. What about asking UPS for driver stop information? Where do they stop?

In conclusion, problem formulation is an important step in the machine learning pipeline that should not be overlooked or underestimated. It can make or break a machine learning project; therefore, it is important to take care when formulating machine learning problems.”

AWS machine Learning Specialty Exam Prep MLS-C01
AWS machine Learning Specialty Exam Prep MLS-C01

Step by Step Solution to a Machine Learning Problem – Feature Engineering

Feature Engineering is the act of reshaping and curating existing data to make patters more apparent. This process makes the data easier for an ML model to understand. Using knowledge of the data, features are engineered and  tuned to make ML algorithms work more efficiently.

 

For this problem, imagine a scenario where you are running a real estate brokerage and you want to predict the selling price of a house. Using a specific county dataset and simple information (like the location, total square footage, and number of bedrooms), let’s practice training a baseline model, conducting feature engineering, and tuning a model to make a prediction.

First, load the dataset and take a look at its basic properties.

# Load the dataset
import pandas as pd
import boto3

df = pd.read_csv(“xxxxx_data_2.csv”)
df.head()

housing dataset example
housing dataset example: xxxxx_data_2.csv

Output:

With average increases in salary of over 25% for certified individuals, you’re going to be in a much better position to secure your dream job or promotion if you earn your AWS Certified Solutions Architect Associate our Cloud Practitioner certification. Get the books below to for real practice exams:

Use the promo codes: W6XM9XP4TWN9 or T6K9P4J9JPPR or 9LWMYKJ7TWPN or TN4NTERJYHY4 for AWS CCP eBook at Apple iBook store.


Use Promo Codes XKPHAATA6LRL 4XJRP9XLT9XL or LTFFY6JA33EL or HKRMTMTHFMAM or 4XHAFTWT4FN6 for AWS SAA-C03 eBook at Apple iBook store



Use Promo Codes EF46PT44LXPN or L6L9R9LKEFFR or TWELPA4JFJWM for Azure Fundamentals eBook at Apple iBook store.

feature_engineering_dataset_example
feature_engineering_dataset_example

This dataset has 21 columns:

  • id – Unique id number
  • date – Date of the house sale
  • price – Price the house sold for
  • bedrooms – Number of bedrooms
  • bathrooms – Number of bathrooms
  • sqft_living – Number of square feet of the living space
  • sqft_lot – Number of square feet of the lot
  • floors – Number of floors in the house
  • waterfront – Whether the home is on the waterfront
  • view – Number of lot sides with a view
  • condition – Condition of the house
  • grade – Classification by construction quality
  • sqft_above – Number of square feet above ground
  • sqft_basement – Number of square feet below ground
  • yr_built – Year built
  • yr_renovated – Year renovated
  • zipcode – ZIP code
  • lat – Latitude
  • long – Longitude
  • sqft_living15 – Number of square feet of living space in 2015 (can differ from sqft_living in the case of recent renovations)
  • sqrt_lot15 – Nnumber of square feet of lot space in 2015 (can differ from sqft_lot in the case of recent renovations)

This dataset is rich and provides a fantastic playground for the exploration of feature engineering. This exercise will focus on a small number of columns. If you are interested, you could return to this dataset later to practice feature engineering on the remaining columns.

A baseline model

Now, let’s  train a baseline model.

People often look at square footage first when evaluating a home. We will do the same in the oflorur model and ask how well can the cost of the house be approximated based on this number alone. We will train a simple linear learner model (documentation). We will compare to this after finishing the feature engineering.

import sagemaker
import numpy as np
from sklearn.model_selection import train_test_split
import time


We know you like your hobbies and especially coding, We do too, but you should find time to build the skills that’ll drive your career into Six Figures. Cloud skills and certifications can be just the thing you need to make the move into cloud or to level up and advance your career. 85% of hiring managers say cloud certifications make a candidate more attractive. Start your cloud journey with these excellent books below:

t1 = time.time()

# Split training, validation, and test
ys = np.array(df[‘price’]).astype(“float32”)
xs = np.array(df[‘sqft_living’]).astype(“float32”).reshape(-1,1)

np.random.seed(8675309)
train_features, test_features, train_labels, test_labels = train_test_split(xs, ys, test_size=0.2)
val_features, test_features, val_labels, test_labels = train_test_split(test_features, test_labels, test_size=0.5)

# Train model
linear_model = sagemaker.LinearLearner(role=sagemaker.get_execution_role(),
instance_count=1,
instance_type=’ml.m4.xlarge’,
predictor_type=’regressor’)

train_records = linear_model.record_set(train_features, train_labels, channel=’train’)
val_records = linear_model.record_set(val_features, val_labels, channel=’validation’)
test_records = linear_model.record_set(test_features, test_labels, channel=’test’)

linear_model.fit([train_records, val_records, test_records], logs=False)

sagemaker.analytics.TrainingJobAnalytics(linear_model._current_job_name, metric_names = [‘test:mse’, ‘test:absolute_loss’]).dataframe()

 

If you examine the quality metrics, you will see that the absolute loss is about $175,000.00. This tells us that the model is able to predict within an average of $175k of the true price. For a model based upon a single variable, this is not bad. Let’s try to do some feature engineering to improve on it.

Throughout the following work, we will constantly be adding to a dataframe called encoded. You will start by populating encoded with just the square footage you used previously.

 

encoded = df[[‘sqft_living’]].copy()

Categorical variables

Let’s start by including some categorical variables, beginning with simple binary variables.

The dataset has the waterfront feature, which is a binary variable. We should change the encoding from 'Y' and 'N' to 1 and 0. This can be done using the map function (documentation) provided by Pandas. It expects either a function to apply to that column or a dictionary to look up the correct transformation.

Binary categorical

Let’s write code to transform the waterfront variable into binary values. The skeleton has been provided below.

encoded[‘waterfront’] = df[‘waterfront’].map({‘Y’:1, ‘N’:0})

You can also encode many class categorical variables. Look at column condition, which gives a score of the quality of the house. Looking into the data source shows that the condition can be thought of as an ordinal categorical variable, so it makes sense to encode it with the order.

Ordinal categorical

Using the same method as in question 1, encode the ordinal categorical variable condition into the numerical range of 1 through 5.

encoded[‘condition’] = df[‘condition’].map({‘Poor’:1, ‘Fair’:2, ‘Average’:3, ‘Good’:4, ‘Very Good’:5})

A slightly more complex categorical variable is ZIP code. If you have worked with geospatial data, you may know that the full ZIP code is often too fine-grained to use as a feature on its own. However, there are only 7070 unique ZIP codes in this dataset, so we may use them.

However, we do not want to use unencoded ZIP codes. There is no reason that a larger ZIP code should correspond to a higher or lower price, but it is likely that particular ZIP codes would. This is the perfect case to perform one-hot encoding. You can use the get_dummies function (documentation) from Pandas to do this.

Nominal categorical

Using the Pandas get_dummies function,  add columns to one-hot encode the ZIP code and add it to the dataset.

encoded = pd.concat([encoded, pd.get_dummies(df[‘zipcode’])], axis=1)

In this way, you may freely encode whatever categorical variables you wish. Be aware that for categorical variables with many categories, something will need to be done to reduce the number of columns created.

One additional technique, which is simple but can be highly successful, involves turning the ZIP code into a single numerical column by creating a single feature that is the average price of a home in that ZIP code. This is called target encoding.

To do this, use groupby (documentation) and mean (documentation) to first group the rows of the DataFrame by ZIP code and then take the mean of each group. The resulting object can be mapped over the ZIP code column to encode the feature.

Nominal categorical II

Complete the following code snippet to provide a target encoding for the ZIP code.

means = df.groupby(‘zipcode’)[‘price’].mean()
encoded[‘zip_mean’] = df[‘zipcode’].map(means)

Normally, you only either one-hot encode or target encode. For this exercise, leave both in. In practice, you should try both, see which one performs better on a validation set, and then use that method.

Scaling

Take a look at the dataset. Print a summary of the encoded dataset using describe (documentation).

encoded.describe()

Scaling  - summary of the encoded dataset using describe
Scaling – summary of the encoded dataset using describe

One column ranges from 290290 to 1354013540 (sqft_living), another column ranges from 11 to 55 (condition), 7171 columns are all either 00 or 11 (one-hot encoded ZIP code), and then the final column ranges from a few hundred thousand to a couple million (zip_mean).

In a linear model, these will not be on equal footing. The sqft_living column will be approximately 1300013000 times easier for the model to find a pattern in than the other columns. To solve this, you often want to scale features to a standardized range. In this case, you will scale sqft_living to lie within 00 and 11.

Feature scaling

Fill in the code skeleton below to scale the column of the DataFrame to be between 00 and 11.

sqft_min = encoded[‘sqft_living’].min()
sqft_max = encoded[‘sqft_living’].max()
encoded[‘sqft_living’] = encoded[‘sqft_living’].map(lambda x : (x-sqft_min)/(sqft_max – sqft_min))

cond_min = encoded[‘condition’].min()
cond_max = encoded[‘condition’].max()
encoded[‘condition’] = encoded[‘condition’].map(lambda x : (x-cond_min)/(cond_max – cond_min))]

Read more here….

Amazon Reviews Solution

Predicting Credit Card Fraud Solution

Predicting Airplane Delays Solution

Data Processing for Machine Learning Example

Model Training and Evaluation Examples

Targeting Direct Marketing Solution

What are the Greenest or Least Environmentally Friendly Programming Languages?

What are the Greenest or Least Environmentally Programming Languages?

You can translate the content of this page by selecting a language in the select box.

What are the Greenest or Least Environmentally Friendly Programming Languages?

Technology has revolutionized the way we live, work, and play. It has also had a profound impact on the world of programming languages. In recent years, there has been a growing trend towards green, energy-efficient languages such as C and C++.  C++ and Rust are two of the most popular languages in this category. Both are designed to be more efficient than traditional languages like Java and JavaScript. And both have been shown to be highly effective at reducing greenhouse gas emissions. So if you’re looking for a language that’s good for the environment, these two are definitely worth considering.

The study below runs 10 benchmark problems in 28 languages [1]. It measures the runtime, memory usage, and energy consumption of each language. The abstract of the paper is shown below.

“This paper presents a study of the runtime, memory usage and energy consumption of twenty seven well-known software languages. We monitor the performance of such languages using ten different programming problems, expressed in each of the languages. Our results show interesting findings, such as, slower/faster languages consuming less/more energy, and how memory usage influences energy consumption. We show how to use our results to provide software engineers support to decide which language to use when energy efficiency is a concern”. [2]

According to the “paper,” in this study, they monitored the performance of these languages using different programming problems for which they used different algorithms compiled by the “Computer Language Benchmarks Game” project, dedicated to implementing algorithms in different languages.

The team used Intel’s Running Average Power Limit (RAPL) tool to measure power consumption, which can provide very accurate power consumption estimates.

The research shows that several factors influence energy consumption, as expected. The speed at which they are executed in the energy consumption is usually decisive, but not always the one that runs the fastest is the one that consumes the least energy as other factors enter into the power consumption equation besides speed, as the memory usage.

Energy

From this table, it is worth noting that C, C++and Java are among the languages that consume the least energy. On the other hand, JavaScript consumes almost twice as much as Java and four times what C consumes. As an interpreted language, Python needs more time to execute and is, therefore, one of the least “green” languages, occupying the position of those that consume the most energy.


Time:

The results are similar to the energy expenditure; the faster a programming language is, the less energy it expends.

Memory

In terms of memory consumption, we see how Java has become one of the most memory-consuming languages along with JavaScript.

Memory ranking.

Ranking

In this ranking, we can see the “greenest” and most efficient languages are: C, C+, Rust, and Java, although this last one shoots the memory usage.

From the Paper: Normalized global results for Energy, Time, and Memory.

To conclude: 

Most Environmentally Friendly Languages: C, Rust, and C++
Least Environmentally Friendly Languages: Ruby, Python, Perl

Although this study may seem curious and without much practical application, it may help design better and more efficient programming languages. Also, we can use this new parameter in our equation when choosing a programing language.

If you are looking for an all-in-one solution to help you prepare for the AWS Cloud Practitioner Certification Exam, look no further than this AWS Cloud Practitioner CCP CLFC01 book below.


This parameter can no longer be ignored in the future or almost the present; besides, the fastest languages are generally also the most environmentally friendly.

If you’re interested in something that is both green and energy efficient, you might want to consider the Groeningen Programming Language (GPL). Developed by a team of researchers at the University of Groningen in the Netherlands, GPL is a relatively new language that is based on the C and C++ programming languages. Python and Rust are also used in its development. GPL is designed to be used for developing energy efficient applications. Its syntax is similar to other popular programming languages, so it should be relatively easy for experienced programmers to learn. And since it’s open source, you can download and use it for free. So why not give GPL a try? It just might be the perfect language for your next project.

Invest in your future today by enrolling in this Azure Fundamentals - Microsoft Azure Certification and Training ebook below. This Azure Fundamentals Exam Prep Book will prepare you for the Azure Fundamentals AZ900 Certification Exam.


Top 10 Caveats – Counter arguments:

#1 C++ will perform better than Python to solve some simple algorithmic problems. C++ is a fairly bare-bone language with a medium level of abstraction, while Python is a high-level languages that relies on many external components, some of which have actually been written in C++. And of course C++ will be efficient than C# to solve some basic problem. But let’s see what happens if you build a complete web application back-end in C++.

#2: This isn’t much useful. I can imagine that the fastest (performance-wise) programming languages are greenest, and vice versa. However, running time is not only the factor here. An engineer may spend 5 minutes writing a Python script that does the job pretty well, and spends hours on debugging C++ code that does the same thing. And the performance difference on the final code may not differ much!

#3:  Has anyone actually taken a look at the winning C and Rust solutions? Most of them are hand-written assembly code masked as SSE intrinsic. That is the kind of code that only a handful of people are able to maintain, not to mention come up with. On the other hand, the Python solutions are pure Python code without a trace of accelerated (read: written in Fortran, C, C++, and/or Rust) libraries like NumPy used in all sane Python projects.

#4:  I used C++ years ago and now use Python, for saving energy consumption, I turn off my laptop when I got off work, I don’t use extra monitors, my AC is always set to 28 Celsius degree, I plan to change my car to electrical one, and I use Python.

#5: I disagree. We should consider the energy saved by the products created in those languages. For example, a C# – based Microsoft Teams allows people to work remotely. How much CO2 do we save that way? 😉

Now, try to do the same in C.

#6 Also, some Python programs, such as anything using NumPy, spend a considerable fraction of their cycles outside the Python interpreter in a C or C++ library..

I would love to see a scatterplot of execution time vs. energy usage as well. Given that modern CPUs can turbo and then go to a low-power state, a modest increase of energy usage during execution can pay dividends in letting the processor go to sleep quicker.

An application that vectorized heavily may end up having very high peak power and moderately higher energy usage that’s repaid by going to sleep much sooner. In the cell phone application processor business, we called that “race to sleep.” By Joe Zbiciak

#7  By Tim Mensch : It’s almost complete garbage.

With average increases in salary of over 25% for certified individuals, you’re going to be in a much better position to secure your dream job or promotion if you earn your AWS Certified Solutions Architect Associate our Cloud Practitioner certification. Get the books below to for real practice exams:

Use the promo codes: W6XM9XP4TWN9 or T6K9P4J9JPPR or 9LWMYKJ7TWPN or TN4NTERJYHY4 for AWS CCP eBook at Apple iBook store.


Use Promo Codes XKPHAATA6LRL 4XJRP9XLT9XL or LTFFY6JA33EL or HKRMTMTHFMAM or 4XHAFTWT4FN6 for AWS SAA-C03 eBook at Apple iBook store



Use Promo Codes EF46PT44LXPN or L6L9R9LKEFFR or TWELPA4JFJWM for Azure Fundamentals eBook at Apple iBook store.

If you look at the TypeScript numbers, they are more than 5x worse than JavaScript.

This has to mean they were running the TypeScript compiler every time they ran their benchmark. That’s not how TypeScript works. TypeScript should be identical to JavaScript. It is JavaScript once it’s running, after all.

Given that glaring mistake, the rest of their numbers are suspect.

I suspect Python and Ruby really are pretty bad given better written benchmarks I’ve seen, but given their testing issues, not as bad as they imply. Python at least has a “compile” phase as well, so if they were running a benchmark repeatedly, they were measuring the startup energy usage along with the actual energy usage, which may have swamped the benchmark itself.

PHP similarly has a compile step, but PHP may actually run that compile step every time a script is run. So of all of the benchmarks, it might be the closest.


We know you like your hobbies and especially coding, We do too, but you should find time to build the skills that’ll drive your career into Six Figures. Cloud skills and certifications can be just the thing you need to make the move into cloud or to level up and advance your career. 85% of hiring managers say cloud certifications make a candidate more attractive. Start your cloud journey with these excellent books below:

I do wonder if they also compiled the C and C++ code as part of the benchmarks as well. C++ should be as optimized or more so than C, and as such should use the same or less power, unless you’re counting the compile phase. And if they’re also measuring the compile phase, then they are being intentionally deceptive. Or stupid. But I’ll go with deceptive to be polite. (You usually compile a program in C or C++ once and then you can run it millions or billions of times—or more. The energy cost of compiling is miniscule compared to the run time cost of almost any program.)

I’ve read that 80% of all studies are garbage. This is one of those garbage studies.

#8 By Chaim Solomon: This is nonsense

This is nonsense as it runs low-level benchmarks that benchmark basic algorithms in high-level languages. You don’t do that for anything more than theoretical work.

Do a comparison of real-world tasks and you should find less of a spread.

Do a comparison of web-server work or something like that – I guess you may find a factor of maybe 5 or 10 – if it’s done right.

Don’t do low-level algorithms in a high-level language for anything more than teaching. If you need such an algorithm – the way to do it is to implement it in a library as a native module. And then it’s compiled to machine code and runs as fast as any other implementation.

#9 By Tim Mensch

It’s worse than nonsense. TypeScript complies directly to JavaScript, but gets a crazy worse rating somehow?!

#10 By Tim Mensch

For NumPy and machine learning applications, most of the calculations are going to be in C.

The world I’ve found myself in is server code, though. Servers that run 24/7/365.

And in that case, a server written in C or C++ will be able to saturate its network interface at a much lower continuous CPU load than a Python or Ruby server can. So in that respect, the latter languages’ performance issues really do make a difference in ongoing energy usage.

But as you point out, in mobile there could be an even greater difference due to the CPU being put to sleep or into a low power mode if it finishes its work more quickly.

 

Why is there a lack of diversity among software engineers in tech companies in USA and Canada?

What are the top 10 biggest lessons you have learned from the corporate world?

You can translate the content of this page by selecting a language in the select box.

Why is there a lack of diversity among software engineers in tech companies in USA and Canada?

One of the most significant problems facing the tech industry is a lack of diversity among software engineers. In the United States and Canada, women and visible minorities are grossly underrepresented in the field. This problem is often attributed to the pipelines that feed into the industry. For example, women are less likely than men to study computer science in college. However, this does not fully explain the disparity. Hiring practices at tech companies are also to blame. Studies have shown that companies are biased against female and minority candidates. They are more likely to hire white men with similar educational backgrounds and work experience. This lack of diversity has negative consequences for both the individuals affected and the companies themselves. Immersive technologies, such as virtual reality, are being designed and developed primarily by white men. This creates a product that does not reflect the needs or perspectives of a diverse population. In addition, companies with diverse teams have been shown to perform better than those without diversity. They are more innovative and adaptable to change. The lack of diversity among software engineers is a problem that needs to be addressed by both educators and tech companies if the industry is to thrive in the future.

Why is there a lack of diversity among software engineers in tech companies in USA and Canada?
FAANGM Compensations

A prominent Engineer from Silicon Valley said  this about the topic: Women and minorities are systematically discouraged from taking STEM subjects starting around the third grade. Fewer of them excel in math and science in high school. Fewer go to college. Women and minorities don’t see women and minorities in tech roles in the media. Some hiring managers are racist, misogynist scumbags. Some co-workers are dismissive of them. They have trouble finding mentors. The women, at least, are encouraged continuously to drop out and make fat babies.

This all forms a pretty effective filter.  By Kurt Guntheroth.

Find Local and Remote Real Time Jobs at https://inRealtimeJobs.com

Software industry has mostly adopted and promoted conformist culture, thanks to its leaders who mostly have been power mongers and Machiavellian by approach, with few exceptions here and there in few organizations. Conforming culture will inherently repel diversity since one is more assured of conforming staff in a known culture rather than more diverse culture. Most of these so called pseudo leaders don’t really seek diverse ideas which inherently stem from diverse cultures. And so such organizations never end up becoming diverse.

There aren’t enough women in the industry — in individual contributor roles and in leadership roles. There also aren’t enough African-Americans and Latinos.

Why does this matter? Many reasons. One big one is that exclusion leads to more exclusion. When one gender/ethnic group is significantly underrepresented in a workforce, strong biases bake in -> the people in the affected group think they don’t belong at the compan(ies) and the people at the companies have an insular bias to pick people from their networks/people who share their backgrounds.


Silicon Valley is the hottest growth sector in the US and will continue to create the best career and wealth opportunities over the next 20–30 years. It’s really not good that the industry isn’t absorbing more women, African-Americans, and Latinos.

To conclude:

Technology is a rapidly growing industry with a huge demand for qualified software engineers. However, there is a lack of diversity among software engineers in tech companies in the USA and Canada. Women and minorities are greatly underrepresented in the field of software engineering. This is due to several factors, including the misogynistic and racist culture of many tech companies. The lack of diversity among software engineers has a negative impact on the quality of products and services offered by tech companies. It also limits the ability of these companies to innovate and serve a wide range of customers. In order to increase diversity among software engineers, tech companies need to change their hiring practices and create an inclusive environment that values all types of people.

Find local and Remote Real Time Jobs at https://inRealtimeJobs.com

If you are looking for an all-in-one solution to help you prepare for the AWS Cloud Practitioner Certification Exam, look no further than this AWS Cloud Practitioner CCP CLFC01 book below.


 

 

Invest in your future today by enrolling in this Azure Fundamentals - Microsoft Azure Certification and Training ebook below. This Azure Fundamentals Exam Prep Book will prepare you for the Azure Fundamentals AZ900 Certification Exam.


How do we know that the Top 3 Voice Recognition Devices like Siri Alexa and Ok Google are not spying on us?

Proxy vs VPN

You can translate the content of this page by selecting a language in the select box.

How do we know that the Top 3 Voice Recognition Devices like Siri Alexa and Ok Google not spying on us?

When you ask Siri a question, she gives you an answer. But have you ever stopped to wonder how she knows the answer? After all, she’s just a computer program, right? Well, actually, Siri is powered by artificial intelligence (AI) and Machine Learning (ML). This means that she constantly learning and getting better at understanding human speech. So when you ask her a question, she uses her ML algorithms to figure out what you’re saying and then provides you with an answer.

So, How do we know that the Top 3 Voice Recognition Devices like Siri Alexa and Ok Google are not spying on us?

The Amazon Echo is a voice-activated speaker powered by Amazon’s AI assistant, Alexa. Echo uses far-field voice recognition to hear you from across the room, even while music is playing. Once it hears the wake word “Alexa,” it streams audio to the cloud, where the Alexa Voice Service turns the speech into text. Machine learning algorithms then analyze this text to try to understand what you want.

But what does this have to do with spying? Well, it turns out that ML can also be used to eavesdrop on people’s conversations. This is why many people are concerned about their privacy when using voice-activated assistants like Siri, Alexa, and Ok Google. However, there are a few things that you can do to protect your privacy. For example, you can disable voice recognition on your devices or only use them when you’re in a private location. You can also be careful about what information you share with voice-activated assistants. So while they may not be perfect, there are ways that you can minimize the risk of them spying on you.

Some applications which have background components, such as Facebook, do send ambient sounds to their data centers for processing. In so doing, they collect information on what you are talking about, and use it to target advertising.

Siri, Google, and Alexa only do this to decide whether or not you’ve invoked the activation trigger. For Apple hardware, recognition of “Siri, …” happens in hardware locally, without sending out data for recognition. The same for “Alexa, …” for Alexa hardware, and “Hey, Google, …” for Google hardware.

Things get more complicated for these three things, when they are installed cross-platform. So, for example, to make “Hey, Google, …” work on non-Google hardware, where it’s not possible to do the recognition locally, yes, it listens. But unlike Facebook, it’s not recording ambient to collect keywords.

Practically, it’s my understanding that the tree major brands don’t, and it’s only things like Facebook which more or less “violate your trust like this. And other than Facebook, I’m uncertain whether or not any other App does this.


You’ll find that most of the terms and conditions you’ve agreed to on installation of a third party App, grant them pretty broad discretion.

Personally, I tend to not install Apps like that, and use the WebUI from the mobile device browser instead.

If you do that, instead of installing an App, you rob them of their power to eavesdrop effectively. Source: Terry Lambert

How do we know that the Top 3 Voice Recognition Devices like Siri Alexa and Ok Google are not spying on us?

Conclusion:

If you are looking for an all-in-one solution to help you prepare for the AWS Cloud Practitioner Certification Exam, look no further than this AWS Cloud Practitioner CCP CLFC01 book below.


Machine learning is a field of artificial intelligence (AI) concerned with the design and development of algorithms that learn from data. Machine learning algorithms have been used for a variety of tasks, including voice recognition, image classification, and spam detection. In recent years, there has been growing concern about the use of machine learning for surveillance and spying. However, it is important to note that machine learning is not necessarily synonymous with spying. Machine learning algorithms can be used for good or ill, depending on how they are designed and deployed. When it comes to voice-activated assistants such as Siri, Alexa, and OK Google, the primary concern is privacy. These assistants are constantly listening for their wake words, which means they may be recording private conversations without the user’s knowledge or consent. While it is possible that these recordings could be used for nefarious purposes, it is also important to remember that machine learning algorithms are not perfect. There is always the possibility that recordings could be misclassified or misinterpreted. As such, it is important to weigh the risks and benefits of using voice-activated assistants before making a decision about whether or not to use them.

Machine Learning For Dummies
Machine Learning For Dummies

ML For Dummies on iOs [Contain Ads]

Invest in your future today by enrolling in this Azure Fundamentals - Microsoft Azure Certification and Training ebook below. This Azure Fundamentals Exam Prep Book will prepare you for the Azure Fundamentals AZ900 Certification Exam.


ML PRO without ADS on iOs [No Ads, More Features]

ML PRO without ADS on Windows [No Ads, More Features]

ML PRO For Web/Android on Amazon [No Ads, More Features]

Use this App to learn about Machine Learning and Elevate your Brain with Machine Learning Quizzes, Cheat Sheets, Ml Jobs Interview Questions and Answers updated daily.

The App provides:

– 400+ Machine Learning Operation on AWS, Azure, GCP and Detailed Answers and References

– 100+ Machine Learning Basics Questions and Answers

– 100+ Machine Learning Advanced Questions and Answers

– Scorecard

– Countdown timer

With average increases in salary of over 25% for certified individuals, you’re going to be in a much better position to secure your dream job or promotion if you earn your AWS Certified Solutions Architect Associate our Cloud Practitioner certification. Get the books below to for real practice exams:

Use the promo codes: W6XM9XP4TWN9 or T6K9P4J9JPPR or 9LWMYKJ7TWPN or TN4NTERJYHY4 for AWS CCP eBook at Apple iBook store.


Use Promo Codes XKPHAATA6LRL 4XJRP9XLT9XL or LTFFY6JA33EL or HKRMTMTHFMAM or 4XHAFTWT4FN6 for AWS SAA-C03 eBook at Apple iBook store



Use Promo Codes EF46PT44LXPN or L6L9R9LKEFFR or TWELPA4JFJWM for Azure Fundamentals eBook at Apple iBook store.

What are some jobs or professions that have become or will soon become obsolete due to technology, automation, and artificial intelligence?

Machine Learning For Dummies

You can translate the content of this page by selecting a language in the select box.

What are some jobs or professions that have become or will soon become obsolete due to technology, automation, and artificial intelligence?

Technology, automation, and artificial intelligence are changing the world as we know it. They are making some jobs obsolete and giving rise to new professions. For example, machine learning is automating many tasks that were previously done by human beings, such as data entry and analysis. As a result, many jobs that require these skills are disappearing. In their place, new jobs are being created for people who can code algorithms and train machine learning models. Similarly, artificial intelligence is changing the landscape of many professions. It is being used to automate tasks such as customer service, financial analysis, and even medical diagnosis. 

It’s no secret that technology, automation, and artificial intelligence are changing the world of work. Machine learning and artificial intelligence are making inroads in a variety of industries, from healthcare to manufacturing to finance. As these technologies advance, they are increasingly capable of doing the jobs that have traditionally been done by human beings. This is having a major impact on the labor market, with many jobs becoming obsolete or at risk of disappearing altogether.

Some of the jobs that are most at risk of being replaced by machines include:

  • assembly line workers, cashiers,
  • data entry clerks,
  • and customer service representatives.
  • Financial analysis
  • Medical Diagnosis Technicians
  • In many cases, these jobs are already being done by robots or automated systems. As machine learning and artificial intelligence continue to evolve, they will only become more capable of taking on these roles. In the future, we may see even more professions becoming obsolete as a result of technology.

While this change can be disruptive, it also opens up new opportunities for people with the right skills. Those who are able to adapt to the changing world of work will find themselves well-positioned for success in the years to come.

Find latest remote jobs at https://inRealTimeJobs.com

Machine Learning For Dummies
Machine Learning For Dummies

Machine Learning For Dummies

ML PRO without ADS on iOs [No Ads, More Features]

ML PRO without ADS on Windows [No Ads, More Features]


ML PRO For Web/Android on Amazon [No Ads, More Features]

Use this App to learn about Machine Learning and Elevate your Brain with Machine Learning Quizzes, Cheat Sheets, Ml Jobs Interview Questions and Answers updated daily.

The App provides:

– 400+ Machine Learning Operation on AWS, Azure, GCP and Detailed Answers and References

If you are looking for an all-in-one solution to help you prepare for the AWS Cloud Practitioner Certification Exam, look no further than this AWS Cloud Practitioner CCP CLFC01 book below.


– 100+ Machine Learning Basics Questions and Answers

– 100+ Machine Learning Advanced Questions and Answers

Invest in your future today by enrolling in this Azure Fundamentals - Microsoft Azure Certification and Training ebook below. This Azure Fundamentals Exam Prep Book will prepare you for the Azure Fundamentals AZ900 Certification Exam.


– Scorecard

– Countdown timer

– Machine Learning Cheat Sheets

– Machine Learning Interview Questions and Answers

– Machine Learning Latest News

The App covers: Azure AI Fundamentals AI-900 Exam Prep: Azure AI 900, ML, Natural Language Processing, Modeling, Data Engineering, Computer Vision, Exploratory Data Analysis, ML implementation and Operations, S3, SageMaker, Kinesis, Lake Formation, Athena, Kibana, Redshift, Textract, EMR, Glue, GCP PROFESSIONAL Machine Learning Engineer, Framing ML problems, Architecting ML solutions, Designing data preparation and processing systems, Developing ML models, Monitoring, optimizing, and maintaining ML solutions, Automating and orchestrating ML pipelines, Quiz and Brain Teaser for AWS Machine Learning MLS-C01, Cloud Build, Kubeflow, TensorFlow, CSV, JSON, IMG, parquet or databases, Hadoop/Spark, Vertex AI Prediction, Describe Artificial Intelligence workloads and considerations, Describe fundamental principles of machine learning on Azure, Describe features of computer vision workloads on Azure, Describe features of Natural Language Processing (NLP) workloads on Azure , Describe features of conversational AI workloads on Azure, QnA Maker service, Language Understanding service (LUIS), Speech service, Translator Text service, Form Recognizer service, Face service, Custom Vision service, Computer Vision service, facial detection, facial recognition, and facial analysis solutions, optical character recognition solutions, object detection solutions, image classification solutions, azure Machine Learning designer, automated ML UI, conversational AI workloads, anomaly detection workloads, forecasting workloads identify features of anomaly detection work, NLP, Kafka, SQl, NoSQL, Python, DocumentDB, linear regression, logistic regression, Sampling, dataset, statistical interaction, selection bias, non-Gaussian distribution, bias-variance trade-off, Normal Distribution, correlation and covariance, Point Estimates and Confidence Interval, A/B Testing, p-value, statistical power of sensitivity, over-fitting and under-fitting, regularization, Law of Large Numbers, Confounding Variables, Survivorship Bias, univariate, bivariate and multivariate, Resampling, ROC curve, TF/IDF vectorization, Cluster Sampling, etc.

#machinelearning #nlp #computervision #mlops #ml #dataengineering #hadoop #tensorflow #ai #mlsc01 #azurai900 #gcpml #AWSML

 

What are some good datasets for Data Science and Machine Learning?

Data Sciences - Data Analytics

You can translate the content of this page by selecting a language in the select box.

Finding good datasets for Data Science and Machine Learning can be a challenge. There are a lot of dataset out there, but not all of them are good for machine learning. In order to find a good dataset, you need to consider what you want to use the dataset for. If you want to use the dataset for training a machine learning model, then you need to make sure that the dataset is representative of the real-world data that you want to use the model on. The dataset should also be large enough to train a robust model. Another important consideration is whether or not the dataset is open source. Open source datasets are typically better because they have been vetted by the community and are more likely to be of high quality. However, open source datasets can also be more difficult to find. A good place to start looking for datasets is on websites like Kaggle and UC Irvine Machine Learning Repository. These websites contain a variety of high-quality datasets that are free to download and use.

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Person climbing a staircase. Learn Data Science from Scratch: online program with 21 courses Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works, Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek. Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2


Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

If you are looking for an all-in-one solution to help you prepare for the AWS Cloud Practitioner Certification Exam, look no further than this AWS Cloud Practitioner CCP CLFC01 book below.


Person climbing a staircase. Learn Data Science from Scratch: online program with 21 courses

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Invest in your future today by enrolling in this Azure Fundamentals - Microsoft Azure Certification and Training ebook below. This Azure Fundamentals Exam Prep Book will prepare you for the Azure Fundamentals AZ900 Certification Exam.


Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions. AWS CLI Access (No AWS account required) aws s3 ls s3://commoncrawl/ --no-sign-request s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System. Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS. See all usage examples for datasets listed in this registry. See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images Documentation: allenai.org/data/tqa Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer researchers and bioinformaticians to search and download cancer data for analysis.
Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. AWS CLI Access (No AWS account required) aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details. Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere. The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data. This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). FlightAware.com has data but you need to pay for a full dataset. The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:
  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights
Airline On-Time Statistics and Delay Causes The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe Download: airports.dat (Airports only, high quality) Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though. flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works, Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek. Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

With average increases in salary of over 25% for certified individuals, you’re going to be in a much better position to secure your dream job or promotion if you earn your AWS Certified Solutions Architect Associate our Cloud Practitioner certification. Get the books below to for real practice exams:

Use the promo codes: W6XM9XP4TWN9 or T6K9P4J9JPPR or 9LWMYKJ7TWPN or TN4NTERJYHY4 for AWS CCP eBook at Apple iBook store.


Use Promo Codes XKPHAATA6LRL 4XJRP9XLT9XL or LTFFY6JA33EL or HKRMTMTHFMAM or 4XHAFTWT4FN6 for AWS SAA-C03 eBook at Apple iBook store



Use Promo Codes EF46PT44LXPN or L6L9R9LKEFFR or TWELPA4JFJWM for Azure Fundamentals eBook at Apple iBook store.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2


We know you like your hobbies and especially coding, We do too, but you should find time to build the skills that’ll drive your career into Six Figures. Cloud skills and certifications can be just the thing you need to make the move into cloud or to level up and advance your career. 85% of hiring managers say cloud certifications make a candidate more attractive. Start your cloud journey with these excellent books below:

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions. AWS CLI Access (No AWS account required) aws s3 ls s3://commoncrawl/ --no-sign-request s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System. Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS. See all usage examples for datasets listed in this registry. See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images Documentation: allenai.org/data/tqa Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer researchers and bioinformaticians to search and download cancer data for analysis.
Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. AWS CLI Access (No AWS account required) aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details. Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere. The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data. This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). FlightAware.com has data but you need to pay for a full dataset. The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:
  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights
Airline On-Time Statistics and Delay Causes The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe Download: airports.dat (Airports only, high quality) Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though. flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works, Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek. Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions. AWS CLI Access (No AWS account required) aws s3 ls s3://commoncrawl/ --no-sign-request s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System. Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS. See all usage examples for datasets listed in this registry. See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images Documentation: allenai.org/data/tqa Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer researchers and bioinformaticians to search and download cancer data for analysis.
Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. AWS CLI Access (No AWS account required) aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details. Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere. The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data. This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). FlightAware.com has data but you need to pay for a full dataset. The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:
  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights
Airline On-Time Statistics and Delay Causes The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe Download: airports.dat (Airports only, high quality) Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though. flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works, Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek. Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions. AWS CLI Access (No AWS account required) aws s3 ls s3://commoncrawl/ --no-sign-request s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System. Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS. See all usage examples for datasets listed in this registry. See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images Documentation: allenai.org/data/tqa Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer researchers and bioinformaticians to search and download cancer data for analysis.
Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. AWS CLI Access (No AWS account required) aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details. Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere. The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data. This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). FlightAware.com has data but you need to pay for a full dataset. The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:
  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights
Airline On-Time Statistics and Delay Causes The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe Download: airports.dat (Airports only, high quality) Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though. flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works, Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek. Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions. AWS CLI Access (No AWS account required) aws s3 ls s3://commoncrawl/ --no-sign-request s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System. Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS. See all usage examples for datasets listed in this registry. See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images Documentation: allenai.org/data/tqa Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer researchers and bioinformaticians to search and download cancer data for analysis.
Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. AWS CLI Access (No AWS account required) aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details. Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere. The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data. This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). FlightAware.com has data but you need to pay for a full dataset. The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:
  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights
Airline On-Time Statistics and Delay Causes The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe Download: airports.dat (Airports only, high quality) Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though. flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works, Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek. Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions. AWS CLI Access (No AWS account required) aws s3 ls s3://commoncrawl/ --no-sign-request s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System. Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS. See all usage examples for datasets listed in this registry. See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images Documentation: allenai.org/data/tqa Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer researchers and bioinformaticians to search and download cancer data for analysis.
Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. AWS CLI Access (No AWS account required) aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details. Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere. The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data. This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). FlightAware.com has data but you need to pay for a full dataset. The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:
  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights
Airline On-Time Statistics and Delay Causes The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe Download: airports.dat (Airports only, high quality) Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though. flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works, Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek. Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions. AWS CLI Access (No AWS account required) aws s3 ls s3://commoncrawl/ --no-sign-request s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System. Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS. See all usage examples for datasets listed in this registry. See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images Documentation: allenai.org/data/tqa Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer researchers and bioinformaticians to search and download cancer data for analysis.
Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. AWS CLI Access (No AWS account required) aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details. Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere. The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data. This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). FlightAware.com has data but you need to pay for a full dataset. The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:
  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights
Airline On-Time Statistics and Delay Causes The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe Download: airports.dat (Airports only, high quality) Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though. flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works, Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek. Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions. AWS CLI Access (No AWS account required) aws s3 ls s3://commoncrawl/ --no-sign-request s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System. Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS. See all usage examples for datasets listed in this registry. See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images Documentation: allenai.org/data/tqa Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer researchers and bioinformaticians to search and download cancer data for analysis.
Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. AWS CLI Access (No AWS account required) aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details. Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere. The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data. This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). FlightAware.com has data but you need to pay for a full dataset. The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:
  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights
Airline On-Time Statistics and Delay Causes The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe Download: airports.dat (Airports only, high quality) Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though. flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works, Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek. Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions. AWS CLI Access (No AWS account required) aws s3 ls s3://commoncrawl/ --no-sign-request s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System. Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS. See all usage examples for datasets listed in this registry. See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images Documentation: allenai.org/data/tqa Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer researchers and bioinformaticians to search and download cancer data for analysis.
Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. AWS CLI Access (No AWS account required) aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details. Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere. The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data. This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). FlightAware.com has data but you need to pay for a full dataset. The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:
  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights
Airline On-Time Statistics and Delay Causes The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe Download: airports.dat (Airports only, high quality) Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though. flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works, Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek. Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions. AWS CLI Access (No AWS account required) aws s3 ls s3://commoncrawl/ --no-sign-request s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System. Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS. See all usage examples for datasets listed in this registry. See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images Documentation: allenai.org/data/tqa Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer researchers and bioinformaticians to search and download cancer data for analysis.
Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. AWS CLI Access (No AWS account required) aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details. Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere. The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data. This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). FlightAware.com has data but you need to pay for a full dataset. The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:
  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights
Airline On-Time Statistics and Delay Causes The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe Download: airports.dat (Airports only, high quality) Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though. flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works, Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek. Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions. AWS CLI Access (No AWS account required) aws s3 ls s3://commoncrawl/ --no-sign-request s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System. Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS. See all usage examples for datasets listed in this registry. See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images Documentation: allenai.org/data/tqa Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer researchers and bioinformaticians to search and download cancer data for analysis.
Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. AWS CLI Access (No AWS account required) aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details. Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere. The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data. This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). FlightAware.com has data but you need to pay for a full dataset. The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:
  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights
Airline On-Time Statistics and Delay Causes The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe Download: airports.dat (Airports only, high quality) Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though. flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here
  • Node.js – Async non-blocking event-driven JavaScript runtime built on Chrome’s V8 JavaScript engine.
  • Frontend Development
  • iOS – Mobile operating system for Apple phones and tablets.
  • Android – Mobile operating system developed by Google.
  • IoT & Hybrid Apps
  • Electron – Cross-platform native desktop apps using JavaScript/HTML/CSS.
  • Cordova – JavaScript API for hybrid apps.
  • React Native – JavaScript framework for writing natively rendering mobile apps for iOS and Android.
  • Xamarin – Mobile app development IDE, testing, and distribution.
  • Linux
    • Containers
    • eBPF – Virtual machine that allows you to write more efficient and powerful tracing and monitoring for Linux systems.
    • Arch-based Projects – Linux distributions and projects based on Arch Linux.
  • macOS – Operating system for Apple’s Mac computers.
  • watchOS – Operating system for the Apple Watch.
  • JVM
  • Salesforce
  • Amazon Web Services
  • Windows
  • IPFS – P2P hypermedia protocol.
  • Fuse – Mobile development tools.
  • Heroku – Cloud platform as a service.
  • Raspberry Pi – Credit card-sized computer aimed at teaching kids programming, but capable of a lot more.
  • Qt – Cross-platform GUI app framework.
  • WebExtensions – Cross-browser extension system.
  • RubyMotion – Write cross-platform native apps for iOS, Android, macOS, tvOS, and watchOS in Ruby.
  • Smart TV – Create apps for different TV platforms.
  • GNOME – Simple and distraction-free desktop environment for Linux.
  • KDE – A free software community dedicated to creating an open and user-friendly computing experience.
  • .NET
    • Core
    • Roslyn – Open-source compilers and code analysis APIs for C# and VB.NET languages.
  • Amazon Alexa – Virtual home assistant.
  • DigitalOcean – Cloud computing platform designed for developers.
  • Flutter – Google’s mobile SDK for building native iOS and Android apps from a single codebase written in Dart.
  • Home Assistant – Open source home automation that puts local control and privacy first.
  • IBM Cloud – Cloud platform for developers and companies.
  • Firebase – App development platform built on Google Cloud Platform.
  • Robot Operating System 2.0 – Set of software libraries and tools that help you build robot apps.
  • Adafruit IO – Visualize and store data from any device.
  • Cloudflare – CDN, DNS, DDoS protection, and security for your site.
  • Actions on Google – Developer platform for Google Assistant.
  • ESP – Low-cost microcontrollers with WiFi and broad IoT applications.
  • Deno – A secure runtime for JavaScript and TypeScript that uses V8 and is built in Rust.
  • DOS – Operating system for x86-based personal computers that was popular during the 1980s and early 1990s.
  • Nix – Package manager for Linux and other Unix systems that makes package management reliable and reproducible.

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works, Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek. Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions. AWS CLI Access (No AWS account required) aws s3 ls s3://commoncrawl/ --no-sign-request s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System. Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS. See all usage examples for datasets listed in this registry. See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images Documentation: allenai.org/data/tqa Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer researchers and bioinformaticians to search and download cancer data for analysis.
Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. AWS CLI Access (No AWS account required) aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details. Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere. The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data. This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). FlightAware.com has data but you need to pay for a full dataset. The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:
  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights
Airline On-Time Statistics and Delay Causes The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe Download: airports.dat (Airports only, high quality) Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though. flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here
  • JavaScript
  • Swift – Apple’s compiled programming language that is secure, modern, programmer-friendly, and fast.
  • Python – General-purpose programming language designed for readability.
    • Asyncio – Asynchronous I/O in Python 3.
    • Scientific Audio – Scientific research in audio/music.
    • CircuitPython – A version of Python for microcontrollers.
    • Data Science – Data analysis and machine learning.
    • Typing – Optional static typing for Python.
    • MicroPython – A lean and efficient implementation of Python 3 for microcontrollers.
  • Rust
  • Haskell
  • PureScript
  • Go
  • Scala
    • Scala Native – Optimizing ahead-of-time compiler for Scala based on LLVM.
  • Ruby
  • Clojure
  • ClojureScript
  • Elixir
  • Elm
  • Erlang
  • Julia – High-level dynamic programming language designed to address the needs of high-performance numerical analysis and computational science.
  • Lua
  • C
  • C/C++ – General-purpose language with a bias toward system programming and embedded, resource-constrained software.
  • R – Functional programming language and environment for statistical computing and graphics.
  • D
  • Common Lisp – Powerful dynamic multiparadigm language that facilitates iterative and interactive development.
  • Perl
  • Groovy
  • Dart
  • Java – Popular secure object-oriented language designed for flexibility to “write once, run anywhere”.
  • Kotlin
  • OCaml
  • ColdFusion
  • Fortran
  • PHP – Server-side scripting language.
  • Pascal
  • AutoHotkey
  • AutoIt
  • Crystal
  • Frege – Haskell for the JVM.
  • CMake – Build, test, and package software.
  • ActionScript 3 – Object-oriented language targeting Adobe AIR.
  • Eta – Functional programming language for the JVM.
  • Idris – General purpose pure functional programming language with dependent types influenced by Haskell and ML.
  • Ada/SPARK – Modern programming language designed for large, long-lived apps where reliability and efficiency are essential.
  • Q# – Domain-specific programming language used for expressing quantum algorithms.
  • Imba – Programming language inspired by Ruby and Python and compiles to performant JavaScript.
  • Vala – Programming language designed to take full advantage of the GLib and GNOME ecosystems, while preserving the speed of C code.
  • Coq – Formal language and environment for programming and specification which facilitates interactive development of machine-checked proofs.
  • V – Simple, fast, safe, compiled language for developing maintainable software.

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works, Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek. Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions. AWS CLI Access (No AWS account required) aws s3 ls s3://commoncrawl/ --no-sign-request s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System. Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS. See all usage examples for datasets listed in this registry. See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images Documentation: allenai.org/data/tqa Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer researchers and bioinformaticians to search and download cancer data for analysis.
Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. AWS CLI Access (No AWS account required) aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details. Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere. The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data. This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). FlightAware.com has data but you need to pay for a full dataset. The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:
  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights
Airline On-Time Statistics and Delay Causes The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe Download: airports.dat (Airports only, high quality) Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though. flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works, Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek. Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions. AWS CLI Access (No AWS account required) aws s3 ls s3://commoncrawl/ --no-sign-request s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System. Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS. See all usage examples for datasets listed in this registry. See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images Documentation: allenai.org/data/tqa Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer researchers and bioinformaticians to search and download cancer data for analysis.
Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. AWS CLI Access (No AWS account required) aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details. Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere. The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data. This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). FlightAware.com has data but you need to pay for a full dataset. The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:
  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights
Airline On-Time Statistics and Delay Causes The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe Download: airports.dat (Airports only, high quality) Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though. flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here
  • Flask – Python framework.
  • Docker
  • Vagrant – Automation virtual machine environment.
  • Pyramid – Python framework.
  • Play1 Framework
  • CakePHP – PHP framework.
  • Symfony – PHP framework.
  • Laravel – PHP framework.
    • Education
    • TALL Stack – Full-stack development solution featuring libraries built by the Laravel community.
  • Rails – Web app framework for Ruby.
    • Gems – Packages.
  • Phalcon – PHP framework.
  • Useful .htaccess Snippets
  • nginx – Web server.
  • Dropwizard – Java framework.
  • Kubernetes – Open-source platform that automates Linux container operations.
  • Lumen – PHP micro-framework.
  • Serverless Framework – Serverless computing and serverless architectures.
  • Apache Wicket – Java web app framework.
  • Vert.x – Toolkit for building reactive apps on the JVM.
  • Terraform – Tool for building, changing, and versioning infrastructure.
  • Vapor – Server-side development in Swift.
  • Dash – Python web app framework.
  • FastAPI – Python web app framework.
  • CDK – Open-source software development framework for defining cloud infrastructure in code.
  • IAM – User accounts, authentication and authorization.
  • Chalice – Python framework for serverless app development on AWS Lambda.

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works, Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek. Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions. AWS CLI Access (No AWS account required) aws s3 ls s3://commoncrawl/ --no-sign-request s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System. Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS. See all usage examples for datasets listed in this registry. See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images Documentation: allenai.org/data/tqa Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer researchers and bioinformaticians to search and download cancer data for analysis.
Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. AWS CLI Access (No AWS account required) aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details. Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere. The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data. This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). FlightAware.com has data but you need to pay for a full dataset. The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:
  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights
Airline On-Time Statistics and Delay Causes The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe Download: airports.dat (Airports only, high quality) Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though. flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works, Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek. Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions. AWS CLI Access (No AWS account required) aws s3 ls s3://commoncrawl/ --no-sign-request s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System. Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS. See all usage examples for datasets listed in this registry. See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images Documentation: allenai.org/data/tqa Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer researchers and bioinformaticians to search and download cancer data for analysis.
Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. AWS CLI Access (No AWS account required) aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details. Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere. The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data. This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). FlightAware.com has data but you need to pay for a full dataset. The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:
  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights
Airline On-Time Statistics and Delay Causes The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe Download: airports.dat (Airports only, high quality) Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though. flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here
  • Big Data
  • Public Datasets
  • Hadoop – Framework for distributed storage and processing of very large data sets.
  • Data Engineering
  • Streaming
  • Apache Spark – Unified engine for large-scale data processing.
  • Qlik – Business intelligence platform for data visualization, analytics, and reporting apps.
  • Splunk – Platform for searching, monitoring, and analyzing structured and unstructured machine-generated big data in real-time.

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works, Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek. Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions. AWS CLI Access (No AWS account required) aws s3 ls s3://commoncrawl/ --no-sign-request s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System. Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS. See all usage examples for datasets listed in this registry. See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images Documentation: allenai.org/data/tqa Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer researchers and bioinformaticians to search and download cancer data for analysis.
Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. AWS CLI Access (No AWS account required) aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details. Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere. The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data. This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). FlightAware.com has data but you need to pay for a full dataset. The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:
  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights
Airline On-Time Statistics and Delay Causes The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe Download: airports.dat (Airports only, high quality) Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though. flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works, Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek. Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions. AWS CLI Access (No AWS account required) aws s3 ls s3://commoncrawl/ --no-sign-request s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System. Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS. See all usage examples for datasets listed in this registry. See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images Documentation: allenai.org/data/tqa Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer researchers and bioinformaticians to search and download cancer data for analysis.
Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. AWS CLI Access (No AWS account required) aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details. Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere. The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data. This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). FlightAware.com has data but you need to pay for a full dataset. The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:
  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights
Airline On-Time Statistics and Delay Causes The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe Download: airports.dat (Airports only, high quality) Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though. flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works, Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek. Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions. AWS CLI Access (No AWS account required) aws s3 ls s3://commoncrawl/ --no-sign-request s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System. Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS. See all usage examples for datasets listed in this registry. See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images Documentation: allenai.org/data/tqa Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer researchers and bioinformaticians to search and download cancer data for analysis.
Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. AWS CLI Access (No AWS account required) aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details. Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere. The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data. This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). FlightAware.com has data but you need to pay for a full dataset. The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:
  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights
Airline On-Time Statistics and Delay Causes The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe Download: airports.dat (Airports only, high quality) Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though. flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works, Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek. Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions. AWS CLI Access (No AWS account required) aws s3 ls s3://commoncrawl/ --no-sign-request s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System. Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS. See all usage examples for datasets listed in this registry. See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images Documentation: allenai.org/data/tqa Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer researchers and bioinformaticians to search and download cancer data for analysis.
Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. AWS CLI Access (No AWS account required) aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details. Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation AWS CLI Access (No AWS account required) aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere. The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data. This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). FlightAware.com has data but you need to pay for a full dataset. The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:
  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights
Airline On-Time Statistics and Delay Causes The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe Download: airports.dat (Airports only, high quality) Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though. flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here