What is machine learning and how does Netflix use it for its recommendation engine?

What is machine learning and how does Netflix use it for its recommendation engine?

You can translate the content of this page by selecting a language in the select box.

AI Unraveled: Demystifying Frequently Asked Questions on Artificial Intelligence

What is machine learning and how does Netflix use it for its recommendation engine?

What is an online recommendation engine?

Think about examples of machine learning you may have encountered in the past such as a website like Netflix  that recommends what video you may be interested in watching next?
Are the recommendations ever wrong or unfair? We will  give an example and explain how this could be addressed.

2023 AWS Certified Machine Learning Specialty (MLS-C01) Practice Exams
2023 AWS Certified Machine Learning Specialty (MLS-C01) Practice Exams

Machine learning is a field of artificial intelligence that Netflix uses to create its recommendation algorithm. The goal of machine learning is to teach computers to learn from data and make predictions based on that data. To do this, Netflix employs Machine Learning Engineers, Data Scientists, and software developers to design and build algorithms that can automatically improve over time. The Netflix recommendations engine is just one example of how machine learning can be used to improve the user experience. By understanding what users watch and why, the recommendations engine can provide tailored suggestions that help users find new shows and movies to enjoy. Machine learning is also used for other Netflix features, such as predicting which shows a user might be interested in watching next, or detecting inappropriate content. In a world where data is becoming increasingly important, machine learning will continue to play a vital role in helping Netflix deliver a great experience to its users.

What is machine learning and how does Netflix use it for its recommendation engine?
What is machine learning and how does Netflix use it for its recommendation engine?

Netflix’s recommendation engine is one of the company’s most valuable assets. By using machine learning, Netflix is able to constantly improve its recommendations for each individual user.

Machine learning engineers, data scientists, and developers work together to build and improve the recommendation engine.

  • They start by collecting data on what users watch and how they interact with the Netflix interface.
  • This data is then used to train machine learning models.
  • The models are constantly being tweaked and improved by the team of engineers.
  • The goal is to make sure that each user sees recommendations that are highly relevant to their interests.

Thanks to the work of the team, Netflix’s recommendation engine is constantly getting better at understanding each individual user.

Achieve AWS Solutions Architect Associate Certification with Confidence: Master SAA Exam with the Latest Practice Tests and Quizzes illustrated

How Does It Work?

In short, Netflix’s recommendation algorithm looks at what you’ve watched in the past and then makes recommendations based on that data. But of course, it’s a bit more complicated than that. The algorithm also looks at data from other users with similar watching habits to yours. This allows Netflix to give you more tailored recommendations.

For example, say you’re a big fan of Friends (who isn’t?). The algorithm knows that a lot of Friends fans also like shows like Cheers, Seinfeld, and The Office. So, if you’re ever feeling nostalgic and in the mood for a sitcom marathon, Netflix will be there to help you out.

But That’s Not All…

Not only does the algorithm take into account what you’ve watched in the past, but it also looks at what you’re currently watching. For example, let’s say you’re halfway through Season 2 of Breaking Bad and you decide to take a break for a few days. When you come back and finish Season 2, the algorithm knows that you’re now interested in similar shows like Dexter and The Wire. And voila! Those shows will now be recommended to you.

Of course, the algorithm isn’t perfect. There are always going to be times when it recommends a show or movie that just doesn’t interest you. But hey, that’s why they have the “thumbs up/thumbs down” feature. Just give those shows the old thumbs down and never think about them again! Problem solved.

Another angle :

When it comes to TV and movie recommendations, there are two main types of data that are being collected and analyzed:

1) demographic data

2) viewing data.

Demographic data is information like your age, gender, location, etc. This data is generally used to group people with similar interests together so that they can be served more targeted recommendations. For example, if you’re a 25-year-old female living in Los Angeles, you might be grouped together with other 25-year-old females living in Los Angeles who have similar viewing habits as you.


Viewing data is exactly what it sounds like—it’s information on what TV shows and movies you’ve watched in the past. This data is used to identify patterns in your viewing habits so that the algorithm can make better recommendations on what you might want to watch next. For example, if you’ve watched a lot of romantic comedies in the past, the algorithm might recommend other romantic comedies that you might like based on those patterns.

Are the Recommendations Ever Wrong or Unfair?
Yes and no. The fact of the matter is that no algorithm is perfect—there will always be some error involved. However, these errors are usually minor and don’t have a major impact on our lives. In fact, we often don’t even notice them!

The bigger issue with machine learning isn’t inaccuracy; it’s bias. Because algorithms are designed by humans, they often contain human biases that can seep into the recommendations they make. For example, a recent study found that Amazon’s algorithms were biased against women authors because the majority of book purchases on the site were made by men. As a result, Amazon’s algorithms were more likely to recommend books written by men over books written by women—regardless of quality or popularity.

These sorts of biases can have major impacts on our lives because they can dictate what we see and don’t see online. If we’re only seeing content that reflects our own biases back at us, we’re not getting a well-rounded view of the world—and that can have serious implications for both our personal lives and society as a whole.

One of the benefits of machine learning is that it can help us make better decisions. For example, if you’re trying to decide what movie to watch on Netflix, the site will use your past viewing history to recommend movies that you might like. This is possible because machine learning algorithms are able to identify patterns in data.

Another benefit of machine learning is that it can help us automate tasks. For example, if you’re a cashier and have to scan the barcodes of the items someone is buying, a machine learning algorithm can be used to automatically scan the barcodes and calculate the total cost of the purchase. This can save time and increase efficiency.

The Consequences of Machine Learning

While machine learning can be beneficial, there are also some potential consequences that should be considered. One consequence is that machine learning algorithms can perpetuate bias. For example, if you’re using a machine learning algorithm to recommend movies to people on Netflix, the algorithm might only recommend movies that are similar to ones that people have already watched. This could lead to people only watching movies that confirm their existing beliefs instead of challenged them.

Another consequence of machine learning is that it can be difficult to understand how the algorithms work. This is because the algorithms are usually created by trained experts and then fine-tuned through trial and error. As a result, regular people often don’t know how or why certain decisions are being made by machines. This lack of transparency can lead to mistrust and frustration.

What is Problem Formulation in Machine Learning and Top 4 examples of Problem Formulation in Machine Learning?

What are some good datasets for Data Science and Machine Learning?

If you are looking for an all-in-one solution to help you prepare for the AWS Cloud Practitioner Certification Exam, look no further than this AWS Cloud Practitioner CCP CLFC01 book below.


This scene in the Black Panther trailer, is it T’Challa’s funeral?

r/marvelstudios - This scene in the Black Panther trailer, is it T’Challa’s funeral?

Recommended New Netflix  Movies 2022

  • 《重新愛上你》:好心分手
    by 大寫特寫 (Netflix on Medium) on March 26, 2023 at 3:52 pm

    在看《重新愛上你》之前,我從不知道原來愛情劇裡男女主角的顏值是如此重要,重要到只要男帥女美、CP感足夠,便可以讓人忍受所有不合理的劇情轉折,以及扁平的角色刻劃,單純用美色去魅惑屏幕前的女性觀眾。Continue reading on Medium »

  • Keanu Reeves Open to Reprising Jack Traven in Speed Sequel
    by /u/singleguy79 (Movie News and Discussion) on March 26, 2023 at 3:19 pm

    submitted by /u/singleguy79 [link] [comments]

  • Are there ANY thriller /horror movies with smart characters?! Recommendations?
    by /u/AndolfShallows (Movie News and Discussion) on March 26, 2023 at 2:43 pm

    I've come to realize that horror/thriller movie characters seem to be lacking common sense. Obviously for the audiences "oh no don't go in there" reaction. Like there must be no horror movies in that universe to learn from. Are there any movies where the character doesn't go into the dark basement where they hear their dead grandma laughing submitted by /u/AndolfShallows [link] [comments]

  • Which movie had the best use of exposition?
    by /u/laterdude (Movie News and Discussion) on March 26, 2023 at 2:31 pm

    I've gotta go with Lakeview Terrace. Thanks to IMDB, we know exactly where it was filmed and that house is worth $1.5 million dollars today so let's round it down to a cool million since it was filmed in 2007. So you're asking yourself how in the world does an LAPD officer afford a million dollar home?!? Answer: He grew up in South Central, worked double shifts and did security work as well on the side so his family could live in a nice neighbourhood plus he bought the place in 1987. Anyways, I liked how they explained it instead of it being a Friends situation of people living mysteriously outside their means. Movie also had a good explanation on why Samuel L Jackson so disliked interracial couples. submitted by /u/laterdude [link] [comments]

  • What do you think of William Friedkin?
    by /u/FreshmenMan (Movie News and Discussion) on March 26, 2023 at 2:06 pm

    William Friedkin is a film director known for The French Connection, The Exorcist was a big part of New Hollywood. I honestly think William Friedkin is the underrated director of New Hollywood and a lot of his films I should say are very gritty. The French Connection was pretty much shot Guerilla Style and Gene Hackman is just fantastic as Jimmy "Popeye" Doyle. The Exorcist I think is his best film. What I like about The Exorcist is that it is not a horror film in a sense, but that it is a Drama Film was Horror Elements and the Horror being on what Happens to Ragen. There is also a feeling of dread that just keeps getting build up until the Climax where it all crashes down. What I also like with Friedkin is that he pretty much cast unknowns who he would see or find in the streets and cast people that you would least likely to see as a Main Character. Apparently he Originally casted Jimmy Breslin as Jimmy Doyle in The French Connection. Do you think who Jimmy Breslin is? No because he wasn't a actor, he was a NY Columnist and never acted before, but he was cast and actually was in Script Reading, but was let go because he wouldn't get behind the wheel of a car. Also he casted Jason Miller just because he saw a performance of him in That Championship Season. After The Exorcist, Friedkin never achieved the heights as those two films but from what I read. Sorcerer is an Undiscovered Masterpiece, The Brink's Job is so unlike Friedkin, Cruising is Polarizing, To Live and Die in L.A is also a Undiscovered Masterpiece. People don't know what do think of Blue Chips, Jade, Bug, and Killer Joe. Also worth mentioning is that William Friedkin is directing an adaptation of The Caine Mutiny? What do you think of William Friedkin? submitted by /u/FreshmenMan [link] [comments]

  • Which movie you were disappointed that they had a good interesting concept at the beginning but they abandoned it halfway through ?
    by /u/Advancedhell (Movie News and Discussion) on March 26, 2023 at 2:03 pm

    Many times I see a movie with a cool concept but quickly get disappointed when they abandoned it for something generic. One of these types of movie which I really hoped that they sticked with the concept is taken 2 , I really liked their beginning concept of them reversing the roles with the daughter doing the rescuing this time, but unfortunately they quickly abandoned it so that Liam Neeson can be Liam Neeson. For what movie you felt such a way? submitted by /u/Advancedhell [link] [comments]

  • Um Match Surpresa | Amor em tempos de Tinder
    by Cassia Lessa (Netflix on Medium) on March 26, 2023 at 1:50 pm

    Para encontrar o amor da sua vida, o que você estaria disposto a fazer? Natalie (Nina Dobrev) se aventura por aplicativos de paquera e…Continue reading on Medium »

  • Shadow & Bone Season 2 - Did the show take 'inspiration' from another series or is it just a coincidence
    by /u/ingenue411 (Netflix) on March 26, 2023 at 1:22 pm

    Spoilers if you haven't seen S2 I'm about halfway through and I can't help but feel a strong sense of familiarity with the shadow monsters.. They are a part of Kirigan and only come out of him when he is in danger (apparently but this lore doesn't follow through) They are made of shadow/black mist so they can't be fought normally, however when they make contact with their prey they become solid, providing an opportunity for attack. They negatively affect the person they are attached to and weaken them. All of this sounds so very similar to the Homunculi in Full Metal Alchemist, specifically the recent live action films on Netflix. Granted the FMA monsters are based on the 7 deadly sins and I don't know if the shadow monsters in Shadow & Bone were in the books and possibly created prior to the Homonculi but the specific characteristics of these monsters is interesting to note. I just found it so interesting how very similar it all felt, but I may be reaching and to be fair the idea of monsters made of shadow/mist isn't exactly an original one. Just my late night thoughts! submitted by /u/ingenue411 [link] [comments]

  • What films gain the most from being rewatched on the big screen; what films would you rewatch if you had the chance
    by /u/kahlzun (Movie News and Discussion) on March 26, 2023 at 1:19 pm

    I'm having a milestone birthday soon and I've come up with the idea of hiring out a local movie theater for me and my friends. I'm thinking of watching classic action films like Aliens and Terminator 2, but there have been so many films I've seen and I am unsure if those are my best choices. Some films seem to benefit from the big screen more than others, so I guess I'm looking for recommendations regarding which films gain the most from this, but overall what movies would you watch if you had the same arrangement? submitted by /u/kahlzun [link] [comments]

  • From 'John Wick 4' to 'Avatar', Audiences are in love with long blockbusters
    by /u/mrnicegy26 (Movie News and Discussion) on March 26, 2023 at 1:07 pm

    submitted by /u/mrnicegy26 [link] [comments]

  • University Survey: Pricing Perception for Current Netflix Customers.
    by /u/CatholicMemer (Netflix) on March 26, 2023 at 12:11 pm

    submitted by /u/CatholicMemer [link] [comments]

  • Jujutsu Kaisen Season 2 Official Main Trailer | English Sub
    by /u/Atlast_2091 (Netflix) on March 26, 2023 at 11:18 am

    submitted by /u/Atlast_2091 [link] [comments]

  • How many *devices* are you allowed per account?
    by /u/hauntinglovelybold (Netflix) on March 26, 2023 at 11:11 am

    If I get the Basic account it says I can only watch on one device at a time. Can I be signed in on my phone, computer, tablet and smart tv all at once but only watching on one of those at a time? submitted by /u/hauntinglovelybold [link] [comments]

  • Backstory — The Big Picture
    by Lüft Labs (Netflix on Medium) on March 26, 2023 at 9:36 am

    MARKETContinue reading on Medium »

  • Young Wallander canceled at Netflix after 2 seasons
    by /u/sensesalt (Netflix) on March 26, 2023 at 9:31 am

    submitted by /u/sensesalt [link] [comments]

  • O Tenso “Roubo Pelos Ares”
    by Marcia (Netflix on Medium) on March 26, 2023 at 9:20 am

    A ação/drama criminal indiana “Chor Nikal Ke Bhaga”, de Ajah Singh (Loose Control), lançada pela plataforma Netflix na sexta-feira…Continue reading on Coadjuvante »

  • ما هو الصوت المكاني في Netflix
    by Alahome (Netflix on Medium) on March 26, 2023 at 9:19 am

    Continue reading on Medium »

  • [request] I need help with horror movies recommendations, ya know the ones with monsters!
    by /u/MomaOf3 (Netflix) on March 26, 2023 at 7:59 am

    So, my favorite genre is Horror. I've watched a lot on Netflix but I always go off of the pictures(cover when clicking the movie). I'm sure there are a lot of gems out there I'm missing out on. I'm really wanting suggestions that have some kind of "monster" or some good paranormal type stuff. Any suggestions would be greatly appreciated! TIA! submitted by /u/MomaOf3 [link] [comments]

  • In Dead to Me, did Jen have breast cancer?
    by /u/Fishtank-Brain (Netflix) on March 26, 2023 at 7:10 am

    i think i remember she said she had that double mastectomy even though she didn’t have breast cancer because it ran in her family. did she cut off her breasts to make sure she never got breast cancer? as a man, this makes absolutely no sense to me submitted by /u/Fishtank-Brain [link] [comments]

  • What's the one movie you simply refuse to watch?
    by /u/Few-Pirate6046 (Movie News and Discussion) on March 26, 2023 at 6:48 am

    I love movies. I'll watch pretty much any film you throw at me. But the one movie I've never watched and never will watch is definitely Arachnophobia. I don't even think there's a word that encapsulates my deep deep hatred for spiders. Just the thought of spiders makes me all anxious and paranoid. One time when I was 12, I went to the zoo with my family and my dad wanted to get a picture of me holding a giant turantula. As soon as the spider was put in my hand, I started crying. Seriously. They terrify me. submitted by /u/Few-Pirate6046 [link] [comments]

World’s Top 10 Youtube channels in 2022

r/dataisbeautiful - [OC] World's Top 10 Youtube Channels of 2022

T-Series, Cocomelon, Set India, PewDiePie, MrBeast, Kids Diana Show, Like Nastya, WWE, Zee Music Company, Vlad and Niki

"Become a Canada Expert: Ace the Citizenship Test and Impress Everyone with Your Knowledge of Canadian History, Geography, Government, Culture, People, Languages, Travel, Wildlife, Hockey, Tourism, Sceneries, Arts, and Data Visualization. Get the Top 1000 Canada Quiz Now!"


What are some ways to increase precision or recall in machine learning?

What are some ways to increase precision or recall in machine learning?

You can translate the content of this page by selecting a language in the select box.

AI Unraveled: Demystifying Frequently Asked Questions on Artificial Intelligence

What are some ways to increase precision or recall in machine learning?

What are some ways to Boost Precision and Recall in Machine Learning?

Sensitivity vs Specificity?

Achieve AWS Solutions Architect Associate Certification with Confidence: Master SAA Exam with the Latest Practice Tests and Quizzes illustrated


In machine learning, recall is the ability of the model to find all relevant instances in the data while precision is the ability of the model to correctly identify only the relevant instances. A high recall means that most relevant results are returned while a high precision means that most of the returned results are relevant. Ideally, you want a model with both high recall and high precision but often there is a trade-off between the two. In this blog post, we will explore some ways to increase recall or precision in machine learning.

What are some ways to increase precision or recall in machine learning?
What are some ways to increase precision or recall in machine learning?


There are two main ways to increase recall:

by increasing the number of false positives or by decreasing the number of false negatives. To increase the number of false positives, you can lower your threshold for what constitutes a positive prediction. For example, if you are trying to predict whether or not an email is spam, you might lower the threshold for what constitutes spam so that more emails are classified as spam. This will result in more false positives (emails that are not actually spam being classified as spam) but will also increase recall (more actual spam emails being classified as spam).

2023 AWS Certified Machine Learning Specialty (MLS-C01) Practice Exams
2023 AWS Certified Machine Learning Specialty (MLS-C01) Practice Exams

To decrease the number of false negatives,

you can increase your threshold for what constitutes a positive prediction. For example, going back to the spam email prediction example, you might raise the threshold for what constitutes spam so that fewer emails are classified as spam. This will result in fewer false negatives (actual spam emails not being classified as spam) but will also decrease recall (fewer actual spam emails being classified as spam).


What are some ways to increase precision or recall in machine learning?

There are two main ways to increase precision:

by increasing the number of true positives or by decreasing the number of true negatives. To increase the number of true positives, you can raise your threshold for what constitutes a positive prediction. For example, using the spam email prediction example again, you might raise the threshold for what constitutes spam so that fewer emails are classified as spam. This will result in more true positives (emails that are actually spam being classified as spam) but will also decrease precision (more non-spam emails being classified as spam).

To decrease the number of true negatives,

you can lower your threshold for what constitutes a positive prediction. For example, going back to the spam email prediction example once more, you might lower the threshold for what constitutes spam so that more emails are classified as spam. This will result in fewer true negatives (emails that are not actually spam not being classified as spam) but will also decrease precision (more non-spam emails being classified as spam).

What are some ways to increase precision or recall in machine learning?

To summarize,

there are a few ways to increase precision or recall in machine learning. One way is to use a different evaluation metric. For example, if you are trying to maximize precision, you can use the F1 score, which is a combination of precision and recall. Another way to increase precision or recall is to adjust the threshold for classification. This can be done by changing the decision boundary or by using a different algorithm altogether.

What are some ways to increase precision or recall in machine learning?

Sensitivity vs Specificity

In machine learning, sensitivity and specificity are two measures of the performance of a model. Sensitivity is the proportion of true positives that are correctly predicted by the model, while specificity is the proportion of true negatives that are correctly predicted by the model.

Google Colab For Machine Learning

State of the Google Colab for ML (October 2022)

Google introduced computing units, which you can purchase just like any other cloud computing unit you can from AWS or Azure etc. With Pro you get 100, and with Pro+ you get 500 computing units. GPU, TPU and option of High-RAM effects how much computing unit you use hourly. If you don’t have any computing units, you can’t use “Premium” tier gpus (A100, V100) and even P100 is non-viable.

Google Colab Pro+ comes with Premium tier GPU option, meanwhile in Pro if you have computing units you can randomly connect to P100 or T4. After you use all of your computing units, you can buy more or you can use T4 GPU for the half or most of the time (there can be a lot of times in the day that you can’t even use a T4 or any kinds of GPU). In free tier, offered gpus are most of the time K80 and P4, which performs similar to a 750ti (entry level gpu from 2014) with more VRAM.

If you are looking for an all-in-one solution to help you prepare for the AWS Cloud Practitioner Certification Exam, look no further than this AWS Cloud Practitioner CCP CLFC01 book below.


For your consideration, T4 uses around 2, and A100 uses around 15 computing units hourly.
Based on the current knowledge, computing units costs for GPUs tend to fluctuate based on some unknown factor.

Considering those:

"Become a Canada Expert: Ace the Citizenship Test and Impress Everyone with Your Knowledge of Canadian History, Geography, Government, Culture, People, Languages, Travel, Wildlife, Hockey, Tourism, Sceneries, Arts, and Data Visualization. Get the Top 1000 Canada Quiz Now!"


  1. For hobbyists and (under)graduate school duties, it will be better to use your own gpu if you have something with more than 4 gigs of VRAM and better than 750ti, or atleast purchase google pro to reach T4 even if you have no computing units remaining.
  2. For small research companies, and non-trivial research at universities, and probably for most of the people Colab now probably is not a good option.
  3. Colab Pro+ can be considered if you want Pro but you don’t sit in front of your computer, since it disconnects after 90 minutes of inactivity in your computer. But this can be overcomed with some scripts to some extend. So for most of the time Colab Pro+ is not a good option.

If you have anything more to say, please let me know so I can edit this post with them. Thanks!

Conclusion:


In machine learning, precision and recall trade off against each other; increasing one often decreases the other. There is no single silver bullet solution for increasing either precision or recall; it depends on your specific use case which one is more important and which methods will work best for boosting whichever metric you choose. In this blog post, we explored some methods for increasing either precision or recall; hopefully this gives you a starting point for improving your own models!

 

What are some ways we can use machine learning and artificial intelligence for algorithmic trading in the stock market?

Machine Learning and Data Science Breaking News 2022 – 2023

  • Salary review
    by /u/Late-Night-5837 (Data Science) on March 26, 2023 at 10:40 pm

    Hi all, I have been working as a Data Science consultant in Sydney, Australia for 2 years. I currently make 100k before tax. I am up for salary review in June. I would like to assess the feasibility of a salary increase to AUD ~140k per annum. I have been working as Data Science team lead (through contract) for a major automotive manufacturer and have successfully maintained the engagement across multiple contracts due to success and interest in my work. I would like to stay with this company, but I may pursue new employment if we can't reach a suitable and fair number. Any advice as to if this is an appropriate salary bump would be appreciated. Thanks submitted by /u/Late-Night-5837 [link] [comments]

  • Data analytics certification
    by /u/Obskyquil (Data Science) on March 26, 2023 at 8:39 pm

    Hi, I’m looking to find careers/internships that involve coding and data analysis while the undergrad degree I’m working on is not data science/computer sciences. Is there any way I can get legitimate certificates while I’m working on a project portfolio? I’m using Python and learning R and so far I found Python Institute if that’s trustworthy. submitted by /u/Obskyquil [link] [comments]

  • What was your most absurd technical data science interview like?
    by /u/ShmDoubleO (Data Science) on March 26, 2023 at 8:19 pm

    I just finished a hackerrank test for a position at a barely mid-tier company. This was an initial tech screen. At this point I have a few different jobs under my belt and a few years of experience, I've done a number of data science interviews, I've had some truly absurd ones but the one I just had left me dumbfounded, and I'm curious about other people's experience. Also, I'm curious about what people think of my experience, if I'm being too critical or unrealistic etc. Sorry I know this sounds a little vent-y, pretty mad. The hackerrank test had 3 sections and was only a few hours long: 1.) A question where we had to build a simple and commonly used algorithm, but from scratch using only numpy. This was an algorithm that nobody would ever build from scratch in a real-world role. This was very much a full on build a model, feed it some data, talk about the data a bit, etc. 2.) A machine learning problem where you have to do a bunch of data exploration and visualization, build and tune a model in a heavily time-limited test where your code is being run on some dinky VM. Talk about model results and all of your logic, and make visualizations related to your results. Everything is expected to be very well documented, not just how or why it works but "I did this because, this is what I saw, these are the implications etc." 3.) A medium-level coding question. What I think was absurd about this was not the questions themselves, I think in some cases they were good questions, but rather the fact that they put them on a platform like hackerrank with a pretty unrealistic time limit. Question 2 had the level of complexity and the amount of different tasks that was easily on par with every take-home DS assessment I've had where I've been emailed a csv and a list of questions and given a number of days to solve it using the tools I want to, in a very open-ended manner, with the ability to email the company with any clarifying questions and google anything I want. This was something that realistically might take a couple days to "do it right" and a quick version of this would be about as quick and dirty as possible. Question 1 was something that a DS would never do, I can't remember ever seeing somebody implement a model in pure numpy other than in a college course maybe where you're learning about the algo itself. This was more difficult than any high-tier big-tech interview that I've ever had. submitted by /u/ShmDoubleO [link] [comments]

  • How do you come up with questions to answer for projects using “real time” data?
    by /u/darthvardhan95 (Data Science) on March 26, 2023 at 6:18 pm

    Hello! I am looking to build a project over the next couple of weeks where I can use real time data from an API such as Reddit, YouTube, Coinmarketcap, Weather, etc. I want to pull data from the API and store it in a database, then pull it from there to build a model and then store the results back into the database. However, I am struggling to come up with an actual use case/ business problem to tackle for this project. I would greatly appreciate any advice related to projects of this kind and general advice on coming up with problems to convert into project ideas would also be greatly appreciated. Thanks! submitted by /u/darthvardhan95 [link] [comments]

  • Emergency Room Queue -
    by /u/Xoloani (Data Science) on March 26, 2023 at 6:02 pm

    Emergency Room Queue - does anyone know of a software that can visualize how a ER treatment moves along the department? submitted by /u/Xoloani [link] [comments]

  • Building a model for part inventory management
    by /u/meatsweats1000 (Data Science) on March 26, 2023 at 6:01 pm

    I want to preface this post by saying my background is in industrial engineering so I apologize in advance if I use the wrong terminology. I’m an Industrial engineer (maintenance focus) for a large transportation/logistics company that has maintenance shops spread across the country. Every shop has a parts department. The current system for managing parts relies on rudimentary code that is not very effective. Because of this, we have a growing problem with “stock-outs” (not having a part when needed) and this causes extended down time and other issues. I want to build a model to predict potential stock-outs and give an alert to preemptively adjust inventory levels accordingly. We have A TON of historical and current data that I believe can be used to build / train the model. This would be something on the side of the current system. Not replacing it. My question is, what kind of model would you recommend using and how hard would a project like this be? Again, I don’t have a background in data science but I’m a fast learner and have some experience with SQL and a little bit of Python. Any recommendations for resources to develop my skillset for this or just general advice about how to go about a project like this would be greatly appreciated. Let me know if you need any other details and thanks! submitted by /u/meatsweats1000 [link] [comments]

  • [P] SimpleAI : A self-hosted alternative to OpenAI API
    by /u/lhenault (Machine Learning) on March 26, 2023 at 5:31 pm

    Hey everyone, I wanted to share with you SimpleAI, a self-hosted alternative to OpenAI API. The aim of this project is to replicate the (main) endpoints of OpenAI API, and to let you easily and quickly plug in any new model. It basically allows you to deploy your custom model wherever you want and easily, while minimizing the amount of changes both on server and client sides. It's compatible with the OpenAI client so you don't have to change much in your existing code (or can use it to easily query your API). Wether you like or not the AI-as-a-service approach of OpenAI, I think that project could be of interest to many. Even if you are fully satisfied with a paid API, you might be interested in this if: You need a model fine tuned on some specific language and don't see any good alternative, or your company data is too sensitive to send it to an external service You’ve developped your own awesome model, and want a drop-in replacement to switch to yours, to be able to A/B test the two approaches. You're deploying your services in an infrastructure with an unreliable internet connection, so you would rather have your service locally You're just another AI enthusiast with a lot of spare time and free GPU I've personally really enjoyed how open the ML(Ops) community has been in the past years, and seeing how the industry seems to be moving towards paid API and black box systems can be a bit worrying. This project might be useful to expose great, community-based alternatives. If that sounds interesting, please have a look at the examples. I also have a blogpost explaining a few more things. Thank you! submitted by /u/lhenault [link] [comments]

  • Have deepfakes become so realistic that they can fool people into thinking they are genuine? [D]
    by /u/YCCY12 (Machine Learning) on March 26, 2023 at 5:19 pm

    I saw this story of a 50 year old Japanese man who facewapped his face with a young women's face. His followers didn't suspect anything of the photos he posted until he came clean and revealed his identity. https://www.insider.com/man-who-used-faceapp-pretend-woman-more-popular-than-before-2021-5 Another story I found was of a South Korean youtuber/influencer who became popular, amassed millions of views and then reveled she was deepfaked by a company called dob world. https://www.youtube.com/watch?v=cGycBsawTew Do you think deepfakes are realistic enough that people can't tell they're looking at a deepfake unless told? The celebrity deepfakes seem obvious since we know the celebrity and can usually know what content they're actually in. But if not told, and looking at an non-famous person, are deepfakes obvious when you see them? Especially if made by a company that has high quality large dataset for both faces It makes me wonder how many influencers are deepfaked or edited heavily that they look completely different in person. And I don't just mean photoshopping to look skinny but their face/identity isn't the same in anyway. submitted by /u/YCCY12 [link] [comments]

  • Coping with failure at a project.
    by /u/thelostknight99 (Data Science) on March 26, 2023 at 4:26 pm

    Took over a machine learning model from a colleague. Improved upon it. Now a SDE from same vertical came up with a slightly simpler heuristics approach and got some small improvement over the current model. And it was disclosed on the same day when we were deploying it. Not sure how to react to it. Also spoiled my whole weekend trying to get small improvement over the model. What should me and my team do? Also the improvement I made and by the SDE are still very small to make the stakeholders happy. submitted by /u/thelostknight99 [link] [comments]

  • [D] Favorite tips for staying up to date with AI/Deep Learning research and news?
    by /u/seraschka (Machine Learning) on March 26, 2023 at 4:12 pm

    AI breakthroughs are happening non-stop! What are your approaches to staying up to date? ​ Not perfect, but here's what I do at the moment: I create lists for major categories that interest me, collecting books, articles, blog posts, videos, and discussions. (The choice of the tool for list-making is less important than the habit and workflow.) I capture everything that appears interesting, but defer to reading it later -- I found that it's all about the tricky balance between prioritizing, exploring, and avoiding distractions. I set weekly goals for myself for consuming selected resources, understanding that not everything captured is a priority. (Usually, I set aside 1 hour in the morning at least) I use tools and social platforms like Google Scholar alerts, Papers with Code, Twitter, and newsletters to stay updated. (I just wrote a slightly more lengthy outline of this here: https://sebastianraschka.com/blog/2023/keeping-up-with-ai.html) submitted by /u/seraschka [link] [comments]

  • Silhouette Coefficient from Distance Matrix
    by /u/skurelowech3 (Data Science) on March 26, 2023 at 4:06 pm

    I am having a hard time understanding the silhouette coefficient... if I am given a distance matrix as follows: https://preview.redd.it/ieft7an0x3qa1.png?width=481&format=png&auto=webp&s=1964c5a6433bf443f2358455a1d1d18d7b56b049 Assuming that the clusters are {P1, P2} and {P3, P4}. I want to compute the silhouette coefficient for each point, for each cluster and for the overall clustering, how would I do that? What is the significance/meaning of the coefficient for each case? submitted by /u/skurelowech3 [link] [comments]

  • [R] New article in Nature Medicine describes the risks of using an AI- & ML-based tool known as NarxCare to guide opioid prescription decision making
    by /u/wildcatbluejay24 (Machine Learning) on March 26, 2023 at 4:01 pm

    submitted by /u/wildcatbluejay24 [link] [comments]

  • R = Python + SQL (just mu opinion)
    by /u/Donfrds (Data Science) on March 26, 2023 at 4:00 pm

    I have learnt about python and sql and have just started to get into R language and I found it is pretty easy to comprehend due to the fact that it is pretty similar to python and sql… discuss about it ? submitted by /u/Donfrds [link] [comments]

  • [P] Using ChatGPT plugins with LLaMA
    by /u/balthierwings (Machine Learning) on March 26, 2023 at 3:38 pm

    submitted by /u/balthierwings [link] [comments]

  • [D] GPT4 and coding problems
    by /u/enryu42 (Machine Learning) on March 26, 2023 at 3:25 pm

    https://medium.com/@enryu9000/gpt4-and-coding-problems-8fbf04fa8134 Apparently it cannot solve coding problems which require any amount of thinking. LeetCode examples were most likely data leakage. Such drastic gap between MMLU performance and end-to-end coding is somewhat surprising. <sarcasm>Looks like AGI is not here yet.</sarcasm> Thoughts? submitted by /u/enryu42 [link] [comments]

  • [D] Simple Questions Thread
    by /u/AutoModerator (Machine Learning) on March 26, 2023 at 3:00 pm

    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]

  • Tools for to solve domain gap between source and target data [D]
    by /u/iamheinrich (Machine Learning) on March 26, 2023 at 2:52 pm

    Hey guys, do you know any tools/solutions that help to bridge domain gaps between source and target data? Did you try some that you'd recommend? Cheers! submitted by /u/iamheinrich [link] [comments]

  • I asked ChatGPT for the best random state value
    by /u/liquidInkRocks (Data Science) on March 26, 2023 at 2:31 pm

    This thing is like a student forced to write 1000 words on the topic. me: What is a good random state value to use in general Bot: In machine learning, a random state is used to initialize the random number generator. The random number generator is used to shuffle the data, split the data into training and test sets, initialize the parameters of the model, and make other random decisions during training and evaluation. Choosing a good random state value is important because it can affect the results of your model. If you use a different random state, you may get different results. In general, there is no one "best" random state value to use. The value you choose depends on the specific problem you're working on and the specific data you're working with. However, it's a good practice to set a random state to ensure reproducibility of your results. You can choose any integer value for the random state, but it's common to use numbers like 0, 42, or 2022. It's also common to use cross-validation techniques like k-fold cross-validation, which helps to reduce the variability of your results due to different random state values. submitted by /u/liquidInkRocks [link] [comments]

  • [D] Best practices for fine-tuning NLP models for prompt-based applications?
    by /u/Bouraouiamir (Machine Learning) on March 26, 2023 at 1:37 pm

    I've noticed that the best NLP models are the ones that have been fine-tuned on the data they learned from rather than their size. For example, the LLaMA model has been fine-tuned and achieved a better overall score compared to models with larger parameter counts. (LLaMA's biggest model has 65B parameters, compared to 175B from GPT-3). I'm interested in learning more about the best practices for fine-tuning NLP models, especially technics that experts at Facebook or Stanford uses, with a focus on prompt-based applications. Can anyone share tips on how to fine-tune NLP models effectively for prompt-based applications? What data should be used for fine-tuning, and how should the data be preprocessed? How can we optimize the hyperparameters during fine-tuning? Are there any particular techniques or tools that work best for fine-tuning NLP models for prompt-based applications? Additionally, I'm curious about the format used for the data that is mined for NLP models. What format is best for the data to be in, and how is it typically organized for training and fine-tuning purposes? It's worth noting that my main interest in NLP is prompt-based applications, rather than text completion. submitted by /u/Bouraouiamir [link] [comments]

  • Web 3.0 and Data Science Learning Resources
    by /u/vaibhav05cse (Data Science) on March 26, 2023 at 12:20 pm

    Find a wide range of learning resources on Web 3.0, Blockchain, AI and Data Science. Get regular career guides and interview preparation guides. Stay tuned with Incubity. https://incubity.ambilio.com/learning-center/ Some of the recently published articles: Diving Deeper into Decentralized Finance (DeFi) How to build a NFT Marketplace: A Step-by-step Guide Top 10 Web 3.0 Project Ideas in 2023 Top Stable Diffusion Based Project Ideas in 2023 DataOps: The Key to Streamlining Complex Data Pipelines Blockchain Developer Interview Questions with Answers Stay tuned with us through our social media handles and get regular updates. submitted by /u/vaibhav05cse [link] [comments]

Top 100 Data Science and Data Analytics and Data Engineering Interview Questions and Answers

What are some good datasets for Data Science and Machine Learning?

Invest in your future today by enrolling in this Azure Fundamentals - Pass the Azure Fundamentals Exam with Ease: Master the AZ-900 Certification with the Comprehensive Exam Preparation Guide!

Microsoft Azure AZ900 Certification and Training

Simple Linear Regression vs. Multiple Linear Regression vs. MANOVA: A Data Scientist’s Guide

As a data scientist, it's important to understand the difference between simple linear regression, multiple linear regression, and MANOVA. This will come in handy when you're working with different datasets and trying to figure out which one to use. Here's a quick overview of each method:

You can translate the content of this page by selecting a language in the select box.

AI Unraveled: Demystifying Frequently Asked Questions on Artificial Intelligence

Simple Linear Regression vs. Multiple Linear Regression vs. MANOVA: A Data Scientist’s Guide

As a data scientist, it’s important to understand the difference between simple linear regression, multiple linear regression, and MANOVA. This will come in handy when you’re working with different datasets and trying to figure out which one to use. Here’s a quick overview of each method:

A Short Overview of Simple Linear Regression, Multiple Linear Regression, and MANOVA

Simple linear regression is used to predict the value of a dependent variable (y) based on the value of one independent variable (x). This is the most basic form of regression analysis.

Achieve AWS Solutions Architect Associate Certification with Confidence: Master SAA Exam with the Latest Practice Tests and Quizzes illustrated

Multiple linear regression is used to predict the value of a dependent variable (y) based on the values of two or more independent variables (x1, x2, x3, etc.). This is more complex than simple linear regression but can provide more accurate predictions.

MANOVA is used to predict the value of a dependent variable (y) based on the values of two or more independent variables (x1, x2, x3, etc.), while also taking into account the relationships between those variables. This is the most complex form of regression analysis but can provide the most accurate predictions.

So, which one should you use? It depends on your dataset and what you’re trying to predict. If you have a small dataset with only one independent variable, then simple linear regression will suffice. If you have a larger dataset with multiple independent variables, then multiple linear regression will be more appropriate. And if you need to take into account the relationships between your independent variables, then MANOVA is the way to go.

In data science, there are a variety of techniques that can be used to model relationships between variables. Three of the most common techniques are simple linear regression, multiple linear regression, and MANOVA. Although these techniques may appear to be similar at first glance, there are actually some key differences that set them apart. Let’s take a closer look at each technique to see how they differ.


Simple Linear Regression

Simple linear regression is a statistical technique that can be used to model the relationship between a dependent variable and a single independent variable. The dependent variable is the variable that is being predicted, while the independent variable is the variable that is being used to make predictions.

Simple Linear Regression vs. Multiple Linear Regression vs. MANOVA: A Data Scientist's Guide
Linear Regression Basics for Absolute Beginners | by Benjamin Obi Tayo Ph.D. | Towards AI

Multiple Linear Regression

Multiple linear regression is a statistical technique that can be used to model the relationship between a dependent variable and two or more independent variables. As with simple linear regression, the dependent variable is the variable that is being predicted. However, in multiple linear regression, there can be multiple independent variables that are being used to make predictions.

Simple Linear Regression vs. Multiple Linear Regression vs. MANOVA: A Data Scientist's Guide\
Multiple Linear Regression from scratch using only numpy | by Debidutta Dash | Analytics Vidhya | Medium

MANOVA

MANOVA (multivariate analysis of variance) is a statistical technique that can be used to model the relationship between a dependent variable and two or more independent variables. Unlike simple linear regression or multiple linear regression, MANOVA can only be used when the dependent variable is continuous. Additionally, MANOVA can only be used when there are two or more dependent variables.

Simple Linear Regression vs. Multiple Linear Regression vs. MANOVA: A Data Scientist's Guide
Simple Linear Regression vs. Multiple Linear Regression vs. MANOVA: A Data Scientist’s Guide

If you are looking for an all-in-one solution to help you prepare for the AWS Cloud Practitioner Certification Exam, look no further than this AWS Cloud Practitioner CCP CLFC01 book below.


When it comes to data modeling, there are a variety of different techniques that can be used. Simple linear regression, multiple linear regression, and MANOVA are three of the most common techniques. Each technique has its own set of benefits and drawbacks that should be considered before deciding which technique to use for a particular project.We often encounter data points that are correlated. For example, the number of hours studied is correlated with the grades achieved. In such cases, we can use regression analysis to study the relationships between the variables.

Simple linear regression is a statistical method that allows us to predict the value of a dependent variable (y) based on the value of an independent variable (x). In other words, we can use simple linear regression to find out how much y will change when x changes.

"Become a Canada Expert: Ace the Citizenship Test and Impress Everyone with Your Knowledge of Canadian History, Geography, Government, Culture, People, Languages, Travel, Wildlife, Hockey, Tourism, Sceneries, Arts, and Data Visualization. Get the Top 1000 Canada Quiz Now!"


Multiple linear regression is a statistical method that allows us to predict the value of a dependent variable (y) based on the values of multiple independent variables (x1, x2, …, xn). In other words, we can use multiple linear regression to find out how much y will change when any of the independent variables changes.

Multivariate analysis of variance (MANOVA) is a statistical method that allows us to compare multiple dependent variables (y1, y2, …, yn) simultaneously. In other words, MANOVA can help us understand how multiple dependent variables vary together.

Simple Linear Regression vs Multiple Linear Regression vs MANOVA: A Comparative Study
The main difference between simple linear regression and multiple linear regression is that simple linear regression can be used to predict the value of a dependent variable based on the value of only one independent variable whereas multiple linear regression can be used to predict the value of a dependent variable based on the values of two or more independent variables. Another difference between simple linear regression and multiple linear regression is that simple linear regression is less likely to produce Type I and Type II errors than multiple linear regression.

Both simple linear regression and multiple linear regression are used to predict future values. However, MANOVA is used to understand how present values vary.

Conclusion:

In this article, we have seen the key differences between simple linear regression vs multiple linear regression vs MANOVA along with their applications. Simple linear regression should be used when there is only one predictor variable whereas multiple linear regressions should be used when there are two or more predictor variables. MANOVA should be used when there are two or more response variables. Hope you found this article helpful!

Invest in your future today by enrolling in this Azure Fundamentals - Pass the Azure Fundamentals Exam with Ease: Master the AZ-900 Certification with the Comprehensive Exam Preparation Guide!

Microsoft Azure AZ900 Certification and Training

Get Certified with the AWS Data analytics DAS-C01 Exam Prep PRO App:
Very Similar to real exam, Countdown timer, Score card, Show/Hide Answers, Cheat Sheets, FlashCards, Detailed Answers and References
No ADS, Access All Quiz Detailed Answers, Reference and Score Card

Hundreds of Quizzes covering Quiz and Brain Teaser for AWS Data analytics DAS-C01, Data Science, Various Practice Exams covering Data Collection, Data Security, Data processing, Data Analysis, Data Visualization, Data Storage and Management,
Data Lakes, S3, Kinesis, Lake Formation, Athena, Kibana, Redshift, EMR, Glue, Kafka, Apache Spark, SQl, NoSQL, Python,DynamoDB, DocumentDB,  linear regression, logistic regression, Sampling, dataset, statistical interaction, selection bias, non-Gaussian distribution, bias-variance trade-off, Normal Distribution, correlation and covariance, Point Estimates and Confidence Interval, A/B Testing, p-value, statistical power of sensitivity, over-fitting and under-fitting, regularization, Law of Large Numbers, Confounding Variables, Survivorship Bias, univariate, bivariate and multivariate, Resampling, ROC curve, TF/IDF vectorization, Cluster Sampling, Data cleansing, ETL, Data Science and Analytics Cheat Sheets

Youtube:

What are some ways we can use machine learning and artificial intelligence for algorithmic trading in the stock market?


What are some good datasets for Data Science and Machine Learning?

Top 100 Data Science and Data Analytics and Data Engineering Interview Questions and Answers

Simple Linear Regression vs. Multiple Linear Regression vs. MANOVA: A Data Scientist’s Guide

As a data scientist, it’s important to understand the difference between simple linear regression, multiple linear regression, and MANOVA. This will come in handy when you’re working with different datasets and trying to figure out which one to use. Here’s a quick overview of each method:

A Short Overview of Simple Linear Regression, Multiple Linear Regression, and MANOVA

Simple linear regression is used to predict the value of a dependent variable (y) based on the value of one independent variable (x). This is the most basic form of regression analysis.

Multiple linear regression is used to predict the value of a dependent variable (y) based on the values of two or more independent variables (x1, x2, x3, etc.). This is more complex than simple linear regression but can provide more accurate predictions.

MANOVA is used to predict the value of a dependent variable (y) based on the values of two or more independent variables (x1, x2, x3, etc.), while also taking into account the relationships between those variables. This is the most complex form of regression analysis but can provide the most accurate predictions.


So, which one should you use? It depends on your dataset and what you’re trying to predict. If you have a small dataset with only one independent variable, then simple linear regression will suffice. If you have a larger dataset with multiple independent variables, then multiple linear regression will be more appropriate. And if you need to take into account the relationships between your independent variables, then MANOVA is the way to go.

In data science, there are a variety of techniques that can be used to model relationships between variables. Three of the most common techniques are simple linear regression, multiple linear regression, and MANOVA. Although these techniques may appear to be similar at first glance, there are actually some key differences that set them apart. Let’s take a closer look at each technique to see how they differ.

Simple Linear Regression

Simple linear regression is a statistical technique that can be used to model the relationship between a dependent variable and a single independent variable. The dependent variable is the variable that is being predicted, while the independent variable is the variable that is being used to make predictions.

Simple Linear Regression vs. Multiple Linear Regression vs. MANOVA: A Data Scientist's Guide
Linear Regression Basics for Absolute Beginners | by Benjamin Obi Tayo Ph.D. | Towards AI

Multiple Linear Regression

Multiple linear regression is a statistical technique that can be used to model the relationship between a dependent variable and two or more independent variables. As with simple linear regression, the dependent variable is the variable that is being predicted. However, in multiple linear regression, there can be multiple independent variables that are being used to make predictions.

Simple Linear Regression vs. Multiple Linear Regression vs. MANOVA: A Data Scientist's Guide\
Multiple Linear Regression from scratch using only numpy | by Debidutta Dash | Analytics Vidhya | Medium

MANOVA

MANOVA (multivariate analysis of variance) is a statistical technique that can be used to model the relationship between a dependent variable and two or more independent variables. Unlike simple linear regression or multiple linear regression, MANOVA can only be used when the dependent variable is continuous. Additionally, MANOVA can only be used when there are two or more dependent variables.

Simple Linear Regression vs. Multiple Linear Regression vs. MANOVA: A Data Scientist's Guide
Simple Linear Regression vs. Multiple Linear Regression vs. MANOVA: A Data Scientist’s Guide

When it comes to data modeling, there are a variety of different techniques that can be used. Simple linear regression, multiple linear regression, and MANOVA are three of the most common techniques. Each technique has its own set of benefits and drawbacks that should be considered before deciding which technique to use for a particular project.We often encounter data points that are correlated. For example, the number of hours studied is correlated with the grades achieved. In such cases, we can use regression analysis to study the relationships between the variables.

Simple linear regression is a statistical method that allows us to predict the value of a dependent variable (y) based on the value of an independent variable (x). In other words, we can use simple linear regression to find out how much y will change when x changes.

Unlock the Secrets of Africa: Master African History, Geography, Culture, People, Cuisine, Economics, Languages, Music, Wildlife, Football, Politics, Animals, Tourism, Science and Environment with the Top 1000 Africa Quiz and Trivia. Get Yours Now!

Multiple linear regression is a statistical method that allows us to predict the value of a dependent variable (y) based on the values of multiple independent variables (x1, x2, …, xn). In other words, we can use multiple linear regression to find out how much y will change when any of the independent variables changes.

Multivariate analysis of variance (MANOVA) is a statistical method that allows us to compare multiple dependent variables (y1, y2, …, yn) simultaneously. In other words, MANOVA can help us understand how multiple dependent variables vary together.

Simple Linear Regression vs Multiple Linear Regression vs MANOVA: A Comparative Study
The main difference between simple linear regression and multiple linear regression is that simple linear regression can be used to predict the value of a dependent variable based on the value of only one independent variable whereas multiple linear regression can be used to predict the value of a dependent variable based on the values of two or more independent variables. Another difference between simple linear regression and multiple linear regression is that simple linear regression is less likely to produce Type I and Type II errors than multiple linear regression.

Both simple linear regression and multiple linear regression are used to predict future values. However, MANOVA is used to understand how present values vary.

Conclusion:

In this article, we have seen the key differences between simple linear regression vs multiple linear regression vs MANOVA along with their applications. Simple linear regression should be used when there is only one predictor variable whereas multiple linear regressions should be used when there are two or more predictor variables. MANOVA should be used when there are two or more response variables. Hope you found this article helpful!

Get Certified with the AWS Data analytics DAS-C01 Exam Prep PRO App:
Very Similar to real exam, Countdown timer, Score card, Show/Hide Answers, Cheat Sheets, FlashCards, Detailed Answers and References
No ADS, Access All Quiz Detailed Answers, Reference and Score Card

Hundreds of Quizzes covering Quiz and Brain Teaser for AWS Data analytics DAS-C01, Data Science, Various Practice Exams covering Data Collection, Data Security, Data processing, Data Analysis, Data Visualization, Data Storage and Management,
Data Lakes, S3, Kinesis, Lake Formation, Athena, Kibana, Redshift, EMR, Glue, Kafka, Apache Spark, SQl, NoSQL, Python,DynamoDB, DocumentDB,  linear regression, logistic regression, Sampling, dataset, statistical interaction, selection bias, non-Gaussian distribution, bias-variance trade-off, Normal Distribution, correlation and covariance, Point Estimates and Confidence Interval, A/B Testing, p-value, statistical power of sensitivity, over-fitting and under-fitting, regularization, Law of Large Numbers, Confounding Variables, Survivorship Bias, univariate, bivariate and multivariate, Resampling, ROC curve, TF/IDF vectorization, Cluster Sampling, Data cleansing, ETL, Data Science and Analytics Cheat Sheets

What is Problem Formulation in Machine Learning and Top 4 examples of Problem Formulation in Machine Learning?

Summary of Machine Learning and Artificial Intelligence Capabilities

You can translate the content of this page by selecting a language in the select box.

AI Unraveled: Demystifying Frequently Asked Questions on Artificial Intelligence

What is Problem Formulation in Machine Learning and Top 4 examples of Problem Formulation in Machine Learning?

Machine Learning (ML) is a field of Artificial Intelligence (AI) that enables computers to learn from data, without being explicitly programmed. Machine learning algorithms build models based on sample data, known as “training data”, in order to make predictions or decisions, rather than following rules written by humans. Machine learning is closely related to and often overlaps with computational statistics; a discipline that also focuses on prediction-making through the use of computers. Machine learning can be applied in a wide variety of domains, such as medical diagnosis, stock trading, robot control, manufacturing and more.

Problem Formulation in Machine Learning
What is Problem Formulation in Machine Learning and Top 4 examples of Problem Formulation in Machine Learning?

The process of machine learning consists of several steps: first, data is collected; then, a model is selected or created; finally, the model is trained on the collected data and then applied to new data. This process is often referred to as the “machine learning pipeline”. Problem formulation is the second step in this pipeline and it consists of selecting or creating a suitable model for the task at hand and determining how to represent the collected data so that it can be used by the selected model. In other words, problem formulation is the process of taking a real-world problem and translating it into a format that can be solved by a machine learning algorithm.

Achieve AWS Solutions Architect Associate Certification with Confidence: Master SAA Exam with the Latest Practice Tests and Quizzes illustrated

2023 AWS Certified Machine Learning Specialty (MLS-C01) Practice Exams
2023 AWS Certified Machine Learning Specialty (MLS-C01) Practice Exams

There are many different types of machine learning problems, such as classification, regression, prediction and so on. The choice of which type of problem to formulate depends on the nature of the task at hand and the type of data available. For example, if we want to build a system that can automatically detect fraudulent credit card transactions, we would formulate a classification problem. On the other hand, if our goal is to predict the sale price of houses given information about their size, location and age, we would formulate a regression problem. In general, it is best to start with a simple problem formulation and then move on to more complex ones if needed.

Some common examples of problem formulations in machine learning are:
Classification: given an input data point (e.g., an image), predict its category label (e.g., dog vs cat).
Regression: given an input data point (e.g., size and location of a house), predict a continuous output value (e.g., sale price).
Prediction: given an input sequence (e.g., a series of past stock prices), predict the next value in the sequence (e.g., future stock price).
Anomaly detection: given an input data point (e.g., transaction details), decide whether it is normal or anomalous (i.e., fraudulent).
Recommendation: given information about users (e.g., age and gender) and items (e.g., books and movies), recommend items to users (e.g., suggest books for someone who likes romance novels).
Optimization: given a set of constraints (e.g., budget) and objectives (e.g., maximize profit), find the best solution (e.g., product mix).

Machine Learning For Dummies
Machine Learning For Dummies

ML For Dummies on iOs

ML PRO without ADS on iOs [No Ads]


ML PRO without ADS on Windows [No Ads]

ML PRO For Web/Android on Amazon [No Ads]

Problem Formulation: What this pipeline phase entails and why it’s important

The problem formulation phase of the ML Pipeline is critical, and it’s where everything begins. Typically, this phase is kicked off with a question of some kind. Examples of these kinds of questions include: Could cars really drive themselves?  What additional product should we offer someone as they checkout? How much storage will clients need from a data center at a given time?

The problem formulation phase starts by seeing a problem and thinking “what question, if I could answer it, would provide the most value to my business?” If I knew the next product a customer was going to buy, is that most valuable? If I knew what was going to be popular over the holidays, is that most valuable? If I better understood who my customers are, is that most valuable?

However, some problems are not so obvious. When sales drop, new competitors emerge, or there’s a big change to a company/team/org, it can be easy to say, “I see the problem!” But sometimes the problem isn’t so clear. Consider self-driving cars. How many people think to themselves, “driving cars is a huge problem”? Probably not many. In fact, there isn’t a problem in the traditional sense of the word but there is an opportunity. Creating self-driving cars is a huge opportunity. That doesn’t mean there isn’t a problem or challenge connected to that opportunity. How do you design a self-driving system? What data would you look at to inform the decisions you make? Will people purchase self-driving cars?

Part of the problem formulation phase includes seeing where there are opportunities to use machine learning.

If you are looking for an all-in-one solution to help you prepare for the AWS Cloud Practitioner Certification Exam, look no further than this AWS Cloud Practitioner CCP CLFC01 book below.


In the following practice examples, you are presented with four different business scenarios. For each scenario, consider the following questions:

  1. Is machine learning appropriate for this problem, and why or why not?
  2. What is the ML problem if there is one, and what would a success metric look like?
  3. What kind of ML problem is this?
  4. Is the data appropriate?’

The solutions given in this article are one of the many ways you can formulate a business problem.

"Become a Canada Expert: Ace the Citizenship Test and Impress Everyone with Your Knowledge of Canadian History, Geography, Government, Culture, People, Languages, Travel, Wildlife, Hockey, Tourism, Sceneries, Arts, and Data Visualization. Get the Top 1000 Canada Quiz Now!"


I)  Amazon recently began advertising to its customers when they visit the company website. The Director in charge of the initiative wants the advertisements to be as tailored to the customer as possible. You will have access to all the data from the retail webpage, as well as all the customer data.

  1. ML is appropriate because of the scale, variety and speed required. There are potentially thousands of ads and millions of customers that need to be served customized ads immediately as they arrive to the site.
  2. The problem is ads that are not useful to customers are a wasted opportunity and a nuisance to customers, yet not serving ads at all is a wasted opportunity. So how does Amazon serve the most relevant advertisements to its retail customers?
    1. Success would be the purchase of a product that was advertised.
  3. This is a supervised learning problem because we have a labeled data point, our success metric, which is the purchase of a product.
  4. This data is appropriate because it is both the retail webpage data as well as the customer data.

II) You’re a Senior Business Analyst at a social media company that focuses on streaming. Streamers use a combination of hashtags and predefined categories to be discoverable by your platform’s consumers. You ran an analysis on unique streamer counts by hashtags and categories over the last month and found that out of tens of thousands of streamers, almost all use only 40 hashtags and 10 categories despite innumerable hashtags and hundreds of categories. You presume the predefined categories don’t represent all the possibilities very well, and that streamers are simply picking the closest fit. You figure there are likely many categories and groupings of streamers that are not accounted for. So you collect a dataset that consists of all streamer profile descriptions (all text), all the historical chat information for each streamer, and all their videos that have been streamed.

  1. ML is appropriate because of the scale and variability.
  2. The problem is the content of streamers is not being represented by the existing categories. Success would be naturally grouping the streamers into categories based on content and seeing if those align with the hashtags and categories that are being commonly used.  If they do not, then the streamers are not being well represented and you can use these groupings to create new categories.
  3. There isn’t a specific outcome variable. There’s no target or label. So this is an unsupervised problem.
  4. The data is appropriate.

III) You’re a headphone manufacturer who sells directly to big and small electronic stores. As an attempt to increase competitive pricing, Store 1 and Store 2 decided to put together the pricing details for all headphone manufacturers and their products (about 350 products) and conduct daily releases of the data. You will have all the specs from each manufacturer and their product’s pricing. Your sales have recently been dropping so your first concern is whether there are competing products that are priced lower than your flagship product.

  1. ML is probably not necessary for this. You can just search the dataset to see which headphones are priced lower than the flagship, then compare their features and build quality.

IV) You’re a Senior Product Manager at a leading ridesharing company. You did some market research, collected customer feedback, and discovered that both customers and drivers are not happy with an app feature. This feature allows customers to place a pin exactly where they want to be picked up. The customers say drivers rarely stop at the pin location. Drivers say customers most often put the pin in a place they can’t stop. Your company has a relationship with the most used maps app for the driver’s navigation so you leverage this existing relationship to get direct, backend access to their data. This includes latitude and longitude, visual photos of each lat/long, traffic delay details, and regulation data if available (ie- No Parking zones, 3 minute parking zones, fire hydrants, etc.).

  1. ML is appropriate because of the scale and automation involved. It’s not feasible to drive everywhere and write down all the places that are ok for pickup. However, maybe we can predict whether a location is ok for pickup.
  2. The problem is drivers and customers are having poor experiences connecting for pickup, which is pushing customers away from the platform.
    1. Success would be properly identifying appropriate pickup locations so they can be integrated into the feature.
  3. This is a supervised learning problem even though there aren’t any labels, yet. Someone will have to go through a sample of the data to label where there are ok places to park and not park, giving the algorithms some target information.
  4. The data is appropriate once a sample of the dataset has been labeled. There may be some other data that could be included too. What about asking UPS for driver stop information? Where do they stop?

In conclusion, problem formulation is an important step in the machine learning pipeline that should not be overlooked or underestimated. It can make or break a machine learning project; therefore, it is important to take care when formulating machine learning problems.”

AWS machine Learning Specialty Exam Prep MLS-C01
AWS machine Learning Specialty Exam Prep MLS-C01

Step by Step Solution to a Machine Learning Problem – Feature Engineering

Feature Engineering is the act of reshaping and curating existing data to make patters more apparent. This process makes the data easier for an ML model to understand. Using knowledge of the data, features are engineered and  tuned to make ML algorithms work more efficiently.

 

For this problem, imagine a scenario where you are running a real estate brokerage and you want to predict the selling price of a house. Using a specific county dataset and simple information (like the location, total square footage, and number of bedrooms), let’s practice training a baseline model, conducting feature engineering, and tuning a model to make a prediction.

Invest in your future today by enrolling in this Azure Fundamentals - Pass the Azure Fundamentals Exam with Ease: Master the AZ-900 Certification with the Comprehensive Exam Preparation Guide!

Microsoft Azure AZ900 Certification and Training

First, load the dataset and take a look at its basic properties.

# Load the dataset
import pandas as pd
import boto3

df = pd.read_csv(“xxxxx_data_2.csv”)
df.head()

housing dataset example
housing dataset example: xxxxx_data_2.csv

Output:


feature_engineering_dataset_example
feature_engineering_dataset_example

This dataset has 21 columns:

  • id – Unique id number
  • date – Date of the house sale
  • price – Price the house sold for
  • bedrooms – Number of bedrooms
  • bathrooms – Number of bathrooms
  • sqft_living – Number of square feet of the living space
  • sqft_lot – Number of square feet of the lot
  • floors – Number of floors in the house
  • waterfront – Whether the home is on the waterfront
  • view – Number of lot sides with a view
  • condition – Condition of the house
  • grade – Classification by construction quality
  • sqft_above – Number of square feet above ground
  • sqft_basement – Number of square feet below ground
  • yr_built – Year built
  • yr_renovated – Year renovated
  • zipcode – ZIP code
  • lat – Latitude
  • long – Longitude
  • sqft_living15 – Number of square feet of living space in 2015 (can differ from sqft_living in the case of recent renovations)
  • sqrt_lot15 – Nnumber of square feet of lot space in 2015 (can differ from sqft_lot in the case of recent renovations)

This dataset is rich and provides a fantastic playground for the exploration of feature engineering. This exercise will focus on a small number of columns. If you are interested, you could return to this dataset later to practice feature engineering on the remaining columns.

A baseline model

Now, let’s  train a baseline model.

People often look at square footage first when evaluating a home. We will do the same in the oflorur model and ask how well can the cost of the house be approximated based on this number alone. We will train a simple linear learner model (documentation). We will compare to this after finishing the feature engineering.

import sagemaker
import numpy as np
from sklearn.model_selection import train_test_split
import time

t1 = time.time()


# Split training, validation, and test
ys = np.array(df[‘price’]).astype(“float32”)
xs = np.array(df[‘sqft_living’]).astype(“float32”).reshape(-1,1)

np.random.seed(8675309)
train_features, test_features, train_labels, test_labels = train_test_split(xs, ys, test_size=0.2)
val_features, test_features, val_labels, test_labels = train_test_split(test_features, test_labels, test_size=0.5)

# Train model
linear_model = sagemaker.LinearLearner(role=sagemaker.get_execution_role(),
instance_count=1,
instance_type=’ml.m4.xlarge’,
predictor_type=’regressor’)

train_records = linear_model.record_set(train_features, train_labels, channel=’train’)
val_records = linear_model.record_set(val_features, val_labels, channel=’validation’)
test_records = linear_model.record_set(test_features, test_labels, channel=’test’)

linear_model.fit([train_records, val_records, test_records], logs=False)

sagemaker.analytics.TrainingJobAnalytics(linear_model._current_job_name, metric_names = [‘test:mse’, ‘test:absolute_loss’]).dataframe()

 

If you examine the quality metrics, you will see that the absolute loss is about $175,000.00. This tells us that the model is able to predict within an average of $175k of the true price. For a model based upon a single variable, this is not bad. Let’s try to do some feature engineering to improve on it.

Throughout the following work, we will constantly be adding to a dataframe called encoded. You will start by populating encoded with just the square footage you used previously.

 

Unlock the Secrets of Africa: Master African History, Geography, Culture, People, Cuisine, Economics, Languages, Music, Wildlife, Football, Politics, Animals, Tourism, Science and Environment with the Top 1000 Africa Quiz and Trivia. Get Yours Now!

encoded = df[[‘sqft_living’]].copy()

Categorical variables

Let’s start by including some categorical variables, beginning with simple binary variables.

The dataset has the waterfront feature, which is a binary variable. We should change the encoding from 'Y' and 'N' to 1 and 0. This can be done using the map function (documentation) provided by Pandas. It expects either a function to apply to that column or a dictionary to look up the correct transformation.

Binary categorical

Let’s write code to transform the waterfront variable into binary values. The skeleton has been provided below.

encoded[‘waterfront’] = df[‘waterfront’].map({‘Y’:1, ‘N’:0})

You can also encode many class categorical variables. Look at column condition, which gives a score of the quality of the house. Looking into the data source shows that the condition can be thought of as an ordinal categorical variable, so it makes sense to encode it with the order.

Ordinal categorical

Using the same method as in question 1, encode the ordinal categorical variable condition into the numerical range of 1 through 5.

encoded[‘condition’] = df[‘condition’].map({‘Poor’:1, ‘Fair’:2, ‘Average’:3, ‘Good’:4, ‘Very Good’:5})

A slightly more complex categorical variable is ZIP code. If you have worked with geospatial data, you may know that the full ZIP code is often too fine-grained to use as a feature on its own. However, there are only 7070 unique ZIP codes in this dataset, so we may use them.

However, we do not want to use unencoded ZIP codes. There is no reason that a larger ZIP code should correspond to a higher or lower price, but it is likely that particular ZIP codes would. This is the perfect case to perform one-hot encoding. You can use the get_dummies function (documentation) from Pandas to do this.

Nominal categorical

Using the Pandas get_dummies function,  add columns to one-hot encode the ZIP code and add it to the dataset.

encoded = pd.concat([encoded, pd.get_dummies(df[‘zipcode’])], axis=1)

In this way, you may freely encode whatever categorical variables you wish. Be aware that for categorical variables with many categories, something will need to be done to reduce the number of columns created.

One additional technique, which is simple but can be highly successful, involves turning the ZIP code into a single numerical column by creating a single feature that is the average price of a home in that ZIP code. This is called target encoding.

To do this, use groupby (documentation) and mean (documentation) to first group the rows of the DataFrame by ZIP code and then take the mean of each group. The resulting object can be mapped over the ZIP code column to encode the feature.

Nominal categorical II

Complete the following code snippet to provide a target encoding for the ZIP code.

means = df.groupby(‘zipcode’)[‘price’].mean()
encoded[‘zip_mean’] = df[‘zipcode’].map(means)

Normally, you only either one-hot encode or target encode. For this exercise, leave both in. In practice, you should try both, see which one performs better on a validation set, and then use that method.

Scaling

Take a look at the dataset. Print a summary of the encoded dataset using describe (documentation).

encoded.describe()

Scaling  - summary of the encoded dataset using describe
Scaling – summary of the encoded dataset using describe

One column ranges from 290290 to 1354013540 (sqft_living), another column ranges from 11 to 55 (condition), 7171 columns are all either 00 or 11 (one-hot encoded ZIP code), and then the final column ranges from a few hundred thousand to a couple million (zip_mean).

In a linear model, these will not be on equal footing. The sqft_living column will be approximately 1300013000 times easier for the model to find a pattern in than the other columns. To solve this, you often want to scale features to a standardized range. In this case, you will scale sqft_living to lie within 00 and 11.

Feature scaling

Fill in the code skeleton below to scale the column of the DataFrame to be between 00 and 11.

sqft_min = encoded[‘sqft_living’].min()
sqft_max = encoded[‘sqft_living’].max()
encoded[‘sqft_living’] = encoded[‘sqft_living’].map(lambda x : (x-sqft_min)/(sqft_max – sqft_min))

cond_min = encoded[‘condition’].min()
cond_max = encoded[‘condition’].max()
encoded[‘condition’] = encoded[‘condition’].map(lambda x : (x-cond_min)/(cond_max – cond_min))]

Read more here….

Amazon Reviews Solution

Predicting Credit Card Fraud Solution

Predicting Airplane Delays Solution

Data Processing for Machine Learning Example

Model Training and Evaluation Examples

Targeting Direct Marketing Solution

What are some good datasets for Data Science and Machine Learning?

What are some good datasets for Data Science and Machine Learning?

You can translate the content of this page by selecting a language in the select box.

AI Unraveled: Demystifying Frequently Asked Questions on Artificial Intelligence

What are some good datasets for Data Science and Machine Learning?

Finding good datasets for Data Science and Machine Learning can be a challenge. There are a lot of dataset out there, but not all of them are good for machine learning. In order to find a good dataset, you need to consider what you want to use the dataset for. If you want to use the dataset for training a machine learning model, then you need to make sure that the dataset is representative of the real-world data that you want to use the model on.

2023 AWS Certified Machine Learning Specialty (MLS-C01) Practice Exams
2023 AWS Certified Machine Learning Specialty (MLS-C01) Practice Exams

The dataset should also be large enough to train a robust model. Another important consideration is whether or not the dataset is open source. Open source datasets are typically better because they have been vetted by the community and are more likely to be of high quality. However, open source datasets can also be more difficult to find. A good place to start looking for datasets is on websites like Kaggle and UC Irvine Machine Learning Repository. These websites contain a variety of high-quality datasets that are free to download and use.

What are the Top 10 AWS jobs you can get with an AWS certification in 2022 plus AWS Interview Questions
AWS Data Analytics Specialty Certification Practice Exams

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Achieve AWS Solutions Architect Associate Certification with Confidence: Master SAA Exam with the Latest Practice Tests and Quizzes illustrated

Person climbing a staircase. Learn Data Science from Scratch: online program with 21 courses

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic


Earth’s population reaches 8 billion

Earth's population reaches 8 billion
Earth’s population reaches 8 billion

The most used words on every country’s Wikipedia Page

What are some good datasets for Data Science and Machine Learning?
The most used words on every country’s Wikipedia Page

Who works from home in 2022? Rates by industry 

Who works from home in 2022? Rates by industry
Who works from home in 2022? Rates by industry

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

If you are looking for an all-in-one solution to help you prepare for the AWS Cloud Practitioner Certification Exam, look no further than this AWS Cloud Practitioner CCP CLFC01 book below.


Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

"Become a Canada Expert: Ace the Citizenship Test and Impress Everyone with Your Knowledge of Canadian History, Geography, Government, Culture, People, Languages, Travel, Wildlife, Hockey, Tourism, Sceneries, Arts, and Data Visualization. Get the Top 1000 Canada Quiz Now!"


Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

What are some good datasets for Data Science and Machine Learning?
What are some good datasets for Data Science and Machine Learning?

Source: LinkedIn data  (see original post)

Invest in your future today by enrolling in this Azure Fundamentals - Pass the Azure Fundamentals Exam with Ease: Master the AZ-900 Certification with the Comprehensive Exam Preparation Guide!

Microsoft Azure AZ900 Certification and Training

Tool: Photoshop from my colleague

Person climbing a staircase. Learn Data Science from Scratch: online program with 21 courses

 
 

The Biggest Source of Power in Every US and Canadian State and Province 

The Biggest Source of Power in Every State and Province
The Biggest Source of Power in Every State and Province

Top 10 largest oil fields by 2021 production

Top 10 largest oil fields by 2021 production
Top 10 largest oil fields by 2021 production

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.


Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

The Largest Entertainment Streaming Companies
The Largest Entertainment Streaming Companies

The F word in Popular Movies

r/dataisbeautiful - [OC] The F word in Popular Movies

The easiest words to rhyme – Words that have the most rhymes

Post image

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)


aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Unlock the Secrets of Africa: Master African History, Geography, Culture, People, Cuisine, Economics, Languages, Music, Wildlife, Football, Politics, Animals, Tourism, Science and Environment with the Top 1000 Africa Quiz and Trivia. Get Yours Now!

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

Suicide rate among countries with the highest Human Development Index

Suicide rate among countries with the highest Human Development Index
Suicide rate among countries with the highest Human Development Index

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Amazon Omics

Store, query, analyze, and generate insights from genomic and other omics data.

Amazon Omics
Amazon Omics

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

 
 
Behshad Behzadi on LinkedIn: Partnering with iCAD to improve breast cancer screening
 

From AI Research to Real world Clinical Practice:
After a pivotal moment in 2020 to show our AI technology performed better than radiologists in a retrospective study at identifying signs of breast cancer, today a new important milestone is achieved: Google Health announces our first commercial agreement to license our mammography AI research model to be integrated in real-world clinical practice.

This can make healthcare AI to be more accessible and eventually saves more lives.

#ai #research #google #health #healthcare #breastcancer #mammography

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

  • Node.js – Async non-blocking event-driven JavaScript runtime built on Chrome’s V8 JavaScript engine.
  • Frontend Development
  • iOS – Mobile operating system for Apple phones and tablets.
  • Android – Mobile operating system developed by Google.
  • IoT & Hybrid Apps
  • Electron – Cross-platform native desktop apps using JavaScript/HTML/CSS.
  • Cordova – JavaScript API for hybrid apps.
  • React Native – JavaScript framework for writing natively rendering mobile apps for iOS and Android.
  • Xamarin – Mobile app development IDE, testing, and distribution.
  • Linux
    • Containers
    • eBPF – Virtual machine that allows you to write more efficient and powerful tracing and monitoring for Linux systems.
    • Arch-based Projects – Linux distributions and projects based on Arch Linux.
  • macOS – Operating system for Apple’s Mac computers.
  • watchOS – Operating system for the Apple Watch.
  • JVM
  • Salesforce
  • Amazon Web Services
  • Windows
  • IPFS – P2P hypermedia protocol.
  • Fuse – Mobile development tools.
  • Heroku – Cloud platform as a service.
  • Raspberry Pi – Credit card-sized computer aimed at teaching kids programming, but capable of a lot more.
  • Qt – Cross-platform GUI app framework.
  • WebExtensions – Cross-browser extension system.
  • RubyMotion – Write cross-platform native apps for iOS, Android, macOS, tvOS, and watchOS in Ruby.
  • Smart TV – Create apps for different TV platforms.
  • GNOME – Simple and distraction-free desktop environment for Linux.
  • KDE – A free software community dedicated to creating an open and user-friendly computing experience.
  • .NET
    • Core
    • Roslyn – Open-source compilers and code analysis APIs for C# and VB.NET languages.
  • Amazon Alexa – Virtual home assistant.
  • DigitalOcean – Cloud computing platform designed for developers.
  • Flutter – Google’s mobile SDK for building native iOS and Android apps from a single codebase written in Dart.
  • Home Assistant – Open source home automation that puts local control and privacy first.
  • IBM Cloud – Cloud platform for developers and companies.
  • Firebase – App development platform built on Google Cloud Platform.
  • Robot Operating System 2.0 – Set of software libraries and tools that help you build robot apps.
  • Adafruit IO – Visualize and store data from any device.
  • Cloudflare – CDN, DNS, DDoS protection, and security for your site.
  • Actions on Google – Developer platform for Google Assistant.
  • ESP – Low-cost microcontrollers with WiFi and broad IoT applications.
  • Deno – A secure runtime for JavaScript and TypeScript that uses V8 and is built in Rust.
  • DOS – Operating system for x86-based personal computers that was popular during the 1980s and early 1990s.
  • Nix – Package manager for Linux and other Unix systems that makes package management reliable and reproducible.

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

  • JavaScript
  • Swift – Apple’s compiled programming language that is secure, modern, programmer-friendly, and fast.
  • Python – General-purpose programming language designed for readability.
    • Asyncio – Asynchronous I/O in Python 3.
    • Scientific Audio – Scientific research in audio/music.
    • CircuitPython – A version of Python for microcontrollers.
    • Data Science – Data analysis and machine learning.
    • Typing – Optional static typing for Python.
    • MicroPython – A lean and efficient implementation of Python 3 for microcontrollers.
  • Rust
  • Haskell
  • PureScript
  • Go
  • Scala
    • Scala Native – Optimizing ahead-of-time compiler for Scala based on LLVM.
  • Ruby
  • Clojure
  • ClojureScript
  • Elixir
  • Elm
  • Erlang
  • Julia – High-level dynamic programming language designed to address the needs of high-performance numerical analysis and computational science.
  • Lua
  • C
  • C/C++ – General-purpose language with a bias toward system programming and embedded, resource-constrained software.
  • R – Functional programming language and environment for statistical computing and graphics.
  • D
  • Common Lisp – Powerful dynamic multiparadigm language that facilitates iterative and interactive development.
  • Perl
  • Groovy
  • Dart
  • Java – Popular secure object-oriented language designed for flexibility to “write once, run anywhere”.
  • Kotlin
  • OCaml
  • ColdFusion
  • Fortran
  • PHP – Server-side scripting language.
  • Pascal
  • AutoHotkey
  • AutoIt
  • Crystal
  • Frege – Haskell for the JVM.
  • CMake – Build, test, and package software.
  • ActionScript 3 – Object-oriented language targeting Adobe AIR.
  • Eta – Functional programming language for the JVM.
  • Idris – General purpose pure functional programming language with dependent types influenced by Haskell and ML.
  • Ada/SPARK – Modern programming language designed for large, long-lived apps where reliability and efficiency are essential.
  • Q# – Domain-specific programming language used for expressing quantum algorithms.
  • Imba – Programming language inspired by Ruby and Python and compiles to performant JavaScript.
  • Vala – Programming language designed to take full advantage of the GLib and GNOME ecosystems, while preserving the speed of C code.
  • Coq – Formal language and environment for programming and specification which facilitates interactive development of machine-checked proofs.
  • V – Simple, fast, safe, compiled language for developing maintainable software.

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

  • Flask – Python framework.
  • Docker
  • Vagrant – Automation virtual machine environment.
  • Pyramid – Python framework.
  • Play1 Framework
  • CakePHP – PHP framework.
  • Symfony – PHP framework.
  • Laravel – PHP framework.
    • Education
    • TALL Stack – Full-stack development solution featuring libraries built by the Laravel community.
  • Rails – Web app framework for Ruby.
    • Gems – Packages.
  • Phalcon – PHP framework.
  • Useful .htaccess Snippets
  • nginx – Web server.
  • Dropwizard – Java framework.
  • Kubernetes – Open-source platform that automates Linux container operations.
  • Lumen – PHP micro-framework.
  • Serverless Framework – Serverless computing and serverless architectures.
  • Apache Wicket – Java web app framework.
  • Vert.x – Toolkit for building reactive apps on the JVM.
  • Terraform – Tool for building, changing, and versioning infrastructure.
  • Vapor – Server-side development in Swift.
  • Dash – Python web app framework.
  • FastAPI – Python web app framework.
  • CDK – Open-source software development framework for defining cloud infrastructure in code.
  • IAM – User accounts, authentication and authorization.
  • Chalice – Python framework for serverless app development on AWS Lambda.

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

  • Big Data
  • Public Datasets
  • Hadoop – Framework for distributed storage and processing of very large data sets.
  • Data Engineering
  • Streaming
  • Apache Spark – Unified engine for large-scale data processing.
  • Qlik – Business intelligence platform for data visualization, analytics, and reporting apps.
  • Splunk – Platform for searching, monitoring, and analyzing structured and unstructured machine-generated big data in real-time.

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. Downloads

SQuAD (Stanford Question Answering Dataset)

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. Access it here.

PubMed Diabetes Dataset

The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. The README file in the dataset provides more details.

Download Link

Drug-Target Interaction Dataset

This dataset contains interactions between drugs and targets collected from DrugBank, KEGG Drug, DCDB, and Matador. It was originally collected by Perlman et al. It contains 315 drugs, 250 targets, 1,306 drug-target interactions, 5 types of drug-drug similarities, and 3 types of target-target similarities. Drug-drug similarities include Chemical-based, Ligand-based, Expression-based, Side-effect-based, and Annotation-based similarities. Target-target similarities include Sequence-based, Protein-protein interaction network-based, and Gene Ontology-based similarities. The original task on the dataset is to predict new interactions between drugs and targets based on different types of similarities in the network. Download link

Pharmacogenomics Datasets

PharmGKB data and knowledge is available as downloads. It is often critical to check with their curators at feedback@pharmgkb.org before embarking on a large project using these data, to be sure that the files and data they make available are being interpreted correctly. PharmGKB generally does NOT need to be a co-author on such analyses; They just want to make sure that there is a correct understanding of our data before lots of resources are spent.

Pancreatic Cancer Organoid Profiling

The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://gdc-organoid-pancreatic-phs001611-2-open/ --no-sign-request

Africa Soil Information Service (AfSIS) Soil Chemistry

This dataset contains soil infrared spectral data and paired soil property reference measurements for georeferenced soil samples that were collected through the Africa Soil Information Service (AfSIS) project, which lasted from 2009 through 2018. Documentation

AWS CLI Access (No AWS account required)

aws s3 ls s3://afsis/ --no-sign-request

Dataset for Affective States in E-Environments

DAiSEE is the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration “in the wild”. The dataset has four levels of labels namely – very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. Download it here.

NatureServe Explorer Dataset

NatureServe Explorer provides conservation status, taxonomy, distribution, and life history information for more than 95,000 plants and animals in the United States and Canada, and more than 10,000 vegetation communities and ecological systems in the Western Hemisphere.

The data available through NatureServe Explorer represents data managed in the NatureServe Central Databases. These databases are dynamic, being continually enhanced and refined through the input of hundreds of natural heritage program scientists and other collaborators. NatureServe Explorer is updated from these central databases to reflect information from new field surveys, the latest taxonomic treatments and other scientific publications, and new conservation status assessments. Explore Data here

Flight Records in the US

Airline On-Time Performance and Causes of Flight Delays – On_Time Data.

This database contains scheduled and actual departure and arrival times, reason of delay. reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data is collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS).

FlightAware.com has data but you need to pay for a full dataset.

The anyflights package supplies a set of functions to generate air travel data (and data packages!) similar to nycflights13. With a user-defined year and airport, the anyflights function will grab data on:

  • flights: all flights that departed a given airport in a given year and month
  • weather: hourly meterological data for a given airport in a given year and month
  • airports: airport names, FAA codes, and locations
  • airlines: translation between two letter carrier (airline) codes and names
  • planes: construction information about each plane found in flights

Airline On-Time Statistics and Delay Causes

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released. Access it here

Worldwide flight data

Open flights: As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe

Download: airports.dat (Airports only, high quality)

Download: airports-extended.dat (Airports, train stations and ferry terminals, including user contributions)

Bureau of Transportation:

Flightera.net seems to have a lot of good data for free. It has in-depth data on flights and doesn’t seem limited by date. I can’t speak on the validity of the data though.

flightradar24.com has lots of data, also historically, they might be willing to help you get it in a nice format.

 

2019 Crime statistics in the USA

Dataset with arrest in US by race and separate states. Download Excel here

  • Database
  • MySQL
  • SQLAlchemy
  • InfluxDB
  • Neo4j
  • MongoDB – NoSQL database.
  • RethinkDB
  • TinkerPop – Graph computing framework.
  • PostgreSQL – Object-relational database.
  • CouchDB – Document-oriented NoSQL database.
  • HBase – Distributed, scalable, big data store.
  • NoSQL Guides – Help on using non-relational, distributed, open-source, and horizontally scalable databases.
  • Contexture – Abstracts queries/filters and results/aggregations from different backing data stores like ElasticSearch and MongoDB.
  • Database Tools – Everything that makes working with databases easier.
  • Grakn – Logical database to organize large and complex networks of data as one body of knowledge.

Researchers from IBM, MIT and Harvard Announced The Release Of DARPA “Common Sense AI” Dataset Along With Two Machine Learning Models At ICML 2021

Building machines that can make decisions based on common sense is no easy feat. A machine must be able to do more than merely find patterns in data; it also needs a way of interpreting the intentions and beliefs behind people’s choices.

At the 2021 International Conference on Machine Learning (ICML), Researchers from IBM, MIT, and Harvard University have come together to release a DARPA “Common Sense AI” dataset for benchmarking AI intuition. They are also releasing two machine learning models that represent different approaches to the problem that relies on testing techniques psychologists use to study infants’ behavior to accelerate the development of AI exhibiting common sense.

Source – Summary – Paper – IBM Blog

100 million protein structures Dataset by DeepMind

DeepMind creates ‘transformative’ map of human proteins drawn by AI. By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,

Here’s a good article about this topic

Google Dataset Search

Google Dataset Search

Malware traffic dataset

Comprises 1914081 records created from all malware traffic analysis .net PCAP files, from 2013 to 2021. The logs are generated using Suricata and Zeek.

Originator: ali_alwashali

Percent of “foreign-born” population in each US and EU state or country.

For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state 🇺🇸🇪🇺

Author: Here

Percent of “foreign-born” population in each US and EU state or country. For the EU, “foreign-born” mean being born outside of any of the EU countries. For the US, “foreign-born” mean being born outside of any US state.

Examples of “foreign-born” in this context:

  • Person born in Spain and living in France is NOT “foreign-born”

  • Person born in Turkey and living in France is “foreign-born”

  • Person born in Florida and living in Texas is NOT “foreign-born”

  • Person born in Mexico and living in Texas is “foreign-born”

  • Person born in Florida and living in France is “foreign-born”

  • Person born in France and living in Florida is “foreign-born”

🇺🇸🇪🇺🗺️

Note: Poland, Ireland, Germany, Greece, Cyprus, Malta, Portugal uses Eurostat 2010 Migration data and Croatia has no data at all

Link1

Link2

Link3

Tools: MS Office

Source: Here

35% of “entry-level” jobs on LinkedIn require 3+ years of experience

r/dataisbeautiful - [OC] 35% of "entry-level" jobs on LinkedIn require 3+ years of experience

Source: LinkedIn data  (see original post)

Tool: Photoshop from my colleague

Latest complete Netflix movie dataset

Created from 4 APIs. 11K+ rows and 30+ attributes of Netflix (Ratings, earnings, actors, language, availability, movie trailers, and many more)

Dataset on Kaggle.

Explore this dataset using FlixGem.com (this dataset is powering this webapp)

Dataset on Google Sheets.

Common Crawl

A corpus of web crawl data composed of over 50 billion web pages. The Common Crawl corpus contains petabytes of data collected since 2008. It contains raw web page data, extracted metadata and text extractions.

AWS CLI Access (No AWS account required)

aws s3 ls s3://commoncrawl/ --no-sign-request

s3://commoncrawl/crawl-data/CC-MAIN-2021-17 – April 2021

 Dataset on protein prices

Data on Primary Commodity Prices are updated monthly based on the IMF’s Primary Commodity Price System.

Excel Database

 CPOST dataset on suicide attacks over four decades

The University of Chicago Project on Security and Threats presents the updated and expanded Database on Suicide Attacks (DSAT), which now links to Uppsala Conflict Data Program data on armed conflicts and includes a new dataset measuring the alliance and rivalry relationships among militant groups with connections to suicide attack groups. Access it here.

Credit Card Dataset – Survey of Consumer Finances (SCF) Combined Extract Data 1989-2019

You can do a lot of aggregated analysis in a pretty straightforward way there.

Drone imagery with annotations for small object detection and tracking dataset

11 TB dataset of drone imagery with annotations for small object detection and tracking

Download and more information are available here

Dataset License: CDLA-Sharing-1.0

Helper scripts for accessing the dataset: DATASET.md

Dataset Exploration: Colab

NOAA High-Resolution Rapid Refresh (HRRR) Model

The HRRR is a NOAA real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model, initialized by 3km grids with 3km radar assimilation. Radar data is assimilated in the HRRR every 15 min over a 1-h period adding further detail to that provided by the hourly data assimilation from the 13km radar-enhanced Rapid Refresh.

Registry of Open Data on AWS

This registry exists to help people discover and share datasets that are available via AWS resources. Learn more about sharing data on AWS.

See all usage examples for datasets listed in this registry.

See datasets from Digital Earth AfricaFacebook Data for GoodNASA Space Act AgreementNIH STRIDESNOAA Big Data ProgramSpace Telescope Science Institute, and Amazon Sustainability Data Initiative.

Textbook Question Answering (TQA)

1,076 textbook lessons, 26,260 questions, 6229 images

Documentation: allenai.org/data/tqa

Download

Harmonized Cancer Datasets: Genomic Data Commons Data Portal

The GDC Data Portal is a robust data-driven platform that allows cancer
researchers and bioinformaticians to search and download cancer data for analysis.

Genomic Data Commons Data Portal
Genomic Data Commons Data Portal

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer.

AWS CLI Access (No AWS account required)

aws s3 ls s3://tcga-2-open/ --no-sign-request

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program applies a comprehensive genomic approach to determine molecular changes that drive childhood cancers. The goal of the program is to use data to guide the development of effective, less toxic therapies. TARGET is organized into a collaborative network of disease-specific project teams.  TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive. Access it here.

Genome Aggregation Database (gnomAD)

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use.