Download the AI & Machine Learning For Dummies App: iOS - Android
What are some ways to increase precision or recall in machine learning?
What are some ways to Boost Precision and Recall in Machine Learning?
Sensitivity vs Specificity?
In machine learning, recall is the ability of the model to find all relevant instances in the data while precision is the ability of the model to correctly identify only the relevant instances. A high recall means that most relevant results are returned while a high precision means that most of the returned results are relevant. Ideally, you want a model with both high recall and high precision but often there is a trade-off between the two. In this blog post, we will explore some ways to increase recall or precision in machine learning.
There are two main ways to increase recall:
by increasing the number of false positives or by decreasing the number of false negatives. To increase the number of false positives, you can lower your threshold for what constitutes a positive prediction. For example, if you are trying to predict whether or not an email is spam, you might lower the threshold for what constitutes spam so that more emails are classified as spam. This will result in more false positives (emails that are not actually spam being classified as spam) but will also increase recall (more actual spam emails being classified as spam).
To decrease the number of false negatives,
you can increase your threshold for what constitutes a positive prediction. For example, going back to the spam email prediction example, you might raise the threshold for what constitutes spam so that fewer emails are classified as spam. This will result in fewer false negatives (actual spam emails not being classified as spam) but will also decrease recall (fewer actual spam emails being classified as spam).
There are two main ways to increase precision:
by increasing the number of true positives or by decreasing the number of true negatives. To increase the number of true positives, you can raise your threshold for what constitutes a positive prediction. For example, using the spam email prediction example again, you might raise the threshold for what constitutes spam so that fewer emails are classified as spam. This will result in more true positives (emails that are actually spam being classified as spam) but will also decrease precision (more non-spam emails being classified as spam).
you can lower your threshold for what constitutes a positive prediction. For example, going back to the spam email prediction example once more, you might lower the threshold for what constitutes spam so that more emails are classified as spam. This will result in fewer true negatives (emails that are not actually spam not being classified as spam) but will also decrease precision (more non-spam emails being classified as spam).
To summarize,
there are a few ways to increase precision or recall in machine learning. One way is to use a different evaluation metric. For example, if you are trying to maximize precision, you can use the F1 score, which is a combination of precision and recall. Another way to increase precision or recall is to adjust the threshold for classification. This can be done by changing the decision boundary or by using a different algorithm altogether.
Sensitivity vs Specificity
In machine learning, sensitivity and specificity are two measures of the performance of a model. Sensitivity is the proportion of true positives that are correctly predicted by the model, while specificity is the proportion of true negatives that are correctly predicted by the model.
Google introduced computing units, which you can purchase just like any other cloud computing unit you can from AWS or Azure etc. With Pro you get 100, and with Pro+ you get 500 computing units. GPU, TPU and option of High-RAM effects how much computing unit you use hourly. If you don’t have any computing units, you can’t use “Premium” tier gpus (A100, V100) and even P100 is non-viable.
Google Colab Pro+ comes with Premium tier GPU option, meanwhile in Pro if you have computing units you can randomly connect to P100 or T4. After you use all of your computing units, you can buy more or you can use T4 GPU for the half or most of the time (there can be a lot of times in the day that you can’t even use a T4 or any kinds of GPU). In free tier, offered gpus are most of the time K80 and P4, which performs similar to a 750ti (entry level gpu from 2014) with more VRAM.
For your consideration, T4 uses around 2, and A100 uses around 15 computing units hourly. Based on the current knowledge, computing units costs for GPUs tend to fluctuate based on some unknown factor.
For hobbyists and (under)graduate school duties, it will be better to use your own gpu if you have something with more than 4 gigs of VRAM and better than 750ti, or atleast purchase google pro to reach T4 even if you have no computing units remaining.
For small research companies, and non-trivial research at universities, and probably for most of the people Colab now probably is not a good option.
Colab Pro+ can be considered if you want Pro but you don’t sit in front of your computer, since it disconnects after 90 minutes of inactivity in your computer. But this can be overcomed with some scripts to some extend. So for most of the time Colab Pro+ is not a good option.
If you have anything more to say, please let me know so I can edit this post with them. Thanks!
In machine learning, precision and recall trade off against each other; increasing one often decreases the other. There is no single silver bullet solution for increasing either precision or recall; it depends on your specific use case which one is more important and which methods will work best for boosting whichever metric you choose. In this blog post, we explored some methods for increasing either precision or recall; hopefully this gives you a starting point for improving your own models!
Now that I have your attention, let me give a bit of context. My team is responsible for validating a data migration. The schemas in the source and target systems are different and there are some other complicating factors, so the whole project is quite intricate. Team member 1 (T1) was responsible for writing a script to carry out part of this validation automatically. They were writing this script based on consultations with the data engineers and software engineers working on the migration. Then the department head announced a big reorganization. T1 would be moved to another team under the same department, while two people from another team (T2 and T3) would join my team. T1 said they would finish their script before leaving the team, and train myself, T2, and T3 on how to run it and interpret the results. However, things did not go so smoothly. As we began the training sessions, T1 told us that they hadn't been able to finish the script in time, and that they would explain to us how it works and how to finish it. T1 would do some parts, while T2, T3, and myself would handle other parts. The first issue was that it was very difficult to follow the training sessions. T1 is just not good at explaining things. They are verbose and unclear, and so is their documentation. Their code is also very difficult to follow. The other issue is that a lot of the details of the migration only exist in T1's head. Their justification for this or that is often "well I had a conversation with this engineer about it." So it's hard to ascertain the reasoning behind many parts of the script, which then makes it impossible to finish it. T2, T3, and other colleagues have agreed with me on these points. As the training sessions continued day after day, T1 would get increasingly snippy and passive aggressive with us when we asked questions. Put simply: it was not a positive learning environment. Things really came to a head on Friday though. T1 has an unusual approach to developing their script. T1 keeps a master copy of the script on several tabs in a Google Sheet. When part of the script needs to be changed, T1 copies that part out into a Google Doc interspersed with instructions (the instructions aren't code or comments). Then T1 reviews the Google Docs, tests the code from them, copies chunk by chunk around the instructions, and pastes them back into the Google Sheet. T1 was having us follow this method to finish the script. I think this approach is absolutely nuts. It's not reasonable to have 4 people working on a program without some form of version control in place, and Google Docs/Sheets are not good places for writing code. I copied the code from the Sheet into a GitHub repo and added T2 and T3 to it. I reached out to T1 and explained my position. T1 asked to talk to T2 and I on the phone. T1's view is that source control isn't appropriate for creating new code, only for maintaining existing code, and that it would only slow us down. "I know what source control is. Check in, check out... yeah that's going to take forever." T1 also doesn't see an issue with coding in Google Docs/Sheets. I disagreed. T1 then got super passive aggressive and basically said they were going to stop helping to finish the script completely and focus on their new job. I brought this up with my manager and explained everything. They agreed with me and are escalating to the department head. At this point I really don't want to work with T1 anymore. I would rather they finish the script on their own, or me, T2, and T3 do it on our own. The issue with the latter option is that the code is so difficult to follow, and so much of the knowledge to finish it only exists in T1's head, that I think we would have to start from scratch. I realize this is a very long post, so thanks for reading this far. Has anyone here dealt with a similar situation and have any advice? submitted by /u/YIRS [link] [comments]
Context: I'm working with a team that has extensive experience with causal modeling, but now is working on a project focused on predicting/forecasting outcomes for future events. I've worked extensively on various forecasting and prediction projects, and I've noticed that several people seem to approach prediction with a causal modeling mindset. Example: Weather impacts the outcomes we are trying to predict, but we need to predict several days ahead, so of course we don't know what the actual weather during the event will be. So what someone has done is create a model that is using historical weather data (actual, not forecasts) for training, but then when it comes to inference/prediction time, use the n-day ahead weather forecast as a substitute. I've tried to explain that it would make more sense to use historical weather forecast data, which we also have, to train the model as well, but have received pushback ("it's the actual weather that impacts our events, not the forecasts"). How do I convince them that they need to think differently about predictive modeling than they are used to? submitted by /u/sowenga [link] [comments]
Hello, after my masters in stats I took up a job in data science. While it’s been fun working and the work is really interesting, part of me still craves keeping up with the stuff I learned in school. I currently do this by reading topics in statistics I never learned in school to keep my knowledge base wide, and revise old topics if need be (sometimes they come up in work). But I feel if I was able to teach this material to someone, I’d be able to keep myself accountable to know it deeply. Like, yes I know the theory of the linear model reasonably well or I know hypothesis testing or time series well, but if I had to teach this to someone, I feel as though I’d be able to actually make sure I retain it for long term memory, because it’s not always where I’m actually thinking about this stuff at work. One of the ways I thought of was volunteering to teach math to students. I don’t know how I’d do this but I want a way to actually volunteer my time to do this, whether it be for some kind of cause, or just for someone who’s learning it. Also a way to kill time on the weekends. Anyone know of good ways to do this ? submitted by /u/AdFew4357 [link] [comments]
I'm having some difficulty with a sales forecasting project and need some help. Dataset: Weekly sales data; So columns such as Store, Item, Week of Year, Sales. This is the most minimal part of the dataset. I can pull in some features such as store dimensional info, item dimensional info, price, and if it is on sale. The date range is about 150 weeks. About 10 unique items and 1000 unique stores. Objective: Forecast 1 week out. My accuracy metric, is 1 - ( sum of absolute errors / sum of actual sales ). I need to achieve an accuracy of at least 0.75. What I have tried: ARIMA, ETS, xgboost and lightgbm. However, with all these models, I can only achieve an accuracy of 0.35 (with lightgbm). With the ML models, I have tried using tweedie objective, and used a plethora of lagged and rolling features. Most of my data are 0's, and if they are not 0's, tend to be smaller numbers (< 10). Making it hard to accurately forecast. I'm at my wits end and would appreciate any advice. submitted by /u/bernful [link] [comments]
In my experience, when people consider applying LLMs to a project they often fall into two camps: they turn the project into a chat bot they use an LLM for some key feature in a larger application, resulting in an error prone mess there's tremendous power in using LLMs to power specific features within larger applications, but LLMs inconsistency in output structure makes it difficult to use their output within a programmatic system. You might ask an llm to output JSON data, for instance, and the LLM decides it's appropriate to wrap the data in a \``json ```` markdown format. you might ask an LLM to output a list of values, and it responds with something like this: here's your list [1,2,3,4] There's an infinite number of ways LLM output can go wrong, which is why output parsing is a thing. I've had the best luck, personally, with LangChain in this regard. LangChain's pydantic parser allows one to define an object which is either constructed from the LLMs output, or an error gets thrown. They essentially use a clever prompting system paired with the user's defined structure to coax the model into a consistent output. That's not fool proof either, which is why it's a common practice to either re-try or re-prompt. You can either just re-prompt on a failure, or pass the response which failed to parse to the LLM again and ask the LLM to correct it's mistake. For robust LLMs this works consistently enough where it's actually viable in applications (assuming proper error handling). I made a post about LangGraph recently, this can also be used to construct complex loops/decisions which can be useful for adding a level of robustness into LLM responses. If you can learn how to consistently turn an LLMs output into JSON, there's a whole world of possible applications. I'm curious what LLM parsing tricks you employ, and what you've seen the most success with! submitted by /u/Daniel-Warfield [link] [comments]
Excited to release ryp, a Python package for running R code inside Python! ryp makes it a breeze to use R packages in your Python data science projects. https://github.com/Wainberg/ryp submitted by /u/ryp_package [link] [comments]
basically the title. I want to know what the best universities are in the US which offer masters in data science, after which I can get into a good product data science role submitted by /u/AdrenoXI [link] [comments]
I used to be able to tell if I failed an interview but now it seems even good questions and feedback and talking about the next steps just comes with rejections I don't get if the market has changed or I got worse. submitted by /u/Xamius [link] [comments]
Hi everyone, I’m currently working as a Senior Data Scientist in Germany. I hold a PhD in Physics with a very high GPA, have completed all the relevant Coursera courses, and I’m in my mid-30s. So far, things have been going well, but my job mainly involves visualizing data in Tableau and writing lengthy SQL queries. Recently, I’ve been lucky to work on some GenAI projects, but that's still new territory for me. I initially took this job because I was going through a tough time and needed an "easy" role. However, I’m now eager to change my job and take on more challenging opportunities. In my region, interesting job positions only become available every few months at most, which makes the search even more competitive and frustrating. When applying for new positions, I sometimes get invited to interviews for high-skill roles that seem like a good fit. However, I struggle to talk about exciting achievements from my last three years. The GenAI/NLP projects I’ve been involved in are quite recent (only about three months), and our team is limited by resources—small GPU, not enough data—so we can’t do things like training LoRA adapters for different use cases. I feel stuck in underwhelming roles, and high-skill positions feel out of reach, even though I believe I could contribute effectively. Additionally, I often find myself being too honest during interviews. When asked questions like what percentage of my daily job involves coding or about my expertise in NLP, I tend to share the full truth, highlighting my limitations. Has anyone experienced something similar or have tips on how to better present my skills and experiences during interviews without underselling myself? Thanks in advance! submitted by /u/Plastic-Mind-1291 [link] [comments]
Had my 3rd round interview today which was a technical based. I guess it went … bad. It was with the vp of the company. It seemed like he had already made up his mind right at the beginning and felt like i was at an uphil battle. He didnt even know if I had any interviews before this and I told him i spoke to guy1(principal data engineer) and guy2(senior data engineer) Been working as data analyst for past 3 years and this is a data analyst/engineer position at a startup(which is quite big now) and the role is amazing int terms of growth opportunity, pay, culture, every aspect and I can thrive in it too imo. He asked me about my resume then asked what is categorical data. I said in a diff tables categorised for diff information like student tsble, prof table. Then asked was I correct ? He said not quite. Its diff categories of prof tables . He then going thru resume and stuff was like this seems to be a mismatch for the role(it was not!) i said i had discussions eith guy1 and guy2 and role is 80-90% sql which ive been using past few years. He then shared a coderdata link to do a query. He could see what i type, but i couldnt run or test queries. I was trying to talk through my thought process through but he seemed uninterested. I did the query by the end when time ran out and he said i have to hop off. but whole time there was less than smooth communication. It was so frustrating. Im thinking to reach out to recruiter and share my experience and if any possibility of another attempt. I dont have much hopes but might as well. This is disheartening as I shouldve been able to clear this smoothly but I was so forward to looking progress but its depressing bcs market is already so competitive and brutal. After like 500+ applications I got like 1-2 interviews and I managed to get to 3rd round only for this to happen sigh. Ultimately he has the final say since hes vp despite having good conversations with principal data engineer, senior data engineer in previous interviews 🙁 submitted by /u/potatotacosandwich [link] [comments]
Received a take-home exercise and am completely bored out of it. They didn't even ask 'is now a good time', just sent a link and needs to be done in a week. The type that says here is a gig of random data, with nested fields everywhere, and no clear ask. I kind of spend most of the time ranting to myself that i shouldn't take this sort of sh*t, have better things to do that sort out the schema of some random company, and realizing how much over the years i've started to dislike the standard wrangling with pandas. The only problem is that I currently desperately need a job, this is the only sort of gigs I hear back from, and reading the posts here I should be even happy to get any reply. Anyway,to conclude this rant with a question..how much time do you guys actually put in on these sorts of tortures. It seems just a clear case of more time, better result, but we got to draw a line somewhere right? submitted by /u/Adorable-Emotion4320 [link] [comments]
Have any of you gone from Data Scientist to Data Analyst? If so, how'd you handle the interviews asking why you're "going back to analyst work" after building models, running experiments, etc.? submitted by /u/ds_contractor [link] [comments]
I am working on a project and looking for some help from the community. The project's goal is to find any kind of relationship between MetricA (integer data eg: Number of incidents) and 5-10 survey questions. The survey question's values are from 1-10. Being a survey question, we can imagine this being sparse. There are lot of surveys with no answer. I have grouped the data by date and merged them together. I chose to find the average survey score for each question to group by. This may not be the greatest approach but this I started off with this and calculated correlation between MetricA and averaged survey scores. Correlation was pretty weak. Another approach was to use xgboost to predict and use shap values to see if high or low values of survey can explain the relationship on predicted MetricA counts. Has any of you worked anything like this? Any guidance would be appreciated! submitted by /u/lostmillenial97531 [link] [comments]
I did my undergrad and Msc in data science, now going to the industry I feel I might lose touch with some topics and techniques. I was thinking about starting a series on medium where I deep dive into different topics in the field. It would get me to study, be updated and get more visibility, what do you think? Will this be good for me? Is this something worth pursuing? submitted by /u/Most_Panic_2955 [link] [comments]
I know that some (most?) recruiters and HMs don't look at your github. But for those who do, what do you want to see in there? What impresses you the most? Is there anything you do NOT like to see on GH? Any red flags? submitted by /u/jmhimara [link] [comments]
Hi all, I've been working with a client and they needed a way to display inline PDFs in a Dash app. I couldn't find any solution so I built one: dash-pdf It allows you to display an inline PDF document along with the current page number and previous/next buttons. Pretty useful if you're generating PDFs programmatically or to preview user uploads. It's pretty basic since I wanted to get something working quickly for my client but let me know if you have any feedback of feature requests. https://i.redd.it/mxznjgrwt8sd1.gif submitted by /u/databot_ [link] [comments]
Currently a 4th year data science undergrad who already has two internships and currently doing a capstone project/contract work with a company. I have the opportunity to do undergrad research as well but kind've burnt out at the moment and feel like my resume is "good enough" and should maybe just focus on job interviews. Am I just being lazy or should I do the undergrad research for grad school applications/letters of rec. submitted by /u/Tenet_Bull [link] [comments]
Hi all, I currently work for a company as somewhere between a data analyst and a data scientist. I have recently been tasked with trying to create a model/algorithm to help classify our help desk’s chat data. The goal is to be able to build a model which can properly identify and label the reason the customer is contacting our help desk (delivery issue, unapproved charge, refund request, etc). This is my first time working on a project like this, I understand the overall steps to be get a copy of a bunch of these chat logs, label the reasoning the customer is reaching out, train a model on the labeled data and then apply it to a test set that was set aside from the training data but I’m a little fuzzy on specifics. This is supposed to be a learning opportunity for me so it’s okay that I don’t know everything going into it but I was hoping you guys who have more experience could give me some advice about how to get started, if my understanding of the process is off, advice on potential pitfalls, or perhaps most helpful of all any good resources that you feel like helped you learn how to do tasks like this. Any help or advice is greatly appreciate! submitted by /u/WeWantTheCup__Please [link] [comments]
Hey hey, just stumbled upon this ELL thing and curious if anyone tried it. How does it compare to langchain? Are they complementary? submitted by /u/LankyRazzmatazz1121 [link] [comments]
Today I Learned (TIL) You learn something new every day; what did you learn today? Submit interesting and specific facts about something that you just found out here.
Reddit Science This community is a place to share and discuss new scientific research. Read about the latest advances in astronomy, biology, medicine, physics, social science, and more. Find and submit new publications and popular science coverage of current research.