What is the Best Machine Learning Algorithms for Imbalanced Datasets

Machine Learning Algorithms and Imbalanced Datasets

What is the Best Machine Learning Algorithms for Imbalanced Datasets?

In machine learning, imbalanced datasets are those where one class heavily outnumbers the others. This can be due to the nature of the problem or simply because more data is available for one class than the others. Either way, imbalanced datasets can pose a challenge for machine learning algorithms. In this blog post, we’ll take a look at which machine learning algorithms are best suited for imbalanced datasets and why they tend to perform better than others.

 For example, in a binary classification problem, if there are 100 observations, and only 10 of them are positive (the rest are negatives), then we say that the dataset is imbalanced. The ratio of positive to negative cases is 1:10. 

What is the Best Machine Learning Algorithms for Imbalanced Datasets
What is the Best Machine Learning Algorithms for Imbalanced Datasets

There are a few reasons why some machine learning algorithms tend to perform better on imbalanced datasets than others. First, certain algorithms are designed to handle imbalanced datasets. Second, some algorithms are more robust to outliers, which can be more common in imbalanced datasets. And third, some algorithms are better able to learn from a limited amount of data, which can be an issue when one class is heavily outnumbered by the others.

Some of the best machine learning algorithms for imbalanced datasets include:

Support Vector Machines (SVMs),
Decision Trees,
Random Forests,
– Naive Bayes Classifiers,
k-Nearest Neighbors (kNN),

Of these, SVMs tend to be the most popular choice as they are specifically designed to handle imbalanced datasets. SVMs work by finding a hyperplane that maximizes the margin between the two classes. This helps to reduce overfitting and improve generalization. Decision trees and random forests are also popular choices as they are less sensitive to outliers than other algorithms such as linear regression. Naive Bayes classifiers are another good choice as they are able to learn from a limited amount of data. kNN is also a good choice as it is not sensitive to outliers and is able to learn from a limited amount of data. However, it can be computationally intensive for large datasets.

There are two main types of machine learning algorithms: supervised and unsupervised. Supervised algorithms tend to perform better on imbalanced datasets than unsupervised algorithms. In this blog post, we will discuss why this is so and look at some examples.

Supervised Algorithms
Supervised algorithms are those where the target variable is known. In other words, we have training data where the correct answers are already given. The algorithm then learns from this data and is able to generalize to new data. Some examples of supervised algorithms are regression and classification.

Unsupervised Algorithms
Unsupervised algorithms are those where the target variable is not known. With unsupervised algorithms, we only have input data, without any corresponding output labels. The algorithm has to learn from the data itself without any guidance. Some examples of unsupervised algorithms are clustering and dimensionality reduction.

Why Supervised Algorithms Perform Better on Imbalanced Datasets
The reason why supervised algorithms perform better on imbalanced datasets is because they can learn from the training data which cases are more important. With unsupervised algorithms, all data points are treated equally, regardless of whether they are in the minority or majority class.

For example, in a binary classification problem with an imbalanced dataset, let’s say that we want to predict whether a customer will default on their loan payment or not. We have a training dataset of 1000 customers, out of which only 100 (10%) have defaulted on their loan in the past.

If we use a supervised algorithm like logistic regression, the algorithm will learn from the training data that defaulting on a loan is rare (since only 10% of cases in the training data are Positive). This means that it will be more likely to predict correctly that a new customer will not default on their loan (since this is the majority class in the training data).
However, if we use an unsupervised algorithm like k-means clustering, all data points will be treated equally since there is no target variable to guide the algorithm. This means that it might incorrectly cluster together customers who have defaulted on their loans with those who haven’t since there is no guidance provided by a target variable.

In conclusion, supervised machine learning algorithms tend to perform better on imbalanced datasets than unsupervised machine learning algorithms because they can learn from the training data which cases are more important. 

Some machine learning algorithms tend to perform better on highly imbalanced datasets because they are designed to deal with imbalance or because they can learn from both classes simultaneously. If you are working with a highly imbalanced dataset, then you should consider using one of these algorithms.

Thanks for reading!

How are machine learning techniques being used to address unstructured data challenges?

Machine learning techniques are being used to address unstructured data challenges in a number of ways:

  1. Natural language processing (NLP): NLP algorithms can be used to extract meaningful information from unstructured text data, such as emails, documents, and social media posts. NLP algorithms can be trained to classify text data, identify key terms and concepts, and extract structured data from unstructured text.
  2. Image recognition: Machine learning algorithms can be used to analyze and classify images, enabling the automatic identification and classification of objects, people, and other elements in images. This can be useful for tasks such as image tagging and search, as well as for applications such as security and surveillance.
  3. Audio and speech recognition: Machine learning algorithms can be used to analyze and classify audio data, enabling the automatic transcription and translation of spoken language. This can be useful for tasks such as speech-to-text transcription, as well as for applications such as call center automation and language translation.
  4. Video analysis: Machine learning algorithms can be used to analyze and classify video data, enabling the automatic detection and classification of objects, people, and other elements in video. This can be useful for tasks such as video tagging and search, as well as for applications such as security and surveillance.

Overall, machine learning techniques are being used in a wide range of applications to extract meaningful information from unstructured data, and to enable the automatic classification and analysis of data in a variety of formats.

How is AI and machine learning impacting application development today?

Artificial intelligence (AI) and machine learning are having a significant impact on application development today in a number of ways:

  1. Enabling new capabilities: AI and machine learning algorithms can be used to enable applications to perform tasks that would be difficult or impossible for humans to do. For example, AI-powered applications can be used to analyze and classify large amounts of data, or to automate complex decision-making processes.
  2. Improving performance: AI and machine learning algorithms can be used to optimize the performance of applications, making them faster, more efficient, and more accurate. For example, machine learning algorithms can be used to improve the accuracy of predictive models, or to optimize the performance of search algorithms.
  3. Streamlining development: AI and machine learning algorithms can be used to automate various aspects of application development, such as testing, debugging, and deployment. This can help to streamline the development process and reduce the time and resources needed to build and maintain applications.
  4. Enhancing user experiences: AI and machine learning algorithms can be used to enhance the user experience of applications, by providing personalized recommendations, recommendations, or by enabling applications to anticipate and respond to the needs and preferences of users.

Overall, AI and machine learning are having a significant impact on application development today, and they are likely to continue to shape the way applications are built and used in the future.

How will advancements in artificial intelligence and machine learning shape the future of work and society?

Advancements in artificial intelligence (AI) and machine learning are likely to shape the future of work and society in a number of ways. Some potential impacts include:

  1. Automation: AI and machine learning algorithms can be used to automate tasks that are currently performed by humans, such as data entry, customer service, and manufacturing. This could lead to changes in the types of jobs that are available and the skills that are in demand, as well as to increased productivity and efficiency.
  2. Job displacement: While automation may create new job opportunities, it could also lead to job displacement, particularly for workers in industries that are more susceptible to automation. This could lead to social and economic challenges, including unemployment and income inequality.
  3. Increased efficiency: AI and machine learning algorithms can be used to optimize and streamline business processes, leading to increased efficiency and productivity. This could lead to economic growth and innovation, and could also help to reduce costs for businesses and consumers.
  4. Enhanced decision-making: AI and machine learning algorithms can be used to analyze large amounts of data and make more informed and accurate decisions. This could lead to improved outcomes in fields such as healthcare, finance, and education, and could also help to reduce bias and improve fairness.

Overall, the impact of AI and machine learning on the future of work and society is likely to be significant and complex, with both potential benefits and challenges. It will be important to consider and address these impacts as these technologies continue to advance and become more widely adopted.

  • [P] Multi Output Regression to predict cost and revenue from ROAS and other features
    by /u/ibraheemn73 (Machine Learning) on July 23, 2024 at 11:14 am

    I am trying to predict expected Cost and Revenue for hotel_name and Channel from user inputs: ROAS (Revenue / Cost), hotel_name, and month (refer to below sample data). I've attempted using Multioutput Regression and the pymc-marketing library but haven't found a satisfactory solution. The predictions are not close to real data and major variabilities. Could someone suggest a method or a library that might be better suited for this problem? Multi Output Model script I have import pandas as pd import numpy as np from warnings import filterwarnings from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder from sklearn.multioutput import MultiOutputRegressor from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split filterwarnings('ignore') # Define function to exclude outliers def exclude_outliers_using_iqr(df, group_columns, columns, multiplier=1.5): def exclude_outliers(group): for column in columns: Q1 = group[column].quantile(0.25) Q3 = group[column].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - multiplier * IQR upper_bound = Q3 + multiplier * IQR group = group[(group[column] >= lower_bound) & (group[column] <= upper_bound)] return group df = df.groupby(group_columns).apply(exclude_outliers).reset_index(drop=True) return df # Define function to make new prediction def make_new_prediction(new_data, pipeline): new_df = pd.DataFrame([new_data]) new_X = new_df[['month', 'channel_grup', 'market', 'ROAS']] new_prediction = pipeline.predict(new_X) return new_prediction # Define function to predict ROAS for all def predict_roas_for_all(df, model, roas, month): days_in_month = { 1: 31, 2: 28, 3: 31, 4: 30, 5: 31, 6: 30, 7: 31, 8: 31, 9: 30, 10: 31, 11: 30, 12: 31 } num_days = days_in_month[month] unique_combinations = df[['channel_grup', 'market']].drop_duplicates() predictions = [] for _, row in unique_combinations.iterrows(): channel_group = row['channel_grup'] market = row['market'] input_data = { 'channel_grup': channel_group, 'market': market, 'ROAS': roas, 'month': month } input_df = pd.DataFrame([input_data]) prediction = model.predict(input_df) cost = prediction[0][0] * num_days revenue = prediction[0][1] * num_days prediction_result = { 'channel_grup': channel_group, 'market': market, 'ROAS': roas, 'month': month, 'cost': cost, 'revenue': revenue } predictions.append(prediction_result) predictions_df = pd.DataFrame(predictions) return predictions_df # Load data df = pd.read_csv(r'data.csv') df['date'] = pd.to_datetime(df['date']) # Define unique hotels list hotels_list = df['hotel_name'].unique() # Initialize final results list final_results = [] for hotel in hotels_list: print(hotel) df_hotel = df.loc[ (df['Revenue'] > 1) & (df['hotel_name'] == hotel) ].reset_index(drop=True) if df_hotel.shape[0] == 0: continue df_hotel.loc[df_hotel['channel_group'] == 'Search', 'channel_grup'] = df_hotel['channel'] + '_' + df_hotel['channel_group'] df_hotel.loc[df_hotel['channel_grup'].isna(), 'channel_grup'] = df_hotel['channel_group'] group = df_hotel.groupby(by=['date', 'channel_grup', 'hotel_name', 'market'])[['Cost', 'Revenue']].sum().reset_index() group['ROAS'] = (group['Revenue'] / group['Cost']).round(2) market_counts = group.groupby(by=['market'])['date'].count().reset_index().sort_values(by=['date']) top_market_percent = market_counts.tail(int(np.ceil(0.75 * len(market_counts)))) top_market_percent = top_market_percent.drop(columns=['date']) group = pd.merge(group, top_market_percent, on=['market'], how='right') group = exclude_outliers_using_iqr(group, ['market', 'channel_grup'], ['ROAS', 'Cost', 'Revenue']) group['month'] = group['date'].dt.month X = group[['month', 'channel_grup', 'market', 'ROAS']] y = group[['Cost', 'Revenue']] categorical_features = ['channel_grup', 'market'] categorical_transformer = OneHotEncoder(handle_unknown='ignore') preprocessor = ColumnTransformer( transformers=[ ('cat', categorical_transformer, categorical_features) ], remainder='passthrough' ) pipeline = Pipeline(verbose=True, steps=[ ('preprocessor', preprocessor), ('regressor', MultiOutputRegressor(RandomForestRegressor())) ]) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) pipeline.fit(X_train, y_train) for increment in range(1, 70, 1): for month in range(1, 13): increment_predictions = predict_roas_for_all(group, pipeline, increment, month) increment_predictions['hotel_name'] = hotel increment_predictions['increment'] = increment increment_predictions['month'] = month final_results.append(increment_predictions) # Combine all results into a single DataFrame final_results_df = pd.concat(final_results, ignore_index=True) Sample data: import pandas as pd data = { 'hotel_name': [ 'Jumeirah Burj Al Arab', 'Jumeirah Beach Hotel', 'Atlantis The Palm', 'Burj Khalifa Hotel', 'Armani Hotel Dubai', 'Jumeirah Burj Al Arab', 'Jumeirah Beach Hotel', 'Atlantis The Palm', 'Burj Khalifa Hotel', 'Armani Hotel Dubai', 'Jumeirah Burj Al Arab', 'Jumeirah Beach Hotel', 'Atlantis The Palm', 'Burj Khalifa Hotel', 'Armani Hotel Dubai' ], 'Channel': [ 'Bing_Search', 'Bing_Search', 'Bing_Search', 'Bing_Search', 'Bing_Search', 'Google_Search', 'Google_Search', 'Google_Search', 'Google_Search', 'Google_Search', 'Google_Search', 'Metasearch', 'Metasearch', 'Metasearch', 'Metasearch' ], 'market': [ 'Australia', 'UAE', 'UK', 'US', 'World Wide', 'Australia', 'Canada', 'UAE', 'UK', 'US', 'World Wide', 'India', 'UAE', 'UK', 'US' ], 'year': [ 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2024, 2023, 2023, 2023 ], 'month': [ 2, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 2, 2, 2 ], 'Cost': [ 38.1, 27.0, 26.2, 426.2, 119.8, 1177.8, 291.3, 16727.9, 10178.4, 4592.7, 44880.7, 162.2, 281.8, 45.0, 321.4 ], 'Revenue': [ 20946.6, 30081.5, 21308.8, 174064.0, 22784.2, 105614.4, 13672.4, 509304.4, 692854.5, 353565.6, 1164871.3, 107757.7, 27406.1, 31325.9, 80625.0 ], 'ROAS': [ 549.78, 1114.13, 813.31, 408.41, 190.19, 89.67, 46.94, 30.45, 68.07, 76.98, 25.95, 664.35, 97.25, 696.13, 250.86 ] ) Thank You! submitted by /u/ibraheemn73 [link] [comments]

  • [P] haipera - an open source tool to instrument Python notebooks & scripts with configs without writing any code
    by /u/dromger (Machine Learning) on July 23, 2024 at 3:24 am

    TL;DR: I made an open source (apache 2) tool (https://github.com/haipera/haipera) to make it easier to do hyperparameter sweeps with simple scripts & notebooks. Hey everyone! I've been doing research in ML / CV for a good 7 years now and I've always been frustrated by how much time I have to spend writing instrumentation code instead of writing algorithms. By instrumentation code, I mean things like: config management, config logging, logging in general, experiment tracking, etc... In my career I've written countless dataclasses, yaml files, json files, and a lot more code to pass the parameters from these config objects through layers of class hierarchies- just to find out whatever experiment I was trying wasn't fruitful and now having to delete what I just added. The oft-repeated meme is 'machine learning researchers just do hyperparameter sweeps' but reality is that we actually write code to pass in these hyper parameters so that we can do the sweeps. This causes even more problems when transferring the code to product teams; product teams get code that has 600 lines of argparse and code that copies from argparse into initializers; code that is often buggy and makes cross-project compatibility hard. I also have a lot of friends who work / worked on config systems to try to solve this- from Hydra to tyro to dysweep to countless internal tools. I was one of them too, but the problem is that these libraries tend to get increasingly complex... as they try to be more 'robust' and less broken. More code to write is almost never the solution. So I wanted to experiment with a new paradigm which throws away instrumentation code entirely and relies on static parsing to instrument code. This means you don't have to ever write a single line of code to enable things like configs for your code. This is recently made possible with the availability of better parsing libraries (like ast and libcst). LLMs hold a lot of exciting potential for this too looking into the future. How does this all work? Given a script like: num_apples = 100 apple_price = 3.0 print("# apples: ", num_apples) print("price of an apple: ", apple_price) price = num_apples * apple_price print("total: ", price) You just do pip install haipera and then you can run the script with haipera run script.py. You can run haipera run script.py --help to see that variables are directly editable from the CLI (right now only supports globals, and primitive types like numbers, bools, strings). You can run something like haipera run script.py --apple-price 1.0 to directly set the parameters from CLI. When you run with haipera, it will create its own experiment folder in reports and populates it with an automatically generated config file which you can rerun directly for reproducibility. If you want to do grid sweeps, you can simply pass in multiple arguments like haipera run script.py --num-apples 1,2,3 --apple-price 2.0,3.0,4.0. You can also do other things like haipera run script.ipynb to run a notebook as a script (convenient if you want to develop inside a notebook, but run lots of experiment with configs as scripts) or haipera notebook script.ipynb --opt1 2 to spin up a new variant of the notebook with the provided config. This turns out to be convenient for versioning your notebooks too! I'm pretty excited about this library and have been getting feedback from my researcher friends, but I wanted to show you all and gather feedback. We plan to make this much more feature complete (like supporting more types of variables, generally making everything more robust, and adding support for things like GPU profiling instrumentation)- but before that we wanted to hear what people think of this and hear what sorts of features you wish existed in MLOps tooling in general. Let us know what you think! https://github.com/haipera/haipera submitted by /u/dromger [link] [comments]

  • Self-supervised learning weights initialization "after" projection head [D][R]
    by /u/grid_world (Machine Learning) on July 22, 2024 at 8:00 pm

    For most Self-supervised learning algorithms: SimCLR, MoCo, BYOL, SimSiam, SwAV, etc., its common to have a projection head after the base encoder (which in most cases is a vanilla ResNet-50 CNN). An example of such a projection (taken from SwAV) is: projection_head = nn.Sequential( nn.Linear(2048, 512), nn.BatchNorm1d(512), nn.ReLU(inplace=True), nn.Linear(512, 128), ) The output of this projection head is L2-normalized: x = projection_head(x) x = nn.functional.normalize(x, dim = 1, p = 2) I am trying to initialize a layer after the projection head as: wts = nn.Parameter(data = torch.empty(40 * 40, 128), requires_grad = True) # The projection head outputs weights in the range [-1, 1], so initialize SOM weights to be in that range- wts.data.uniform_(-1.0, 1.0) Since the output of the projection head is L2-normalized, I am assuming that the input range to "wts" ∈ [-1, 1] and therefore use the uniform initialization above. Is this a correct approach or am I missing something? submitted by /u/grid_world [link] [comments]

  • [D] What are the problems with using Llama in a commercial app?
    by /u/technicallynotlying (Machine Learning) on July 22, 2024 at 6:24 pm

    I searched and saw a thread saying Llama shouldn't be used for commercial purposes, but I can't tell why. I looked at the Meta license for Llama and it says you don't need a license until you have 700M monthly users, a number which there is no way the application I have in mind would ever hit. What am I missing? If I use Llama in a commercial application with far fewer users (maybe 1M per month at the very highest), is there going to be a problem? submitted by /u/technicallynotlying [link] [comments]

  • [D] Supervised Fine-Tuning (SFT)
    by /u/juliannorton (Machine Learning) on July 22, 2024 at 2:03 pm

    Every chatbot in use today, from ChatGPT to custom chatbots built from open-source large language models (LLMs), has been instruction-tuned. An LLM, like any language model, is simply a next-token predictor. To get a vanilla LLM to interact with a user like a chatbot, it must be fine-tuned using tens of thousands of examples of user-and-assistant conversations. This process, called supervised fine-tuning, is a basic building block of productionizing an LLM application. Publicly available LLMs remain general purpose and aren’t suitable for direct use in most business applications because they need to be continuously fine-tuned to produce high-quality results. A modern supervised fine-tuning solution involves something called the low-rank adapter. Low-rank adapters are relatively small matrices (millions, not billions of elements) that sit alongside each layer of the LLM and act as a sidekick. It’s job is to translate the inputs and outputs of LLM layers into the proper domain without adding latency in production. During the fine-tuning process, low-rank adapters are trained on gold standard examples to teach an LLM how to respond. If the dataset is high quality and diverse, then the fine-tuned LLM’s output measurably increases in quality with as few as 100 examples as opposed to tens of thousands. Traditionally, these examples would be handcrafted by an expert, but writing them is time-consuming and labor-intensive. At Plum Defense, we automatically generate examples that are on par with human-written ones. This allows for continuous fine-tuning, which increases the quality of the LLM’s responses on an ongoing basis. By combining well-trained low-rank adapters with a well-written system prompt, a machine learning practitioner can produce a robust application that conforms well to the required output and is fast enough to use in production. A good system prompt conveys the intention but is concise enough to leave room for retrieval-augmentation (RAG) systems to inject relevant facts into the application. The system prompt’s length also has a direct impact on application latency. The smaller the system prompt, the faster the application’s average response time. With advanced techniques like soft-prompting, the size of the system prompt can be reduced significantly, which speeds up response time. If you’d like to learn more about continuous fine-tuning and soft-prompting system for your production application, shoot me a message. submitted by /u/juliannorton [link] [comments]

  • [P] TTSDS - Benchmarking recent TTS systems
    by /u/cdminix (Machine Learning) on July 22, 2024 at 1:29 pm

    TL;DR - I made a benchmark for TTS, and you can see the results here: https://huggingface.co/spaces/ttsds/benchmark There are a lot of LLM benchmarks out there and while they're not perfect, they give at least an overview over which systems perform well at which tasks. There wasn't anything similar for Text-to-Speech systems, so I decided to address that with my latest project. The idea was to find representations of speech that correspond to different factors: for example prosody, intelligibility, speaker, etc. - then compute a score based on the Wasserstein distances to real and noise data for the synthetic speech. I go more into detail on this in the paper (https://www.arxiv.org/abs/2407.12707), but I'm happy to answer any questions here as well. I then aggregate those factors into one score that corresponds with the overall quality of the synthetic speech - and this score correlates well with human evluation scores from papers from 2008 all the way to the recently released TTS Arena by huggingface. Anyone can submit their own synthetic speech here. and I will be adding some more models as well over the coming weeks. The code to run the benchmark offline is here. submitted by /u/cdminix [link] [comments]

  • [Discussion] when I can use research models for commercial purpose
    by /u/Frosty-Equipment-692 (Machine Learning) on July 22, 2024 at 11:47 am

    I was going through one research paper in which they are using diffusion model for specific purpose. I had a thought it can be use for commercial purposes with huge market opportunities if executed correctly. So I wonder, if have research paper code, model architecture and trained weights I have three questions 1. Can I use this model and weight productionize it and use for commercial? 2. If not, if a make some necessary changes in architecture or trained it new dataset or both the use for commercial purpose When I get into legal or copyright license issue submitted by /u/Frosty-Equipment-692 [link] [comments]

  • [R] Equation requirements for PINNs (Physics-inforemd Neural networks)
    by /u/its_a_targaryen (Machine Learning) on July 22, 2024 at 11:00 am

    I had a question about the differential equations in the loss term. Typically, in PINNs, we use differential equations of the predicted_output wrt to the input variables in the loss function. For example, if u is the predicted_output and x, y, m are the inputs, the loss function include terms like du/d(x,y,m). However, what if we only have differential equations for the input variables with respect to other input or the output variable? For example: dx/dt=f(x,y,u) dy/dt=g(x,u) Here, x and y derivates are wrt time t. and no equation for du/d(x,y,m) Is it possible to use a PINN approach in this case, where the loss function is constructed only using dx/dt​ and dy/dt? submitted by /u/its_a_targaryen [link] [comments]

  • [P] FLUTE - a new CUDA kernel for quantized LLM Inference achieving up to 2.6x latency improvements over vLLM. It extends QLoRA with learnable scales to 4-bit and 3-bit per parameter quantization.
    by /u/radi-cho (Machine Learning) on July 22, 2024 at 8:56 am

    The deployment of large language models (LLMs) is often constrained by memory bandwidth, where the primary bottleneck is the cost of transferring model parameters from the GPU's global memory to its registers. When coupled with custom kernels that fuse the dequantization and matmul operations, weight-only quantization can thus enable faster inference by reducing the amount of memory movement. However, developing high-performance kernels for weight-quantized LLMs presents substantial challenges, especially when the weights are compressed to non-evenly-divisible bit widths (e.g., 3 bits) with non-uniform, lookup table (LUT) quantization. This paper describes FLUTE, a flexible lookup table engine for LUT-quantized LLMs, which uses offline restructuring of the quantized weight matrix to minimize bit manipulations associated with unpacking, and vectorization and duplication of the lookup table to mitigate shared memory bandwidth constraints. At batch sizes < 32 and quantization group size of 128 (typical in LLM inference), the FLUTE kernel can be 2-4x faster than existing GEMM kernels. As an application of FLUTE, we explore a simple extension to lookup table-based NormalFloat quantization and apply it to quantize LLaMA3 to various configurations, obtaining competitive quantization performance against strong baselines while obtaining an end-to-end throughput increase of 1.5 to 2 times. Arxiv: https://arxiv.org/abs/2407.10960 submitted by /u/radi-cho [link] [comments]

  • [R] Neural networks have been trained to accurately predict the optimal geometry of molecules using 50 times less data
    by /u/AIRI_Institute (Machine Learning) on July 22, 2024 at 8:04 am

    An important task of computational chemistry is to find molecular geometries where a local energy minimum is achieved, as these are the most likely configurations in which the molecule undergoes a chemical reaction. Despite recent progress in neural networks for molecular conformation energy prediction, such models are prone to errors due to distribution shifts, leading to inaccurate energy minimization. The quality of energy minimization with neural networks can be improved by providing optimization trajectories as additional training data. Still, obtaining complete optimization trajectories demands a lot of extra computations. A team of researchers developed a new framework called Gradual Optimization Learning Framework (GOLF), consisting of an efficient data-collecting scheme and an external optimizer. The author demonstrated that using significantly less additional data, the neural network trained with GOLF performs on par with the oracle on a benchmark of diverse drug-like molecules. The ~paper~ is published in the ICLR 2024 conference proceedings submitted by /u/AIRI_Institute [link] [comments]

  • [P] ModelClash: Dynamic LLM Evaluation Through AI Duels
    by /u/throwquestion111 (Machine Learning) on July 22, 2024 at 7:20 am

    I've developed ModelClash, an open-source framework for LLM evaluation that could offer some potential advantages over static benchmarks: Automatic challenge generation, reducing manual effort Should scale with advancing model capabilities Evaluates both problem creation and solving skills The project is in early stages, but initial tests with GPT and Claude models show promising results. GitHub: https://github.com/mrconter1/ModelClash What are your thoughts on how this approach could complement existing LLM evaluation methods? submitted by /u/throwquestion111 [link] [comments]

  • [D] Aggregating token probabilities
    by /u/archiesteviegordie (Machine Learning) on July 22, 2024 at 5:36 am

    What are some good aggregation techniques that I can use to give a score to the generated sequence using the token probabilities (this can be either just the softmax probabilities or the log probabilities)? For example, finding key entities in an answer and trying to find out the token probabilities of it and see how much is the median token probabilities accross such key entities. submitted by /u/archiesteviegordie [link] [comments]

  • [Discussion] Document Image Restoration
    by /u/atlury (Machine Learning) on July 22, 2024 at 5:25 am

    Here is DocRes a Image Restoration model running in chainner for improving scanned documents. Original Image followed by Restored image followed by chainner model. Going further, using Mindee Doctr to very accurately getting line segments. The next task that I am working on is getting font sizes recognized, then font styles and then using Microsoft Phi-3 or similar model with OCR capabilities to OCR and apply the styles and then restore the image Links https://github.com/ZZZHANG-jx/DocRes https://github.com/chaiNNer-org/chaiNNer Original Image Restored Image Chainner Architecture Line Segments Recognized submitted by /u/atlury [link] [comments]

  • [P] Best practices in fine tuning OS models with sparse data for custom downstream tasks
    by /u/VBQL (Machine Learning) on July 22, 2024 at 5:03 am

    I have a certain downstream task that during the input, 99+% of data is context, being generated by various sources. The actual model output are just a couple of tokens, however the input can vary from 2k tokens all the way up to 10k tokens in size. Therefore, I'm trying to fine tune mistral 7b v0.3 for this task, given the long context window. But trying a lower learning rate like 8e-6 and decaying I'm still getting higher and higher training losses per run. The training set consists of the standard input_ids, attention_mask and labels, but due to the nature of training data attention_mask and labels would be mostly 1s and -100s, respectively. Since they also vary wildly in size, I've packed the data into length of 4096 so that its constant. My training machine is the AWS trn1n.32xlarge type. Are there any suggestions on what I should do here? For anyone curious on the dataset, here is a link to the directly tokenized version of the data. submitted by /u/VBQL [link] [comments]

  • [P] Why can't I use single component images for model training.
    by /u/Sherlockgnomes98 (Machine Learning) on July 22, 2024 at 3:54 am

    Hey guys. So I'm working on a machine learning project using Yolov5. I'm using this to identify components in p& ID diagrams.the thing is when I used complex diagrams with few components in each image the model was trained correctly. My next step was to get some single images of these components and train the model on those(with variations like flip, mirror,90° rotated etc) because it's way easier to annotate and build the dataset. But this version only works on single images of valves.(I'm trying to detect valves).when I use full diagrams withs multiple components,it doesn't detect anything? Is there a reason for this. What can I do to improve the model. submitted by /u/Sherlockgnomes98 [link] [comments]

  • [D] Is it okay to use different numbers of query samples across folds in few-shot prototypical networks
    by /u/The_Aoki_Taki (Machine Learning) on July 21, 2024 at 8:39 pm

    I'm currently working on a few-shot learning project using prototypical networks and have encountered a situation that I'd like some advice on. I have a predefined 5-fold cross-validation setup, but there's an inconsistency in the number of examples for Class A across these folds. Specifically, one fold has 6 examples for Class A, while the other folds each have 12 examples for Class A. Given this scenario, is it acceptable to use a different number of query samples during training and testing while maintaining the same number of shots and ways? For instance, can I train the model with a variable number of queries and evaluate each fold with its respective number of queries? submitted by /u/The_Aoki_Taki [link] [comments]

  • [P] ChessGPT, 100,000x smaller than GPT-4, plays chess at 1500 Elo. By finding a skill vector, we can increase its win rate by 2.6x in out-of-distribution games.
    by /u/seraine (Machine Learning) on July 21, 2024 at 7:59 pm

    A previous project trained ChessGPT, a set of 25M and 50M parameter GPT models that can play chess at 1500 Elo. These models are ~100,000x smaller than GPT-4's 1.8T parameters. At Stockfish level 0, the 50M parameter model has a win rate of 70%. However, if the game is initialized with 20 random moves, its win rate drops to 17%. Is this because it can't generalize out of distribution? When considering the task of next-token prediction, a good next token predictor would predict legal but low skill moves if the game begins with random moves. This is what we find with ChessGPT. By adding a skill vector to the model's activations, we can increase its win rate to 43%, or by 2.6x. We don't fully recover the performance gap, but it is a significant fraction. The intervention is very simple, and it's possible that a more sophisticated intervention could further increase its win rate. This model is only trained to predict the next character in PGN strings (1.e4 e5 2.Nf3 …) and is never explicitly given the state of the board or the rules of chess. Despite this, in order to better predict the next character, it learns to compute the state of the board at any point of the game, and learns a diverse set of rules, including check, checkmate, castling, en passant, promotion, pinned pieces, etc. In addition, to better predict the next character it also learns to estimate latent variables such as the Elo rating of the players in the game. We can also use interpretability methods to intervene on the model's internal board state. This work was recently accepted to the 2024 Conference on Language Modeling (COLM) under the title "Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models". More information is available in this post: https://adamkarvonen.github.io/machine_learning/2024/03/20/chess-gpt-interventions.html And the code is here: https://github.com/adamkarvonen/chess_llm_interpretability submitted by /u/seraine [link] [comments]

  • [R] Intelligent Digital Agents in the Era of Large Language Models
    by /u/thebigbigbuddha (Machine Learning) on July 21, 2024 at 5:20 pm

    https://doi.org/10.31219/osf.io/f75wz submitted by /u/thebigbigbuddha [link] [comments]

  • [R] Discussion of ReFT Paper with lead author Zhengxuan Wu
    by /u/FallMindless3563 (Machine Learning) on July 21, 2024 at 4:59 pm

    Hey all, We were lucky enough to have the lead author of the ReFT paper in our Friday paper dive this week and thought I'd share the discussion and our notes! https://www.oxen.ai/blog/arxiv-dives-how-reft-works TLDR ~ ReFT is a fine-tuning technique that is 15x-60x more parameter efficient than LoRA. It is super speedy to train. About 18 minutes for 1k examples on an A100. I successfully fine-tuned a ReFT on Llama 2 7B in less than 1 minute of an A10 with ~100 examples. It works by operating on the representations in the residual stream instead of the K-V matrices. They add extra learned parameters they call "interventions" to specific token indices and layers making it efficient and easy to steer the representations. ReFTs are also nice because they are composable. For example, you could train one for instruction following, one for German, then apply them both to get and instruction following model in German. The author gives super practical tips and lessons they learned while iterating in the lab. The whole discussion is on YouTube as well. Hope you enjoy! submitted by /u/FallMindless3563 [link] [comments]

  • [D] On the Neural Developmental Program. How sound do you think is this "learn the encoding" idea, and how far do you think it is from self-programming?
    by /u/HermanHel (Machine Learning) on July 21, 2024 at 4:11 pm

    I just saw this talk on ALIFE2023, and this looks really interesting. For summary, they used 3 "policy" networks to grow/generate a "target" network, which is in term the traditional "policy" network who gets input from env and gives output of actions. In the end this agent described in the talk is still a couple of traditional neural networks, only that their output is a model instead of prediction, and the training from reward is only evolutionary algorithm or PPO. This idea of a strucutre generating a structure that looks quite like itself sounds absolutely fantastic for a self-programming advocate like myself, but by the look of it it's sort of fancy network generative model like preferential attachment, just this time with help of reward it can solve real problem.(I still think it is fantastic and a definitive step out of the routine we have now over pretraining finetuning RLHF stuff) What do you think of it? submitted by /u/HermanHel [link] [comments]

Ace the 2023 AWS Solutions Architect Associate SAA-C03 Exam with Confidence Pass the 2023 AWS Certified Machine Learning Specialty MLS-C01 Exam with Flying Colors

List of Freely available programming books - What is the single most influential book every Programmers should read

#BlackOwned #BlackEntrepreneurs #BlackBuniness #AWSCertified #AWSCloudPractitioner #AWSCertification #AWSCLFC02 #CloudComputing #AWSStudyGuide #AWSTraining #AWSCareer #AWSExamPrep #AWSCommunity #AWSEducation #AWSBasics #AWSCertified #AWSMachineLearning #AWSCertification #AWSSpecialty #MachineLearning #AWSStudyGuide #CloudComputing #DataScience #AWSCertified #AWSSolutionsArchitect #AWSArchitectAssociate #AWSCertification #AWSStudyGuide #CloudComputing #AWSArchitecture #AWSTraining #AWSCareer #AWSExamPrep #AWSCommunity #AWSEducation #AzureFundamentals #AZ900 #MicrosoftAzure #ITCertification #CertificationPrep #StudyMaterials #TechLearning #MicrosoftCertified #AzureCertification #TechBooks

zCanadian Quiz and Trivia, Canadian History, Citizenship Test, Geography, Wildlife, Secenries, Banff, Tourism

Africa Quiz, Africa Trivia, Quiz, African History, Geography, Wildlife, Culture

Exploring the Pros and Cons of Visiting All Provinces and Territories in Canada.
Exploring the Pros and Cons of Visiting All Provinces and Territories in Canada

Exploring the Advantages and Disadvantages of Visiting All 50 States in the USA
Exploring the Advantages and Disadvantages of Visiting All 50 States in the USA

Health Health, a science-based community to discuss health news and the coronavirus (COVID-19) pandemic

Today I Learned (TIL) You learn something new every day; what did you learn today? Submit interesting and specific facts about something that you just found out here.

Reddit Science This community is a place to share and discuss new scientific research. Read about the latest advances in astronomy, biology, medicine, physics, social science, and more. Find and submit new publications and popular science coverage of current research.

Reddit Sports Sports News and Highlights from the NFL, NBA, NHL, MLB, MLS, and leagues around the world.

Turn your dream into reality with Google Workspace: It’s free for the first 14 days.
Get 20% off Google Google Workspace (Google Meet) Standard Plan with  the following codes:
Get 20% off Google Google Workspace (Google Meet) Standard Plan with  the following codes: 96DRHDRA9J7GTN6 96DRHDRA9J7GTN6
With Google Workspace, Get custom email @yourcompany, Work from anywhere; Easily scale up or down
Google gives you the tools you need to run your business like a pro. Set up custom email, share files securely online, video chat from any device, and more.
Google Workspace provides a platform, a common ground, for all our internal teams and operations to collaboratively support our primary business goal, which is to deliver quality information to our readers quickly.
Get 20% off Google Workspace (Google Meet) Business Plan (AMERICAS): M9HNXHX3WC9H7YE
Even if you’re small, you want people to see you as a professional business. If you’re still growing, you need the building blocks to get you where you want to be. I’ve learned so much about business through Google Workspace—I can’t imagine working without it.
(Email us for more codes)