No.4357
AI is just so batshit retarded that it fails simple probability questions because it gets corrupted by your first prompt.
WRONG WTF ARE YOU THINKING ??
https://chatgpt.com/share/687b9007-fcc0-8002-8369-59e31311b36fWOW SO YOU CAN DO IT??
https://chatgpt.com/share/687b9055-a368-8002-9736-4c81a3529e6f
No.4358
>>4357Oh, i understand now... It's giving me the odds that it occurs exactly once because it's been instructed to think in absolute numbers, not human language
No.4359
>>4357I'm glad I took all those boring ass corpus linguistic classes in university, they actually help me understand how LLM work.
No.4362
>>4357nothing about this is simple, nerd
No.4367
>>4365I didn't consider that the models eclipse the capacity of a single server GPU.
Apparently not even Grok believes what Brave's AI search told him. Guess I will post this prompt if it comes up.
It basically leaves up llama as the only option.
Also very funny, I run into Grok's message limit after a couple of prompts
No.4378
>>4368People really, REALLY shouldn't use AI as a source of information. That data there is retarded.
https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/tree/mainAn 8B parameter model for end use is like 7gb at absolute lowest levels of efficiency after quantization. For common usage they're like 4gb. Some companies are pursuing them (or 3b models) so they can fit them on smartphones or similar portable tech. You can fit them on consumer GPUs from a decade ago.
No.4379
>>4378Are you sure you're not misunderstanding what it's saying. It's talking about FP16
No.4380
gehh, I have no idea how to get any of this set up
https://huggingface.co/mistralai/Mixtral-8x7B-v0.1Is this even the release version, how do I interface with it... so many questions...
No.4381
>>4379Yeah I was in the middle of typing more:
fp32 or 16 - used by people with expensive hardware making finetunes. You never touch this stuff as a user. This is an example of AI lacking common sense and making erroneous conclusions since no one would ever 'host' these.
That level of 'precision' is entirely a placebo for end-use. Unfortunately I lack the hardware to do text model merging (hundreds of GB of VRAM), but for image gen there's a crapload of junk data and people only share fp8 and below models online.
No.4382
>>4381Hosting it is exactly what I have to find out how to do.
No.4383
>>4381>people only share fp8 and below models onlineNo wait, fp8 is the default and sometimes people share the fp16 but it's very much an optional thing on the side
No.4384
>>4383Alright... I'll just ask you the question...
What VRAM would I need to host Mixtral 8x7B at F32 and F16
No.4385
Presumably I can just follow instructions and get these two options but I'm not sure what sort of requirements are needed
https://huggingface.co/mistralai/Mixtral-8x7B-v0.1GPT pretty consistently says 90GB of VRAM but since it's a mix of experts there are ways to have it work in 80GB of mem
No.4386
also it's just a base so you have to train it into a more specific model for your use case or something
No.4387
>>4384That's an older model that's been superceded by something better likely in every regard, but you can look at the sizes here:
https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/tree/mainYou'd want to fit the highest quant you can for quality, although the Q8 one is usually seen as heavily past the the point of diminishing returns.
There are new models like QWEN3 that do do 'mixture of experts' thing (very MOE) where it's not all active at once and despite it being like 72gb I can load it on my 5090. I don't know the science behind it:
https://huggingface.co/unsloth/Qwen3-235B-A22B-GGUF/tree/main/UD-Q2_K_XL
No.4388
Also the QWEN people JUST came out with a new QWEN3 and it hasn't been quantized by someone yet.
No.4389
>>4387Oh wait, you specifically said F32 and F16. Sorry, I'm tired.
I don't know. Those sizes are massive. Maybe a 3 A100s? Again, those are the parameters used by people doing finetunes, they've got people with machine learning papers doing stuff.
The corporate "I want an AI model that tells my employees to do [this] when asked" is a prompt solution.
No.4390
>>4387So this qwen thing is a better version of mixtral that works on the same system.
And the 70b of llama 3 is the sum lf these file sizes?
https://huggingface.co/meta-llama/Llama-3.1-70B/tree/mainI have no idea how you found that concinient list of files
No.4391
Convenient
No.4392
So something like two 80gb GPU would support all of llama3 70B and QWEN at max would be similar but a mix of experts so only two might run at the same time allowing for slower speed but better memory
No.4393
>>4390It goes like this:
Company releases their weights or whatever you call them. It's that list of
File01 of 40
File02 of 40
that are uploaded on an official account on huggingface. I think you probably could load them directly somehow, but I don't know how and I wouldn't have the hardware to do so.
Then a hobbyist comes around and, uh, quantisizes them or something into a few files that are made to be loaded on consumer software like llama.ccp. It ends up in a format like GGUF, which again is something I don't understand, but I know it means that it's efficient and designed to be loaded on something outside of a supercomputer.
This is what my local folder looks like.
No.4394
>>4393Oh, which reminds me. Nemotron is a finetune of Llama3 that was sponsored by nvidia for something and people speak very highly of it. I don't know if it's been superceded by other stuff. I also don't know the rankings for this stuff when it comes to coding or corporate use in general, I just want to roleplay personally.
No.4395
So the quantizations are easier ways to interface with it or are they reductions to make it easier to run on consumer hardware
I I wanted to load all of LLaMA onto a server I would download that and use their provided scripts to interface with it(or maybe some of meta's tooling) and it would process it and output answers.
Or I could get the sloth build which might have some sort of interface to make it have a web UI.
No.4396
But either way, the unrestricted llama 3 70b is something like 130GB and QWEN3 would be similar but a mix of experts
No.4397
Also what is a finetune vs what that file list is
No.4398
I guess it's different to a lora where they actually alter the models for a specific use case?
No.4399
>>4395>So the quantizations are easier ways to interface with it or are they reductions to make it easier to run on consumer hardwareBoth, really. It might just be that it became habit that people load a specific filetype over time and it's just that way due to momentum. I know on /g/ in the LLM general (the only decent thread on that board) there's people with enthusiast hardware that load those big ones. They're people that enjoy tinkering with this stuff for the sake of it, though.
>Or I could get the sloth build which might have some sort of interface to make it have a web UI.This is probably the answer, but I don't know for sure. Maybe check if there's some licensing issue if you're going to be using it for profit?
>>4397A finetune is uhh.. basically a second stage of training on a model with additional data to try and expand its capabilities in general or for more specific information. You could finetune a general model to know more about anime by feeding it tons of new data specifically about anime and make the Llama3-Anime finetune. People used to do this to try and make local models that excel at ERP. LORAs are different in that they are also trained on the source model, but they're separate files you can choose to load at runtime. You have more control, but it comes at the cost of needing to load more stuff into memory.
No.4400
>>4399I see... So you'd put a model such as llama fp16 onto a gpu(140gb) then I would have a 10gb finetune and then loras which get read when required?
No.4401
onto a gpu array*
No.4402
>>4400The Finetune becomes the loaded model itself. You wouldn't load Llama3, you'd load Llama3-FineTune. The thing with finetunes is that, as with AI training in general, there's no guarantee it will actually improve things. It's quite likely if you make this rhetorical llama3-anime finetune I mentioned it gains knowledge of Speed Racer but loses knowledge of airplanes.
No.4403
>>4402>It's quite likelyquite possible*
No.4404
>>4402I see.. you're actually overwritting the training that came out of the box. Pretty risky.. or high skill... or gaccha pro gambling..
No.4405
>>4404Yeah, with the giant models there are cases where they have to restart the training for various reasons, and these are the data center things where they have 10,000 GPUs working 24/7 so it's a lot of processing thrown down the gutter.
No.4406
As another example, Llama4 was an utter failure despite throwing craploads of resources at it. It lead to zuck poaching talent from openAI in a last ditch effort to save things by throwing billions of dollars around.
No.4407
>>4377This was the event in reference. Using AI to speed up your workflow is fine, but there's a certain schadenfreude from seeing these "tech bros" get burned for thinking AI is something that it isn't. Even better when they try to interrogate and hold it accountable. It's honestly baffling that this line of thinking goes all the way up to people who work on them, like the google guy who got fooled thinking lamda was alive.
I hope to see more of it.
No.4452
I was looking at motherboards and the newest gigabyte boards bundle an llm software to make use of 3 pci16/8 slots
But I'm wondering how I can do this on any mobo and make use of 11+12GB OF RAM and run something like LLaMA 7B FP16 @14GB for testing out how to build an AI assistant. Then experiment with the other quantizations.
Any tips?
No.4462
>>4363RAG is mostly for text, so no.
No.4463
The answers in this thread seem heavily influenced by hobbyist usage and the answers are misleading and uninformed.
First of all:
- You didn't mention how big the company is, how many people are going to be using the LLM, nor how frequently it would be used.
- A single GPU is okay for a hobbyist because they're only a single person and prompts come sequently and infrequently; a single GPU is likely NOT at all adequate for a company unless the same is true: single person usage, infrequent use, low context window
- If you want RAG, you should be looking at dedicated RAG models, not hobbyist models like Mixtral, Mistral, Qwen, or Llama.
- Number of parameters, precision, and quantization DOES have a significant effect. FP32 is non-existent. FP16 is approximately equal to FP8 > INT8 > INT4. Q8 > Q7 > Q6 [...] > Q2. The more parameters (e.g. 207B vs 32B), the better. The higher precision, the better. The lower quantization (Q8 vs Q4), the better.
You would likely want to look at models such as those by Cohere, such a Command R+, which is a 108B parameter model specifically designed for corporate RAG.
No.4464
>>4463When you say fp32 is non existent you mean there's no accuracy drop or it's never used?
No.4466
>>4464Never used. Only exists in training scenarios. Your options are basically:
FP16, FP8, INT8, and INT4. As precision goes down, compute requirements decrease and the model will run faster and have lower running requirements (lower VRAM, lower GPU utilization; lower electricity cost, higher number of prompts per dollar) -- but there WILL be a penalty in terms of capabilities (lower benchmark results, lower one-shot success).
https://zhuanlan.zhihu.com/p/18736185169
No.4467
You made me check my posts since you called me misleading and uninformed, but nothing you said was different...
He said in OP his boss wants to try
one card, which means the company is small enough that his boss interacts with him and asks questions. I still stand by my words: use the best quality/quant model that will fit on the GPU and allow for context. It's true that I don't actually know corporate stuff, though, as those models, like cohere you mentioned, require you to give them your personal information before you're allowed to download it.
>The more parameters (e.g. 207B vs 32B), the betterThis can be misleading since lower parameter models these days can perform better than higher parameter models in the past. All things being equal, though, yeah.
>>4464Without being some machine learning scientist I don't know how to explain it other than it's treated as extremely wasteful, like a placebo. Maybe it's like an uncompressed PNG versus a compressed PNG or something? There's probably some value out there for people to have 40MB PNGs on their drive instead of 3MB ones, but I don't know what it is. They're still not lossy JPEGs.
No.4468
>>4463>prompts come sequently and infrequentlyForgot to expand on why this is significant: a single large context window is fine for an individual, but this doesn't scale when you increase concurrent sessions. Response latency increases due to the memory requirements for inference going up with each additional prompt.
This article goes more into depth about the real-world impacts:
https://research.trychroma.com/context-rotThis article gives an idea into how you would calculate your raw overall inference capacity based on model and selected hardware:
https://blogs.vmware.com/cloud-foundation/2024/09/25/llm-inference-sizing-and-performance-guidance/
No.4469
>>4467>you called me misleading and uninformedI don't know which posts are yours, but, again, the discussion revolved around hobbyist knowledge which isn't particularly applicable.
"Finetunes", for example, really are not something you would ever see in a corporate environment (maybe outside of a dedicated AI company, like Anthropic, OpenAI, xAI, IBM, Cohere, etc.). Most finetunes I have seen in the wild are basically tacitly advertised to allow ERP or harmful topics. I simply cannot imagine any serious company willingly using a "Llama Uncensored" finetune. A serious finetune, like Orca, by Microsoft, or Nemotron, by Nvidia, is another matter entirely, but these almost never describe themselves as a finetune -- they're "Orca", or they're "Nemotron" first and foremost, and if someone asks, they're derived from Meta Llama. Furthermore, a typical company is extremely unlikely to produce their own finetune either. Much more likely they just prepend their own proprietary information to the prompt and then use a secondary safety model for scanning user prompts (think like Microsoft Azure's GPT4 API which scans prompts for violence, harmful topics, sexual content, etc -- a smaller safety model prevents wasting tokens and compute on the primary large model). [Fill-in-the-blank Company] isn't renting 100x H200's to make a finetune of Llama 3.2 so that it has encyclopedic knowledge of their product catalog for their company website online chatbot assistant -- they're just going to drop in a ten thousand word, high-level product catalog summary at the beginning of the prompt.
An 8B, or even an older 70B model, is very doubtful to be capable of
competent RAG. The first article I posted talks about what they call "context rot" and has a good graphic that gets the idea across. For a smaller model (<100B parameters), you would be looking at the rightmost graphs. If you've ever used one of the "first generation" models that started claiming longer context windows, like Phi3 (3.8B, 14B) or Gemma3 (1B, 4B, 12B, 27B) that claimed a context window of 128K tokens, they're not at all truly capable of that! Like their previous iterations, they really only have a practical context window of 8K tokens. This sort of distinction is important for RAG because users are liable (if not likely) to provide "fuzzy" inputs that doesn't resemble the data they're looking for.
Suggesting, or at least following OP's lead in taking about, hobbyist models without providing thoughtful alternatives is misleading. Failing to mention how those hobbyist models really aren't capable of what OP is concerned with (RAG) is uniformed.
No.4471
I didn't specify the number of people, but the application is for a company that would probably at most see a 3 person queue for prompts. And you'd feed in various real world conditions/live data into the system and have it use that information to make analysis.
Since you seem to be quite versed in how to apply this in a professional environment, what do you think the likeliehood is that a company will ever see a possitive return on investment for the capital expenses put into acquiring the GPU hardware to self host.
Even in the case where you use it as an API service I feel skeptical that it's even worth it.
No.4472
I'm reading through what you've typed and the more I read the more my assumption of using two GPUs to run a 70B LLaMA seems less probably than going past those parameters. Because the people who use it won't be very understanding of the limitations of chatbots and start blaming me that the AI didn't understand their vague query about niche information.
No.4473
>>4471To be entirely honest, the financial case for AI in these sorts of scenarios is almost entirely headcount reduction: take a given team, off-load certain low-level tasks to an AI and then reduce the number of entry-level positions you need to fill while retaining experts for oversight roles and babysitting the AI so it's not hallucinating or making improper suggestions/conclusions.
Whether AI makes a positive financial contribution, despite high capital costs, is basically analogous to supermarkets shedding cashier positions and putting in self-checkout. Sure, the company may have a high upfront cost and incure greater shrink (item theft) due to less human oversight, but on-net they're benefitting from no longer having to hire 10-20 cashiers. I think for the positions where AI can be used, it will be seen and implemented similarly. Hiring is currently one high profile example where AI systems have basically entirely eliminated traditional review where someone has to sit down and evaluate resumes. Sure, the AI system may make mistakes, but running the AI should be a much lower expense. Claude 4 Opus is $15 per million tokens, for instance. If you're currently paying a salaried position $45K for evaluating resumes, conducting interviews, etc. and you could replace that position with an LLM, and now you're making out like a bandit. A typical resume is maybe 600 words. 1 word is typically about 1.5 tokens, so 900 tokens. For $15, Claude can evaluate 1000 resume submissions. The AI doesn't need to sleep. It doesn't demand overtime. It doesn't take time off. The remaining work can get passed along to someone higher and more experienced to conduct interviews and see if the person meshes well with company values and culture and whether they're truly knowledgeable about the applied position. So, hypothetically, maybe we take our hiring team down from 10 people to 3. Those 7 who get fired were the ones sending emails, scheduling interview times, reading resumes, etc. but now we just need someone to conduct an interview and get employment contracts sorted. The company itself is likely going to shift some additional work onto the remaining employees while simultaneously keeping them at the same pay grade or offering marginal increases.
For repetitive, beaucratic mid- to entry-level administrative tasks it's likely going to be a bloodbath over the next 5-10 years. Some companies will integrate it successfully, many will likely use it the same way consulting companies like McKinsey come in and give a legal rationale for firing significant numbers of workers who are salaried and they can't just fire without cause. Others will lazily implement it (re: Blockchain Coffee Company), calling anything even remotely automated "AI" and change minimally. Those who do not respond to change will likely be at significant financial disadvantage -- not because AI is good at what it will be used for, but because the expenditure reductions it enables will allow the companies that use it successfully to undercut the competition on pricing. In the short-term, companies may genuinely be able to offer lower prices, and this will put pressure on the competition to do the same or go bust, and then prices will return to their previous higher levels much in the way Uber and Lyft basically eradicated cab companies.
All of that being said, in the near-term whether those capital expenses recoup themselves? For hardware? It really depends. The issue is hardware depreciates, software increasingly has higher requirements, and you need to get on the upgrade treadmill. That's a big expense for small- to mid-level companies and will likely never make financial sense for anyone but very large companies who have thousands of employees. API usage would be a much better fit because you don't have to concern yourself with hardware capital costs and you can just continually move from one model to the next without worrying about speed or hardware requirements changing. The issue is, of course, the second you start feeding your data to any given API, you start becoming part of their training data and open yourself up to risk that way. If I had to guess, the same way FIPS hardware exists, there will eventually become a market for FIPS certified LLM APIs; higher costs to keep your info secure and out of public training data. If you're not handling sensitive user data, like processing customer queries, you probably opt for the cheaper tier where your data gets scraped. For intra-company sensitive info, that gets handed to the FIPS LLM API.
This is, of course, all speculation, but it's where I see things headed.
No.4474
>>4469How petty are you that my offhand "I think nvidia did something" post is worthy of devoting time to "correcting" it by stating the finetune isn't a finetune because it's not labeled a finetune? I even said "I don't know the rankings for this stuff when it comes to coding or corporate use in general". Obviously I wasn't advising him to use ERP models for business.
Good lord, tech people are aggravating. OP get chart guy to help you. I'm going to go spend some time with plants and bugs.
No.4475
>>4474I mean... It's a technical discussion. It's fine to be accurate but not necessarily precise when it's a low complexity topic, but precision is important when the contours of discussion are asking how or whether someone should do something. If someone asks how to paint their car, both of these suffice: "buy some rattle cans and spray it on your car" and "sand and use body filler until the surface feels uniform, apply primer and sand again as necessary, and then spray on a coat, waiting between coats and then apply a clear coat to seal and protect the paint." The latter gives enough information that a layman should understand everything expected of them, but between experts the former is fine.
Omission is by its nature misleading. Is that to say not including every last detail is malicious? No, I never said that. I only just posted today because this is a topic I'm fairly well-versed in and interested in -- I didn't even know which posts were yours, nor was I intentionally trying to single you out. It was fairly presumptuous to come along and declare "you called me misleading" when no offense was meant. I was addressing areas I felt were lacking and adding context where appropriate. If that's not actually appropriate for this board's culture, I'll refrain myself in the future...
No.4476
>>4473yeah, I can see what you're saying of it being used not for any sort of "consultant" role, but for menial Eye+Criterea activity. So long as the government and loans keep pouring money into the companies involved.
Companies like OpenAI and so forth are getting funded hundreds of billions of dollars to be able to do this. When that funding dries up are the costs going to stay the same or will we experience a 100% increase in prices. From the modest 10$/mo of Github Copilot to a more significant 25$/mo.
Also there's people like the company in question from the OP who think that they can stay ahead of the game by using AI as an information consultant.
Anyways, thanks for the info in general and from anyone. Since I know a bit more about how I might be able to test a local LLM in such a way that it could translate to a professional environment.