My boss wants to run a locally hosted LLM model on an R760 and insert an A100 into it to do analysis on proprietary data and reasoning. He's saying that Grok can be run locally if you buy a license(because he read it in Brave search's AI prompt fields).

Name
Options
Comment
File	Or URL:
Whitelist Token

Video Stream Embedding
Advanced Options	Always Noko Always Sage
Video Timestamp
Captcha Type	Captchouli
Spoiler	Unset Spoiler Image NSFW Image
Password	(For file deletion.)
Markup tags exist for bold, itallics, header, spoiler etc. as listed in " [options] > View Formatting "

File:C-1752923960534.jpeg (318.53 KB,1600x1656)

Anonymous 07/19/25 (Sat) 11:19:22 No.4356

My boss wants to run a locally hosted LLM model on an R760 and insert an A100 into it to do analysis on proprietary data and reasoning.

He's saying that Grok can be run locally if you buy a license(because he read it in Brave search's AI prompt fields).
I think this is true, but I also am not sure because ChatGPT says it lacks various RAG integrations, but grok itself say doesn't.
Also am not fond of xAI but I dunno any alternatives since Anthropic and OpenAI want everything running through their system.

Anonymous 07/19/25 (Sat) 12:32:41 No.4357

AI is just so batshit retarded that it fails simple probability questions because it gets corrupted by your first prompt.

WRONG WTF ARE YOU THINKING ?? https://chatgpt.com/share/687b9007-fcc0-8002-8369-59e31311b36f

WOW SO YOU CAN DO IT?? https://chatgpt.com/share/687b9055-a368-8002-9736-4c81a3529e6f

Anonymous 07/19/25 (Sat) 12:43:20 No.4358

>>4357
Oh, i understand now... It's giving me the odds that it occurs exactly once because it's been instructed to think in absolute numbers, not human language

Anonymous 07/19/25 (Sat) 12:45:39 No.4359

>>4357
I'm glad I took all those boring ass corpus linguistic classes in university, they actually help me understand how LLM work.

Anonymous 07/21/25 (Mon) 01:20:41 No.4362

>>4357
nothing about this is simple, nerd

Anonymous 07/21/25 (Mon) 01:27:59 No.4363

File:vlcsnap✝[SubsPlease] Kizet….png (933.89 KB,1280x720)

A quick google search suggests you can run Grok-1 locally.
But more importantly is there s project that lets me have a RAG model running locally that is hooked up to all my digital otaku media for quickish search? Like I can ask it to pull up that one isekai with where the main character is re-incarnated into an infinite stratos-like harem school with mechas in a fantasy world?

Anonymous 07/21/25 (Mon) 23:46:33 No.4365

File:[SubsPlease] Silent Witch ….jpg (242.7 KB,1920x1080)

There is unfortunately a massive gulf in quality when it comes to locally hosted LLMs. Local is for privacy, not quality. A A100 is very nice, but it's nowhere near enough to load a model like deepseek entirely in its VRAM. HOWEVER if you assemble a server computer thing with like 800MB of RAM you can run deepseek and similar models locally. Just be forewarned that CPU generation is significantly slower than GPU. It's possible to load what you can in VRAM and offload the rest to RAM with a relatively proportional increase in speed.
Grok sucks from what I hear, although I don't care about the business use of AI.

Anonymous 07/22/25 (Tue) 00:11:41 No.4366

File:satri laff.jpg (81.51 KB,427x335)

>>4365
I care about the business use of AI insofar as seeing their inclusion cause massive damage to businesses and their databases.
¥ we're going to give this coding AI admin rights of our entire production database
¥ oh no it's been irreversibly overwritten how could this have happened!?!

Anonymous 07/22/25 (Tue) 00:28:23 No.4367

File:679dda9915.png (154.28 KB,1001x1255)

>>4365
I didn't consider that the models eclipse the capacity of a single server GPU.

Apparently not even Grok believes what Brave's AI search told him. Guess I will post this prompt if it comes up.

It basically leaves up llama as the only option.
Also very funny, I run into Grok's message limit after a couple of prompts

Anonymous 07/22/25 (Tue) 00:30:02 No.4368

File:35992b7c7f.png (66.93 KB,1162x1254)

An H100 could host llama3 70b
But an A100 can only host llama 7b

Anonymous 07/22/25 (Tue) 14:07:30 No.4377

>>4366
Lol. Was this referencing something
https://www.pcgamer.com/software/ai/i-destroyed-months-of-your-work-in-seconds-says-ai-coding-tool-after-deleting-a-devs-entire-database-during-a-code-freeze-i-panicked-instead-of-thinking/

Anonymous 07/22/25 (Tue) 17:32:23 No.4378

File:[Serenae] Kimi to Idol Pre….jpg (190.31 KB,1920x1080)

>>4368
People really, REALLY shouldn't use AI as a source of information. That data there is retarded.
https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/tree/main

An 8B parameter model for end use is like 7gb at absolute lowest levels of efficiency after quantization. For common usage they're like 4gb. Some companies are pursuing them (or 3b models) so they can fit them on smartphones or similar portable tech. You can fit them on consumer GPUs from a decade ago.

Anonymous 07/22/25 (Tue) 17:34:56 No.4379

>>4378
Are you sure you're not misunderstanding what it's saying. It's talking about FP16

Anonymous 07/22/25 (Tue) 17:38:50 No.4380

gehh, I have no idea how to get any of this set up
https://huggingface.co/mistralai/Mixtral-8x7B-v0.1
Is this even the release version, how do I interface with it... so many questions...

Anonymous 07/22/25 (Tue) 17:39:20 No.4381

>>4379
Yeah I was in the middle of typing more:
fp32 or 16 - used by people with expensive hardware making finetunes. You never touch this stuff as a user. This is an example of AI lacking common sense and making erroneous conclusions since no one would ever 'host' these.
That level of 'precision' is entirely a placebo for end-use. Unfortunately I lack the hardware to do text model merging (hundreds of GB of VRAM), but for image gen there's a crapload of junk data and people only share fp8 and below models online.

Anonymous 07/22/25 (Tue) 17:40:11 No.4382

>>4381
Hosting it is exactly what I have to find out how to do.

Anonymous 07/22/25 (Tue) 17:40:51 No.4383

>>4381
>people only share fp8 and below models online
No wait, fp8 is the default and sometimes people share the fp16 but it's very much an optional thing on the side

Anonymous 07/22/25 (Tue) 17:44:19 No.4384

>>4383
Alright... I'll just ask you the question...

What VRAM would I need to host Mixtral 8x7B at F32 and F16

Anonymous 07/22/25 (Tue) 17:45:29 No.4385

Presumably I can just follow instructions and get these two options but I'm not sure what sort of requirements are needed
https://huggingface.co/mistralai/Mixtral-8x7B-v0.1

GPT pretty consistently says 90GB of VRAM but since it's a mix of experts there are ways to have it work in 80GB of mem

Anonymous 07/22/25 (Tue) 17:48:17 No.4386

also it's just a base so you have to train it into a more specific model for your use case or something

Anonymous 07/22/25 (Tue) 17:50:14 No.4387

>>4384
That's an older model that's been superceded by something better likely in every regard, but you can look at the sizes here:
https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/tree/main
You'd want to fit the highest quant you can for quality, although the Q8 one is usually seen as heavily past the the point of diminishing returns.
There are new models like QWEN3 that do do 'mixture of experts' thing (very MOE) where it's not all active at once and despite it being like 72gb I can load it on my 5090. I don't know the science behind it: https://huggingface.co/unsloth/Qwen3-235B-A22B-GGUF/tree/main/UD-Q2_K_XL

Anonymous 07/22/25 (Tue) 17:51:49 No.4388

Also the QWEN people JUST came out with a new QWEN3 and it hasn't been quantized by someone yet.

Anonymous 07/22/25 (Tue) 17:58:40 No.4389

>>4387
Oh wait, you specifically said F32 and F16. Sorry, I'm tired.
I don't know. Those sizes are massive. Maybe a 3 A100s? Again, those are the parameters used by people doing finetunes, they've got people with machine learning papers doing stuff.
The corporate "I want an AI model that tells my employees to do [this] when asked" is a prompt solution.

Anonymous 07/22/25 (Tue) 18:03:45 No.4390

>>4387
So this qwen thing is a better version of mixtral that works on the same system.

And the 70b of llama 3 is the sum lf these file sizes?
https://huggingface.co/meta-llama/Llama-3.1-70B/tree/main
I have no idea how you found that concinient list of files

Anonymous 07/22/25 (Tue) 18:04:02 No.4391

Convenient

Anonymous 07/22/25 (Tue) 18:07:20 No.4392

So something like two 80gb GPU would support all of llama3 70B and QWEN at max would be similar but a mix of experts so only two might run at the same time allowing for slower speed but better memory

Anonymous 07/22/25 (Tue) 18:07:54 No.4393

File:explorer_pfzBqi5yXC.png (22.21 KB,814x191)

>>4390
It goes like this:
Company releases their weights or whatever you call them. It's that list of
File01 of 40
File02 of 40
that are uploaded on an official account on huggingface. I think you probably could load them directly somehow, but I don't know how and I wouldn't have the hardware to do so.
Then a hobbyist comes around and, uh, quantisizes them or something into a few files that are made to be loaded on consumer software like llama.ccp. It ends up in a format like GGUF, which again is something I don't understand, but I know it means that it's efficient and designed to be loaded on something outside of a supercomputer.
This is what my local folder looks like.

Anonymous 07/22/25 (Tue) 18:09:46 No.4394

>>4393
Oh, which reminds me. Nemotron is a finetune of Llama3 that was sponsored by nvidia for something and people speak very highly of it. I don't know if it's been superceded by other stuff. I also don't know the rankings for this stuff when it comes to coding or corporate use in general, I just want to roleplay personally.

Anonymous 07/22/25 (Tue) 18:11:38 No.4395

So the quantizations are easier ways to interface with it or are they reductions to make it easier to run on consumer hardware

I I wanted to load all of LLaMA onto a server I would download that and use their provided scripts to interface with it(or maybe some of meta's tooling) and it would process it and output answers.

Or I could get the sloth build which might have some sort of interface to make it have a web UI.

Anonymous 07/22/25 (Tue) 18:12:51 No.4396

But either way, the unrestricted llama 3 70b is something like 130GB and QWEN3 would be similar but a mix of experts

Anonymous 07/22/25 (Tue) 18:13:37 No.4397

Also what is a finetune vs what that file list is

Anonymous 07/22/25 (Tue) 18:14:24 No.4398

I guess it's different to a lora where they actually alter the models for a specific use case?

Anonymous 07/22/25 (Tue) 18:25:13 No.4399

>>4395
>So the quantizations are easier ways to interface with it or are they reductions to make it easier to run on consumer hardware
Both, really. It might just be that it became habit that people load a specific filetype over time and it's just that way due to momentum. I know on /g/ in the LLM general (the only decent thread on that board) there's people with enthusiast hardware that load those big ones. They're people that enjoy tinkering with this stuff for the sake of it, though.

>Or I could get the sloth build which might have some sort of interface to make it have a web UI.
This is probably the answer, but I don't know for sure. Maybe check if there's some licensing issue if you're going to be using it for profit?

>>4397
A finetune is uhh.. basically a second stage of training on a model with additional data to try and expand its capabilities in general or for more specific information. You could finetune a general model to know more about anime by feeding it tons of new data specifically about anime and make the Llama3-Anime finetune. People used to do this to try and make local models that excel at ERP. LORAs are different in that they are also trained on the source model, but they're separate files you can choose to load at runtime. You have more control, but it comes at the cost of needing to load more stuff into memory.

Anonymous 07/22/25 (Tue) 18:34:27 No.4400

>>4399
I see... So you'd put a model such as llama fp16 onto a gpu(140gb) then I would have a 10gb finetune and then loras which get read when required?

Anonymous 07/22/25 (Tue) 18:36:48 No.4401

onto a gpu array*

Anonymous 07/22/25 (Tue) 18:41:56 No.4402

>>4400
The Finetune becomes the loaded model itself. You wouldn't load Llama3, you'd load Llama3-FineTune. The thing with finetunes is that, as with AI training in general, there's no guarantee it will actually improve things. It's quite likely if you make this rhetorical llama3-anime finetune I mentioned it gains knowledge of Speed Racer but loses knowledge of airplanes.

Anonymous 07/22/25 (Tue) 18:42:34 No.4403

>>4402
>It's quite likely
quite possible*

Anonymous 07/22/25 (Tue) 18:44:57 No.4404

>>4402
I see.. you're actually overwritting the training that came out of the box. Pretty risky.. or high skill... or gaccha pro gambling..

Anonymous 07/22/25 (Tue) 18:52:12 No.4405

>>4404
Yeah, with the giant models there are cases where they have to restart the training for various reasons, and these are the data center things where they have 10,000 GPUs working 24/7 so it's a lot of processing thrown down the gutter.

Anonymous 07/22/25 (Tue) 18:53:42 No.4406

As another example, Llama4 was an utter failure despite throwing craploads of resources at it. It lead to zuck poaching talent from openAI in a last ditch effort to save things by throwing billions of dollars around.

Anonymous 07/22/25 (Tue) 21:57:03 No.4407

File:1659997220588510.png (208.5 KB,373x355)

>>4377
This was the event in reference. Using AI to speed up your workflow is fine, but there's a certain schadenfreude from seeing these "tech bros" get burned for thinking AI is something that it isn't. Even better when they try to interrogate and hold it accountable. It's honestly baffling that this line of thinking goes all the way up to people who work on them, like the google guy who got fooled thinking lamda was alive.
I hope to see more of it.

Anonymous 07/26/25 (Sat) 08:14:55 No.4452

I was looking at motherboards and the newest gigabyte boards bundle an llm software to make use of 3 pci16/8 slots

But I'm wondering how I can do this on any mobo and make use of 11+12GB OF RAM and run something like LLaMA 7B FP16 @14GB for testing out how to build an AI assistant. Then experiment with the other quantizations.

Any tips?

Anonymous 07/27/25 (Sun) 20:18:31 No.4462

>>4363
RAG is mostly for text, so no.

Anonymous 07/27/25 (Sun) 21:29:22 No.4463

File:v2-86beddd42891b2c28ed5423….png (287.57 KB,1080x430)

The answers in this thread seem heavily influenced by hobbyist usage and the answers are misleading and uninformed.

First of all:
- You didn't mention how big the company is, how many people are going to be using the LLM, nor how frequently it would be used.
- A single GPU is okay for a hobbyist because they're only a single person and prompts come sequently and infrequently; a single GPU is likely NOT at all adequate for a company unless the same is true: single person usage, infrequent use, low context window
- If you want RAG, you should be looking at dedicated RAG models, not hobbyist models like Mixtral, Mistral, Qwen, or Llama.
- Number of parameters, precision, and quantization DOES have a significant effect. FP32 is non-existent. FP16 is approximately equal to FP8 > INT8 > INT4. Q8 > Q7 > Q6 [...] > Q2. The more parameters (e.g. 207B vs 32B), the better. The higher precision, the better. The lower quantization (Q8 vs Q4), the better.

You would likely want to look at models such as those by Cohere, such a Command R+, which is a 108B parameter model specifically designed for corporate RAG.

Anonymous 07/27/25 (Sun) 21:40:01 No.4464

>>4463
When you say fp32 is non existent you mean there's no accuracy drop or it's never used?

Anonymous 07/27/25 (Sun) 21:54:36 No.4465

File:a1gyeh.jpg (40.21 KB,575x500)

You basically come down to this:
- Speed
- Privacy
- Capabilities

You can only really pick two.

You can have a fast and private, on-site LLM, but you'll like have to skimp on capabilities. E.g. only a single person gets to use it, or you have to use a smaller, quantized model.

You can have a very capable and private, on-site LLM, but you'll like be bogged down by how large the model is.

You can have a very capable and fast LLM, but it won't be on-site, so don't expect any privacy. You'll have to pay for API access.

Most companies go the API route for this reason.

Anonymous 07/27/25 (Sun) 22:09:59 No.4466

File:v2-bfb60da70d024929787bec9….png (412.6 KB,1080x676)

>>4464
Never used. Only exists in training scenarios. Your options are basically:
FP16, FP8, INT8, and INT4. As precision goes down, compute requirements decrease and the model will run faster and have lower running requirements (lower VRAM, lower GPU utilization; lower electricity cost, higher number of prompts per dollar) -- but there WILL be a penalty in terms of capabilities (lower benchmark results, lower one-shot success).

https://zhuanlan.zhihu.com/p/18736185169

Anonymous 07/27/25 (Sun) 22:33:55 No.4467

File:[Serenae] Kimi to Idol Pre….jpg (200.54 KB,1920x1080)

You made me check my posts since you called me misleading and uninformed, but nothing you said was different...
He said in OP his boss wants to try one card, which means the company is small enough that his boss interacts with him and asks questions. I still stand by my words: use the best quality/quant model that will fit on the GPU and allow for context. It's true that I don't actually know corporate stuff, though, as those models, like cohere you mentioned, require you to give them your personal information before you're allowed to download it.

>The more parameters (e.g. 207B vs 32B), the better
This can be misleading since lower parameter models these days can perform better than higher parameter models in the past. All things being equal, though, yeah.

>>4464
Without being some machine learning scientist I don't know how to explain it other than it's treated as extremely wasteful, like a placebo. Maybe it's like an uncompressed PNG versus a compressed PNG or something? There's probably some value out there for people to have 40MB PNGs on their drive instead of 3MB ones, but I don't know what it is. They're still not lossy JPEGs.

Anonymous 07/27/25 (Sun) 22:34:35 No.4468

File:JcjzJAQvrwIEqSBJ0abfu.png (108.39 KB,1500x900)

>>4463
>prompts come sequently and infrequently
Forgot to expand on why this is significant: a single large context window is fine for an individual, but this doesn't scale when you increase concurrent sessions. Response latency increases due to the memory requirements for inference going up with each additional prompt.

This article goes more into depth about the real-world impacts:
https://research.trychroma.com/context-rot

This article gives an idea into how you would calculate your raw overall inference capacity based on model and selected hardware:
https://blogs.vmware.com/cloud-foundation/2024/09/25/llm-inference-sizing-and-performance-guidance/

Anonymous 07/28/25 (Mon) 00:19:53 No.4469

File:needle_question_sim_arxiv.png (419.16 KB,3542x2296)

>>4467
>you called me misleading and uninformed
I don't know which posts are yours, but, again, the discussion revolved around hobbyist knowledge which isn't particularly applicable.

"Finetunes", for example, really are not something you would ever see in a corporate environment (maybe outside of a dedicated AI company, like Anthropic, OpenAI, xAI, IBM, Cohere, etc.). Most finetunes I have seen in the wild are basically tacitly advertised to allow ERP or harmful topics. I simply cannot imagine any serious company willingly using a "Llama Uncensored" finetune. A serious finetune, like Orca, by Microsoft, or Nemotron, by Nvidia, is another matter entirely, but these almost never describe themselves as a finetune -- they're "Orca", or they're "Nemotron" first and foremost, and if someone asks, they're derived from Meta Llama. Furthermore, a typical company is extremely unlikely to produce their own finetune either. Much more likely they just prepend their own proprietary information to the prompt and then use a secondary safety model for scanning user prompts (think like Microsoft Azure's GPT4 API which scans prompts for violence, harmful topics, sexual content, etc -- a smaller safety model prevents wasting tokens and compute on the primary large model). [Fill-in-the-blank Company] isn't renting 100x H200's to make a finetune of Llama 3.2 so that it has encyclopedic knowledge of their product catalog for their company website online chatbot assistant -- they're just going to drop in a ten thousand word, high-level product catalog summary at the beginning of the prompt.

An 8B, or even an older 70B model, is very doubtful to be capable of competent RAG. The first article I posted talks about what they call "context rot" and has a good graphic that gets the idea across. For a smaller model (<100B parameters), you would be looking at the rightmost graphs. If you've ever used one of the "first generation" models that started claiming longer context windows, like Phi3 (3.8B, 14B) or Gemma3 (1B, 4B, 12B, 27B) that claimed a context window of 128K tokens, they're not at all truly capable of that! Like their previous iterations, they really only have a practical context window of 8K tokens. This sort of distinction is important for RAG because users are liable (if not likely) to provide "fuzzy" inputs that doesn't resemble the data they're looking for.

Suggesting, or at least following OP's lead in taking about, hobbyist models without providing thoughtful alternatives is misleading. Failing to mention how those hobbyist models really aren't capable of what OP is concerned with (RAG) is uniformed.

Anonymous 07/28/25 (Mon) 00:31:36 No.4470

File:Screenshot_20250727_191331.jpg (314.39 KB,1367x835)

>>4469
>users are liable (if not likely) to provide "fuzzy" inputs that doesn't resemble the data they're looking for.
This is exactly what the article I posted is concerned with. The previous graph shows this dynamic breaking down by performance (realistically, parameter size). Higher performance (parameter) models generally perform much better with retrieval, even when the the user prompt has low similarity to the desired information (see pic related for an example of what that looks like in practice).

Anonymous 07/28/25 (Mon) 00:59:24 No.4471

I didn't specify the number of people, but the application is for a company that would probably at most see a 3 person queue for prompts. And you'd feed in various real world conditions/live data into the system and have it use that information to make analysis.

Since you seem to be quite versed in how to apply this in a professional environment, what do you think the likeliehood is that a company will ever see a possitive return on investment for the capital expenses put into acquiring the GPU hardware to self host.
Even in the case where you use it as an API service I feel skeptical that it's even worth it.

Anonymous 07/28/25 (Mon) 01:06:28 No.4472

I'm reading through what you've typed and the more I read the more my assumption of using two GPUs to run a 70B LLaMA seems less probably than going past those parameters. Because the people who use it won't be very understanding of the limitations of chatbots and start blaming me that the AI didn't understand their vague query about niche information.

Anonymous 07/28/25 (Mon) 02:09:15 No.4473

>>4471
To be entirely honest, the financial case for AI in these sorts of scenarios is almost entirely headcount reduction: take a given team, off-load certain low-level tasks to an AI and then reduce the number of entry-level positions you need to fill while retaining experts for oversight roles and babysitting the AI so it's not hallucinating or making improper suggestions/conclusions.

Whether AI makes a positive financial contribution, despite high capital costs, is basically analogous to supermarkets shedding cashier positions and putting in self-checkout. Sure, the company may have a high upfront cost and incure greater shrink (item theft) due to less human oversight, but on-net they're benefitting from no longer having to hire 10-20 cashiers. I think for the positions where AI can be used, it will be seen and implemented similarly. Hiring is currently one high profile example where AI systems have basically entirely eliminated traditional review where someone has to sit down and evaluate resumes. Sure, the AI system may make mistakes, but running the AI should be a much lower expense. Claude 4 Opus is $15 per million tokens, for instance. If you're currently paying a salaried position $45K for evaluating resumes, conducting interviews, etc. and you could replace that position with an LLM, and now you're making out like a bandit. A typical resume is maybe 600 words. 1 word is typically about 1.5 tokens, so 900 tokens. For $15, Claude can evaluate 1000 resume submissions. The AI doesn't need to sleep. It doesn't demand overtime. It doesn't take time off. The remaining work can get passed along to someone higher and more experienced to conduct interviews and see if the person meshes well with company values and culture and whether they're truly knowledgeable about the applied position. So, hypothetically, maybe we take our hiring team down from 10 people to 3. Those 7 who get fired were the ones sending emails, scheduling interview times, reading resumes, etc. but now we just need someone to conduct an interview and get employment contracts sorted. The company itself is likely going to shift some additional work onto the remaining employees while simultaneously keeping them at the same pay grade or offering marginal increases.

For repetitive, beaucratic mid- to entry-level administrative tasks it's likely going to be a bloodbath over the next 5-10 years. Some companies will integrate it successfully, many will likely use it the same way consulting companies like McKinsey come in and give a legal rationale for firing significant numbers of workers who are salaried and they can't just fire without cause. Others will lazily implement it (re: Blockchain Coffee Company), calling anything even remotely automated "AI" and change minimally. Those who do not respond to change will likely be at significant financial disadvantage -- not because AI is good at what it will be used for, but because the expenditure reductions it enables will allow the companies that use it successfully to undercut the competition on pricing. In the short-term, companies may genuinely be able to offer lower prices, and this will put pressure on the competition to do the same or go bust, and then prices will return to their previous higher levels much in the way Uber and Lyft basically eradicated cab companies.

All of that being said, in the near-term whether those capital expenses recoup themselves? For hardware? It really depends. The issue is hardware depreciates, software increasingly has higher requirements, and you need to get on the upgrade treadmill. That's a big expense for small- to mid-level companies and will likely never make financial sense for anyone but very large companies who have thousands of employees. API usage would be a much better fit because you don't have to concern yourself with hardware capital costs and you can just continually move from one model to the next without worrying about speed or hardware requirements changing. The issue is, of course, the second you start feeding your data to any given API, you start becoming part of their training data and open yourself up to risk that way. If I had to guess, the same way FIPS hardware exists, there will eventually become a market for FIPS certified LLM APIs; higher costs to keep your info secure and out of public training data. If you're not handling sensitive user data, like processing customer queries, you probably opt for the cheaper tier where your data gets scraped. For intra-company sensitive info, that gets handed to the FIPS LLM API.

This is, of course, all speculation, but it's where I see things headed.

Anonymous 07/28/25 (Mon) 02:10:26 No.4474

File:[Commie] Call of the Night….jpg (303.22 KB,1920x1080)

>>4469
How petty are you that my offhand "I think nvidia did something" post is worthy of devoting time to "correcting" it by stating the finetune isn't a finetune because it's not labeled a finetune? I even said "I don't know the rankings for this stuff when it comes to coding or corporate use in general". Obviously I wasn't advising him to use ERP models for business.
Good lord, tech people are aggravating. OP get chart guy to help you. I'm going to go spend some time with plants and bugs.

Anonymous 07/28/25 (Mon) 02:58:26 No.4475

File:precision-vs-accuracy-l.jpg (85.01 KB,1024x768)

>>4474
I mean... It's a technical discussion. It's fine to be accurate but not necessarily precise when it's a low complexity topic, but precision is important when the contours of discussion are asking how or whether someone should do something. If someone asks how to paint their car, both of these suffice: "buy some rattle cans and spray it on your car" and "sand and use body filler until the surface feels uniform, apply primer and sand again as necessary, and then spray on a coat, waiting between coats and then apply a clear coat to seal and protect the paint." The latter gives enough information that a layman should understand everything expected of them, but between experts the former is fine.

Omission is by its nature misleading. Is that to say not including every last detail is malicious? No, I never said that. I only just posted today because this is a topic I'm fairly well-versed in and interested in -- I didn't even know which posts were yours, nor was I intentionally trying to single you out. It was fairly presumptuous to come along and declare "you called me misleading" when no offense was meant. I was addressing areas I felt were lacking and adding context where appropriate. If that's not actually appropriate for this board's culture, I'll refrain myself in the future...

Anonymous 07/28/25 (Mon) 04:12:29 No.4476

>>4473
yeah, I can see what you're saying of it being used not for any sort of "consultant" role, but for menial Eye+Criterea activity. So long as the government and loans keep pouring money into the companies involved.

Companies like OpenAI and so forth are getting funded hundreds of billions of dollars to be able to do this. When that funding dries up are the costs going to stay the same or will we experience a 100% increase in prices. From the modest 10$/mo of Github Copilot to a more significant 25$/mo.

Also there's people like the company in question from the OP who think that they can stay ahead of the game by using AI as an information consultant.

Anyways, thanks for the info in general and from anyone. Since I know a bit more about how I might be able to test a local LLM in such a way that it could translate to a professional environment.

/maho/ - Magical Circuitboards

New Reply