stable-diffusion-webui-forge

cxllm-vendors/stable-diffusion-webui-forge

Author	SHA1	Message	Date
Lucas Freire Sangoi	75120d02f3	Restore lines for DoRA TE keys fix (#2240 )	2024-11-06 20:20:57 +00:00
layerdiffusion	44eb4ea837	Support T5&Clip Text Encoder LoRA from OneTrainer requested by #1727 and some cleanups/licenses PS: LoRA request must give download URL to at least one LoRA	2024-09-08 01:39:29 -07:00
layerdiffusion	a8a81d3d77	fix offline quant lora precision	2024-08-31 13:12:23 -07:00
layerdiffusion	3a9cf1f8e5	Revert partially "use safer codes"	2024-08-31 11:07:28 -07:00
layerdiffusion	70a555906a	use safer codes	2024-08-31 10:55:19 -07:00
layerdiffusion	4c9380c46a	Speed up quant model loading and inference ... ... based on 3 evidences: 1. torch.Tensor.view on one big tensor is slightly faster than calling torch.Tensor.to on multiple small tensors. 2. but torch.Tensor.to with dtype change is significantly slower than torch.Tensor.view 3. “baking” model on GPU is significantly faster than computing on CPU when model load. mainly influence inference of Q8_0, Q4_0/1/K and loading of all quants	2024-08-30 00:49:05 -07:00
layerdiffusion	0abb6c4686	Second Attempt for #1502	2024-08-28 08:08:40 -07:00
layerdiffusion	25662974f8	try to test #1502	2024-08-27 18:42:00 -07:00
layerdiffusion	acf99dd74e	fix old version of pytorch	2024-08-26 06:51:48 -07:00
layerdiffusion	82dfc2b15b	Significantly speed up Q4_0, Q4_1, Q4_K by precomputing all possible 4bit dequant into a lookup table and use pytorch indexing to get dequant, rather than really computing the bit operations. This should give very similar performance to native CUDA kernels, while being LoRA friendly and more flexiable	2024-08-25 16:49:33 -07:00
layerdiffusion	e60bb1c96f	Make Q4_K_S as fast as Q4_0 by baking the layer when model load	2024-08-25 15:02:54 -07:00
layerdiffusion	868f662eb6	fix	2024-08-25 14:44:01 -07:00
layerdiffusion	13d6f8ed90	revise GGUF by precomputing some parameters rather than computing them in each diffusion iteration	2024-08-25 14:30:09 -07:00
layerdiffusion	8fd889dcad	fix #1336	2024-08-20 08:04:09 -07:00
layerdiffusion	4e8ba14dd0	info	2024-08-19 05:13:28 -07:00
layerdiffusion	d38e560e42	Implement some rethinking about LoRA system 1. Add an option to allow users to use UNet in fp8/gguf but lora in fp16. 2. All FP16 loras do not need patch. Others will only patch again when lora weight change. 3. FP8 unet + fp16 lora are available (somewhat only available) in Forge now. This also solves some “LoRA too subtle” problems. 4. Significantly speed up all gguf models (in Async mode) by using independent thread (CUDA stream) to compute and dequant at the same time, even when low-bit weights are already on GPU. 5. View “online lora” as a module similar to ControlLoRA so that it is moved to GPU together with model when sampling, achieving significant speedup and perfect low VRAM management simultaneously.	2024-08-19 04:31:59 -07:00
layerdiffusion	e5f213c21e	upload some GGUF supports	2024-08-19 01:09:50 -07:00
layerdiffusion	243952f364	wip qx_1 loras	2024-08-15 17:07:41 -07:00
layerdiffusion	1bd6cf0e0c	Support LoRAs for Q8/Q5/Q4 GGUF Models what a crazy night of math	2024-08-15 05:34:46 -07:00
layerdiffusion	fd0d25ba8a	fix type hints	2024-08-15 03:08:25 -07:00
layerdiffusion	2690b654fd	reimplement q8/q85/q4 and review and match official gguf	2024-08-15 02:41:15 -07:00
layerdiffusion	358277e7a0	remove unused files	2024-08-15 01:47:59 -07:00
layerdiffusion	3acb50c40e	integrate llama3's GGUF	2024-08-15 01:45:29 -07:00
layerdiffusion	00f1cd36bd	multiple lora implementation sources	2024-08-13 07:13:32 -07:00

24 Commits