GenAI Models on your PC using Ollama
Run the Large Language Models directly on your Windows/Mac/Linux system
Index
1. What are inference engines?
2. How do Inference Enginess work?
3. What is Ollama?
4. Running LLMs Locally
5. Summary
6. Important links for References
What are inference engines?
Inference engines run Large Language Models. They process input text (a user-inputted query), understand the context behind the input, and generate relevant responses.
How do Inference Engines work?
Inference engines are systems that apply logical rules to a knowledge base to retrieve answers.
For Example:
Fact: All fishes live underwater.
Rule: If something is a fish, then it will live underwater.
The Steps in Working of an inference engine are:
- Tokenization → The input text is broken into smaller units, typically words or subwords.
- Embedding: Each token is assigned a numerical representation, called an embedding, that captures its semantic meaning.
- Contextual Understanding: The inference engine considers the sequence of tokens and their embeddings to understand the context of the input query.
- Prediction: Based on the context and the model’s learned patterns, the inference engine predicts the most likely next token or sequence of tokens.
- Generation: The predicted tokens are combined to form the output text, which could be a continuation of the input, a translation, or a summary.
What is Ollama?
Ollama is an open-source platform that helps run open-source large language models locally on your Windows, Linux, or macOS.
Running LLMs Locally
Step 1: Download Ollama
You can visit the website to download the package installer for Windows, Linux, or macOS on this link.
For the Linux Command Line Interface, run the following command:
curl -fsSL https://ollama.com/install.sh | sh
The installation will be completed in a couple of minutes depending on your internet speed (of course). Once installed, we will load the respective model. But to check if it’s correctly installed, open Command prompt in Windows and type “ollama” to see a similar response like below.
ollama
Step 2: Select the Model to run via Ollama
Please visit this Model library hosted on Ollama. You can see details like model name, model family, size and quantization used. For example, for my personal use, I will be running the Gemma 2B model from the Gemma family offered by Google DeepMind.
Step 3: Pull the Model:
Pull the model using the below command →
ollama pull gemma:2b
Step 4: Run the Model:
You can see the downloaded model using command → ollama list
To run the same, please type the command → ollama run gemma:2b
This will start the interactive command line interface allowing me to send a message to the LLM and get the responses locally. Below is a snapshot of the commands as well as the “Read-Evaluate-Print-Loop” Powered by Ollama.
Commands used:
ollama list
ollama run gemma:2b
To exit the interface, type the command → /bye
To remove the LLM from your local environment, use the below command:
ollama list
ollama rm gemma:2b
ollama list
Summary
- Inference Engines: These are the brains behind LLMs. They process user input, understand context, and generate responses.
- How Inference Engines Work: They break down input text, analyze meaning, predict responses, and generate output.
- Ollama: This open-source platform allows you to run LLMs on your own computer (Windows, Linux, macOS).
- Running LLMs Locally with Ollama:
- Download Ollama from the website.
- Choose a model from the Ollama Model Library.
- Pull the model using the
ollama pull
command. - Run the model using the
ollama run
command. - Interact with the LLM using the REPL interface.
- Exit the REPL with
/bye
. - Remove the model with
ollama rm
.
Important links for References:
- Ollama website → Ollama
- Linux Manual Installation → ollama/docs/linux.md at main · ollama/ollama · GitHub
- HuggingFace Open LLM Leaderboard → Open LLM Leaderboard 2 — a Hugging Face Space by open-llm-leaderboard