{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "1bc7af07",
   "metadata": {},
   "source": [
    "# Loading and Sampling"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ded5760f",
   "metadata": {},
   "source": [
    "{nb-download}`Download this as a Jupyter notebook <loading_and_sampling.ipynb>`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e697210",
   "metadata": {},
   "source": [
    "This guide demonstrates how to load samples in `pyXla`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67dbd626",
   "metadata": {},
   "source": [
    "The first step to doing any landscape analysis in pyxla is loading a data sample. \n",
    "In `pyXla`, samples are specified using a regular Python dictionary."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "918f3ce7",
   "metadata": {},
   "source": [
    "The `pyXla` framework specifies 6 different types of input files outlined in the table below:\n",
    "\n",
    "|Input|Description|Required|\n",
    "|-----|-----------|--------|\n",
    "|F|File or function  specifying objective values|Yes|\n",
    "|X|File or domain specifying solutions|No|\n",
    "|V|File or function specifying violation values|No|\n",
    "|D|File or function specifying distance information|No|\n",
    "|N|File or function specifying neighbourhood information|No|\n",
    "|I|File specifying additional information (work in progress)|No|\n",
    "\n",
    "`F`, `X`, `V`, `D` and `N` are the keys in the Python dictionary used to define a sample. Additionally the keys `max` and `representation` are can also be specified. Only the `F` must be present."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c4c8f6d",
   "metadata": {},
   "source": [
    "`pyXla` allows users to load sample is 3 ways each corresponding to how values are specified for each input key (e.g. `F`, `X`):\n",
    "1. Using file paths or urls to delimited data file e.g. CSV file,\n",
    "2. By specifying `pandas` `DataFrames`,\n",
    "3. By specifying a function (for all keys except `X`; for `X` a [Sampler](#pyxla.sampling.Sampler) object is supplied).\n",
    "\n",
    "All these ways can be mixed."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25eb15be",
   "metadata": {},
   "source": [
    "To load a sample, import the `load_data` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "29f1a1c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyxla import load_data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6a8a9c1",
   "metadata": {},
   "source": [
    "## Defining Samples"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d895148",
   "metadata": {},
   "source": [
    "### Method 1: Load sample using file paths or URLs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f1c73b3",
   "metadata": {},
   "source": [
    "This is the simplest method."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a9b94d8",
   "metadata": {},
   "source": [
    "To define a sample using file paths or URLs, simple write a dictionary as below, supplying the path or URL for each key:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "dac86bf1",
   "metadata": {},
   "outputs": [],
   "source": [
    "sample = {\n",
    "    \"X\": \"../../../data/test_samples/cec2010_c01_2d_F1_V2/cec2010_c01_2d_F1_V2_X.csv\",\n",
    "    \"F\": \"../../../data/test_samples/cec2010_c01_2d_F1_V2/cec2010_c01_2d_F1_V2_F.csv\",\n",
    "    \"V\": \"../../../data/test_samples/cec2010_c01_2d_F1_V2/cec2010_c01_2d_F1_V2_V.csv\",\n",
    "    \"D\": \"../../../data/test_samples/cec2010_c01_2d_F1_V2/cec2010_c01_2d_F1_V2_D.csv\"\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84b99d3e",
   "metadata": {},
   "source": [
    "Once the sample is defined use the `load_data` function to load in the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "ee0f2059",
   "metadata": {},
   "outputs": [],
   "source": [
    "load_data(sample)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39c8686d",
   "metadata": {},
   "source": [
    "The `load_data` functions load the data in place such that the `sample` variable contains the `pyXla` sample which is ready for analysis."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d65bbc0c",
   "metadata": {},
   "source": [
    "We can check that the dataframe for the `F` input has indeed been loaded:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "ba834680",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>f1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-0.000046</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.041758</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-0.034587</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-0.004379</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-0.012440</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         f1\n",
       "0 -0.000046\n",
       "1 -0.041758\n",
       "2 -0.034587\n",
       "3 -0.004379\n",
       "4 -0.012440"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample['F'].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "120d1c10",
   "metadata": {},
   "source": [
    "### Method 2: Load by supplying `pandas` `DataFrames`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13d5abb6",
   "metadata": {},
   "source": [
    "This method is straight-forward. Simply supply a dataframe for each key:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "e8b01937",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "X_df = pd.read_csv(\"../../../data/test_samples/cec2010_c01_2d_F1_V2/cec2010_c01_2d_F1_V2_X.csv\", sep=\" \")\n",
    "F_df = pd.read_csv(\"../../../data/test_samples/cec2010_c01_2d_F1_V2/cec2010_c01_2d_F1_V2_F.csv\", sep=\" \")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "2c8e3762",
   "metadata": {},
   "outputs": [],
   "source": [
    "sample = {\n",
    "    \"X\": X_df,\n",
    "    \"F\": F_df,\n",
    "}\n",
    "\n",
    "# load the sample\n",
    "load_data(sample)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "096b4c20",
   "metadata": {},
   "source": [
    "### Method 3: Loading via domain and function specification"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49fe0666",
   "metadata": {},
   "source": [
    "This method allows you to specify a sample *declaratively*. For the `X` key, one specifies the domain of the search space along with the sampling strategy via a [Sampler](#pyxla.sampling.Sampler) object.\n",
    "\n",
    "Currently `pyXla` supports 3 continuous sampling techniques:\n",
    "1. [Random walk sampling](#pyxla.sampling.random_walk_sampling),\n",
    "1. [Adaptive walk sampling](#pyxla.sampling.adaptive_walk_sampling_continuous), and\n",
    "1. [Hilbert curve sampling](#pyxla.sampling.hilbert_curve_sampling)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "994c4030",
   "metadata": {},
   "source": [
    "To use random walk sampling, import the `RandomWalkSampler` class:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "31d2d9ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyxla.sampling import RandomWalkSampler"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "82ca7c74",
   "metadata": {},
   "source": [
    "Instantiate the sampler:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "987d9866",
   "metadata": {},
   "outputs": [],
   "source": [
    "sampler = RandomWalkSampler(\n",
    "    sample_size=100, step_size=1, dim=2, return_neighbourhood=True\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4be9cb9d",
   "metadata": {},
   "source": [
    "For the remaining input keys, Python functions are supplied. Thus, we can define and load a sample as:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "d41c745d",
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "\n",
    "sample = {\n",
    "    \"X\": sampler,\n",
    "    \"F\": lambda x: (x**2).sum(),\n",
    "    \"V\": lambda x: ((x**2).sum() - 3),\n",
    "    \"D\": lambda x1, x2: ((x1 - x2)**2).sum(), #must take a pair of arguments and real number\n",
    "    \"N\": lambda x1, x2: random.choice([True, False]) # must take a pair of arguments and return a bool\n",
    "}\n",
    "\n",
    "# load the sample\n",
    "load_data(sample)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67d0d029",
   "metadata": {},
   "source": [
    "We can confirm that, for instance the `N` input has been loaded."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "edf63fb9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id1</th>\n",
       "      <th>id2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   id1  id2\n",
       "0    0    3\n",
       "1    0    4\n",
       "2    0    6\n",
       "3    0    7\n",
       "4    0    8"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample['N'].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "308310c8",
   "metadata": {},
   "source": [
    "### All the methods can be mixed"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7466b5ee",
   "metadata": {},
   "source": [
    "We can supply values to the input keys using file paths, URLs, dataframes and functions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "92cc807b",
   "metadata": {},
   "outputs": [],
   "source": [
    "sample = {\n",
    "    \"X\": X_df, # dataframe\n",
    "    \"F\": [lambda x: (x**2).sum(), lambda x: x.sum()], # multiple objectives\n",
    "    \"max\": [True, False],\n",
    "    \"V\": lambda x: ((x**2).sum() - 3),\n",
    "    \"D\": \"../../../data/test_samples/cec2010_c01_2d_F1_V2/cec2010_c01_2d_F1_V2_D.csv\", # file path\n",
    "    \"N\": lambda x1, x2: random.choice([True, False])\n",
    "}\n",
    "\n",
    "# load the sample\n",
    "load_data(sample)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ebb223b5",
   "metadata": {},
   "source": [
    "## The `max` and `representation` Keys"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aafe3ab7",
   "metadata": {},
   "source": [
    "### The `max` key"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19a2988f",
   "metadata": {},
   "source": [
    "The `max` key is used to specify whether the objective(s) should be maximised or not."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a67440b9",
   "metadata": {},
   "source": [
    "If the `max` key is not supplied, `pyXla` assumes minimisation by default. To specify that the objective should be maximised simply supply:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "79284e01",
   "metadata": {},
   "outputs": [],
   "source": [
    "sample = {\n",
    "    #...\n",
    "    \"max\": True\n",
    "    #...\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db66f1f7",
   "metadata": {},
   "source": [
    "If there are multiple objectives you can specify a boolean value for each, or simply supply a single boolean value the objectives are treated similarly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "85a3ab1e",
   "metadata": {},
   "outputs": [],
   "source": [
    "sample = {\n",
    "    #...\n",
    "    \"max\": [True, False]\n",
    "    #...\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f28bd067",
   "metadata": {},
   "source": [
    "### The `representation` key"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "00254613",
   "metadata": {},
   "source": [
    "The `representation` key is (optionally) used to specify whether the sample is from a discrete or continuous domain.\n",
    "\n",
    "It can as values the following strings: `discrete`, `binary` or `continuous`. If specified this allows `pyXla` to use sensible default when computing information such as distance. For instance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "1a8f93b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "sample = {\n",
    "    #...sample has no D input\n",
    "    \"representation\": \"continuous\"\n",
    "    #...\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2751fcce",
   "metadata": {},
   "source": [
    "Suppose that the sample above has no `D` input, i.e. it lacks distance information. If the `representation` key is specified, `pyXla` will use Euclidean distance by default to compute distance information should feature that utilises distance be invoked."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.14.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}