• Ml agents (V) gridworld
    • Visual Observations
    • Masking Discrete Actions
    • Environment and training parameters
    • Basic structure of the scene
    • code analysis
      • Environment initialization code
      • Agent script
        • Initialization and reset
        • Action mask
        • Agent action feedback
        • FixedUpdate()
        • Manual operation code
      • About gridsetting
    • About others
    • Training model
      • Generic parameter configuration
      • Start training
    • summary

Ml agents (V) gridworld

Gridworld is an interesting example. It still uses reinforcement learning to learn. The difference is that it uses visual observations to train agents.


As shown in the figure above, the agent is a blue square. It can move one grid (up, down, left, right) at a time. It is required not to touch the Red Cross, and finally reach the green plus target.

Visual Observations

Let’s first understand what visual observations are about. Mainly through ml agentsCameraSensororRenderTextureSensorThere are two ways to provide visual observation to agents. The image information collected by these two components is input into CNN (convolutional neural network) of agent policy, which enables agent to learn from observing image rules. Agents can use both visual observations and vector observations.

Using visual observation can make the agent capture any complex state, and it is very useful when it is difficult to describe the state with numbers. Of course, compared with vector observation training, visual observation training is inefficient, slow, and sometimes completely unsuccessful. Therefore, use visual observations only when using vector observations or ray cast observations (which will be studied later, is ray observation) cannot solve the problem.

Visual observations can be obtained from cameras or rendertextures in the scene. In order to add visual observation component to agent, it is necessary to add visual observation component to agentCamera Sensor ComponentperhapsRender Texture Sensor ComponentComponents, and thenCameraperhapsRenderTextureDrag into place (as shown below). At the same time, more than one camera or render texture component can be added to an agent, or even two components can be combined. For each visual observation component, you need to set the width and height (in pixels) of the image, and whether the observation value is color or gray.


Agents using the same policy must have the same number of visual observations, and these visual components need to have the same resolution (including grayscale settings). In addition, the sensor component of the agent must have its ownSensor NameSo that it can be sorted definitively (names must be unique for the agent, but multiple agents can have sensor components with the same name).

When usedRender Texture Sensor ComponentComponents, you can useCanvasTo debug, you need to set up theRaw ImageThe object of the component, and thenRenderTextureSet to raw imageTextureFor example, the following figure.


The grid world example shows how to useRenderTextureComponents to debug and observe. Note that in this example, you render camera as rendertexture, and then use it for observation and debugging. In order to update the rendertexture, camera must require that every time it makes a decision, it needs to render the picture. When you directly use camera as the observation value, the agent automatically completes this operation.

  • Visual observation summary & Best Practices
    • In order to collect visual observations, GameObject needs to be addedCameraSensorComponent orRenderTextureSensorassembly
    • Unless vector observations are inadequate, visual observations should generally be used
    • The size of the image should be as small as possible without losing the details required for the decision
    • Image should use theGreyscale(grayscale)

Masking Discrete Actions

In addition to visual observation, action mask is used in this example. Let’s introduce this concept first.

When we use discrete actions, we can specify that some actions are not possible for the next decision. That is, when the agent is controlled by neural network, the agent will not be able to perform the specified operation. Note that when an agent is controlled artificially (heuristic type), the agent can still decide to perform the shielded operation. To mask certain actions, you need to override theAgent.CollectDiscreteActionMasks()Virtual functions need to be called in functions.DiscreteActionMasker.SetMask()As follows

public override void CollectDiscreteActionMasks(DiscreteActionMasker actionMasker){
    actionMasker.SetMask(branch, actionIndices)

Among them:

  • branch: you want to block the index of the action branch (starting from 0)
  • actionIndices: int list corresponding to the index that the agent cannot perform the operation

The branch above isBehviour ParametersComponentVector ActioninBranches SizesAttribute.


For example, if an agent has two branches, the first branch (branch 0) has four action enumerations:“do nothing”, “jump”, “shoot” and “change weapon”, corresponding to index values 0, 1, 2, 3 respectively. If the agent needs to jump and change webpage when it needs to shoot, the code is as follows:

public override void CollectDiscreteActionMasks(DiscreteActionMasker actionMasker){
	If (agent. Action = = 3) // pseudo code, which means when the agent action is shot
	    Actionmasker. Setmask (0, new int [2] {1,3}); // the point is


  • If you want to mask multiple branches, you can useSet Maskmany times
  • You can’t mask all the actions on a branch
  • No maskContinuous Type Continuous action in

OK, based on the above, let’s take a look at the grid world example.

Environment and training parameters

First, translate the following project parameters according to the official document parameters:

  • Setting: the scene contains delegates, goals, and obstacles

  • Target: agent must find the target and avoid obstacles at the same time

  • Agent: the environment contains nine agents with the same behavior parameters

  • Agent reward settings

    • – 0.01F for each step (to make the agent find the target with the shortest path)
    • If the agent finds the location of the target (green plus sign), + 1 and starts the next time
    • If the agent navigates to the obstacle (Red Cross) – 1, at the same time restart the next time
  • Behavioral parameters

    • Vector observations: None
    • Vector action space: (DiscreteDiscreteType) size is 4, which corresponds to four directions of motion of the agent. In addition, in the environment, action masking is enabled by default (you can check on or off the corresponding component). The training model provided in source engineering is generated when shielding is started. In fact, the use of action mask here is to limit the blue box agent not to go out of the grid range. See the code later.
    • Visual observations: the top-down view of gridworld
  • Generalization parameters: gridsize, number of obstacles and number of targets. For a detailed explanation of generalization, please refer to the previous article

  • Benchmark average reward: 0.8

Basic structure of the scene

Based on a basic agent, first look at the scene view:


Its hierarchy level is:


amongsceneIs the parent object of the grid range, including a plane and four walls;RenderTextureAgentThe blue box is the agent;agentCamIn order to render the camera, the texture rendered through the observation of the camera is used as the input data of CNN;pitandgoalThey represent obstacles and targets respectively. In this case, they are generated randomly in the grid at runtime.

Notice the parent node hereAreaRenderTexutreHang upGrid AreaScript, which is mainly used to initialize the environment (including wall generation, random generation of targets and obstacles), reset the agent and the role of the environment.

Besides, you’ll find thisAreaRenderTexutreS unit is different from other training units, because the camera of this training unit is responsible for rendering and outputting rendertexture, which is a small picture at runtime.


code analysis

Environment initialization code


using System.Collections.Generic;
using UnityEngine;
using System.Linq;
using MLAgents;
using MLAgents.SideChannels;

public class GridArea : MonoBehaviour
    Public list actor objs; // obstacles and target GameObjects list
    Public int [] players; // array of obstacles and targets, where the number of "1" represents the number of obstacles, and the number of "0" represents the number of targets

    Public GameObject trueagent; // agent
    Public GameObject goalpref; // target prefab
    Public GameObject pitpref; // obstacle prefabrication

    Iflotatproperties m_resetparameters; // generalize parameters

    Camera m_agentcam; // for the camera of the basic unit, you need to set the camera position and the orthographic size
    GameObject [] m_objects; // a prefabrication that stores targets and obstacles

    //Floor and four walls
    GameObject m_Plane;
    GameObject m_Sn;
    GameObject m_Ss;
    GameObject m_Se;
    GameObject m_Sw;

    Vector3 m ﹤ initialposition; // initial position of parent prefabrication

    public void Start()
        //Parameter initialization
        m_ResetParameters = Academy.Instance.FloatProperties;

        m_Objects = new[] { goalPref, pitPref };

        m_AgentCam = transform.Find("agentCam").GetComponent();

        actorObjs = new List();

        var sceneTransform = transform.Find("scene");

        m_Plane = sceneTransform.Find("Plane").gameObject;
        m_Sn = sceneTransform.Find("sN").gameObject;
        m_Ss = sceneTransform.Find("sS").gameObject;
        m_Sw = sceneTransform.Find("sW").gameObject;
        m_Se = sceneTransform.Find("sE").gameObject;
        m_InitialPosition = transform.position;

    ///Setting up the environment
    public void SetEnvironment()
        //Initialize the parent node location, because there are nine training units in the scene, which are distributed according to the gridsize
        transform.position = m_InitialPosition * (m_ResetParameters.GetPropertyWithDefault("gridSize", 5f) + 1);

        //Initializes the players array, where "1" represents an obstacle and "0" represents a target
        var playersList = new List();
        for (var i = 0; i < (int)m_ResetParameters.GetPropertyWithDefault("numObstacles", 1f); i++)

        for (var i = 0; i < (int)m_ResetParameters.GetPropertyWithDefault("numGoals", 1f); i++)
        players = playersList.ToArray();

        //Initialize the position and scale of the floor and wall, and gridsize represents the number of cells in the scene
        var gridSize = (int)m_ResetParameters.GetPropertyWithDefault("gridSize", 5f);
        m_Plane.transform.localScale = new Vector3(gridSize / 10.0f, 1f, gridSize / 10.0f);
        m_Plane.transform.localPosition = new Vector3((gridSize - 1) / 2f, -0.5f, (gridSize - 1) / 2f);
        m_Sn.transform.localScale = new Vector3(1, 1, gridSize + 2);
        m_Ss.transform.localScale = new Vector3(1, 1, gridSize + 2);
        m_Sn.transform.localPosition = new Vector3((gridSize - 1) / 2f, 0.0f, gridSize);
        m_Ss.transform.localPosition = new Vector3((gridSize - 1) / 2f, 0.0f, -1);
        m_Se.transform.localScale = new Vector3(1, 1, gridSize + 2);
        m_Sw.transform.localScale = new Vector3(1, 1, gridSize + 2);
        m_Se.transform.localPosition = new Vector3(gridSize, 0.0f, (gridSize - 1) / 2f);
        m_Sw.transform.localPosition = new Vector3(-1, 0.0f, (gridSize - 1) / 2f);

        //Initialize orthographic camera
        M? Agentcam. Orthographicsize = (gridsize) / 2F; // camera orthogonal field of view
        M_agentcam. Transform. Localposition = new vector3 ((gridsize - 1) / 2F, gridsize + 1f, (gridsize - 1) / 2f); // camera position

    ///Environment reset
    public void AreaReset()
        Var gridsize = (int) m_resetparameters.getpropertywithdefault ("gridsize, 5F); // number of grids
        foreach (var actor in actorObjs)
        {// destroy all current targets and obstacles
        Setenvironment(); // environment reset

        //Using the HashSet, the random values of players. Length + 1 (number of all obstacles and targets + 1 agent) are calculated
        var numbers = new HashSet();
        while (numbers.Count < players.Length + 1)
            numbers.Add(Random.Range(0, gridSize * gridSize));
        var numbersA = Enumerable.ToArray(numbers);

        //Using x = randomnum / gridsize, y = randomnum% gridsize to determine the location of each object
        for (var i = 0; i < players.Length; i++)
        {// random placement of obstacles and targets
            var x = (numbersA[i]) / gridSize;
            var y = (numbersA[i]) % gridSize;
            var actorObj = Instantiate(m_Objects[players[i]], transform);
            actorObj.transform.localPosition = new Vector3(x, -0.25f, y);

        //Agent position random reset
        var xA = (numbersA[players.Length]) / gridSize;
        var yA = (numbersA[players.Length]) % gridSize;
        trueAgent.transform.localPosition = new Vector3(xA, -0.25f, yA);

Generally speaking, the environment reset code is relatively simple. Most of the above codes with comments are OK.

Agent script

Initialization and reset

In the previous 3D ball, initialize theInitializeAgent(), reset inAgentReset()In a sense, initialization and reset are the same. In this example grid world, the initial test reset of both environment and agent is put intoGridArea.csIn the script, in addition, you can take a look at the variables initialized by the agent script.

public class GridAgent : Agent
    The [formerlyserialized as ("m? Area")] // [formerlyserialized as (name)] feature can prevent the loss of the original serialized object when the "area" variable is renamed. See the following for specific operations
    [header ("specific to gridworld")] // [header (string)] makes the variable preceded by a descriptive title
    Public gridarea area; // environment reset script
    Public float timebetweendecision atainfluence; // the agent's action speed is moved every timebetweendecision atainfluence seconds
    Float m \ \ timesencedecision; // decision time timer
    Public camera rendercamera; // camera to output rendertexture
    Public bool maskactions = true; // if the action mask switch is disabled, the training model with action mask may not achieve the best training effect

    //Action value of block agent
    Const int k ou noaction = 0; // no action
    Const int k_up = 1; // move up
    Const int k_down = 2; // move down
    Const int k Ou left = 3; // move left
    Const int k Ou right = 4; // move right

    public override void InitializeAgent()
    // empty
    public override void AgentReset()
        Area. Areareset(); // agent reset

Most of the above code has comments, which is not a big problem. The following describes the features in the two Unity.

  • [FormerlySerializedAs(string name)]

    This feature allows the serialized properties or variables in the script to be renamed without losing the reference objects. The specific usage is as follows, assuming that there is a script test.cs:


    using UnityEngine;
    public class Test : MonoBehaviour
        public GridArea Area;
        void Start()

    Place it on an object in the scene and drag any matching script onto the area variable.


    If it changes at this timeAreaVariable name isArea_1, the following phenomena will occur:


    It will be found that the original reference variable is lost. In order to prevent the above situation, we need to use the [formerlyserialized as] feature to avoid it. For example, here we continue to assign the original reference to theArea_1Go inside.


    Then renameArea_1Before, add the formerlyserialized as feature to it, as follows:

    using UnityEngine;
    using UnityEngine.Serialization;
    public class Test : MonoBehaviour
        public GridArea Area;
        void Start()

    Let’s look at the reference of the original variable:


    It will return that the reference exists. Of course, if there are many parameter variables in the reference, the serialization of these variables will not be lost. However, this method of dragging objects to scripts is generally not recommended in development. If a project is large and many objects are referenced in this way, it is difficult to manage the loss of object references in the scene uniformly.

  • [Header(string content)]

    Just look at the effect directly. This feature is relatively simple.


Action mask

As described above for action masks, take a look at the settings for masks in grid world.

public override void CollectDiscreteActionMasks(DiscreteActionMasker actionMasker)
        If (maskactions) // action mask switch
            //Prevent agent from touching the wall
            Var position x = (int) transform. Position. X; // x position of agent
            Var position z = (int) transform. Position. Z; // Z position of agent
            //The maximum location of agent movement
            var maxPosition = (int)Academy.Instance.FloatProperties.GetPropertyWithDefault("gridSize", 5f) - 1;

            if (positionX == 0)
            {// when the agent is on the far left, it cannot be left again
                actionMasker.SetMask(0, new int[] { k_Left });
            if (positionX == maxPosition)
            {// when the agent is on the far right, it cannot be turned to the right again
                actionMasker.SetMask(0, new int[] { k_Right });
            if (positionZ == 0)
            {// when the agent is at the bottom, it cannot go down any more
                actionMasker.SetMask(0, new int[] { k_Down });
            if (positionZ == maxPosition)
            {// when the agent is at the top, it can't go up
                actionMasker.SetMask(0, new int[] { k_Up });

Match with the following figure:


Basically, the code matching diagram shows the action mask. In addition, the action mask is only applicable toDiscrete Type, that is, the action mask can only be used when the discrete space feedback.

Agent action feedback

Let’s take a look at the action feedback function of agentAgentAction()

public override void AgentAction(float[] vectorAction)
        Addreward (- 0.01F); // 0.01 will be fined for each step, so that the agent can find the target as soon as possible
        //Convert feedback parameters to integers
        var action = Mathf.FloorToInt(vectorAction[0]);
        //According to the action parameter, calculate the next location targetpos that the agent will move
        var targetPos = transform.position;
        switch (action)
            case k_NoAction:
                // do nothing
            case k_Right:
                targetPos = transform.position + new Vector3(1f, 0, 0f);
            case k_Left:
                targetPos = transform.position + new Vector3(-1f, 0, 0f);
            case k_Up:
                targetPos = transform.position + new Vector3(0f, 0, 1f);
            case k_Down:
                targetPos = transform.position + new Vector3(0f, 0, -1f);
                throw new ArgumentException("Invalid action value");

        //Define box type ray detection, and generate a box ray detection at the next location where the agent will move
        var hit = Physics.OverlapBox(
            targetPos, new Vector3(0.3f, 0.3f, 0.3f));
        if (hit.Where(col => col.gameObject.CompareTag("wall")).ToArray().Length == 0)
        {// if the next location of the agent does not touch the wall (the object labeled "wall"), move the agent
            transform.position = targetPos;

            if (hit.Where(col => col.gameObject.CompareTag("goal")).ToArray().Length == 1)
            {// if the next position to be moved is goal, reward 1 and finish the training
            else if (hit.Where(col => col.gameObject.CompareTag("pit")).ToArray().Length == 1)
            {// if the next position to move is an obstacle (PIT), penalty 1. This training is completed

After the above code has been annotated, there is nothing too difficult to understand. Note the following two points:

  • Formal parametervectorActionIt represents the motion vector space of the agent. Grid world’s agent has only one discrete motion at a time, namelyBranches SizeFor 1, the motion can include five choices, k’noaction, k’right, k’left, k’up and k’down, soBranch 0 SizeFor 5.


  • Physics.overlapbox is a box type ray, as shown in the following figure. In addition, it can provide an idea that the next position calculated can be transformed into a ray detection or collision body to detect whether the next position meets the conditions.



In the agent script, we also notice that theFixedUpdate()To influence the brain of the agent to make decisions:

public void FixedUpdate()
        Waittimeinference(); // called per frame

    void WaitTimeInference()
        if (renderCamera != null)
        {// if the rendering camera is not empty, manually render the camera for each frame

        if (Academy.Instance.IsCommunicatorOn)
        {// determine whether the environment is connected to python. If it is connected, the brain of the agent will make a decision every frame
        {// if it is not connected to the outside world, time between decisions at influence seconds later, make brain make a decision
            if (m_TimeSinceDecision >= timeBetweenDecisionsAtInference)
                m_TimeSinceDecision = 0f;
                M_timesencedecision + = time. Fixeddeltatime; // decision timer

Manual operation code

The manual operation code is as follows:

public override float[] Heuristic()
        if (Input.GetKey(KeyCode.D))
            return new float[] { k_Right };
        if (Input.GetKey(KeyCode.W))
            return new float[] { k_Up };
        if (Input.GetKey(KeyCode.A))
            return new float[] { k_Left };
        if (Input.GetKey(KeyCode.S))
            return new float[] { k_Down };
        return new float[] { k_NoAction };

The code here is easy to understand, but if you switch to manual operation mode, you will find that your operation in the scene can not get feedback, because the agent has not been addedDecision RequesterComponents, after adding, and thenBehavior TypeChange toHeuristic Only, which can be operated manually. Although the operation is not so comfortable = =.

About gridsetting

It can also be found in the sceneMain CameraThere is the gridsetting.cs script on, as follows:

using UnityEngine;
using MLAgents;

public class GridSettings : MonoBehaviour
    public Camera MainCamera;

    public void Awake()
        Academy.Instance.FloatProperties.RegisterCallback("gridSize", f =>
            MainCamera.transform.position = new Vector3(-(f - 1) / 2f, f * 1.25f, -(f - 1) / 2f);
            MainCamera.orthographicSize = (f + 5f) / 2f;
        // test
        //MainCamera.transform.position = new Vector3(-(10 - 1) / 2f, 10 * 1.25f, -(10 - 1) / 2f);
        //MainCamera.orthographicSize = (10 + 5f) / 2f;

In fact, the purpose here is to initialize the position of the main camera according to the gridsize. I don’t know why there is a problem with the callback call here in the source code, so I directly change the gridsize to 10 (at the same time, I need to modify the related gridsize values in gridarea and gridagent to 10), and then use the code above to test the annotation part to adjust the camera position, and get the following results:


It can be found that the camera will dynamically adjust the appropriate position according to the gridsize.

About others

It is found that two types of visual observation sensors are used in this example:


The other one


In EngineeringAreaRenderTextureTraining unit usedRender Texture Sensor Component, while other agent units useCamera Sensor Component。 Here you will findAreaRenderTextureIn the training unit, if the agentcam is removed, it has no effect on the output, because it uses rendertexturesensor to train. I don’t know if it’s my misunderstanding, or if the official wants to use two sensors to teach here You use….

Training model

OK, the above is the analysis of the sample code. Let’s practice and try to train the project. First of all, try changing the number of obstacles and targets to 3 and 2. Try whether the original engineering training model has training generalization parameters.


According to the above picture, it can be found that the agent is a bit stupid and can’t find the target immediately, which indicates that the original training model may not introduce generalization parameters, so we introduce generalization parameters to train it when we train.

Generic parameter configuration

Based on the generalization parameters in this example, weml-agent\configNewly buildgridworld_generalize.yamlProfile:

resampling-interval: 5000

    sampler-type: "uniform"
    min_value: 5
    max_value: 15

    sampler-type: "uniform"
    min_value: 1
    max_value: 5

    sampler-type: "uniform"
    min_value: 1
    max_value: 3

The meaning of the configuration file here is that the minimum gridsize is 5 grids and the maximum gridsize is 15 grids; the minimum number of obstacles is 1 and the maximum number of obstacles is 5; the minimum number of objects is 1 and the maximum number of objects is 3.

Of course, I’ll try the configuration here. I don’t have much experience in the configuration of parameters here, but I can use it to see whether the training model can be universal within a certain range of variability.

Start training

Based on the previous article, the training process will not be written here in detail, only the key steps will be described, CD to the directory D of ML agent, and then enter the following training command:

mlagents-learn config/trainer_config.yaml --sampler=config/gridworld_generalize.yaml --run-id=GridWolrd_Gen --train


Compared with 3D ball, this training process is a bit stuck. Wait a while, there should be corresponding training steps output in CMD, and wait for it to finish training. It took me about 35 minutes to train here. What’s wrong with me is… For the generalization training of three parameters, the training results are not good, even the most basic requirements are not met.

Try to generalize only one parameter, choose 1-4 obstacles, then train and see, wait for another 30 minutes = =, and find that even if there is a random change of obstacle parameters, the training effect is not good.

The tenserboard (three generalization parameters, one generalization parameter and no generalization parameter) for three times of training is attached below.


According to the chart, it can be seen that the two training with generalization parameters failed, and the specific reason why it is possible to know more ml agent later. In fact, if we put the two models with generalization parameters into unity to run, we will find that the training results are not ideal.


Grid world is an interesting example. It uses visual observations to output images for training, which is more like the process of human learning with eyes. Of course, the disadvantage is that the training model has not been generalized, which will be studied later.

It is not easy to write, so make the following statement:

1. The original articles are marked in the blog, and the copyright belongs to Xu Yang (the blogger);

2. Without the permission of the original author, it is not allowed to reprint the contents of this article, otherwise it will be regarded as infringement;

3. Please indicate the source and the original author of the reprint or quotation;

4. For those who do not comply with this statement or use the contents of this article illegally, I reserve the right of investigation according to law.