Configuring application scaling
With IBM Cloud® Code Engine, you don't need to think about scaling your application because the number of running instances of an application is automatically scaled up or down (to zero) based on incoming workloads. With automatic scaling, you don't pay for resources that are not used.
Code Engine monitors the number of requests in the system and scales the application instances up and down to meet the load of incoming requests, including any HTTP connections to your application. These HTTP connections can be requests that come from outside of your project, from other workloads that are running in your project, or from event producers that you might be subscribed to, regardless of where those producers are located. Code Engine automatically replicates application instances and configures the network infrastructure to load balance the requests across all instances.
How scaling works
Code Engine monitors the number of requests in the system and scales the application instances, within the constraints of the application's concurrency settings.
To observe application scaling from the Code Engine console, navigate to your specific application page. While the application is running, the number of running instances is 1
or greater based on the maximum number of instances that you specified. When the application handles fewer requests than its configured target concurrency, the application scales the number of running instances down to the configured minimum
number of instances. If the minimum number of instances is set to 0
(default value), the application scales down to zero and the number of instances for the app reflects 0
instances. When the application is scaled
down, and a request is routed to the application, Code Engine scales up the application and routes the request to the newly created application instance.
For applications to be able to scale down cleanly, your application code must handle a SIGTERM signal. When Code Engine automatically scales down, a SIGTERM signal is sent to your application. If your application does not handle the SIGTERM
signal, then your app instances stay in Terminating
status for the time that is configured by the request timeout value (default value is 300 seconds). After the duration of the request timeout, Code Engine sends a SIGKILL signal
to forcefully stop the app instances in Terminating
status.
Scaling boundaries
You can configure scaling boundaries by setting a range of values for the minimum and maximum number of instances within which Code Engine autoscales the number of running instances of your application. Configure this range when you create or update applications.
-
Minimum number of instances - The minimum number of instances of the app that are always running, even if no requests are processed. When set to
0
(default) Code Engine removes all instances when no traffic is reaching the application. To always keep a running instance of your application, set this value greater than0
. When you create or update an application with Code Engine from the console, set the minimum number of instances value in the Autoscaling section of the Resources & scaling tab. With the CLI, specify the--min-scale
option on theapp create
andapp update
commands. -
Maximum number of instances - The maximum number of instances that can be running for the app. Autoscaling occurs up to this limit. If you set this value to
0
, the application scales as needed, and the app scaling is limited only by the instances per the resource quota for the project of your app. See Limits and quotas for Code Engine. When you create or update an application with Code Engine from the console, set the maximum number of instances value in the Autoscaling section of the Resources & scaling tab. With the CLI, specify the--max-scale
option on theapp create
andapp update
commands. The default value for this option is10
.
When you connect your applications to event producers, remember to account for the frequency and volume of the events from those producers when you set your scale boundaries. For example, if you expect to receive many events at the same time and the processing of each event can take several minutes, then you might need a higher maximum scale value than if each event can be quickly processed. If you set the value too low, you might experience delays in receiving events, or even dropped events due to timeouts while you wait for processing resources to be free.
Autoscaling settings for concurrency and timing
Use the following configuration settings to control application scaling.
-
Concurrency - This value indicates how many requests each instance of your app can process at one time. For example, a currency value of 100 means that your code can handle 100 concurrent requests at the same time. This value is a "hard limit", meaning that Code Engine does not allow more than the number of requests (as specified with the concurrency setting) to reach any one instance of your application. Therefore, if your code is single threaded and can process only one request at a time, then consider setting concurrency to
1
. When the specified number of requests are sent to all running instances of your application, Code Engine then increases the number of instances in preparation for more requests. When you create or update an application with Code Engine from the console, set the max concurrency value in the Autoscaling section of the Resources & scaling tab. With the CLI, specify the--concurrency
option on theapp create
andapp update
commands. -
Target concurrency - This value acts as a "soft limit" or number of requests per instance that you want in a loaded system. For example, if you specify
concurrency
as100
, and you specifytarget concurrency
as70
, then Code Engine tries to limit the number of requests per instance to 70. Specifying this option doesn't ensure that requests per instance won't go higher than 70 because it most likely will during an increase in traffic. However, the buffer between 70 and 100 enables the system to create new instances to bring the load per instance back down to 70 (or lower) per instance. Iftarget concurrency
is not specified, the default is the value ofconcurrency
. When you create or update an application with Code Engine from the console, set the target concurrency value in the Autoscaling section of the Resources & scaling tab. With the CLI, specify the--concurrency-target
option on theapp create
andapp update
commands. -
Request timeout - The amount of time, in seconds, that the application must respond to requests, or else they fail. When you create or update an application with Code Engine from the console, set the request timeout value in the Autoscaling section of the Resources & scaling tab. With the CLI, specify the
--request-timeout
option on theapp create
andapp update
commands. -
Scale-down delay - The amount of time, in seconds, that must pass at reduced concurrency before the application scales down. If you know the pattern of incoming requests to your application, you might choose to specify a value greater than the default of
0
to delay the scaling down of your application. If you use a high value for this option the average number of concurrently running application instances might increase, which incurs additional cost. Even with a scale-down delay setting of0
, the system waits for a short amount of time before the application scales down the number of instances to ensure that a drop in request concurrency is stable enough to warrant a scale down. When you create or update an application with Code Engine from the console, set the scale-down delay value in the Autoscaling section of the Resources & scaling tab. With the CLI, specify the--scale-down-delay
option on theapp create
andapp update
commands.
For example, if the maximum number of instances value is set to 10
and the concurrency is set to 100
, then an application can process 1000 concurrent requests before potential queuing of requests might occur. If you
expect more than 1000 requests concurrently, then you might consider increasing the maximum number of instances value for your app.
For more information about how Code Engine works, see IBM Cloud Code Engine: Optimizing Application Scaling, Latency, and Throughput, and Tuning IBM Cloud Code Engine Scale-Down Delay to Reduce Application Response Times.
If you set the minimum number of instances equal to the maximum number of instances, then scaling does not occur and the target currency and scale-down delay values do not apply. However, the concurrency value is still in effect.
Example scenario for autoscaling
Suppose that you have an application that is configured with the following autoscaling settings.
- Min number of instances = 4
- Max number of instances = 20
- Concurrency = 2
With these settings, your app can process 2 concurrent requests at the same time. When there is no traffic, your application automatically scales down to 4 instances. The 4 instances can handle only 4 * 2 (currency) = 8 requests. The application can scale up to a maximum of 20 instances, such that the 20 instances can handle 20 * 2(currency) = 40 requests, assuming sufficient CPU and memory settings for the application.
Suppose that your application is a static web application that requires 6 calls to your application to load a login page. Assume that 10 people log in to your application login page at the same time (10 * 6 = 60 calls). Because 60 calls ares greater than 40 requests that can be handled by your app, users of the application likely observe slowness as the calls are queued.
After about a minute of no traffic against your app, Code Engine starts to automatically scale down. When your application scales down, some instances are set in Terminating
status. Code Engine sends a SIGTERM signal to your application
and your code must handle this signal and shut down. If your application does not handle this signal, your app instances stay in Terminating
status for the time that is configured by the request timeout value (default value
is 300 seconds). After the duration of the request timeout, Code Engine sends a SIGKILL signal to forcefully stop the app instances in Terminating
status.
If you have 8 active instances and 12 instances stay in Terminating
status for 300 seconds (5 minutes). These 12 instances do not receive any requests. If more than 8 * 2 (currency) = 16 requests come in, then Code Engine scales
up new instances.
In this example, the concurrency value of 2
is low for this static content. You must determine the autoscaling values to set for your app, which depends on what the implementation of your app can handle with its assigned CPU and
memory resources.
Optimize latency and throughput
To optimize your application latency and throughput, understand the pros and cons of some common models and best practices to configure the container concurrency.
Use the setting for concurrency to specify the maximum number of requests that can be processed concurrently per instance when you create or update an application. From the Code Engine console, set the concurrency value in the Runtime section.
With the CLI, use the --concurrency
option (alias --cn
) with the app create
or app update
command.
Model | Pros | Cons |
---|---|---|
Single-concurrency, --cn=1 |
Choose the single concurrency configuration when the application serves a memory-intensive or CPU-intensive workload because only one request enters the application instance at a time and therefore gets the full amount of CPU and memory that is configured for the instance. | Applications that use the single-concurrency model scale-out quickly. The scale-out might introduce additional latency and lower throughput because it's more expensive to create a new application instance than to reuse an existing one. Do not choose this model if requests can be processed concurrently and latency is a critical aspect of the application. |
High-concurrency, --cn=100 (default) or higher |
Choose this configuration if your application serves a high volume of HTTP request or response workloads that are not CPU-intensive or memory-intensive and can wait for resources. You might choose this concurrency configuration for an API back end that reads and writes data on create, retrieve, update, and delete operations to a remote database. While some requests wait on I/O, other requests can be processed without impacting overall latency and throughput. | This setting is not optimal when concurrent requests compete for CPU, memory, or I/O because competition for resources can delay execution and impact latency and throughput negatively. |
Optimal concurrency, --cn=N |
Choose the optimal concurrency configuration if you know the amount of resources that are required for a single request to meet the response time that you want for your application. You might choose this configuration if your app is a
natural language translation application, where the machine learning model for the language conversion is 32 GB, and a single translation computation needs about 0.7 vCPU per request. You can choose a configuration
of 9 vCPUs and 32 GB of memory per instance. The optimal container concurrency is about 13 (9 vCPU/0.7 vCPU ). |
Do not choose the optimal concurrency configuration if you do not know the resource requirements for your application. Setting the wrong container concurrency can lead to either too aggressive or too lazy scaling, which might impact the latency, error rate, and costs of your application. For more information, see Determine concurrency for your application container. |
Infinite concurrency, --cn=0 (disabled) |
Disabled | The infinite concurrency configuration is not supported in Code Engine. The infinite concurrency configuration forwards as many requests as possible to a single application instance, which can delay the scaling of additional application instances. |
Determining concurrency of your application container
The optimal container concurrency value is determined by the maximum number of concurrent requests that the application can handle with an acceptable request latency.
The container concurrency has a direct impact on the success rate, latency, and throughput of the application. When the container concurrency value is too high for the application to handle, the latency and throughput is impacted negatively,
and you might observe 502
and 503
error responses.
When the container concurrency value is too low for the application, the system scales out the application more quickly and spreads the request across many application instances, introducing additional costs and latency. During a burst of
load, a low concurrency value can also lead to temporary 502
responses when the internal buffers of the system run over.
To determine the container concurrency configuration for your application, examine requests and latency.
-
Create an application and set its concurrency to
1000
(max) and bothmin-scale
andmax-scale
to1
.ibmcloud ce application create --name APPNAME --image APPIMAGE --min-scale=1 --max-scale=1 --concurrency=1000
-
Start by sending a high rate of requests. If
502
errors occur, then decrease the rate until the result shows a 100% success rate. -
Find the request latency of the output of Step 2. If the request latency is not acceptable, further decrease the request rate until the request latency looks acceptable. Note that the request duration is also important as it makes a difference if the computation of the request takes
2
seconds or100
milliseconds.Your acceptable request rate and latency varies according to your application characteristics and your scaling configuration. For example, if the computation within your application takes about 100 ms and your concurrency setting is 10 one application instance is able to process around 100 requests per second with a latency of about 100 ms (ignoring network latency).
-
To calculate the container concurrency value for your application, take the rate from Step 2 (in
req/s
) and divide by the latency of Step 3 (in seconds):CC = RATE/LATENCY
. For example, if the rate is80
requests per second and the latency is2
seconds, the resulting concurrency is concurrency =80 req/s / 2 s = 40
. -
Update the application to set the container concurrency to the value that you found in the previous step and rerun the workload to check whether the success rate and latency are acceptable.
-
Experiment with the application by increasing the container concurrency value and observing the success rate and latency.
-
Update your application with the optimal container concurrency value and remove the
min-Scale
andmax-scale
boundaries to allow the application to scale automatically.
Observing how your app scales with the CLI
You can observe the number of running instances of your app with the CLI.
-
Create an application with the
app create
command.ibmcloud ce application create --name myapp --image icr.io/codeengine/helloworld
-
Call the application. You can obtain the URL of your app from the output of the
app create
command, or you can runibmcloud ce app get --name myapp --output url
.curl https://myapp.4svg40kna19.us-south.codeengine.appdomain.cloud
-
Run the
application get
command to display the status of your application. Look for the value forRunning instances
. In this example, the app has1
running instance. For example,ibmcloud ce application get --name myapp
Example output
[...] OK Name: myapp [...] URL: https://myapp.4svg40kna19.us-south.codeengine.appdomain.cloud Cluster Local URL: http://myapp.4svg40kna19.svc.cluster.local Console URL: https://cloud.ibm.com/codeengine/project/us-south/01234567-abcd-abcd-abcd-abcdabcd1111/application/myapp/configuration Status Summary: Application deployed successfully Environment Variables: Type Name Value Literal CE_API_BASE_URL https://api.us-south.codeengine.cloud.ibm.com Literal CE_APP myapp Literal CE_DOMAIN us-south.codeengine.appdomain.cloud Literal CE_PROJECT_ID abcdefgh-abcd-abcd-abcd-1a2b3c4d5e6f Literal CE_REGION us-south Literal CE_SUBDOMAIN abcdabcdab Image: icr.io/codeengine/helloworld Resource Allocation: CPU: 1 Ephemeral Storage: 400M Memory: 4G Revisions: myapp-ds8fn-1: Age: 6m25s Traffic: 100% Image: icr.io/codeengine/helloworld (pinned to fe0446) Running Instances: 1 Runtime: Concurrency: 100 Maximum Scale: 10 Minimum Scale: 0 Timeout: 300 Conditions: Type OK Age Reason ConfigurationsReady true 6m10s Ready true 5m56s RoutesReady true 5m56s Events: Type Reason Age Source Messages Normal Created 6m28s service-controller Created Configuration "myapp" Normal Created 6m28s service-controller Created Route "myapp" Instances: Name Revision Running Status Restarts Age myapp-ds8fn-1-deployment-79bdd76749-khtmw myapp-ds8fn-1 2/2 Running 0 32s
Wait a few minutes, as it can take a few minutes for your app to scale to zero.
-
Run the
application get
command again and notice that the value forRunning instances
scaled to zero. When the application handles fewer requests than its configured target concurrency, the application scales the number of running instances down to the configured minimum number of instances. In this scenario, the number of running instances automatically scales to zero. When the--min-scale
option is set to0
(default value), then the number of running instances automatically scales to zero.Wait a few minutes, as it can take a few minutes for your app to scale to zero.
ibmcloud ce application get -n myapp
Example output
OK Name: myapp [...] URL: https://myapp.4svg40kna19.us-south.codeengine.appdomain.cloud Cluster Local URL: http://myapp.4svg40kna19.svc.cluster.local Console URL: https://cloud.ibm.com/codeengine/project/us-south/01234567-abcd-abcd-abcd-abcdabcd1111/application/myapp/configuration Image: icr.io/codeengine/helloworld Resource Allocation: CPU: 1 Ephemeral Storage: 400M Memory: 4G Revisions: myapp-ds8fn-1: Age: 12m Traffic: 100% Image: icr.io/codeengine/helloworld (pinned to 548d5c) Running Instances: 0 Runtime: Concurrency: 100 Maximum Scale: 10 Minimum Scale: 0 Timeout: 300 Conditions: Type OK Age Reason ConfigurationsReady true 3m7s Ready true 2m54s RoutesReady true 2m54s Events: Type Reason Age Source Messages Normal Created 3m21s service-controller Created Configuration "myapp" Normal Created 3m20s service-controller Created Route "myapp"
-
Call the application again to scale from zero.
curl https://myapp.4svg40kna19.us-south.codeengine.appdomain.cloud
-
Run the
application get
command again and notice that the value forRunning instances
scaled up from zero. For example,ibmcloud ce application get -n myapp
Example output
OK Name: myapp [...] URL: https://myapp.4svg40kna19.us-south.codeengine.appdomain.cloud Cluster Local URL: http://myapp.4svg40kna19.svc.cluster.local Console URL: https://cloud.ibm.com/codeengine/project/us-south/01234567-abcd-abcd-abcd-abcdabcd1111/application/myapp/configuration Status Summary: Application deployed successfully Environment Variables: Type Name Value Literal CE_API_BASE_URL https://api.us-south.codeengine.cloud.ibm.com Literal CE_APP myapp Literal CE_DOMAIN us-south.codeengine.appdomain.cloud Literal CE_PROJECT_ID abcdefgh-abcd-abcd-abcd-1a2b3c4d5e6f Literal CE_REGION us-south Literal CE_SUBDOMAIN abcdabcdab Image: icr.io/codeengine/helloworld Resource Allocation: CPU: 1 Ephemeral Storage: 400M Memory: 4G Revisions: myapp-00001: Age: 42s Latest: true Traffic: 100% Image: icr.io/codeengine/helloworld (pinned to 1cee99) Running Instances: 1 Runtime: Concurrency: 100 Maximum Scale: 10 Minimum Scale: 0 Scale Down Delay: 0 Timeout: 300 Conditions: Type OK Age Reason ConfigurationsReady true 25s Ready true 12s RoutesReady true 12s Events: Type Reason Age Source Messages Normal Created 44s service-controller Created Configuration "myapp" Normal Created 43s service-controller Created Route "myapp" Instances: Name Revision Running Status Restarts Age myapp-00001-deployment-d59b87654-xkqh7 myapp-00001 3/3 Running 0 43s
Setting the number of minimum and maximum instances for your app
You can change the number of running instances of your app by updating the minimum and maximum value for the number of app instances.
The default value for the minimum number of instances for your application is 0
. If your app does not receive any traffic, it scales to 0
and no instances of your app are running. In this situation, your application
can be slower to respond when it receives traffic again, while it scales back up. You can change this behavior by updating your application and setting the minimum scale to 1
either in the console or from the CLI.
The default value for the maximum number of instances for your app is 0
, which allows your app to scale as needed. For more information, see Scaling boundaries.
Changing the autoscaling range from the console
To change the range within which Code Engine autoscales the number of running instances for an app from the console, follow these steps.
- Navigate to your app.
- Select the Configuration tab.
- Select the Resources & scaling tab.
- Set the number for minimum and maximum instances for your app.
- (Optional) Review and set the request concurrency and timing settings for autoscaling your app.
- Click Deploy to save your changes and deploy the app revision.
When you update your application, your app creates a new revision and routes traffic to that instance.
Changing the autoscaling range with the CLI
To change the range within which Code Engine autoscales the number of running instances for an app with the CLI, run the application update
command with the --min-scale
and --max-scale
options set to the number of instances that you want for your app. You can optionally set request concurrency and timing settings for your app. These options include
--concurrency
, --concurrency-target
, --request-timeout
, and --scale-down-delay
. For more information about setting these autoscaling values, see Scaling boundaries,
and Autoscaling settings for concurrency and timing.
When you update your application, your app creates a new revision and routes traffic to that instance.