How can `f(x) = wx + b` be scalar when it points to a multidimensionality

Hi everyone,

As I’ve been moving through this course, one idea keeps circling in my mind—especially around the function
f(x) = wx + b
used in linear regression.

We’re told it’s just a line in 2D space. And technically, it is. But I can’t shake the sense that this “line” is really the flattened projection of a higher-dimensional object—and that we’re losing conceptual clarity by calling it only a line.

Here’s what I mean:

w (the weight) doesn’t just tilt the line—it transforms the x-axis. It feels like it’s applying a scaling or rotation from an outside influence.

b (the bias) isn’t just a number we add—it acts like an anchor, raising or lowering the entire output space into a new layer of y.

Now, this makes perfect sense if we’re dealing with multiple inputs—where f(x) becomes a plane or a hyperplane. But even with just one variable, I think it’s helpful to realize:

This “line” is really a 2D shadow of a 3D or nD process.

We’re not just drawing from a simple y = mx + b equation; we’re slicing through a larger structure.

Why This Matters (At Least to Me)

In school, we were taught that 2D means two variables—length and width. But here, even when it’s a 2D chart, we’re injecting extra dimensions through parameters like w, b, and eventually through m (number of examples), loss functions, and model behavior.

So yes, the chart is flat—but the system it represents isn’t.

Curious if Others Feel This

Has anyone else thought of f(x) = wx + b as something deeper than a line?

Or maybe wondered why it feels like we’re working in more than two dimensions—but only plotting two?

Would love to hear how you’ve been visualizing it—or if you’ve had to re-train your brain to think “flatter” than it really wants to.

Cheers,
Daniel

Week 1?
linear-regression-model-part-2

  • Description How can f(x) = wx + b be scalar when it points to a multidimensionality
2 Likes

The linear regression model is still 2D. The w and b values are the two variables. “length” and “width” are geometric assumptions. But we’re not limited to modeling only geometry.

I do not disagree with your concept that this is a transformation.

The same form holds if x is a matrix and w is a vector and b is a scalar. Then x times w is a dot product, and f(x) is a vector.

The 2D case (of a straight line model) is just a simple example where the w vector has length 1.

1 Like

To bring this intuition to life, consider the following plot. It shows how the cost function J(w, b), which measures how well our line fits the data depends on the parameters w and b. Each point on this surface corresponds to a possible line (a choice of w, b), and the “best” line corresponds to the lowest point on the surface.

1 Like

Thanks for the response, TMosh—you’re right that the standard form of linear regression is presented as a 2D model: one feature x, one target y, with w and b serving as parameter scalars.

That simplicity is powerful for visualization. But I think there’s a deeper point worth surfacing here:

From Algebraic Form to Geometric Reality
When we write:

𝑓(x) = 𝑤𝑥 + 𝑏

we’re not just describing a line—we’re describing a transformation of input space x into output space y through a linear map parameterized by w, then translated by b.

This is where the “2D” interpretation starts to break down. Why?..

Because the Function f(x) Is a Projection

  • w is not just a slope—it’s a scaling factor from a vector space, even in 1D.
  • b is not just a y-intercept—it’s a translation vector, moving the entire function vertically across a higher-dimensional field.

In higher dimensions, when:
x ∈ ℝⁿ, w ∈ ℝⁿ, b ∈ ℝ
we calculate:
𝑓(x) = xᵀw + b
and now the model becomes a dot product + translation. That’s not a “line” in the traditional sense, but the equation of a hyperplane.

So even in the 1D feature case, what appears as a line is better thought of as the intersection of a hyperplane (defined by w and b) with a one-dimensional subspace. It’s a slice, not a standalone object.

So, Is This Really 2D?
Not quite. Mathematically:

  • The inputs (x) form a set in ℝⁿ (often n ≥ 1),
  • The model is operating in a parameter space where the function f(x) lives in ℝ,
  • But the learned relationship (through loss minimization) lives in function space, a higher-order abstraction.

That’s not just geometry—that’s functional analysis.

The dimensional reduction from function space → parametric model → 2D plot is useful for pedagogy, but it can obscure the richness of what’s really happening. And if learners carry the misconception that linear regression is inherently 2D, they may struggle later when intuition breaks down in models with dozens or hundreds of features.

Thanks again for engaging. I think we agree—this is a transformation. My angle is just that we should recognize it always was, even before we scaled it to matrices.

1 Like

If we call something a “transformation” in a beginner-level course with no analytic prerequisites, 80% of the audience is going to be totally lost.

Hence Andrew doesn’t describe it that way.

Thanks, Conscell — I really appreciate your engagement and the visualization you’re pointing toward.

I think where I’m still finding tension isn’t in the cost surface itself (which is clear enough as a mapping of error across parameter space), but rather in how casually we assume that (w, b) live inside the same conceptual frame as (x, y)—when really, they’re operating in parameter space, not input space.

To me, the cost surface isn’t just a hill we’re sliding down—it’s a dimensional echo of an even more complex transformation. Each point on that surface isn’t just a candidate line, it’s a re-expression of the entire system’s behavior at that setting.

So when we say “the best line corresponds to the lowest point,” I agree technically—but I’m also seeing that we’re navigating a compression of a far more dimensional decision surface that includes not just w and b, but how they interact with the distribution and orientation of x and y across the dataset.

That’s why I keep circling back to the idea that calling it “just a line” (even in linear regression) undersells the richness of the abstraction.

Appreciate you helping me explore this out loud. Curious to know if you or others ever visualize cost functions not just as hills, but as dynamic relational topologies—shaped by both data and architecture.

Warm regards,
Daniel

1 Like

You are right… I often forget that… i was hyperfocused of planar and dimensional relationships… lol

1 Like

Linear Regression works in higher dimensions as well, as Pavel and Tom have mentioned. w and b define a hyperplane in higher dimensions instead of a line. The cost function is still a hill, but it’s a convex hill in n dimensions. You may well have higher powers in that respect from your work in physics, but I have trouble visualizing that. :smiley:

Oh, yes.. I am just getting to that part now.. my next lesson I am taking is:

Cost function formula

1 Like

Let’s say we have a ground truth function y = f(x), where x and y can be scalars or vectors (if we are dealing with higher-dimensional structures such as matrices or tensors, we can vectorize them using {\rm vec}(\cdot)). We don’t know the exact form of f, but we have access to a dataset of observations {\mathbb D} = \{ (x^{(i)}, y^{(i)}) \}_{i = 1}^{|{\mathbb D}|}. Our goal is to construct an approximate function f_{\theta}(x), parameterized by \theta (in case of linear regression we can set \theta = (b \ \ w^\top)^\top and x_0=1). To evaluate how well our model approximates the ground truth, we define a cost function over the dataset \displaystyle J_{\mathbb D}(\theta) = \frac{1}{|\mathbb D|} \sum_{(x^{(i)}, y^{(i)}) \in {\mathbb D}} l(f_\theta(x^{(i)}), y^{(i)})), where l is a chosen loss function (e.g. squared loss for linear regression). So, we are going to minimize \displaystyle J_{\mathbb D}(\theta): \displaystyle \min_{\theta} J_{\mathbb D}(\theta). This formulation already hints a deeper structure. What f is non-linear and we are using linear approximation f_\theta? Maybe we could represent f_\theta as f'_\theta \circ g or as f^{(L)}_{\theta^{[L]}} \circ \dots \circ f^{(1)}_{\theta^{[1]}}. This transforms our cost surface from a simple hill into a non-convex landscape shaped by multiple layers of abstraction. Moreover, as you pointed out, the cost surface is not just a function of parameters, it’s implicitly entangled with the geometry and distribution of the data. We can treat the population of input-target pairs as samples drawn from a data-generating process governed by some unknown distribution p. In that case, we can define the cost as the expected loss under this distribution: J_p(\theta) = {\mathbb E}_{(x,y)∼p​}[l(f_\theta (x),y)].
However, in practice, the distribution p is unknown and inaccessible. What we have instead is a finite dataset \mathbb D, which we treat as a proxy for the underlying distribution. We minimize the empirical cost J_{\mathbb D}(\theta) in hopes that the resulting parameters \theta will generalize well. But this hope rests on assumptions about the size and representativeness \mathbb D, as well as about the complexity of f_\theta. And because the cost function is defined over a finite sample, its topology is only an approximation of the underlying landscape. So we can consider each point on that surface as a compressed summary of how a particular parameterization would behave across the entire data distribution.

Thank you, Paul—

I truly appreciate your phrasing here—“undersells the richness of the abstraction” captures my core concern perfectly.

You’re absolutely right: in higher dimensions, w and b define a hyperplane, and the cost function remains convex, forming a surface with a global minimum. But I think what’s often missed—especially in early explanations—is that even in the so-called “simple” 2D case, the system being modeled already has embedded dimensionality. That is, the act of adjusting w and b implies we’re navigating a 3D+ landscape (parameters, loss surface, and input relationships), even if the data input itself is 1D.

It’s not just that the model can be extended to nD—it’s that the process always lives there. We simply project it onto a 2D plane to make it cognitively digestible.

In physics, we often encounter similar projections—like field lines or wavefunctions rendered onto a screen—but we’re always aware that what we’re seeing is a “slice” of a deeper, more entangled structure. That’s what I feel is conceptually missing in how linear regression is initially taught: that awareness that even the line is a projection from a more complex manifold of possibility.

And yes—visualizing nD convexity gets hairy even for us! But I believe reinforcing the dimensional embeddedness of what we’re modeling—even in simple examples—can help learners grasp the elegance and limits of these abstractions much sooner.

Thanks again for the thoughtful response!

—Daniel

2 Likes